An AutoML survey focusing on practical systems.

Overview

AutoML Survey

An (in-progress) AutoML survey focusing on practical systems.


This project is a community effort in constructing and maintaining an up-to-date beginner-friendly introduction to AutoML, focusing on practical systems. AutoML is a big field, and continues to grow daily. Hence, we cannot hope to provide a comprehensive description of every interesting idea or approach available. Thus, we decided to focus on practical AutoML systems, and spread outwards from there into the methodologies and theoretical concepts that power these systems. Our intuition is that, even though there are a lot of interesting ideas still in research stage, the most mature and battle-tested concepts are those that have been succesfully applied to construct practical AutoML systems.

To this end, we are building a database of qualitative criteria for all AutoML systems we've heard of. We define an AutoML system as a software project that can be used by non-experts in machine learning to build effective ML pipelines on at least some common domains and tasks. It doesn't matter if its open-source and/or commercial, a library or an application with a GUI, or a cloud service. What matters is that it is intended to be used in practice, as opposed to, say, a reference implementation of a novel AutoML strategy in a Jupyter Notebook.

Features of an AutoML system

For each of them we are creating a system card that describes, in our opinion, the most relevant features of the system, both from the scientific and the engineering points of view. To describe an AutoML system, we use a YAML-based definition. Most of the features are self-explanatory.

💡 Check data/systems/_template.yml for a starting template.

Basic information

Characteristics about the basic information of the system as a software product.

  • name (str): Name of the system.
  • description (str): A short (2-4 sentences) description of the sytem.
  • website (str): The URL of the main website or documentation.
  • open_source (bool): Whether the system is open-source.
  • institutions (list[str]): List of businesses or academic institutions that directly support the development of the system, and/or hold intellectual property over it.
  • repository (str): If it's open-source, link of a public source code repository, otherwise null.
  • license (str): If it's open-source, a license key, otherwise null.
  • references (list[str]): List of links to relevant papers, preferably DOIs or other universal handlers, but can also be links to arxiv.org or other repositories sorted by most relevant papers, not date.

User interfaces

Characteristics describing how the users interact with the system.

  • cli (bool): Whether the system has a command line interface
  • gui (bool): Whether the system has a graphic user interface
  • http (bool): Whether the system can used from an HTTP RESTful API
  • library (bool): Whether the system can be linked as a code library
  • programming_languages (list[str]): List of programming languages in which the system can be used, i.e., it is either natively coded in that language or there are maintained bindings (as opposed to using language X's standard way to call code from language Y).

Domains

Characteristics describing the domains in which the system can be applied, which roughly correspond to the types of input data that the system can handle.

  • domains (list[str]): Domains in which the system can be deployed. Valid values are:
    • images
    • nlp
    • tabular
    • time_series
  • multi_domain (bool): Whether the system supports multiple domains for a single workflow, e.g., by allowing multiple inputs of different types simultaneously

Techniques

Characteristics describing the actual models and techniques used in the system, and the underlying ML libraries where those techniques are implemented.

  • techniques (list[str]): List of high-level techniques that are available in the systems, broadly classified according to model families. Valid values are:
    • linear_models
    • trees
    • bayesian
    • kernel_machines
    • graphical_models
    • mlp
    • cnn
    • rnn
    • pretrained
    • ensembles
    • ad_hoc ( 📝 indicates non-ML algorithms, e.g., tokenizers)
  • distillation (bool): Whether the system supports model distillation
  • ml_libraries (list[str]): List of ML libraries that support the system, i.e., where the techniques are actually implemented, if any. Valid values are lists of strings. Some examples are:
    • scikit-learn
    • keras
    • pytorch
    • nltk
    • spacy
    • transformers

Tasks

Characteristics describing the types of tasks, or problems, in which the system can be applied, which roughly correspond to the types of outputs supported.

  • tasks (list[str]): List of high-level tasks the system can perform automatically. Valid values are:
    • classification
    • structured_prediction
    • structured_generation
    • unstructured_generation
    • regression
    • clustering
    • imputation
    • segmentation
    • feature_preprocessing
    • feature_selection
    • data_augmentation
    • dimensionality_reduction
    • data_preprocessing ( 📝 domain-agonostic data preprocessing such as normalization and scaling)
    • domain_preprocessing ( 📝 refers to domain-specific preprocessing, e.g., stemming)
  • multi_task: Whether the system supports multiple tasks in a single workflow, e.g., by allowing multiple output heads from the same neural network

Search strategies

Characteristics describing the optimizaction/search strategies used for model search and/or hyperparameter tunning.

  • search_strategies (list[str]): List of high-level search strategies that are available in the system. Valid values are:
    • random
    • evolutionary
    • gradient_descent
    • hill_climbing
    • bayesian
    • grid
    • hyperband
    • reinforcement_learning
    • constructive
    • monte_carlo
  • meta_learning (list[str]): If the system includes meta-learning, list of broadly classified techniques used. Valid values are:
    • portfolio
    • warm_start

Search space

Characteristics describing the search space, the types of hyperparameters that can be optimized, and the types of ML pipelines that can be represented in this space.

  • search_space: High-level characteristics of the hyperparameter search space.
    • hierarchical (bool): If there are hyperparameters that only make sense conditioned to others.
    • probabilistic (bool): If the hyperparameter space has an associated probabilistic model.
    • differentiable (bool): If the hyperameter space can be used for gradient descent.
    • automatic (bool): If the global structure of the hyperparameter space is inferred automatically from, e.g., type annotations or model's documentation, as opposed to explicitely defined by the developers or the user.
    • hyperparameters (list[str]): Types of hyperparameters that can be optimized. Valid values are:
      • continuous
      • discrete
      • categorical
      • conditional
    • pipelines: Types of pipelines that can be discovered by the AutoML process. Each of the following keys is boolean.
      • single (bool): A single estimator (or model in general)
      • fixed (bool): A fixed pipeline with several, but predefined, steps
      • linear (bool): A variable-length pipeline where each step feeds on the immediately previous output
      • graph (bool): An arbitrarily graph-shaped pipeline where each step can feed on any of the previous outputs
    • robust (bool): Whether the seach space contains potentially invalid pipelines that are only discovered when evaluated, e.g., allowing a dense-only estimator to precede a sparse transformer.

Software architecture

Other characteristics describing general features of the system as a software product.

  • extensible (bool): Whether the system is designed to be extensible, in the sense that a user can add a single new type of model, or search algorithm, etc., in an easy manner, not needing to modify any part of the system/
  • accessible (bool): Whether the models obtained from the AutoML process can be freely inspected by the user up to the level of individual parameters (e.g., neural network weights).
  • portable (bool): Whether the models obtained can be exported out of the AutoML system, either on a standard format, or, at least, in a format native of the underlying ML library,such that they can be deployed on another platform without depending on the AutoML system itself.
  • computational_resources: Computational resources that, if available, can be leveraged by the system.
    • gpu (bool): Whether the system supports GPUs.
    • tpu (bool): Whether the system supports TPUs.
    • cluster (bool): Whether the system supports cluster-based parallelism.

How to contribute

If you are an author or a user of any practical AutoML system that roughly fits the previous criteria, we would love to have your contributions. You can add new systems, add information for existing ones, or fix anything that is incorrect.

To do this, either create a new or modify an existing file in data/systems. Once done, you can run make check to ensure that the modifications are valid with respect to the schema defined in scripts/models.py. If you need to add new fields, or new values to any of the enumerations defined, feel free to modify the corresponding schema as well (and modify both data/systems/_template.yml and this README).

Once validated, you can open a pull request.

License

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Owner
AutoGOAL
Democratizing Machine Learning
AutoGOAL
Model factory is a ML training platform to help engineers to build ML models at scale

Model Factory Machine learning today is powering many businesses today, e.g., search engine, e-commerce, news or feed recommendation. Training high qu

16 Sep 23, 2022
A collection of neat and practical data science and machine learning projects

Data Science A collection of neat and practical data science and machine learning projects Explore the docs » Report Bug · Request Feature Table of Co

Will Fong 2 Dec 10, 2021
Dive into Machine Learning

Dive into Machine Learning Hi there! You might find this guide helpful if: You know Python or you're learning it 🐍 You're new to Machine Learning You

Michael Floering 11.1k Jan 03, 2023
A series of Jupyter notebooks that walk you through the fundamentals of Machine Learning and Deep Learning in Python using Scikit-Learn, Keras and TensorFlow 2.

Machine Learning Notebooks, 3rd edition This project aims at teaching you the fundamentals of Machine Learning in python. It contains the example code

Aurélien Geron 1.6k Jan 05, 2023
Apache (Py)Spark type annotations (stub files).

PySpark Stubs A collection of the Apache Spark stub files. These files were generated by stubgen and manually edited to include accurate type hints. T

Maciej 114 Nov 22, 2022
A chain of stores, 10 different stores and 50 different requests a 3-month demand forecast for its product.

Demand-Forecasting Business Problem A chain of stores, 10 different stores and 50 different requests a 3-month demand forecast for its product.

Ayşe Nur Türkaslan 3 Mar 06, 2022
Transpile trained scikit-learn estimators to C, Java, JavaScript and others.

sklearn-porter Transpile trained scikit-learn estimators to C, Java, JavaScript and others. It's recommended for limited embedded systems and critical

Darius Morawiec 1.2k Jan 05, 2023
Formulae is a Python library that implements Wilkinson's formulas for mixed-effects models.

formulae formulae is a Python library that implements Wilkinson's formulas for mixed-effects models. The main difference with other implementations li

34 Dec 21, 2022
Nevergrad - A gradient-free optimization platform

Nevergrad - A gradient-free optimization platform nevergrad is a Python 3.6+ library. It can be installed with: pip install nevergrad More installati

Meta Research 3.4k Jan 08, 2023
My project contrasts K-Nearest Neighbors and Random Forrest Regressors on Real World data

kNN-vs-RFR My project contrasts K-Nearest Neighbors and Random Forrest Regressors on Real World data In many areas, rental bikes have been launched to

1 Oct 28, 2021
Stats, linear algebra and einops for xarray

xarray-einstats Stats, linear algebra and einops for xarray ⚠️ Caution: This project is still in a very early development stage Installation To instal

ArviZ 30 Dec 28, 2022
Automated Machine Learning Pipeline for tabular data. Designed for predictive maintenance applications, failure identification, failure prediction, condition monitoring, etc.

Automated Machine Learning Pipeline for tabular data. Designed for predictive maintenance applications, failure identification, failure prediction, condition monitoring, etc.

Amplo 10 May 15, 2022
A Microsoft Azure Web App project named Covid 19 Predictor using Machine learning Model

A Microsoft Azure Web App project named Covid 19 Predictor using Machine learning Model (Random Forest Classifier Model ) that helps the user to identify whether someone is showing positive Covid sym

Priyansh Sharma 2 Oct 06, 2022
We have a dataset of user performances. The project is to develop a machine learning model that will predict the salaries of baseball players.

Salary-Prediction-with-Machine-Learning 1. Business Problem Can a machine learning project be implemented to estimate the salaries of baseball players

Ayşe Nur Türkaslan 9 Oct 14, 2022
Summer: compartmental disease modelling in Python

Summer: compartmental disease modelling in Python Summer is a Python-based framework for the creation and execution of compartmental (or "state-based"

6 May 13, 2022
Automatically build ARIMA, SARIMAX, VAR, FB Prophet and XGBoost Models on Time Series data sets with a Single Line of Code. Now updated with Dask to handle millions of rows.

Auto_TS: Auto_TimeSeries Automatically build multiple Time Series models using a Single Line of Code. Now updated with Dask. Auto_timeseries is a comp

AutoViz and Auto_ViML 519 Jan 03, 2023
A comprehensive repository containing 30+ notebooks on learning machine learning!

A comprehensive repository containing 30+ notebooks on learning machine learning!

Jean de Dieu Nyandwi 3.8k Jan 09, 2023
Relevance Vector Machine implementation using the scikit-learn API.

scikit-rvm scikit-rvm is a Python module implementing the Relevance Vector Machine (RVM) machine learning technique using the scikit-learn API. Quicks

James Ritchie 204 Nov 18, 2022
MLFlow in a Dockercontainer based on Azurite and Postgres

mlflow-azurite-postgres docker This is a MLFLow image which works with a postgres DB and a local Azure Blob Storage Instance (Azurite). This image is

2 May 29, 2022
A modular active learning framework for Python

Modular Active Learning framework for Python3 Page contents Introduction Active learning from bird's-eye view modAL in action From zero to one in a fe

modAL 1.9k Dec 31, 2022