The Fuzzy Labs guide to the universe of open source MLOps

Overview

Open Source MLOps

This is the Fuzzy Labs guide to the universe of free and open source MLOps tools.

Contents

What is MLOps anyway?

MLOps (machine learning operations) is a discipline that helps people to train, deploy and run machine learning models successfully in production environments. Because this is a new and rapidly-evolving field, there are a lot of tools out there, and new ones appear all the time. If we've missed any, then please do raise a pull request!

Data version control

Just like code, data grows and evolves over time. Data versioning tools help you to keep track of these changes.

You might wonder why you can't just store data in Git (or equivalent). There are a few reasons this doesn't work, but the main one is size: Git is designed for small text files, and typical datasets used in machine learning are just too big. Some tools, like DVC, store the data externally, but also integrate with Git so that data versions can be linked to code versions.

  • DVC - one of the most popular general-purpose data versioning tools.
  • Delta Lake - data versioning for data warehouses.
  • LakeFS - Transform your object storage into a Git-like repository.
  • Git LFS - while this doesn't specialise in machine learning use-cases, it's another popular way to version datasets.

Experiment tracking

Machine learning involves a lot of experimentation. We end up training a lot of models, most of which are never intended to go into production, but represent progressive steps towards having something production-worthy. Experiment tracking tools are there to help us keep track of each experiment. What exactly do we need to track? typically this includes the code version, data version, input parameters, training performance metrics, as well as the final model assets.

Model training

Feature stores

Model deployment and serving

Model serving is the process of taking a trained model and presenting it behind a REST API, and this enables other software components to interact with a model. To make deployment of these model servers as simple as possible, it's commonplace to run them inside Docker containers and deploy them to a container orchestration system such as Kubernetes.

Model monitoring

Full stacks

More resources

Here are some more resources for MLOps, both open-source and proprietary.

Comments
  • Data Catalogues

    Data Catalogues

    Just a short snippet on Data Catalogues. Almost started trying to build out a custom one because I didn't know they existed. Therefore feel it's important to make some noise about them.

    Open to feedback, can add more detail once I know if you guys think they fit the MLOps category. Only didn't put them under the Data Governance bin because those that focus on Data Discovery don't have so much to do with it. Happy to shuffle the content over though.

    opened by GeorgePearse 3
  • Add feathr (feature store by Linkedin)

    Add feathr (feature store by Linkedin)

    This PR adds Feathr

    About Feathr:

    Feathr is an open source enterprise-grade, high performance feature store, hosted in incubation in the LF AI & Data Foundation.

    @archena Please let me know if any changes are needed for the description.

    opened by SangamSwadiK 1
  • add new section for Model validation, with trubrics

    add new section for Model validation, with trubrics

    Hello, we have recently launched Trubrics and I'd love to add it to your great list!

    Trubrics helps Data Scientists validate their models, by providing them with a framework to write validations. Validations can be built purely with data science knowledge, or with feedback collected from business users on ML models.

    Thanks :grinning:

    opened by jeffkayne 1
  • Added MLEM and put description for CML

    Added MLEM and put description for CML

    Hello! We have a new MLOPs tool we'd love to add to the awesome-open-mlops!

    You can find the repo here: https://github.com/iterative/mlem Read the blog post: https://iterative.ai/blog/MLEM-release Watch the video: https://youtu.be/7h0fiZNwCnA

    Let me know if you need anything else or would like to collaborate in some way! Best regards...

    opened by mertbozkir 1
  • Added few notes on YDNBB, plus two more MIT-licensed repos

    Added few notes on YDNBB, plus two more MIT-licensed repos

    Added few hopefully interesting repos, all OS focused. I used monitoring for RecList even if (interestingly) is probably in his own category (with CheckList for example): "model testing", or something like that!

    opened by jacopotagliabue 1
  • Update the open source MLOps repo with data annotation

    Update the open source MLOps repo with data annotation

    Picked out the most interesting tools from this repo built by ZenML. Added a new section to our repo for data annotation. Include a link to ZenML’s repo as well.

    opened by osw282 1
  • chore: Add envd

    chore: Add envd

    I'd like to share envd with the community!

    envd is a machine learning development environment for data science and AI/ML engineering teams.

    🐍 No Docker, only Python - Focus on writing Python code, we will take care of Docker and development environment setup.

    🖨️ Built-in Jupyter/VSCode - First-class support for Jupyter and VSCode remote extension.

    ⏱️ Save time - Better cache management to save your time, keep the focus on the model, instead of dependencies.

    ☁️ Local & cloud - envd integrates seamlessly with Docker so that you can easily share, version, and publish envd environments with Docker Hub or any other OCI image registries.

    🔁 Repeatable builds & reproducible results - You can reproduce the same dev environment on your laptop, public cloud VMs, or Docker containers, without any change in setup.

    Signed-off-by: Ce Gao [email protected]

    opened by gaocegege 1
  • Adds a section for model registries

    Adds a section for model registries

    👋 Hello!

    This PR adds a new section for model registries and I've added both ML Flow and modelstore to it. Disclosure: I'm the author of the latter.

    When reading through the other sections, I saw that this might overlap slightly with experiment tracking, which is described as:

    What exactly do we need to track? typically this includes the code version, data version, input parameters, training performance metrics, as well as the final model assets.

    Happy to change this around as you see fit.

    Thank you for considering this contribution 🙏

    opened by nlathia 1
  • Adds Hamilton and feature engineering section

    Adds Hamilton and feature engineering section

    Hamilton was created to help wrangle a feature engineering code base. It forces decoupling of feature transform logic from materialization, and results in code that is always unit testable, reusable, and documentation friendly.

    I didn't see an appropriate section to add it, so I created a feature engineering section - putting it in feature stores wouldn't be the right place.

    opened by skrawcz 1
  • Image Analysis Tools

    Image Analysis Tools

    I've synced the forks. I have no idea why my previous data catalogue addition is listed again as a new change? Will correct that if anyone can tell me where I've gone wrong.

    I've hit my limit of GitHub lists so will be making greater use of this repo. Adding in image analysis tools (fiftyone - the dominant player, and dendromap - looks very powerful).

    Will add links, and try to get round to adding descriptions for data catalogues soon

    opened by GeorgePearse 1
  • Deepchecks library

    Deepchecks library

    Deepchecks is open source tool for testing and validating machine learning models and data. The 3 checks supported in different phases of ML pipeline are

    • Data Integrity (between ingestion and preprocessing step)
    • Train-Test Validation (Distribution and Methodology Checks) (between preprocessing and training step)
    • Model Performance Evaluation (evaluation step)
    opened by dudeperf3ct 0
Releases(v0.1.0-alpha)
  • v0.1.0-alpha(Dec 13, 2021)

    Initial release of the Awesome Open Source MLOps list.

    • Data version control
    • Experiment tracking
    • Model training
    • Feature stores
    • Model deployment and serving
    • Model monitoring
    • Full stacks
    Source code(tar.gz)
    Source code(zip)
Owner
Fuzzy Labs
MLOps done right
Fuzzy Labs
Retrieve annotated intron sequences and classify them as minor (U12-type) or major (U2-type)

(intron I nterrogator and C lassifier) intronIC is a program that can be used to classify intron sequences as minor (U12-type) or major (U2-type), usi

Graham Larue 4 Jul 26, 2022
ml4ir: Machine Learning for Information Retrieval

ml4ir: Machine Learning for Information Retrieval | changelog Quickstart → ml4ir Read the Docs | ml4ir pypi | python ReadMe ml4ir is an open source li

Salesforce 77 Jan 06, 2023
scikit-multimodallearn is a Python package implementing algorithms multimodal data.

scikit-multimodallearn is a Python package implementing algorithms multimodal data. It is compatible with scikit-learn, a popul

12 Jun 29, 2022
Price Prediction model is used to develop an LSTM model to predict the future market price of Bitcoin and Ethereum.

Price Prediction model is used to develop an LSTM model to predict the future market price of Bitcoin and Ethereum.

2 Jun 14, 2022
LibRerank is a toolkit for re-ranking algorithms. There are a number of re-ranking algorithms, such as PRM, DLCM, GSF, miDNN, SetRank, EGRerank, Seq2Slate.

LibRerank LibRerank is a toolkit for re-ranking algorithms. There are a number of re-ranking algorithms, such as PRM, DLCM, GSF, miDNN, SetRank, EGRer

126 Dec 28, 2022
Scikit-Garden or skgarden is a garden for Scikit-Learn compatible decision trees and forests.

Scikit-Garden or skgarden (pronounced as skarden) is a garden for Scikit-Learn compatible decision trees and forests.

260 Dec 21, 2022
Given the names and grades for each student in a class N of students, store them in a nested list and print the name(s) of any student(s) having the second lowest grade.

Hackerank-Nested-List Given the names and grades for each student in a class N of students, store them in a nested list and print the name(s) of any s

Sangeeth Mathew John 2 Dec 14, 2021
Probabilistic programming framework that facilitates objective model selection for time-varying parameter models.

Time series analysis today is an important cornerstone of quantitative science in many disciplines, including natural and life sciences as well as eco

Christoph Mark 129 Dec 24, 2022
Skoot is a lightweight python library of machine learning transformer classes that interact with scikit-learn and pandas.

Skoot is a lightweight python library of machine learning transformer classes that interact with scikit-learn and pandas. Its objective is to ex

Taylor G Smith 54 Aug 20, 2022
fMRIprep Pipeline To Machine Learning

fMRIprep Pipeline To Machine Learning(Demo) 所有配置均在config.py文件下定义 前置环境(lilab) 各个节点均安装docker,并有fmripre的镜像 可以使用conda中的base环境(相应的第三份包之后更新) 1. fmriprep scr

Alien 3 Mar 08, 2022
Flightfare-Prediction - It is a Flightfare Prediction Web Application Using Machine learning,Python and flask

Flight_fare-Prediction It is a Flight_fare Prediction Web Application Using Machine learning,Python and flask Using Machine leaning i have created a F

1 Dec 06, 2022
flexible time-series processing & feature extraction

A corona statistics and information telegram bot.

PreDiCT.IDLab 206 Dec 28, 2022
A repository for collating all the resources such as articles, blogs, papers, and books related to Bayesian Statistics.

A repository for collating all the resources such as articles, blogs, papers, and books related to Bayesian Statistics.

Aayush Malik 80 Dec 12, 2022
Python package for concise, transparent, and accurate predictive modeling

Python package for concise, transparent, and accurate predictive modeling. All sklearn-compatible and easy to use. 📚 docs • 📖 demo notebooks Modern

Chandan Singh 983 Jan 01, 2023
Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

Horovod Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. The goal of Horovod is to make dis

Horovod 12.9k Jan 07, 2023
An easier way to build neural search on the cloud

Jina is geared towards building search systems for any kind of data, including text, images, audio, video and many more. With the modular design & multi-layer abstraction, you can leverage the effici

Jina AI 17k Jan 01, 2023
CrayLabs and user contibuted examples of using SmartSim for various simulation and machine learning applications.

SmartSim Example Zoo This repository contains CrayLabs and user contibuted examples of using SmartSim for various simulation and machine learning appl

Cray Labs 14 Mar 30, 2022
Python module for performing linear regression for data with measurement errors and intrinsic scatter

Linear regression for data with measurement errors and intrinsic scatter (BCES) Python module for performing robust linear regression on (X,Y) data po

Rodrigo Nemmen 56 Sep 27, 2022
Formulae is a Python library that implements Wilkinson's formulas for mixed-effects models.

formulae formulae is a Python library that implements Wilkinson's formulas for mixed-effects models. The main difference with other implementations li

34 Dec 21, 2022
Meerkat provides fast and flexible data structures for working with complex machine learning datasets.

Meerkat makes it easier for ML practitioners to interact with high-dimensional, multi-modal data. It provides simple abstractions for data inspection, model evaluation and model training supported by

Robustness Gym 115 Dec 12, 2022