The Fuzzy Labs guide to the universe of open source MLOps

Last update: Dec 29, 2022

Overview

Open Source MLOps

This is the Fuzzy Labs guide to the universe of free and open source MLOps tools.

What is MLOps, anyway?
Data version control
Experiment tracking
Model training
Feature stores
Model deployment and serving
Model monitoring
Full stacks
More resources

What is MLOps anyway?

MLOps (machine learning operations) is a discipline that helps people to train, deploy and run machine learning models successfully in production environments. Because this is a new and rapidly-evolving field, there are a lot of tools out there, and new ones appear all the time. If we've missed any, then please do raise a pull request!

Data version control

Just like code, data grows and evolves over time. Data versioning tools help you to keep track of these changes.

You might wonder why you can't just store data in Git (or equivalent). There are a few reasons this doesn't work, but the main one is size: Git is designed for small text files, and typical datasets used in machine learning are just too big. Some tools, like DVC, store the data externally, but also integrate with Git so that data versions can be linked to code versions.

DVC - one of the most popular general-purpose data versioning tools.
Delta Lake - data versioning for data warehouses.
LakeFS - Transform your object storage into a Git-like repository.
Git LFS - while this doesn't specialise in machine learning use-cases, it's another popular way to version datasets.

Experiment tracking

Machine learning involves a lot of experimentation. We end up training a lot of models, most of which are never intended to go into production, but represent progressive steps towards having something production-worthy. Experiment tracking tools are there to help us keep track of each experiment. What exactly do we need to track? typically this includes the code version, data version, input parameters, training performance metrics, as well as the final model assets.

Model training

Feature stores

Feast

Model deployment and serving

Model serving is the process of taking a trained model and presenting it behind a REST API, and this enables other software components to interact with a model. To make deployment of these model servers as simple as possible, it's commonplace to run them inside Docker containers and deploy them to a container orchestration system such as Kubernetes.

Model monitoring

Full stacks

More resources

Here are some more resources for MLOps, both open-source and proprietary.

Top 10 Open Source MLOps Tools
Awesome MLOps - a mixture of open source and proprietory tools and platforms.
Best open source MLOps tools

Comments

Data Catalogues

Just a short snippet on Data Catalogues. Almost started trying to build out a custom one because I didn't know they existed. Therefore feel it's important to make some noise about them.

Open to feedback, can add more detail once I know if you guys think they fit the MLOps category. Only didn't put them under the Data Governance bin because those that focus on Data Discovery don't have so much to do with it. Happy to shuffle the content over though.

opened by GeorgePearse 3
Add feathr (feature store by Linkedin)

This PR adds Feathr

About Feathr:

Feathr is an open source enterprise-grade, high performance feature store, hosted in incubation in the LF AI & Data Foundation.

@archena Please let me know if any changes are needed for the description.

opened by SangamSwadiK 1
add new section for Model validation, with trubrics

Hello, we have recently launched Trubrics and I'd love to add it to your great list!

Trubrics helps Data Scientists validate their models, by providing them with a framework to write validations. Validations can be built purely with data science knowledge, or with feedback collected from business users on ML models.

Thanks :grinning:

opened by jeffkayne 1
Added MLEM and put description for CML

Hello! We have a new MLOPs tool we'd love to add to the awesome-open-mlops!

You can find the repo here: https://github.com/iterative/mlem Read the blog post: https://iterative.ai/blog/MLEM-release Watch the video: https://youtu.be/7h0fiZNwCnA

Let me know if you need anything else or would like to collaborate in some way! Best regards...

opened by mertbozkir 1
Added few notes on YDNBB, plus two more MIT-licensed repos

Added few hopefully interesting repos, all OS focused. I used monitoring for RecList even if (interestingly) is probably in his own category (with CheckList for example): "model testing", or something like that!

opened by jacopotagliabue 1
Update the open source MLOps repo with data annotation

Picked out the most interesting tools from this repo built by ZenML. Added a new section to our repo for data annotation. Include a link to ZenML’s repo as well.

opened by osw282 1
chore: Add envd

I'd like to share envd with the community!

envd is a machine learning development environment for data science and AI/ML engineering teams.

🐍 No Docker, only Python - Focus on writing Python code, we will take care of Docker and development environment setup.

🖨️ Built-in Jupyter/VSCode - First-class support for Jupyter and VSCode remote extension.

⏱️ Save time - Better cache management to save your time, keep the focus on the model, instead of dependencies.

☁️ Local & cloud - envd integrates seamlessly with Docker so that you can easily share, version, and publish envd environments with Docker Hub or any other OCI image registries.

🔁 Repeatable builds & reproducible results - You can reproduce the same dev environment on your laptop, public cloud VMs, or Docker containers, without any change in setup.

Signed-off-by: Ce Gao [email protected]

opened by gaocegege 1
Adds a section for model registries

👋 Hello!

This PR adds a new section for model registries and I've added both ML Flow and modelstore to it. Disclosure: I'm the author of the latter.

When reading through the other sections, I saw that this might overlap slightly with experiment tracking, which is described as:

What exactly do we need to track? typically this includes the code version, data version, input parameters, training performance metrics, as well as the final model assets.

Happy to change this around as you see fit.

Thank you for considering this contribution 🙏

opened by nlathia 1
Adds Hamilton and feature engineering section

Hamilton was created to help wrangle a feature engineering code base. It forces decoupling of feature transform logic from materialization, and results in code that is always unit testable, reusable, and documentation friendly.

I didn't see an appropriate section to add it, so I created a feature engineering section - putting it in feature stores wouldn't be the right place.

opened by skrawcz 1
Image Analysis Tools

I've synced the forks. I have no idea why my previous data catalogue addition is listed again as a new change? Will correct that if anyone can tell me where I've gone wrong.

I've hit my limit of GitHub lists so will be making greater use of this repo. Adding in image analysis tools (fiftyone - the dominant player, and dendromap - looks very powerful).

Will add links, and try to get round to adding descriptions for data catalogues soon

opened by GeorgePearse 1
Deepchecks library
Deepchecks is open source tool for testing and validating machine learning models and data. The 3 checks supported in different phases of ML pipeline are

Data Integrity (between ingestion and preprocessing step)

Train-Test Validation (Distribution and Methodology Checks) (between preprocessing and training step)

Model Performance Evaluation (evaluation step)
opened by dudeperf3ct 0

Releases(v0.1.0-alpha)

v0.1.0-alpha(Dec 13, 2021)
Initial release of the Awesome Open Source MLOps list.

Data version control

Experiment tracking

Model training

Feature stores

Model deployment and serving

Model monitoring

Full stacks

Source code(tar.gz)
Source code(zip)

Owner

Fuzzy Labs

MLOps done right

GitHub Repository

Iris species predictor app is used to classify iris species created using python's scikit-learn, fastapi, numpy and joblib packages.

Iris Species Predictor Iris species predictor app is used to classify iris species using their sepal length, sepal width, petal length and petal width

5 Apr 05, 2022

Given the names and grades for each student in a class N of students, store them in a nested list and print the name(s) of any student(s) having the second lowest grade.

Hackerank-Nested-List Given the names and grades for each student in a class N of students, store them in a nested list and print the name(s) of any s

2 Dec 14, 2021

LibTraffic is a unified, flexible and comprehensive traffic prediction library based on PyTorch

LibTraffic is a unified, flexible and comprehensive traffic prediction library, which provides researchers with a credibly experimental tool and a convenient development framework. Our library is imp

432 Jan 05, 2023

Automatically create Faiss knn indices with the most optimal similarity search parameters.

It selects the best indexing parameters to achieve the highest recalls given memory and query speed constraints.

419 Jan 01, 2023

easyNeuron is a simple way to create powerful machine learning models, analyze data and research cutting-edge AI.

5 Jun 18, 2022

Data science, Data manipulation and Machine learning package.

duality Data science, Data manipulation and Machine learning package. Use permitted according to the terms of use and conditions set by the attached l

3 Oct 19, 2022

As we all know the BGMI Loot Crate comes with so many resources for the gamers, this ML Crate will be the hub of various ML projects which will be the resources for the ML enthusiasts! Open Source Program: SWOC 2021 and JWOC 2022.

Machine Learning Loot Crate 💻 🧰 🔴 Welcome contributors! As we all know the BGMI Loot Crate comes with so many resources for the gamers, this ML Cra

89 Dec 28, 2022

Time series forecasting with PyTorch

Our article on Towards Data Science introduces the package and provides background information. Pytorch Forecasting aims to ease state-of-the-art time

2.5k Jan 02, 2023

Predicting India’s COVID-19 Third Wave with LSTM

Predicting India’s COVID-19 Third Wave with LSTM Complete project of predicting new COVID-19 cases in the next 90 days with LSTM India is seeing a ste

4 Jan 27, 2022

Using Logistic Regression and classifiers of the dataset to produce an accurate recall, f-1 and precision score

1 Jan 31, 2022

A model to predict steering torque fully end-to-end

torque_model The torque model is a spiritual successor to op-smart-torque, which was a project to train a neural network to control a car's steering f

4 Jun 03, 2022

Forecast dynamically at scale with this unique package. pip install scalecast

🌄 Scalecast: Dynamic Forecasting at Scale About This package uses a scaleable forecasting approach in Python with common scikit-learn and statsmodels

158 Jan 03, 2023

QuickAI is a Python library that makes it extremely easy to experiment with state-of-the-art Machine Learning models.

152 Jan 02, 2023

Code Repository for Machine Learning with PyTorch and Scikit-Learn

1.4k Jan 03, 2023

About Solve CTF offline disconnection problem - based on python3's small crawler

About Solve CTF offline disconnection problem - based on python3's small crawler, support keyword search and local map bed establishment, currently support Jianshu, xianzhi,anquanke,freebuf,seebug

32 Oct 25, 2022

ThunderGBM: Fast GBDTs and Random Forests on GPUs

Documentations | Installation | Parameters | Python (scikit-learn) interface What's new? ThunderGBM won 2019 Best Paper Award from IEEE Transactions o

648 Dec 16, 2022

Conducted ANOVA and Logistic regression analysis using matplot library to visualize the result.

Intro-to-Data-Science Conducted ANOVA and Logistic regression analysis. Project ANOVA The main aim of this project is to perform One-Way ANOVA analysi

1 Feb 06, 2022

Classification based on Fuzzy Logic(C-Means).

CMeans_fuzzy Classification based on Fuzzy Logic(C-Means). Table of Contents About The Project Fuzzy CMeans Algorithm Built With Getting Started Insta

3 Feb 08, 2022

Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CIKM 2020)

Karate Club is an unsupervised machine learning extension library for NetworkX. Please look at the Documentation, relevant Paper, Promo Video, and Ext

1.8k Jan 03, 2023

A project based example of Data pipelines, ML workflow management, API endpoints and Monitoring.

MLOps template with examples for Data pipelines, ML workflow management, API development and Monitoring.

33 Dec 03, 2022