A benchmark of data-centric tasks from across the machine learning lifecycle.

Overview
banner

GitHub Workflow Status GitHub Documentation Status pre-commit PyPI - Python Version codecov

A benchmark of data-centric tasks from across the machine learning lifecycle.

Getting Started | What is dcbench? | Docs | Contributing | Website | About

⚡️ Quickstart

pip install dcbench

Optional: some parts of Meerkat rely on optional dependencies. If you know which optional dependencies you'd like to install, you can do so using something like pip install dcbench[dev] instead. See setup.py for a full list of optional dependencies.

Installing from dev: pip install "dcbench[dev] @ git+https://github.com/data-centric-ai/[email protected]"

Using a Jupyter notebook or some other interactive environment, you can import the library and explore the data-centric problems in the benchmark:

import dcbench
dcbench.tasks

To learn more, follow the walkthrough in the docs.

💡 What is dcbench?

This benchmark evaluates the steps in your machine learning workflow beyond model training and tuning. This includes feature cleaning, slice discovery, and coreset selection. We call these “data-centric” tasks because they're focused on exploring and manipulating data – not training models. dcbench supports a growing list of them:

dcbench includes tasks that look very different from one another: the inputs and outputs of the slice discovery task are not the same as those of the minimal data cleaning task. However, we think it important that researchers and practitioners be able to run evaluations on data-centric tasks across the ML lifecycle without having to learn a bunch of different APIs or rewrite evaluation scripts.

So, dcbench is designed to be a common home for these diverse, but related, tasks. In dcbench all of these tasks are structured in a similar manner and they are supported by a common Python API that makes it easy to download data, run evaluations, and compare methods.

✉️ About

dcbench is being developed alongside the data-centric-ai benchmark. Reach out to Bojan Karlaš (karlasb [at] inf [dot] ethz [dot] ch) and Sabri Eyuboglu (eyuboglu [at] stanford [dot] edu if you would like to get involved or contribute!)

You might also like...
Data science, Data manipulation and Machine learning package.
Data science, Data manipulation and Machine learning package.

duality Data science, Data manipulation and Machine learning package. Use permitted according to the terms of use and conditions set by the attached l

Data Version Control or DVC is an open-source tool for data science and machine learning projects
Data Version Control or DVC is an open-source tool for data science and machine learning projects

Continuous Machine Learning project integration with DVC Data Version Control or DVC is an open-source tool for data science and machine learning proj

A mindmap summarising Machine Learning concepts, from Data Analysis to Deep Learning.
A mindmap summarising Machine Learning concepts, from Data Analysis to Deep Learning.

A mindmap summarising Machine Learning concepts, from Data Analysis to Deep Learning.

A toolkit for making real world machine learning and data analysis applications in C++

dlib C++ library Dlib is a modern C++ toolkit containing machine learning algorithms and tools for creating complex software in C++ to solve real worl

A library of extension and helper modules for Python's data analysis and machine learning libraries.
A library of extension and helper modules for Python's data analysis and machine learning libraries.

Mlxtend (machine learning extensions) is a Python library of useful tools for the day-to-day data science tasks. Sebastian Raschka 2014-2021 Links Doc

A machine learning toolkit dedicated to time-series data

tslearn The machine learning toolkit for time series analysis in Python Section Description Installation Installing the dependencies and tslearn Getti

A machine learning toolkit dedicated to time-series data

tslearn The machine learning toolkit for time series analysis in Python Section Description Installation Installing the dependencies and tslearn Getti

Apache Liminal is an end-to-end platform for data engineers & scientists, allowing them to build, train and deploy machine learning models in a robust and agile way
Apache Liminal is an end-to-end platform for data engineers & scientists, allowing them to build, train and deploy machine learning models in a robust and agile way

Apache Liminals goal is to operationalise the machine learning process, allowing data scientists to quickly transition from a successful experiment to an automated pipeline of model training, validation, deployment and inference in production. Liminal provides a Domain Specific Language to build ML workflows on top of Apache Airflow.

Meerkat provides fast and flexible data structures for working with complex machine learning datasets.
Meerkat provides fast and flexible data structures for working with complex machine learning datasets.

Meerkat makes it easier for ML practitioners to interact with high-dimensional, multi-modal data. It provides simple abstractions for data inspection, model evaluation and model training supported by efficient and robust IO under the hood.

Comments
  •  No module named 'dcbench.tasks.budgetclean.cpclean'

    No module named 'dcbench.tasks.budgetclean.cpclean'

    After installing dcbench in Google colab environment, the above error was thrown for import dcbench. Full error traceback,

    ---------------------------------------------------------------------------
    ModuleNotFoundError                       Traceback (most recent call last)
    <ipython-input-8-a1030f6d7ef9> in <module>()
          1 
    ----> 2 import dcbench
          3 dcbench.tasks
    
    2 frames
    /usr/local/lib/python3.7/dist-packages/dcbench/__init__.py in <module>()
         13 )
         14 from .config import config
    ---> 15 from .tasks.budgetclean import BudgetcleanProblem
         16 from .tasks.minidata import MiniDataProblem
         17 from .tasks.slice_discovery import SliceDiscoveryProblem
    
    /usr/local/lib/python3.7/dist-packages/dcbench/tasks/budgetclean/__init__.py in <module>()
          3 from ...common import Task
          4 from ...common.table import Table
    ----> 5 from .baselines import cp_clean, random_clean
          6 from .common import Preprocessor
          7 from .problem import BudgetcleanProblem, BudgetcleanSolution
    
    /usr/local/lib/python3.7/dist-packages/dcbench/tasks/budgetclean/baselines.py in <module>()
          6 from ...common.baseline import baseline
          7 from .common import Preprocessor
    ----> 8 from .cpclean.algorithm.select import entropy_expected
          9 from .cpclean.algorithm.sort_count import sort_count_after_clean_multi
         10 from .cpclean.clean import CPClean, Querier
    
    ModuleNotFoundError: No module named 'dcbench.tasks.budgetclean.cpclean'
    

    !pip install dcbench gave the following log

    ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. 
    flask 1.1.4 requires click<8.0,>=5.1, but you have click 8.0.3 which is incompatible.
    datascience 0.10.6 requires coverage==3.7.1, but you have coverage 6.2 which is incompatible.
    datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.
    coveralls 0.5 requires coverage<3.999,>=3.6, but you have coverage 6.2 which is incompatible.
    Successfully installed SecretStorage-3.3.1 aiohttp-3.8.1 aiosignal-1.2.0 antlr4-python3-runtime-4.8 async-timeout-4.0.2 asynctest-0.13.0 black-21.12b0 cfgv-3.3.1 click-8.0.3 colorama-0.4.4 commonmark-0.9.1 coverage-6.2 cryptography-36.0.1 cytoolz-0.11.2 dataclasses-0.6 datasets-1.17.0 dcbench-0.0.4 distlib-0.3.4 docformatter-1.4 flake8-4.0.1 frozenlist-1.2.0 fsspec-2021.11.1 future-0.18.2 fuzzywuzzy-0.18.0 fvcore-0.1.5.post20211023 huggingface-hub-0.2.1 identify-2.4.1 importlib-metadata-4.2.0 iopath-0.1.9 isort-5.10.1 jeepney-0.7.1 jsonlines-3.0.0 keyring-23.4.0 livereload-2.6.3 markdown-3.3.4 mccabe-0.6.1 meerkat-ml-0.2.3 multidict-5.2.0 mypy-extensions-0.4.3 nbsphinx-0.8.8 nodeenv-1.6.0 omegaconf-2.1.1 parameterized-0.8.1 pathspec-0.9.0 pkginfo-1.8.2 platformdirs-2.4.1 pluggy-1.0.0 portalocker-2.3.2 pre-commit-2.16.0 progressbar-2.5 pyDeprecate-0.3.1 pycodestyle-2.8.0 pyflakes-2.4.0 pytest-6.2.5 pytest-cov-3.0.0 pytorch-lightning-1.5.7 pyyaml-6.0 readme-renderer-32.0 recommonmark-0.7.1 requests-toolbelt-0.9.1 rfc3986-1.5.0 sphinx-autobuild-2021.3.14 sphinx-rtd-theme-1.0.0 torchmetrics-0.6.2 twine-3.7.1 typed-ast-1.5.1 ujson-5.1.0 untokenize-0.1.1 virtualenv-20.12.1 xxhash-2.0.2 yacs-0.1.8 yarl-1.7.2
    WARNING: The following packages were previously imported in this runtime:
      [pydevd_plugins]
    You must restart the runtime in order to use newly installed versions.
    

    python version : 3.7.12 platform: Linux-5.4.144+-x86_64-with-Ubuntu-18.04-bionic

    opened by mathav95raj 2
  • Slice discovery problem p_72411 misses files

    Slice discovery problem p_72411 misses files

    Hi,

    Thanks for this great tool!

    I'm loading slice discovery problems, however, the problem p_72411 misses files. Can you fix this SD problem?

    FileNotFoundError: [Errno 2] No such file or directory: '/home/user/.dcbench/slice_discovery/problem/artifacts/p_72411/test_predictions.mk/meta.yaml'
    
    opened by duguyue100 0
Releases(v-0.0.1-beta)
Implementations of Machine Learning models, Regularizers, Optimizers and different Cost functions.

Linear Models Implementations of LinearRegression, LassoRegression and RidgeRegression with appropriate Regularizers and Optimizers. Linear Regression

Keivan Ipchi Hagh 1 Nov 22, 2021
Fourier-Bayesian estimation of stochastic volatility models

fourier-bayesian-sv-estimation Fourier-Bayesian estimation of stochastic volatility models Code used to run the numerical examples of "Bayesian Approa

15 Jun 20, 2022
An MLOps framework to package, deploy, monitor and manage thousands of production machine learning models

Seldon Core: Blazing Fast, Industry-Ready ML An open source platform to deploy your machine learning models on Kubernetes at massive scale. Overview S

Seldon 3.5k Jan 01, 2023
ArviZ is a Python package for exploratory analysis of Bayesian models

ArviZ (pronounced "AR-vees") is a Python package for exploratory analysis of Bayesian models. Includes functions for posterior analysis, data storage, model checking, comparison and diagnostics

ArviZ 1.3k Jan 05, 2023
A Pythonic framework for threat modeling

pytm: A Pythonic framework for threat modeling Introduction Traditional threat modeling too often comes late to the party, or sometimes not at all. In

Izar Tarandach 644 Dec 20, 2022
QML: A Python Toolkit for Quantum Machine Learning

QML is a Python2/3-compatible toolkit for representation learning of properties of molecules and solids.

176 Dec 09, 2022
The code from the Machine Learning Bookcamp book and a free course based on the book

The code from the Machine Learning Bookcamp book and a free course based on the book

Alexey Grigorev 5.5k Jan 09, 2023
PySpark + Scikit-learn = Sparkit-learn

Sparkit-learn PySpark + Scikit-learn = Sparkit-learn GitHub: https://github.com/lensacom/sparkit-learn About Sparkit-learn aims to provide scikit-lear

Lensa 1.1k Jan 04, 2023
Predicting India’s COVID-19 Third Wave with LSTM

Predicting India’s COVID-19 Third Wave with LSTM Complete project of predicting new COVID-19 cases in the next 90 days with LSTM India is seeing a ste

Samrat Dutta 4 Jan 27, 2022
A repository to work on Machine Learning course. Select an algorithm to classify writer's gender, of Hebrew texts.

MachineLearning A repository to work on Machine Learning course. Select an algorithm to classify writer's gender, of Hebrew texts. Tested algorithms:

Haim Adrian 1 Feb 01, 2022
A complete guide to start and improve in machine learning (ML)

A complete guide to start and improve in machine learning (ML), artificial intelligence (AI) in 2021 without ANY background in the field and stay up-to-date with the latest news and state-of-the-art

Louis-François Bouchard 3.3k Jan 04, 2023
A single Python file with some tools for visualizing machine learning in the terminal.

Machine Learning Visualization Tools A single Python file with some tools for visualizing machine learning in the terminal. This demo is composed of t

Bram Wasti 35 Dec 29, 2022
MiniTorch - a diy teaching library for machine learning engineers

This repo is the full student code for minitorch. It is designed as a single repo that can be completed part by part following the guide book. It uses

1.1k Jan 07, 2023
ParaMonte is a serial/parallel library of Monte Carlo routines for sampling mathematical objective functions of arbitrary-dimensions

ParaMonte is a serial/parallel library of Monte Carlo routines for sampling mathematical objective functions of arbitrary-dimensions, in particular, the posterior distributions of Bayesian models in

Computational Data Science Lab 182 Dec 31, 2022
Python package for causal inference using Bayesian structural time-series models.

Python Causal Impact Causal inference using Bayesian structural time-series models. This package aims at defining a python equivalent of the R CausalI

Thomas Cassou 219 Dec 11, 2022
Iris-Heroku - Putting a Machine Learning Model into Production with Flask and Heroku

Puesta en Producción de un modelo de aprendizaje automático con Flask y Heroku L

Jesùs Guillen 1 Jun 03, 2022
AI and Machine Learning with Kubeflow, Amazon EKS, and SageMaker

Data Science on AWS - O'Reilly Book Get the book on Amazon.com Book Outline Quick Start Workshop (4-hours) In this quick start hands-on workshop, you

Data Science on AWS 2.8k Jan 03, 2023
A Python package to preprocess time series

Disclaimer: This package is WIP. Do not take any APIs for granted. tspreprocess Time series can contain noise, may be sampled under a non fitting rate

Maximilian Christ 57 Dec 17, 2022
jaxfg - Factor graph-based nonlinear optimization library for JAX.

Factor graphs + nonlinear optimization in JAX

Brent Yi 134 Dec 21, 2022
Interactive Parallel Computing in Python

Interactive Parallel Computing with IPython ipyparallel is the new home of IPython.parallel. ipyparallel is a Python package and collection of CLI scr

IPython 2.3k Dec 30, 2022