Lightweight, Python library for fast and reproducible experimentation :microscope:

Overview

Steppy

license

What is Steppy?

  1. Steppy is a lightweight, open-source, Python 3 library for fast and reproducible experimentation.
  2. Steppy lets data scientist focus on data science, not on software development issues.
  3. Steppy's minimal interface does not impose constraints, however, enables clean machine learning pipeline design.

What problem steppy solves?

Problems

In the course of the project, data scientist faces two problems:

  1. Difficulties with reproducibility in data science / machine learning projects.
  2. Lack of the ability to prepare or extend experiments quickly.

Solution

Steppy address both problems by introducing two simple abstractions: Step and Tranformer. We consider it minimal interface for building machine learning pipelines.

  1. Step is a wrapper over the transformer and handles multiple aspects of the execution of the pipeline, such as saving intermediate results (if needed), checkpointing the model during training and much more.
  2. Tranformer in turn, is purely computational, data scientist-defined piece that takes an input data and produces some output data. Typical Transformers are neural network, machine learning algorithms and pre- or post-processing routines.

Start using steppy

Installation

Steppy requires python3.5 or above.

pip3 install steppy

(you probably want to install it in your virtualenv)

Resources

  1. 📒 Documentation
  2. 💻 Source
  3. 📛 Bugs reports
  4. 🚀 Feature requests
  5. 🌟 Tutorial notebooks (their repository):

Feature Requests

Please send us your ideas on how to improve steppy library! We are looking for your comments here: Feature requests.

Roadmap

At this point steppy is early-stage library heavily tested on multiple machine learning challenges (data-science-bowl, toxic-comment-classification-challenge, mapping-challenge) and educational projects (minerva-advanced-data-scientific-training).

We are developing steppy towards practical tool for data scientists who can run their experiments easily and change their pipelines with just few manipulations in the code.

Related projects

We are also building steppy-toolkit, a collection of high quality implementations of the top deep learning architectures -> all of them with the same, intuitive interface.

Contributing

You are welcome to contribute to the Steppy library. Please check CONTRIBUTING for more information.

Terms of use

Steppy is MIT-licensed.

Comments
  • Concat features

    Concat features

    How is it possible to do the following Step in new version(use of pandas_concat_inputs)?:

                                        transformer=GroupbyAggregationsFeatures(AGGREGATION_RECIPIES),
                                        input_steps=[df_step],
                                        input_data=['input'],
                                        adapter=Adapter({
                                            'X': ([('input', 'X'),
                                                   (df_step.name, 'X')],
                                                  pandas_concat_inputs)
                                        }),
                                        cache_dirpath=config.env.cache_dirpath)
    opened by denyslazarenko 8
  • Docs3

    Docs3

    Pull Request template

    Doc contributions

    Contributing.html FAQ.html intro.html testdoc.html

    tested by running in docs/

    >>> (Steppy) sphinx-apidoc -o generated/ -d 4 -fMa ../steppy
     >>> (Steppy) clear;make clean;make html
    

    Regards Bruce

    core contributors to the minerva.ml

    opened by bcottman 6
  • How to evaluate each step only once?

    How to evaluate each step only once?

    I have the following structure of my steps. The problem is that many steps are called more than once and it makes the process of training very slow. Is it possible somehow to simplify it? more precisely, how to optimize this part? I would like to compute input_missing just once selection_105

    opened by denyslazarenko 4
  • Difference between cache and persist

    Difference between cache and persist

    I do not really get the difference between these two things. Both of them cache the result of execution in the disc. selection_114 Is it a good idea to add cache_output to all the Steps to avoid any executions twice? In some of your examples, you use both cache and persist at the same time, I think it is a good idea to use one of it... selection_115

    opened by denyslazarenko 2
  • ENH: Adds id to support output caching

    ENH: Adds id to support output caching

    Fixes https://github.com/neptune-ml/steppy/issues/39

    This PR adds an optional id field to data dictionary. When cache_output is set to True, theid field is appended to step.nameto distinguish between output caches produced by different data dictionaries.

    For example:

    data_train = {
        'id': 'data_train'
        'input': {
            'features': np.array([
                [1, 6],
                [2, 5],
                [3, 4]
            ]),
            'labels': np.array([2, 5, 3]),
        }
    }
    step = Step(
        name='test_cache_output_with_key',
        transformer=IdentityOperation(),
        input_data=['input'],
        experiment_directory='/exp_dir',
        cache_output=True
    )
    step.fit_transform(data_train)
    

    This will produce a output cache file at /exp_dir/cache/test_cache_output_with_key__data_train.

    opened by thomasjpfan 2
  • Simplified adapter syntax

    Simplified adapter syntax

    This is my idea for simplifying adapter syntax. The benefit is that importing the extractor E from the adapter module is no longer needed. On the other hand, the rules for deciding if something is an atomic recipe or part of a larger recipe or even a constant get more complicated.

    feature-request API-design 
    opened by mromaniukcdl 2
  • refactor adapter.py

    refactor adapter.py

    Problem: Currently User must from steppy.adapter import Adapter, E in order to use adapters.

    Refactor so that:

    • Use does not have to import E
    • add Example to docstrings

    Refactor is comprehensive, so that:

    • correct the code
    • correct tests
    • correct docstrings
    feature-request API-design 
    opened by kamil-kaczmarek 2
  • PyTorch model is never saved as checkpoint after first epoch

    PyTorch model is never saved as checkpoint after first epoch

    Look here: https://github.com/minerva-ml/gradus/blob/dev/steps/pytorch/callbacks.py#L266 If self.epoch_id is equal to 0, then loss_sum is equal to self.best_score and model is not saved. I think it should be fixed, because sometimes we want to have model after first epoch saved.

    bug feature-request 
    opened by apyskir 2
  • Unintuitive adapter syntax

    Unintuitive adapter syntax

    Current syntax for adapters has some peculiarities. Consider the following example.

            step = Step(
                name='ensembler',
                transformer=Dummy(),
                input_data=['input_1'],
                adapter={'X': [('input_1', 'features')]},
                cache_dirpath='.cache'
            )
    

    This step basically extracts one element of the input. It seems redundant to write brackets and parentheses. Doing adapter={'X': ('input_1', 'features')}, should be sufficient.

    Moreover, to my suprise adapter={'X': [('input_1', 'features'), ('input_2', 'extra_features')]}, is incorrect, and currently leads to ValueError: too many values to unpack (expected 2)

    My suggestions to make the syntax consistent are:

    1. adapter={'X': ('input_1', 'features')} should map X to extracted features.
    2. adapter={'X': [...]} should map X to a list of extracted objects (specified by elements of the list). In particular adapter={'X': [('input_1', 'features')]} should map X to a one-element list with extracted features.
    3. adapter={'X': ([...], func)} should extract appropriate objects and put them on the list, then func should be called on that list, and X should map to the result of that call.
    API-design 
    opened by grzes314 2
  • 2nd version docs for steppy

    2nd version docs for steppy

    Pull Request template

    Doc contributions

    This represents 0.01, where we/you were at 0.0? As you should be able to see I was able to use 95% of what was there previously. redid index.rst redid conf.py added directory docs.nbdocs

    needs more work . about days worth. before pushing out to read the docs.

    i found the docstrings very strong.

    i not very strongly suggest step-toolkit and steppy-examples be merged into one project.

    I see you use goggle-docstring-style. i will switch from numpy-style.

    Regards Bruce

    opened by bcottman 1
  • FAQ DOC

    FAQ DOC

    Started. intend on first pass to fill with my (naive/embarassing) discoveries and really good (i.e. incredibly stupid) questions and enlightening answers from gaggle.

    opened by bcottman 1
  • Let's make it possible to transform based on checkpoints

    Let's make it possible to transform based on checkpoints

    Hi! Let's assume I'm training a huge network for a lot of epochs and it saves checkpoints in checkpoints folder. I suggest to prepare a possibility to run transform on a pipeline, when transformer is not in experiment_dir/transformers, but a checkpoint is available in checkpoints folder. What do you think?

    opened by apyskir 0
  • Structure of steps - ideas for making it cleaner

    Structure of steps - ideas for making it cleaner

    @kamil-kaczmarek, @jakubczakon I know it is a bunch of different ideas and suggestions clustered in one issue. Let me know which of those are compatible with the current roadmap. (I am happy to contribute/collaborate on some.)

    • default data folder (e.g. ./.steppy/step_name/) or to be configurable if needed; overriding only when strictly necessary
    • no input_data; it complicates things for no obvious reason!
    • names optional, automatically generated from class names + number
    • more explicit job structure (steps = Sequence([step1, step2])); vide Keras API
    • adapters as inheriting from BaseTrainers,step = Rename({'a': 'aaa', 'b': 'bbb'}), vide rename in Pandas
    • how to separate persist-data vs persist-parameters? (e.g. for image preprocessing, it may be time-saving to save once processed images)
    • built-in data tests (e.g. len(X) == len(Y)), in def test
    • built-in test if persist->load is correct (i.e. loaded data is the same as saved)
    opened by stared 2
  • Do all Steps execute parallel?

    Do all Steps execute parallel?

    Is it necessary to divide executions inside my class to be separate Thread or just divide them between Steps? For example, I can to fit KNN, PCA in one class method and parallel them or create two separate classes for them...

    opened by denyslazarenko 2
  • Maybe load_saved_input?

    Maybe load_saved_input?

    Hi, I have a proposal: let's make it possible to dump adapted input of a step to disk. It's very handy when you are working on a 5th or 10th step in a pipeline that has 2,3 or more input steps. Now you have to set flag load_saved_output=True on each of the input steps to be able to work on your beloved step. If you could just set load_saved_input=True (adapted or not adapted, I think it's worth discussion) on the step you are currently working on, it would be much easier. What do you think?

    opened by apyskir 0
Releases(v0.1.16)
Owner
minerva.ml
minerva.ml
Hashformers is a framework for hashtag segmentation with transformers.

Hashtag segmentation is the task of automatically inserting the missing spaces between the words in a hashtag. Hashformers applies Transformer models

Ruan Chaves 41 Nov 09, 2022
This repository is an unoffical PyTorch implementation of Medical segmentation in 3D and 2D.

Pytorch Medical Segmentation Read Chinese Introduction:Here! Recent Updates 2021.1.8 The train and test codes are released. 2021.2.6 A bug in dice was

EasyCV-Ellis 618 Dec 27, 2022
Code and data accompanying our SVRHM'21 paper.

Code and data accompanying our SVRHM'21 paper. Requires tensorflow 1.13, python 3.7, scikit-learn, and pytorch 1.6.0 to be installed. Python scripts i

5 Nov 17, 2021
Tensorflow Implementation of the paper "Spectral Normalization for Generative Adversarial Networks" (ICML 2017 workshop)

tf-SNDCGAN Tensorflow implementation of the paper "Spectral Normalization for Generative Adversarial Networks" (https://www.researchgate.net/publicati

Nhat M. Nguyen 248 Nov 25, 2022
A simple and useful implementation of LPIPS.

lpips-pytorch Description Developing perceptual distance metrics is a major topic in recent image processing problems. LPIPS[1] is a state-of-the-art

So Uchida 121 Dec 24, 2022
A Multi-modal Model Chinese Spell Checker Released on ACL2021.

ReaLiSe ReaLiSe is a multi-modal Chinese spell checking model. This the office code for the paper Read, Listen, and See: Leveraging Multimodal Informa

DaDa 106 Dec 29, 2022
Multi-Modal Machine Learning toolkit based on PaddlePaddle.

简体中文 | English PaddleMM 简介 飞桨多模态学习工具包 PaddleMM 旨在于提供模态联合学习和跨模态学习算法模型库,为处理图片文本等多模态数据提供高效的解决方案,助力多模态学习应用落地。 近期更新 2022.1.5 发布 PaddleMM 初始版本 v1.0 特性 丰富的任务

njustkmg 520 Dec 28, 2022
Aesara is a Python library that allows one to define, optimize, and efficiently evaluate mathematical expressions involving multi-dimensional arrays.

Aesara is a Python library that allows one to define, optimize, and efficiently evaluate mathematical expressions involving multi-dimensional arrays.

Aesara 898 Jan 07, 2023
MACE is a deep learning inference framework optimized for mobile heterogeneous computing platforms.

Documentation | FAQ | Release Notes | Roadmap | MACE Model Zoo | Demo | Join Us | 中文 Mobile AI Compute Engine (or MACE for short) is a deep learning i

Xiaomi 4.7k Dec 29, 2022
Physics-informed convolutional-recurrent neural networks for solving spatiotemporal PDEs

PhyCRNet Physics-informed convolutional-recurrent neural networks for solving spatiotemporal PDEs Paper link: [ArXiv] By: Pu Ren, Chengping Rao, Yang

Pu Ren 11 Aug 23, 2022
Py-FEAT: Python Facial Expression Analysis Toolbox

Py-FEAT is a suite for facial expressions (FEX) research written in Python. This package includes tools to detect faces, extract emotional facial expressions (e.g., happiness, sadness, anger), facial

Computational Social Affective Neuroscience Laboratory 147 Jan 06, 2023
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Website | Documentation | Tutorials | Installation | Release Notes CatBoost is a machine learning method based on gradient boosting over decision tree

CatBoost 6.9k Jan 04, 2023
Segmentation models with pretrained backbones. PyTorch.

Python library with Neural Networks for Image Segmentation based on PyTorch. The main features of this library are: High level API (just two lines to

Pavel Yakubovskiy 6.6k Jan 06, 2023
The Medical Detection Toolkit contains 2D + 3D implementations of prevalent object detectors such as Mask R-CNN, Retina Net, Retina U-Net, as well as a training and inference framework focused on dealing with medical images.

The Medical Detection Toolkit contains 2D + 3D implementations of prevalent object detectors such as Mask R-CNN, Retina Net, Retina U-Net, as well as a training and inference framework focused on dea

MIC-DKFZ 1.2k Jan 04, 2023
Implementation of ToeplitzLDA for spatiotemporal stationary time series data.

Code for the ToeplitzLDA classifier proposed in here. The classifier conforms sklearn and can be used as a drop-in replacement for other LDA classifiers. For in-depth usage refer to the learning from

Jan Sosulski 5 Nov 07, 2022
Official pytorch implement for “Transformer-Based Source-Free Domain Adaptation”

Official implementation for TransDA Official pytorch implement for “Transformer-Based Source-Free Domain Adaptation”. Overview: Result: Prerequisites:

stanley 54 Dec 22, 2022
TRIQ implementation

TRIQ Implementation TF-Keras implementation of TRIQ as described in Transformer for Image Quality Assessment. Installation Clone this repository. Inst

Junyong You 115 Dec 30, 2022
Official PyTorch implementation of the paper Image-Based CLIP-Guided Essence Transfer.

TargetCLIP- official pytorch implementation of the paper Image-Based CLIP-Guided Essence Transfer This repository finds a global direction in StyleGAN

Hila Chefer 221 Dec 13, 2022
Classify music genre from a 10 second sound stream using a Neural Network.

MusicGenreClassification Academic research in the field of Deep Learning (Deep Neural Networks) and Sound Processing, Tel Aviv University. Featured in

Matan Lachmish 453 Dec 27, 2022
Video-based open-world segmentation

UVO_Challenge Team Alpes_runner Solutions This is an official repo for our UVO Challenge solutions for Image/Video-based open-world segmentation. Our

Yuming Du 84 Dec 22, 2022