BatchFlow helps you conveniently work with random or sequential batches of your data and define data processing and machine learning workflows even for datasets that do not fit into memory.

Overview

License Python TensorFlow PyTorch codecov PyPI Status

BatchFlow

BatchFlow helps you conveniently work with random or sequential batches of your data and define data processing and machine learning workflows even for datasets that do not fit into memory.

For more details see the documentation and tutorials.

Main features:

  • flexible batch generaton
  • deterministic and stochastic pipelines
  • datasets and pipelines joins and merges
  • data processing actions
  • flexible model configuration
  • within batch parallelism
  • batch prefetching
  • ready to use ML models and proven NN architectures
  • convenient layers and helper functions to build custom models
  • a powerful research engine with parallel model training and extended experiment logging.

Basic usage

my_workflow = my_dataset.pipeline()
              .load('/some/path')
              .do_something()
              .do_something_else()
              .some_additional_action()
              .save('/to/other/path')

The trick here is that all the processing actions are lazy. They are not executed until their results are needed, e.g. when you request a preprocessed batch:

my_workflow.run(BATCH_SIZE, shuffle=True, n_epochs=5)

or

for batch in my_workflow.gen_batch(BATCH_SIZE, shuffle=True, n_epochs=5):
    # only now the actions are fired and data is being changed with the workflow defined earlier
    # actions are executed one by one and here you get a fully processed batch

or

NUM_ITERS = 1000
for i in range(NUM_ITERS):
    processed_batch = my_workflow.next_batch(BATCH_SIZE, shuffle=True, n_epochs=None)
    # only now the actions are fired and data is changed with the workflow defined earlier
    # actions are executed one by one and here you get a fully processed batch

Train a neural network

BatchFlow includes ready-to-use proven architectures like VGG, Inception, ResNet and many others. To apply them to your data just choose a model, specify the inputs (like the number of classes or images shape) and call train_model. Of course, you can also choose a loss function, an optimizer and many other parameters, if you want.

from batchflow.models.tf import ResNet34

my_workflow = my_dataset.pipeline()
              .init_model('dynamic', ResNet34, config={
                          'inputs/images/shape': B('image_shape'),
                          'labels/classes': 10,
                          'initial_block/inputs': 'images'})
              .load('/some/path')
              .some_transform()
              .another_transform()
              .train_model('ResNet34', images=B('images'), labels=B('labels'))
              .run(BATCH_SIZE, shuffle=True)

For more advanced cases and detailed API see the documentation.

Installation

BatchFlow module is in the beta stage. Your suggestions and improvements are very welcome.

BatchFlow supports python 3.5 or higher.

Stable python package

With modern pipenv

pipenv install batchflow

With old-fashioned pip

pip3 install batchflow

Development version

With modern pipenv

pipenv install git+https://github.com/analysiscenter/batchflow.git#egg=batchflow

With old-fashioned pip

pip3 install git+https://github.com/analysiscenter/batchflow.git

After that just import batchflow:

import batchflow as bf

Git submodule

In many cases it might be more convenient to install batchflow as a submodule in your project repository than as a python package.

git submodule add https://github.com/analysiscenter/batchflow.git
git submodule init
git submodule update

If your python file is located in another directory, you might need to add a path to batchflow:

import sys
sys.path.insert(0, "/path/to/batchflow")
import batchflow as bf

What is great about using a submodule that every commit in your project can be linked to its own commit of a submodule. This is extremely convenient in a fast paced research environment.

Relative import is also possible:

from .batchflow import Dataset

Projects based on BatchFlow

Citing BatchFlow

Please cite BatchFlow in your publications if it helps your research.

DOI

Roman Khudorozhkov et al. BatchFlow library for fast ML workflows. 2017. doi:10.5281/zenodo.1041203
@misc{roman_kh_2017_1041203,
  author       = {Khudorozhkov, Roman and others},
  title        = {BatchFlow library for fast ML workflows},
  year         = 2017,
  doi          = {10.5281/zenodo.1041203},
  url          = {https://doi.org/10.5281/zenodo.1041203}
}
Comments
  • 0 full project tutorial

    0 full project tutorial

    This PR contains a new tutorial with the project from scratch using Batchflow. This tutorial was intended to be a Batchflow intro containing links to helpful pages in documentation and almost every other tutorial.

    Plus, I made some fixes (ResNet configs), upgrades (SEBlock, fontsize in plot_images) and additions (show_confusion_matrix, model weights initialization with kaiming normal, branch arguments parsing, new datasets and their parsing) for it.

    Almost all improvements are in https://github.com/analysiscenter/batchflow/pull/624 (except batchflow/models/utils.py and the tutorial).

    opened by HollowPrincess 36
  • Image examples from dataset/examples/simple_but_ugly/ don't work out of the box

    Image examples from dataset/examples/simple_but_ugly/ don't work out of the box

    Hi! I've tried to run several examples from the directory. For example, trying to run https://github.com/analysiscenter/dataset/tree/master/examples/simple_but_ugly fails with bunch of errors. Looks like there are no several files with definitions (random_scale, random_rotate, convert_to_pil, etc)

    opened by mikhailkin 7
  • Learning rate features

    Learning rate features

    Changelist:

    • [x] New decay interface;
    • [x] Ability to fetch learning rate: fetches= 'lr';
    • [x] Saving learning rate into iter_info;
    • [x] Notebook with decay experiments.

    Fixes:

    • Docstrings for optimizer, decay, loss;
    • The bug of building several models occurring when prefetch is enabled, and model_config does not have an input shape.
    opened by Dimonovez 6
  • Incremental Torch improvements

    Incremental Torch improvements

    • refactor BasePool: it can be split into multiple classes with clear functionality (done in #469)

    • improve Encoder/Decoder modules

    • refactor n_iters and decay configurations: no need to pass n_iters in the root configuration (done)

    • make so every block is sent to device: can be helpful with pre-trained models (done in #461)

    • refactor pyramid layers so they use common base

    • make Xception out of XceptionBlocks

    opened by SergeyTsimfer 6
  • Eager Torch

    Eager Torch

    This PR proposes to add EagerTorch model that:

    • can build off of batch_data during first call of the train method

    • allows for better usage of native torch modules

    • does not use redundant tf-like methods (make_inputs, has_classes, etc)

    opened by SergeyTsimfer 6
  • Sampler classes

    Sampler classes

    Main changes

    • Remove inner functions from Sampler.sample for multiprocessing (pickle, really) to work with Sampler-objects. Now all operations on samplers (&, |,..., +, -, ..., %) are implemented in corresponding subclasses of Sampler.
    • Add multiprocessing-example into the Sampler-tutorial.
    opened by akoryagin 5
  • Regression metrics

    Regression metrics

    This PR proposes to add regression metrics and tests. No tests provided yet.

    Metrics are compatible with multi output task i.e. when targets have shape (n_samples, n_outputs). In this case aggregation among outputs is available.

    opened by nikita-klsh 5
  • Fix pylint

    Fix pylint

    This PR fixes pylint output in the latest docker image analysiscenter1/ds-py3.

    A lot of missing-function-docstring and abstract-method warnings happened in batchflow/models/torch folder we cant fix, because they occur on the torch side. They are fixing this and this.

    So we can:

    1. Close eyes on this warnings.
    2. Temporary set additional restrictions for pylint for torch folder only like its currently done.

    @roman-kh

    opened by nikita-klsh 4
  • Create release gh action

    Create release gh action

    Make release actions:

    • build docs
    • calc test coverage
    • create python package and upload it to pypi.

    Actions should fire at various statuses: created/edited, published, etc.

    opened by roman-kh 4
  • Trackers

    Trackers

    This PR proposes to add:

    • monitoring tools (namely, new ResourceMonitor class, as well as context managers for conveniency), for example
    with monitor_resource('gpu', frequency=0.5) as monitor:
        # train model
    

    to better understand resource (cpu, gpu, memory) utilization

    • tracking tools (namely, new Notifier class) to provide better utilization of tqdm progress-bars in Pipeline, as well as to plot graphs of, for example, loss values, dynamically during model training

    • Notifier class is also capable of using any of the resource monitoring utilities provided by ResourceMonitors. Old functionality of just passing n/True is working too

    image

    opened by SergeyTsimfer 3
  • TFModel improvements

    TFModel improvements

    This PR proposes to:

    • Make ConvBlock able to chain multiple layers, just like Torch version can

    • Simplify logic of letter parsing, as well as adding capability of using R letter as separate Branch with complex parameters

    • Swap all arguments inside calls to conv_block to keyword ones

    • Add squeeze-and-excitation versions of ResNet

    • Re-check all the tests of model compilation

    • Add various attention modules: some of them are available through S (stands for self-attention) letter, some of them are mods of Combine

    opened by SergeyTsimfer 3
  • Refactor `inbatch_parallel`

    Refactor `inbatch_parallel`

    As the inbatch_parallel now not supposed to be used on its own, we can refactor it with following goals in mind:

    • [ ] remove _use_self args
    • [ ] remove init/post functions: the container with init should be passed directly from Batch.apply_parallel, and the results should be post-processed in the Batch.apply_parallel as well
    • [ ] make inbatch_parallel a class: that would allow for easier introspection and parameter changes on the fly, for example, target to any other.
    opened by SergeyTsimfer 2
  • Initialize random seed for processes

    Initialize random seed for processes

    Each python process starts with the same random seed initialization which results in no randomness across processes. Batch or pipeline action is required to provide random seed.

    opened by roman-kh 0
  • Minimize requirements

    Minimize requirements

    BatchFlow should require numpy only as it is used everywhere. All other packages should be optional and modules / functions should provide a clear message if needed reqs are not installed.

    opened by roman-kh 0
Releases(0.8.0)
  • 0.8.0(Dec 30, 2022)

    This release fixes crop behavior of TorchModel, as well as adds new blocks and methods:

    • InternBlock with deformable convolutions
    • separate BottleneckBlock that extends the functionality of ResBlock
    • method for getting a reference to the current TorchModel instance inside train/predict contexts
    • mode parameter for train and predict methods to control nn.Module behavior.

    Also, this is the first version after numpy deprecation of autocast to dtype=object of mishaped arrays, so this is fixed in some places.

    Source code(tar.gz)
    Source code(zip)
  • 0.7.7(Nov 7, 2022)

  • 0.7.6(Oct 21, 2022)

    This release changes the way Batch.apply_parallel works: now it accepts both init and post functions, and should be the preferrable way to decorate batch methods (by marking them with decorators.apply_parallel).

    Other than that, there are a few new building blocks for TorchModel, parameter to pad the last microbatches to full microbatch_size, and small bug fixes.

    Source code(tar.gz)
    Source code(zip)
  • 0.7.5(Jul 7, 2022)

    Models

    • added gradient clipping and new layers

    Plot

    • refactored existing plots across the library to rely on plot, introduced in the previous release

    Research improvements

    • modified stored configs to use aliases instead of actual values: that fixes some pickling problems
    Source code(tar.gz)
    Source code(zip)
  • 0.7.0(Jun 20, 2022)

    Models

    • refactored model building procedure: split modules into separate entities like EncoderModule and DecoderModule
    • introduced new modules that import ready-to-use networks from other libraries: currently, we support TIMM and HuggingFace libraries
    • better module repr
    • check #645 for other changes

    Plot

    • introduced plot module with utilities for displaying images and curves
    • plot has a few tutorials with lots of examples: refer to them to get a more in-depth understanding of plot usages

    Research improvements

    • added separate Storage class, that manages output streams of research results.
    • various fixes and QoL changes
    Source code(tar.gz)
    Source code(zip)
  • 0.6.0(Feb 17, 2022)

    Named expressions

    Added BA

    Models

    Removed tensorflow

    Research

    Added research module for massive parallel model training / evaluation.

    Tutorials and examples

    New tutorials.

    Also added some tests.

    Source code(tar.gz)
    Source code(zip)
  • v0.5.0(Jun 10, 2021)

  • v0.5.0beta3(Mar 5, 2021)

  • v0.5.0beta1(Mar 5, 2021)

  • v0.5.0beta2(Mar 5, 2021)

  • 0.3.0(Jan 20, 2018)

    Bug fixes and a lot of refactoring.

    Batch

    Components can be added dynamically during execution. Parameters order is changed in apply_transform and apply_transform_all.

    Named expressions:

    • B() returns the batch itself.
    • F takes args and kwargs.
    • added R (random) and L (lambda).

    Pipeline

    Refactored models directory and variables directory. Added print. Removed print_variable.

    Tensorflow

    Layers

    Added:

    • 1d and 3d bilinear resize
    • 3d depth to space
    • separable transposed convolutions
    • subpixel convolutions
    • bilinear additive resize
    • upsample
    • alpha dropout
    • universal pooling and global_pooling

    Changed:

    • conv_block support residuals (with sum and concat) and upsample layers.

    TFModel:

    • new methods: upsample, Pyramid Pooling module, Atrous Spatial Pyramid Pooling module
    • model predictions can be an output of predefined operations (sigmoid, softmax, argmax, etc)

    Model zoo

    Added DenseNetFC, ResNetAttention, VNet, RefineNet, Faster-RCNN, Global Convolution Network, Encoder-decoder, Inception-ResNet v2, MobileNet v2.

    Source code(tar.gz)
    Source code(zip)
  • 0.2.2(Nov 23, 2017)

    • Changed model structure and configuration (with default_config() and build_config())

    • Added ready to use TensorFlow models: VGG, Inception v1, v3, v4, ResNet, MobileNet, SqueezeNet, DenseNet, FCN32, FCN16, FCN8, UNet, LinkNet.

    • Added new layers: fractional_max_pooling.

    • Dimensionality for all layers is now inferred from the input tensor shape.

    • Added fake njit decorator for environments without numba installed.

    Source code(tar.gz)
    Source code(zip)
  • 0.2.0(Nov 3, 2017)

Clean APIs for data cleaning. Python implementation of R package Janitor

pyjanitor pyjanitor is a Python implementation of the R package janitor, and provides a clean API for cleaning data. Why janitor? Originally a port of

Eric Ma 1.1k Jan 01, 2023
BatchFlow helps you conveniently work with random or sequential batches of your data and define data processing and machine learning workflows even for datasets that do not fit into memory.

BatchFlow BatchFlow helps you conveniently work with random or sequential batches of your data and define data processing and machine learning workflo

Data Analysis Center 185 Dec 20, 2022
A Python toolkit for processing tabular data

meza: A Python toolkit for processing tabular data Index Introduction | Requirements | Motivation | Hello World | Usage | Interoperability | Installat

Reuben Cummings 401 Dec 19, 2022
dplyr for python

Dplython: Dplyr for Python Welcome to Dplython: Dplyr for Python. Dplyr is a library for the language R designed to make data analysis fast and easy.

Chris Riederer 754 Nov 21, 2022
Directions overlay for working with pandas in an analysis environment

dovpanda Directions OVer PANDAs Directions are hints and tips for using pandas in an analysis environment. dovpanda is an overlay companion for workin

dovpandev 431 Dec 20, 2022
Easy pipelines for pandas DataFrames.

pdpipe ˨ Easy pipelines for pandas DataFrames (learn how!). Website: https://pdpipe.github.io/pdpipe/ Documentation: https://pdpipe.github.io/pdpipe/d

694 Jan 05, 2023
functional data manipulation for pandas

pandas-ply: functional data manipulation for pandas pandas-ply is a thin layer which makes it easier to manipulate data with pandas. In particular, it

Coursera 188 Nov 24, 2022
Microsoft Azure provides a wide number of services for managing and storing data

Microsoft Azure provides a wide number of services for managing and storing data. One product is Microsoft Azure SQL. Which gives us the capability to create and manage instances of SQL Servers hoste

Riya Vijay Vishwakarma 1 Dec 12, 2021
Pandas integration with sklearn

Sklearn-pandas This module provides a bridge between Scikit-Learn's machine learning methods and pandas-style Data Frames. In particular, it provides

2.7k Dec 27, 2022
Tools for parsing messy tabular data.

Parsing for messy tables A library for dealing with messy tabular data in several formats, guessing types and detecting headers. See the documentation

Open Knowledge Foundation 382 Nov 10, 2022
Build, test, deploy, iterate - Dev and prod tool for data science pipelines

Prodmodel is a build system for data science pipelines. Users, testers, contributors are welcome! Motivation · Concepts · Installation · Usage · Contr

Prodmodel 53 Nov 29, 2022