Data Augmentation with Variational Autoencoders

Overview



Documentation 	Status Downloads 	Status

Documentation

Pyraug

This library provides a way to perform Data Augmentation using Variational Autoencoders in a reliable way even in challenging contexts such as high dimensional and low sample size data.

Installation

To install the library from pypi.org run the following using pip

$ pip install pyraug

or alternatively you can clone the github repo to access to tests, tutorials and scripts.

$ git clone https://github.com/clementchadebec/pyraug.git

and install the library

$ cd pyraug
$ pip install .

Augmenting your Data

In Pyraug, a typical augmentation process is divided into 2 distinct parts:

  1. Train a model using the Pyraug's TrainingPipeline or using the provided scripts/training.py script
  2. Generate new data from a trained model using Pyraug's GenerationPipeline or using the provided scripts/generation.py script

There exist two ways to augment your data pretty straightforwardly using Pyraug's built-in functions.

Using Pyraug's Pipelines

Pyraug provides two pipelines that may be used to either train a model on your own data or generate new data with a pretrained model.

note: These pipelines are independent of the choice of the model and sampler. Hence, they can be used even if you want to access to more advanced features such as defining your own autoencoding architecture.

Launching a model training

To launch a model training, you only need to call a TrainingPipeline instance. In its most basic version the TrainingPipeline can be built without any arguments. This will by default train a RHVAE model with default autoencoding architecture and parameters.

>>> from pyraug.pipelines import TrainingPipeline
>>> pipeline = TrainingPipeline()
>>> pipeline(train_data=dataset_to_augment)

where dataset_to_augment is either a numpy.ndarray, torch.Tensor or a path to a folder where each file is a data (handled data formats are .pt, .nii, .nii.gz, .bmp, .jpg, .jpeg, .png).

More generally, you can instantiate your own model and train it with the TrainingPipeline. For instance, if you want to instantiate a basic RHVAE run:

>>> from pyraug.models import RHVAE
>>> from pyraug.models.rhvae import RHVAEConfig
>>> model_config = RHVAEConfig(
...    input_dim=int(intput_dim)
... ) # input_dim is the shape of a flatten input data
...   # needed if you did not provide your own architectures
>>> model = RHVAE(model_config)

In case you instantiate yourself a model as shown above and you did not provide all the network architectures (encoder, decoder & metric if applicable), the ModelConfig instance will expect you to provide the input dimension of your data which equals to n_channels x height x width x .... Pyraug's VAE models' networks indeed default to Multi Layer Perceptron neural networks which automatically adapt to the input data shape.

note: In case you have different size of data, Pyraug will reshape it to the minimum size min_n_channels x min_height x min_width x ...

Then the TrainingPipeline can be launched by running:

>>> from pyraug.pipelines import TrainingPipeline
>>> pipe = TrainingPipeline(model=model)
>>> pipe(train_data=dataset_to_augment)

At the end of training, the model weights models.pt and model config model_config.json file will be saved in a folder outputs/my_model/training_YYYY-MM-DD_hh-mm-ss/final_model.

Important: For high dimensional data we advice you to provide you own network architectures and potentially adapt the training and model parameters see documentation for more details.

Launching data generation

To launch the data generation process from a trained model, run the following.

>>> from pyraug.pipelines import GenerationPipeline
>>> from pyraug.models import RHVAE
>>> model = RHVAE.load_from_folder('path/to/your/trained/model') # reload the model
>>> pipe = GenerationPipeline(model=model) # define pipeline
>>> pipe(samples_number=10) # This will generate 10 data points

The generated data is in .pt files in dummy_output_dir/generation_YYYY-MM-DD_hh-mm-ss. By default, it stores batch data of a maximum of 500 samples.

Retrieve generated data

Generated data can then be loaded pretty easily by running

>>> import torch
>>> data = torch.load('path/to/generated_data.pt')

Using the provided scripts

Pyraug provides two scripts allowing you to augment your data directly with commandlines.

note: To access to the predefined scripts you should first clone the Pyraug's repository. The following scripts are located in scripts folder. For the time being, only RHVAE model training and generation is handled by the provided scripts. Models will be added as they are implemented in pyraug.models

Launching a model training:

To launch a model training, run

$ python scripts/training.py --path_to_train_data "path/to/your/data/folder" 

The data must be located in path/to/your/data/folder where each input data is a file. Handled image types are .pt, .nii, .nii.gz, .bmp, .jpg, .jpeg, .png. Depending on the usage, other types will be progressively added.

At the end of training, the model weights models.pt and model config model_config.json file will be saved in a folder outputs/my_model_from_script/training_YYYY-MM-DD_hh-mm-ss/final_model.

Launching data generation

Then, to launch the data generation process from a trained model, you only need to run

$ python scripts/generation.py --num_samples 10 --path_to_model_folder 'path/to/your/trained/model/folder' 

The generated data is stored in several .pt files in outputs/my_generated_data_from_script/generation_YYYY-MM-DD_hh_mm_ss. By default, it stores batch data of 500 samples.

Important: In the simplest configuration, default configurations are used in the scripts. You can easily override as explained in documentation. See tutorials for a more in depth example.

Retrieve generated data

Generated data can then be loaded pretty easily by running

>>> import torch
>>> data = torch.load('path/to/generated_data.pt')

Getting your hands on the code

To help you to understand the way Pyraug works and how you can augment your data with this library we also provide tutorials that can be found in examples folder:

Dealing with issues

If you are experiencing any issues while running the code or request new features please open an issue on github

Citing

If you use this library please consider citing us:

@article{chadebec_data_2021,
	title = {Data {Augmentation} in {High} {Dimensional} {Low} {Sample} {Size} {Setting} {Using} a {Geometry}-{Based} {Variational} {Autoencoder}},
	copyright = {All rights reserved},
	journal = {arXiv preprint arXiv:2105.00026},
  	arxiv = {2105.00026},
	author = {Chadebec, Clément and Thibeau-Sutre, Elina and Burgos, Ninon and Allassonnière, Stéphanie},
	year = {2021}
}

Credits

Logo: SaulLu

You might also like...
Unofficial PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners

Unofficial PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners This repository is built upon BEiT, thanks very much! Now, we on

PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners for self-supervised ViT.

MAE for Self-supervised ViT Introduction This is an unofficial PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners for self-sup

 An pytorch implementation of Masked Autoencoders Are Scalable Vision Learners
An pytorch implementation of Masked Autoencoders Are Scalable Vision Learners

An pytorch implementation of Masked Autoencoders Are Scalable Vision Learners This is a coarse version for MAE, only make the pretrain model, the fine

A framework that constructs deep neural networks, autoencoders, logistic regressors, and linear networks

A framework that constructs deep neural networks, autoencoders, logistic regressors, and linear networks without the use of any outside machine learning libraries - all from scratch.

Autoencoders pretraining using clustering

Autoencoders pretraining using clustering

Re-implememtation of MAE (Masked Autoencoders Are Scalable Vision Learners) using PyTorch.

mae-repo PyTorch re-implememtation of "masked autoencoders are scalable vision learners". In this repo, it heavily borrows codes from codebase https:/

ConvMAE: Masked Convolution Meets Masked Autoencoders
ConvMAE: Masked Convolution Meets Masked Autoencoders

ConvMAE ConvMAE: Masked Convolution Meets Masked Autoencoders Peng Gao1, Teli Ma1, Hongsheng Li2, Jifeng Dai3, Yu Qiao1, 1 Shanghai AI Laboratory, 2 M

Code and pre-trained models for MultiMAE: Multi-modal Multi-task Masked Autoencoders
Code and pre-trained models for MultiMAE: Multi-modal Multi-task Masked Autoencoders

MultiMAE: Multi-modal Multi-task Masked Autoencoders Roman Bachmann*, David Mizrahi*, Andrei Atanov, Amir Zamir Website | arXiv | BibTeX Official PyTo

This is the official Pytorch implementation of
This is the official Pytorch implementation of "Lung Segmentation from Chest X-rays using Variational Data Imputation", Raghavendra Selvan et al. 2020

README This is the official Pytorch implementation of "Lung Segmentation from Chest X-rays using Variational Data Imputation", Raghavendra Selvan et a

Comments
  • It takes a long time to train the model

    It takes a long time to train the model

    I am trying to train a RHVAE model for data augmentation and the model starts training but it takes a long time training and do not see any results. I do not know if is an error from my dataset, computer or from the library. Could you help me?

    opened by mikel-hernandezj 2
  • Geodesics computation

    Geodesics computation

    It would be great to have a function to compute geodesics, given a trained model and two points in the latent space.

    The goal would be to allow the exploration of the latent space via geodesics, as visualised in Figure 2 of (Chadebec et al., 2021):

    Screenshot 2021-09-28 at 10 06 34 enhancement 
    opened by Virgiliok 2
  • riemann_tools

    riemann_tools

    Hi,

    In on of your example notebooks (geodesic_computation_example), you import the function Geodesic_autodiff from the package riemann_tools. I cannot find any mention of this package however. Could you perhaps provide some documentation on how to install/import the riemann_tools? Thank you in advance!

    Edit: removing the import solved the problem

    opened by VivienvV 0
Releases(v0.0.6)
Gated-Shape CNN for Semantic Segmentation (ICCV 2019)

GSCNN This is the official code for: Gated-SCNN: Gated Shape CNNs for Semantic Segmentation Towaki Takikawa, David Acuna, Varun Jampani, Sanja Fidler

859 Dec 26, 2022
Storage-optimizer - Identify potintial optimizations on the cloud storage accounts

Storage Optimizer Identify potintial optimizations on the cloud storage accounts

Zaher Mousa 1 Feb 13, 2022
Reporting and Visualization for Hazardous Events

Reporting and Visualization for Hazardous Events

Jv Kyle Eclarin 2 Oct 03, 2021
Codebase for Image Classification Research, written in PyTorch.

pycls pycls is an image classification codebase, written in PyTorch. It was originally developed for the On Network Design Spaces for Visual Recogniti

Facebook Research 2k Jan 01, 2023
Weakly Supervised Text-to-SQL Parsing through Question Decomposition

Weakly Supervised Text-to-SQL Parsing through Question Decomposition The official repository for the paper "Weakly Supervised Text-to-SQL Parsing thro

14 Dec 19, 2022
Official PyTorch Implementation of Unsupervised Learning of Scene Flow Estimation Fusing with Local Rigidity

UnRigidFlow This is the official PyTorch implementation of UnRigidFlow (IJCAI2019). Here are two sample results (~10MB gif for each) of our unsupervis

Liang Liu 28 Nov 16, 2022
Official repository for the NeurIPS 2021 paper Get Fooled for the Right Reason: Improving Adversarial Robustness through a Teacher-guided curriculum Learning Approach

Get Fooled for the Right Reason Official repository for the NeurIPS 2021 paper Get Fooled for the Right Reason: Improving Adversarial Robustness throu

Sowrya Gali 1 Apr 25, 2022
This repository contains the code for EMNLP-2021 paper "Word-Level Coreference Resolution"

Word-Level Coreference Resolution This is a repository with the code to reproduce the experiments described in the paper of the same name, which was a

79 Dec 27, 2022
Learning to Initialize Neural Networks for Stable and Efficient Training

GradInit This repository hosts the code for experiments in the paper, GradInit: Learning to Initialize Neural Networks for Stable and Efficient Traini

Chen Zhu 124 Dec 30, 2022
A pytorch implementation of the ACL2019 paper "Simple and Effective Text Matching with Richer Alignment Features".

RE2 This is a pytorch implementation of the ACL 2019 paper "Simple and Effective Text Matching with Richer Alignment Features". The original Tensorflo

287 Dec 21, 2022
Attack on Confidence Estimation algorithm from the paper "Disrupting Deep Uncertainty Estimation Without Harming Accuracy"

Attack on Confidence Estimation (ACE) This repository is the official implementation of "Disrupting Deep Uncertainty Estimation Without Harming Accura

3 Mar 30, 2022
Bridging the Gap between Label- and Reference based Synthesis(ICCV 2021)

Bridging the Gap between Label- and Reference based Synthesis(ICCV 2021) Tensorflow implementation of Bridging the Gap between Label- and Reference-ba

huangqiusheng 8 Jul 13, 2022
Instance-based label smoothing for improving deep neural networks generalization and calibration

Instance-based Label Smoothing for Neural Networks Pytorch Implementation of the algorithm. This repository includes a new proposed method for instanc

Mohamed Maher 1 Aug 13, 2022
Pyramid Scene Parsing Network, CVPR2017.

Pyramid Scene Parsing Network by Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, Jiaya Jia, details are in project page. Introduction This

Hengshuang Zhao 1.5k Jan 05, 2023
Dynamic Head: Unifying Object Detection Heads with Attentions

Dynamic Head: Unifying Object Detection Heads with Attentions dyhead_video.mp4 This is the official implementation of CVPR 2021 paper "Dynamic Head: U

Microsoft 550 Dec 21, 2022
Read number plates with https://platerecognizer.com/

HASS-plate-recognizer Read vehicle license plates with https://platerecognizer.com/ which offers free processing of 2500 images per month. You will ne

Robin 69 Dec 30, 2022
Code for paper "Which Training Methods for GANs do actually Converge? (ICML 2018)"

GAN stability This repository contains the experiments in the supplementary material for the paper Which Training Methods for GANs do actually Converg

Lars Mescheder 885 Jan 01, 2023
Repo público onde postarei meus estudos de Python, buscando aprender por meio do compartilhamento do aprendizado!

Seja bem vindo à minha repo de Estudos em Python 3! Este é um repositório criado por um programador amador que estuda tópicos de finanças, estatística

32 Dec 24, 2022
GradAttack is a Python library for easy evaluation of privacy risks in public gradients in Federated Learning

GradAttack is a Python library for easy evaluation of privacy risks in public gradients in Federated Learning, as well as corresponding mitigation strategies.

129 Dec 30, 2022
Source code for Task-Aware Variational Adversarial Active Learning

Contrastive Coding for Active Learning under Class Distribution Mismatch Official PyTorch implementation of ["Contrastive Coding for Active Learning u

27 Nov 23, 2022