Synthetic data need to preserve the statistical properties of real data in terms of their individual behavior and (inter-)dependences

Overview

Overview

Synthetic data need to preserve the statistical properties of real data in terms of their individual behavior and (inter-)dependences. Copula and functional Principle Component Analysis (fPCA) are statistical models that allow these properties to be simulated (Joe 2014). As such, copula generated data have shown potential to improve the generalization of machine learning (ML) emulators (Meyer et al. 2021) or anonymize real-data datasets (Patki et al. 2016).

Synthia is an open source Python package to model univariate and multivariate data, parameterize data using empirical and parametric methods, and manipulate marginal distributions. It is designed to enable scientists and practitioners to handle labelled multivariate data typical of computational sciences. For example, given some vertical profiles of atmospheric temperature, we can use Synthia to generate new but statistically similar profiles in just three lines of code (Table 1).

Synthia supports three methods of multivariate data generation through: (i) fPCA, (ii) parametric (Gaussian) copula, and (iii) vine copula models for continuous (all), discrete (vine), and categorical (vine) variables. It has a simple and succinct API to natively handle xarray's labelled arrays and datasets. It uses a pure Python implementation for fPCA and Gaussian copula, and relies on the fast and well tested C++ library vinecopulib through pyvinecopulib's bindings for fast and efficient computation of vines. For more information, please see the website at https://dmey.github.io/synthia.

Table 1. Example application of Gaussian and fPCA classes in Synthia. These are used to generate random profiles of atmospheric temperature similar to those included in the source data. The xarray dataset structure is maintained and returned by Synthia.

Source Synthetic with Gaussian Copula Synthetic with fPCA
ds = syn.util.load_dataset() g = syn.CopulaDataGenerator() g = syn.fPCADataGenerator()
g.fit(ds, syn.GaussianCopula()) g.fit(ds)
g.generate(n_samples=500) g.generate(n_samples=500)
Source Gaussian fPCA

Documentation

For installation instructions, getting started guides and tutorials, background information, and API reference summaries, please see the website.

How to cite

If you are using Synthia, please cite the following two papers using their respective Digital Object Identifiers (DOIs). Citations may be generated automatically using Crosscite's DOI Citation Formatter or from the BibTeX entries below.

Synthia Software Software Application
DOI: 10.21105/joss.02863 DOI: 10.5194/gmd-14-5205-2021
@article{Meyer_and_Nagler_2021,
  doi = {10.21105/joss.02863},
  url = {https://doi.org/10.21105/joss.02863},
  year = {2021},
  publisher = {The Open Journal},
  volume = {6},
  number = {65},
  pages = {2863},
  author = {David Meyer and Thomas Nagler},
  title = {Synthia: multidimensional synthetic data generation in Python},
  journal = {Journal of Open Source Software}
}

@article{Meyer_and_Nagler_and_Hogan_2021,
  doi = {10.5194/gmd-14-5205-2021},
  url = {https://doi.org/10.5194/gmd-14-5205-2021},
  year = {2021},
  publisher = {Copernicus {GmbH}},
  volume = {14},
  number = {8},
  pages = {5205--5215},
  author = {David Meyer and Thomas Nagler and Robin J. Hogan},
  title = {Copula-based synthetic data augmentation for machine-learning emulators},
  journal = {Geoscientific Model Development}
}

If needed, you may also cite the specific software version with its corresponding Zendo DOI.

Contributing

If you are looking to contribute, please read our Contributors' guide for details.

Development notes

If you would like to know more about specific development guidelines, testing and deployment, please refer to our development notes.

Copyright and license

Copyright 2020 D. Meyer and T. Nagler. Licensed under MIT.

Acknowledgements

Special thanks to @letmaik for his suggestions and contributions to the project.

Comments
  • Explain how to run the test suite

    Explain how to run the test suite

    Describe the bug There is a test suite, but the documentation does not explain how to run it.

    Here is what works for me:

    1. Install pytest.
    2. Clone the source repository.
    3. Run pytest in the root directory of the repository.
    opened by khinsen 7
  • Review: Copula distribution usage and examples

    Review: Copula distribution usage and examples

    Your package offers support for simulating vine copulas. However, I don't see examples demonstrating how to simulate data from a vine copula given desired conditional dependency requirements.

    Is this possible with the current API? If not, how would I use the vine copula generator to achieve this?

    Otherwise, can examples show the difference between simulating Gaussian and vine copulas? I only see examples for the Gaussian copula.

    opened by mnarayan 5
  • fPCA documentation

    fPCA documentation

    Describe the bug

    The documentation page on fPCA says:

    PCA can be used to generate synthetic data for the high-dimensional vector $X$. For every instance $X_i$ in the data set, we compute the principal component scores $a_{i, 1}, \dots, a_{i, K}$. Because the principal components $v_1, \dots, v_K$ are orthogonal, the scores are necessarily uncorrelated and we may treat them as independent.
    

    The claim that "because the principal components $v_1, \dots, v_K$ are orthogonal, the scores are necessarily uncorrelated" looks wrong to me. These scores are projections of the $X_i$ onto the elements of an orthonormal basis. That doesn't make them uncorrelated. There are lots of orthonormal bases one can project on, and for most of them the projections are not uncorrelated. You need some property of the distribution of $X$ to derive a zero correlation, for example a Gaussian distribution, for which the PCA basis yields approximately uncorrelated projections.

    opened by khinsen 3
  • Review: Clarify API

    Review: Clarify API

    It would be helpful to add/explain what the different classes do Data Generators, Parametrizer, Transformers somewhere in the introduction or usage component of the documentation. Explain the different classes and what each is supposed to do. If it is similar to or inspired by well-known API of a different package, please point to it.

    I think generators and transformers are obvious but I only sort of understand Parametrizers. It is also confusing in the sense that people might think this has something to do with parametric distributions when you mean it to be something different.

    Is this API for Parametrizers inspired by some convention elsewhere? If so it would be helpful to point to that. For instance, the generators are very similar to statsmodel generators.

    opened by mnarayan 2
  • Small error in docs

    Small error in docs

    Hi, just letting you know I noticed a small error in the documentation.

    At the bottom of this page https://dmey.github.io/synthia/examples/fpca.html

    The error is in line [6] of the code, under "Plot the results".

    You have: plot_profiles(ds_true, 'temperature_fl')

    But I believe it should be: plot_profiles(ds_synth, 'temperature_fl')

    you want to plot results, not the original here.

    Cheers & thanks for the cool project!

    opened by BigTuna08 1
  • Review: Comparisons to other common packages

    Review: Comparisons to other common packages

    What are other packages people might use to simulate data (e.g. statsmodels comes to mind) and how is this package different? Your package supports generating data for multivariate copula distributions and via fPCA. I understand what this entails but I think this could use further elaboration.

    This package supports nonparametric distributions much more than the typical parametric data generators found in common packages and it would be useful to highlight these explicitly.

    opened by mnarayan 1
  • Support categorical data for pyvinecopulib

    Support categorical data for pyvinecopulib

    During fitting, category values are reindexed as integers starting from 0 and transformed to one-hot vectors. The opposite during generation. Any data type works for categories, including strings.

    opened by letmaik 0
  • Add support for categorical data

    Add support for categorical data

    We can treat categorical data as discrete but first we need to pre-process categorical values by one hot encoding to remove the order. Re API we can change the current version from

    # Assuming  an xarray datasets ds with X1 discrete and and X2 categorical 
    generator.fit(ds, copula=syn.VineCopula(controls=ctrl), is_discrete={'X1': True, 'X2': False})
    

    to something like

    with X3 continuous 
    g.fit(ds, copula=syn.VineCopula(controls=ctrl), types={'X1': 'disc', 'X2': 'cat', 'X3': 'cont'})
    
    opened by dmey 0
  • Add support for handling discrete quantities

    Add support for handling discrete quantities

    Introduces the option to specify and model discrete quantities as follows:

    # Assuming  an xarray datasets ds with X1 discrete and and X2 continuous 
    generator.fit(ds, copula=syn.VineCopula(controls=ctrl), is_discrete={'X1': True, 'X2': False})
    

    This option is only supported for vine copulas

    opened by dmey 0
Releases(1.1.0)
HyperSpy is an open source Python library for the interactive analysis of multidimensional datasets

HyperSpy is an open source Python library for the interactive analysis of multidimensional datasets that can be described as multidimensional arrays o

HyperSpy 411 Dec 27, 2022
DaCe is a parallel programming framework that takes code in Python/NumPy and other programming languages

aCe - Data-Centric Parallel Programming Decoupling domain science from performance optimization. DaCe is a parallel programming framework that takes c

SPCL 330 Dec 30, 2022
Sensitivity Analysis Library in Python (Numpy). Contains Sobol, Morris, Fractional Factorial and FAST methods.

Sensitivity Analysis Library (SALib) Python implementations of commonly used sensitivity analysis methods. Useful in systems modeling to calculate the

SALib 663 Jan 05, 2023
A forecasting system dedicated to smart city data

smart-city-predictions System prognostyczny dedykowany dla danych inteligentnych miast Praca inżynierska realizowana przez Michała Stawikowskiego and

Kevin Lai 1 Nov 08, 2021
Python Implementation of Scalable In-Memory Updatable Bitmap Indexing

PyUpBit CS490 Large Scale Data Analytics — Implementation of Updatable Compressed Bitmap Indexing Paper Table of Contents About The Project Usage Cont

Hyeong Kyun (Daniel) Park 1 Jun 28, 2022
nrgpy is the Python package for processing NRG Data Files

nrgpy nrgpy is the Python package for processing NRG Data Files Website and source: https://github.com/nrgpy/nrgpy Documentation: https://nrgpy.github

NRG Tech Services 23 Dec 08, 2022
Probabilistic Programming in Python: Bayesian Modeling and Probabilistic Machine Learning with Theano

PyMC3 is a Python package for Bayesian statistical modeling and Probabilistic Machine Learning focusing on advanced Markov chain Monte Carlo (MCMC) an

PyMC 7.2k Dec 30, 2022
Provide a market analysis (R)

market-study Provide a market analysis (R) - FRENCH Produisez une étude de marché Prérequis Pour effectuer ce projet, vous devrez maîtriser la manipul

1 Feb 13, 2022
CleanX is an open source python library for exploring, cleaning and augmenting large datasets of X-rays, or certain other types of radiological images.

cleanX CleanX is an open source python library for exploring, cleaning and augmenting large datasets of X-rays, or certain other types of radiological

Candace Makeda Moore, MD 20 Jan 05, 2023
X-news - Pipeline data use scrapy, kafka, spark streaming, spark ML and elasticsearch, Kibana

X-news - Pipeline data use scrapy, kafka, spark streaming, spark ML and elasticsearch, Kibana

Nguyễn Quang Huy 5 Sep 28, 2022
A Python package for modular causal inference analysis and model evaluations

Causal Inference 360 A Python package for inferring causal effects from observational data. Description Causal inference analysis enables estimating t

International Business Machines 506 Dec 19, 2022
A set of tools to analyse the output from TraDIS analyses

QuaTradis (Quadram TraDis) A set of tools to analyse the output from TraDIS analyses Contents Introduction Installation Required dependencies Bioconda

Quadram Institute Bioscience 2 Feb 16, 2022
The Master's in Data Science Program run by the Faculty of Mathematics and Information Science

The Master's in Data Science Program run by the Faculty of Mathematics and Information Science is among the first European programs in Data Science and is fully focused on data engineering and data a

Amir Ali 2 Jun 17, 2022
Geospatial data-science analysis on reasons behind delay in Grab ride-share services

Grab x Pulis Detailed analysis done to investigate possible reasons for delay in Grab services for NUS Data Analytics Competition 2022, to be found in

Keng Hwee 6 Jun 07, 2022
PyEmits, a python package for easy manipulation in time-series data.

PyEmits, a python package for easy manipulation in time-series data. Time-series data is very common in real life. Engineering FSI industry (Financial

Thompson 5 Sep 23, 2022
Vaex library for Big Data Analytics of an Airline dataset

Vaex-Big-Data-Analytics-for-Airline-data A Python notebook (ipynb) created in Jupyter Notebook, which utilizes the Vaex library for Big Data Analytics

Nikolas Petrou 1 Feb 13, 2022
DefAP is a program developed to facilitate the exploration of a material's defect chemistry

DefAP is a program developed to facilitate the exploration of a material's defect chemistry. A large number of features are provided and rapid exploration is supported through the use of autoplotting

6 Oct 25, 2022
This is an analysis and prediction project for house prices in King County, USA based on certain features of the house

This is a project for analysis and estimation of House Prices in King County USA The .csv file contains the data of the house and the .ipynb file con

Amit Prakash 1 Jan 21, 2022
This creates a ohlc timeseries from downloaded CSV files from NSE India website and makes a SQLite database for your research.

NSE-timeseries-form-CSV-file-creator-and-SQL-appender- This creates a ohlc timeseries from downloaded CSV files from National Stock Exchange India (NS

PILLAI, Amal 1 Oct 02, 2022
OpenDrift is a software for modeling the trajectories and fate of objects or substances drifting in the ocean, or even in the atmosphere.

opendrift OpenDrift is a software for modeling the trajectories and fate of objects or substances drifting in the ocean, or even in the atmosphere. Do

OpenDrift 167 Dec 13, 2022