Synthetic data need to preserve the statistical properties of real data in terms of their individual behavior and (inter-)dependences

Overview

Overview

Synthetic data need to preserve the statistical properties of real data in terms of their individual behavior and (inter-)dependences. Copula and functional Principle Component Analysis (fPCA) are statistical models that allow these properties to be simulated (Joe 2014). As such, copula generated data have shown potential to improve the generalization of machine learning (ML) emulators (Meyer et al. 2021) or anonymize real-data datasets (Patki et al. 2016).

Synthia is an open source Python package to model univariate and multivariate data, parameterize data using empirical and parametric methods, and manipulate marginal distributions. It is designed to enable scientists and practitioners to handle labelled multivariate data typical of computational sciences. For example, given some vertical profiles of atmospheric temperature, we can use Synthia to generate new but statistically similar profiles in just three lines of code (Table 1).

Synthia supports three methods of multivariate data generation through: (i) fPCA, (ii) parametric (Gaussian) copula, and (iii) vine copula models for continuous (all), discrete (vine), and categorical (vine) variables. It has a simple and succinct API to natively handle xarray's labelled arrays and datasets. It uses a pure Python implementation for fPCA and Gaussian copula, and relies on the fast and well tested C++ library vinecopulib through pyvinecopulib's bindings for fast and efficient computation of vines. For more information, please see the website at https://dmey.github.io/synthia.

Table 1. Example application of Gaussian and fPCA classes in Synthia. These are used to generate random profiles of atmospheric temperature similar to those included in the source data. The xarray dataset structure is maintained and returned by Synthia.

Source Synthetic with Gaussian Copula Synthetic with fPCA
ds = syn.util.load_dataset() g = syn.CopulaDataGenerator() g = syn.fPCADataGenerator()
g.fit(ds, syn.GaussianCopula()) g.fit(ds)
g.generate(n_samples=500) g.generate(n_samples=500)
Source Gaussian fPCA

Documentation

For installation instructions, getting started guides and tutorials, background information, and API reference summaries, please see the website.

How to cite

If you are using Synthia, please cite the following two papers using their respective Digital Object Identifiers (DOIs). Citations may be generated automatically using Crosscite's DOI Citation Formatter or from the BibTeX entries below.

Synthia Software Software Application
DOI: 10.21105/joss.02863 DOI: 10.5194/gmd-14-5205-2021
@article{Meyer_and_Nagler_2021,
  doi = {10.21105/joss.02863},
  url = {https://doi.org/10.21105/joss.02863},
  year = {2021},
  publisher = {The Open Journal},
  volume = {6},
  number = {65},
  pages = {2863},
  author = {David Meyer and Thomas Nagler},
  title = {Synthia: multidimensional synthetic data generation in Python},
  journal = {Journal of Open Source Software}
}

@article{Meyer_and_Nagler_and_Hogan_2021,
  doi = {10.5194/gmd-14-5205-2021},
  url = {https://doi.org/10.5194/gmd-14-5205-2021},
  year = {2021},
  publisher = {Copernicus {GmbH}},
  volume = {14},
  number = {8},
  pages = {5205--5215},
  author = {David Meyer and Thomas Nagler and Robin J. Hogan},
  title = {Copula-based synthetic data augmentation for machine-learning emulators},
  journal = {Geoscientific Model Development}
}

If needed, you may also cite the specific software version with its corresponding Zendo DOI.

Contributing

If you are looking to contribute, please read our Contributors' guide for details.

Development notes

If you would like to know more about specific development guidelines, testing and deployment, please refer to our development notes.

Copyright and license

Copyright 2020 D. Meyer and T. Nagler. Licensed under MIT.

Acknowledgements

Special thanks to @letmaik for his suggestions and contributions to the project.

Comments
  • Explain how to run the test suite

    Explain how to run the test suite

    Describe the bug There is a test suite, but the documentation does not explain how to run it.

    Here is what works for me:

    1. Install pytest.
    2. Clone the source repository.
    3. Run pytest in the root directory of the repository.
    opened by khinsen 7
  • Review: Copula distribution usage and examples

    Review: Copula distribution usage and examples

    Your package offers support for simulating vine copulas. However, I don't see examples demonstrating how to simulate data from a vine copula given desired conditional dependency requirements.

    Is this possible with the current API? If not, how would I use the vine copula generator to achieve this?

    Otherwise, can examples show the difference between simulating Gaussian and vine copulas? I only see examples for the Gaussian copula.

    opened by mnarayan 5
  • fPCA documentation

    fPCA documentation

    Describe the bug

    The documentation page on fPCA says:

    PCA can be used to generate synthetic data for the high-dimensional vector $X$. For every instance $X_i$ in the data set, we compute the principal component scores $a_{i, 1}, \dots, a_{i, K}$. Because the principal components $v_1, \dots, v_K$ are orthogonal, the scores are necessarily uncorrelated and we may treat them as independent.
    

    The claim that "because the principal components $v_1, \dots, v_K$ are orthogonal, the scores are necessarily uncorrelated" looks wrong to me. These scores are projections of the $X_i$ onto the elements of an orthonormal basis. That doesn't make them uncorrelated. There are lots of orthonormal bases one can project on, and for most of them the projections are not uncorrelated. You need some property of the distribution of $X$ to derive a zero correlation, for example a Gaussian distribution, for which the PCA basis yields approximately uncorrelated projections.

    opened by khinsen 3
  • Review: Clarify API

    Review: Clarify API

    It would be helpful to add/explain what the different classes do Data Generators, Parametrizer, Transformers somewhere in the introduction or usage component of the documentation. Explain the different classes and what each is supposed to do. If it is similar to or inspired by well-known API of a different package, please point to it.

    I think generators and transformers are obvious but I only sort of understand Parametrizers. It is also confusing in the sense that people might think this has something to do with parametric distributions when you mean it to be something different.

    Is this API for Parametrizers inspired by some convention elsewhere? If so it would be helpful to point to that. For instance, the generators are very similar to statsmodel generators.

    opened by mnarayan 2
  • Small error in docs

    Small error in docs

    Hi, just letting you know I noticed a small error in the documentation.

    At the bottom of this page https://dmey.github.io/synthia/examples/fpca.html

    The error is in line [6] of the code, under "Plot the results".

    You have: plot_profiles(ds_true, 'temperature_fl')

    But I believe it should be: plot_profiles(ds_synth, 'temperature_fl')

    you want to plot results, not the original here.

    Cheers & thanks for the cool project!

    opened by BigTuna08 1
  • Review: Comparisons to other common packages

    Review: Comparisons to other common packages

    What are other packages people might use to simulate data (e.g. statsmodels comes to mind) and how is this package different? Your package supports generating data for multivariate copula distributions and via fPCA. I understand what this entails but I think this could use further elaboration.

    This package supports nonparametric distributions much more than the typical parametric data generators found in common packages and it would be useful to highlight these explicitly.

    opened by mnarayan 1
  • Support categorical data for pyvinecopulib

    Support categorical data for pyvinecopulib

    During fitting, category values are reindexed as integers starting from 0 and transformed to one-hot vectors. The opposite during generation. Any data type works for categories, including strings.

    opened by letmaik 0
  • Add support for categorical data

    Add support for categorical data

    We can treat categorical data as discrete but first we need to pre-process categorical values by one hot encoding to remove the order. Re API we can change the current version from

    # Assuming  an xarray datasets ds with X1 discrete and and X2 categorical 
    generator.fit(ds, copula=syn.VineCopula(controls=ctrl), is_discrete={'X1': True, 'X2': False})
    

    to something like

    with X3 continuous 
    g.fit(ds, copula=syn.VineCopula(controls=ctrl), types={'X1': 'disc', 'X2': 'cat', 'X3': 'cont'})
    
    opened by dmey 0
  • Add support for handling discrete quantities

    Add support for handling discrete quantities

    Introduces the option to specify and model discrete quantities as follows:

    # Assuming  an xarray datasets ds with X1 discrete and and X2 continuous 
    generator.fit(ds, copula=syn.VineCopula(controls=ctrl), is_discrete={'X1': True, 'X2': False})
    

    This option is only supported for vine copulas

    opened by dmey 0
Releases(1.1.0)
Data science/Analysis Health Care Portfolio

Health-Care-DS-Projects Data Science/Analysis Health Care Portfolio Consists Of 3 Projects: Mexico Covid-19 project, analyze the patient medical histo

Mohamed Abd El-Mohsen 1 Feb 13, 2022
PandaPy has the speed of NumPy and the usability of Pandas 10x to 50x faster (by @firmai)

PandaPy "I came across PandaPy last week and have already used it in my current project. It is a fascinating Python library with a lot of potential to

Derek Snow 527 Jan 02, 2023
BIGDATA SIMULATION ONE PIECE WORLD CENSUS

ONE PIECE is a Japanese manga of great international success. The story turns inhabited in a fictional world, tells the adventures of a young man whose body gained rubber properties after accidentall

Maycon Cypriano 3 Jun 30, 2022
Yet Another Workflow Parser for SecurityHub

YAWPS Yet Another Workflow Parser for SecurityHub "Screaming pepper" by Rum Bucolic Ape is licensed with CC BY-ND 2.0. To view a copy of this license,

myoung34 8 Dec 22, 2022
A tax calculator for stocks and dividends activities.

Revolut Stocks calculator for Bulgarian National Revenue Agency Information Processing and calculating the required information about stock possession

Doino Gretchenliev 200 Oct 25, 2022
A Streamlit web-app for a data-science project that aims to evaluate if the answer to a question is helpful.

How useful is the aswer? A Streamlit web-app for a data-science project that aims to evaluate if the answer to a question is helpful. If you want to l

1 Dec 17, 2021
Exploratory Data Analysis for Employee Retention Dataset

Exploratory Data Analysis for Employee Retention Dataset Employee turn-over is a very costly problem for companies. The cost of replacing an employee

kana sudheer reddy 2 Oct 01, 2021
Bearsql allows you to query pandas dataframe with sql syntax.

Bearsql adds sql syntax on pandas dataframe. It uses duckdb to speedup the pandas processing and as the sql engine

14 Jun 22, 2022
MidTerm Project for the Data Analysis FT Bootcamp, Adam Tycner and Florent ZAHOUI

MidTerm Project for the Data Analysis FT Bootcamp, Adam Tycner and Florent ZAHOUI Hallo

Florent Zahoui 1 Feb 07, 2022
Includes all files needed to satisfy hw02 requirements

HW 02 Data Sets Mean Scale Score for Asian and Hispanic Students, Grades 3 - 8 This dataset provides insights into the New York City education system

7 Oct 28, 2021
Data collection, enhancement, and metrics calculation.

l3_data_collection Data collection, enhancement, and metrics calculation. Summary Repository containing code for QuantDAO's JDT data collection task.

Ruiwyn 3 Dec 23, 2022
Display the behaviour of a realtime program with a scope or logic analyser.

1. A monitor for realtime MicroPython code This library provides a means of examining the behaviour of a running system. It was initially designed to

Peter Hinch 17 Dec 05, 2022
Additional tools for particle accelerator data analysis and machine information

PyLHC Tools This package is a collection of useful scripts and tools for the Optics Measurements and Corrections group (OMC) at CERN. Documentation Au

PyLHC 3 Apr 13, 2022
Transform-Invariant Non-Negative Matrix Factorization

Transform-Invariant Non-Negative Matrix Factorization A comprehensive Python package for Non-Negative Matrix Factorization (NMF) with a focus on learn

EMD Group 6 Jul 01, 2022
MEAD: A Large-scale Audio-visual Dataset for Emotional Talking-face Generation [ECCV2020]

MEAD: A Large-scale Audio-visual Dataset for Emotional Talking-face Generation [ECCV2020] by Kaisiyuan Wang, Qianyi Wu, Linsen Song, Zhuoqian Yang, Wa

112 Dec 28, 2022
Pandas-based utility to calculate weighted means, medians, distributions, standard deviations, and more.

weightedcalcs weightedcalcs is a pandas-based Python library for calculating weighted means, medians, standard deviations, and more. Features Plays we

Jeremy Singer-Vine 98 Dec 31, 2022
A notebook to analyze Amazon Recommendation Review Dataset.

Amazon Recommendation Review Dataset Analyzer A notebook to analyze Amazon Recommendation Review Dataset. Features Calculates distinct user count, dis

isleki 3 Aug 22, 2022
TextDescriptives - A Python library for calculating a large variety of statistics from text

A Python library for calculating a large variety of statistics from text(s) using spaCy v.3 pipeline components and extensions. TextDescriptives can be used to calculate several descriptive statistic

150 Dec 30, 2022
Probabilistic Programming in Python: Bayesian Modeling and Probabilistic Machine Learning with Theano

PyMC3 is a Python package for Bayesian statistical modeling and Probabilistic Machine Learning focusing on advanced Markov chain Monte Carlo (MCMC) an

PyMC 7.2k Dec 30, 2022
PyTorch implementation for NCL (Neighborhood-enrighed Contrastive Learning)

NCL (Neighborhood-enrighed Contrastive Learning) This is the official PyTorch implementation for the paper: Zihan Lin*, Changxin Tian*, Yupeng Hou* Wa

RUCAIBox 73 Jan 03, 2023