Metrics to evaluate quality and efficacy of synthetic datasets.

Overview

DAI-Lab An Open Source Project from the Data to AI Lab, at MIT

Development Status PyPI Shield Downloads Tests Coverage Status

Metrics for Synthetic Data Generation Projects

Overview

The SDMetrics library provides a set of dataset-agnostic tools for evaluating the quality of a synthetic database by comparing it to the real database that it is modeled after.

It supports multiple data modalities:

  • Single Columns: Compare 1 dimensional numpy arrays representing individual columns.
  • Column Pairs: Compare how columns in a pandas.DataFrame relate to each other, in groups of 2.
  • Single Table: Compare an entire table, represented as a pandas.DataFrame.
  • Multi Table: Compare multi-table and relational datasets represented as a python dict with multiple tables passed as pandas.DataFrames.
  • Time Series: Compare tables representing ordered sequences of events.

It includes a variety of metrics such as:

  • Statistical metrics which use statistical tests to compare the distributions of the real and synthetic distributions.
  • Detection metrics which use machine learning to try to distinguish between real and synthetic data.
  • Efficacy metrics which compare the performance of machine learning models when run on the synthetic and real data.
  • Bayesian Network and Gaussian Mixture metrics which learn the distribution of the real data and evaluate the likelihood of the synthetic data belonging to the learned distribution.
  • Privacy metrics which evaluate whether the synthetic data is leaking information about the real data.

Install

SDMetrics is part of the SDV project and is automatically installed alongside it. For details about this process please visit the SDV Installation Guide

Optionally, SDMetrics can also be installed as a standalone library using the following commands:

Using pip:

pip install sdmetrics

Using conda:

conda install -c sdv-dev -c conda-forge -c pytorch sdmetrics

For more installation options please visit the SDMetrics installation Guide

Usage

SDMetrics is included as part of the framework offered by SDV to evaluate the quality of your synthetic dataset. For more details about how to use it please visit the corresponding User Guide:

Standalone usage

SDMetrics can also be used as a standalone library to run metrics individually.

In this short example we show how to use it to evaluate a toy multi-table dataset and its synthetic replica by running all the compatible multi-table metrics on it:

import sdmetrics

# Load the demo data, which includes:
# - A dict containing the real tables as pandas.DataFrames.
# - A dict containing the synthetic clones of the real data.
# - A dict containing metadata about the tables.
real_data, synthetic_data, metadata = sdmetrics.load_demo()

# Obtain the list of multi table metrics, which is returned as a dict
# containing the metric names and the corresponding metric classes.
metrics = sdmetrics.multi_table.MultiTableMetric.get_subclasses()

# Run all the compatible metrics and get a report
sdmetrics.compute_metrics(metrics, real_data, synthetic_data, metadata=metadata)

The output will be a table with all the details about the executed metrics and their score:

metric name score min_value max_value goal
CSTest Chi-Squared 0.76651 0 1 MAXIMIZE
KSTest Inverted Kolmogorov-Smirnov D statistic 0.75 0 1 MAXIMIZE
KSTestExtended Inverted Kolmogorov-Smirnov D statistic 0.777778 0 1 MAXIMIZE
LogisticDetection LogisticRegression Detection 0.882716 0 1 MAXIMIZE
SVCDetection SVC Detection 0.833333 0 1 MAXIMIZE
BNLikelihood BayesianNetwork Likelihood nan 0 1 MAXIMIZE
BNLogLikelihood BayesianNetwork Log Likelihood nan -inf 0 MAXIMIZE
LogisticParentChildDetection LogisticRegression Detection 0.619444 0 1 MAXIMIZE
SVCParentChildDetection SVC Detection 0.916667 0 1 MAXIMIZE

What's next?

If you want to read more about each individual metric, please visit the following folders:

The Synthetic Data Vault

This repository is part of The Synthetic Data Vault Project

Comments
  • Gh stronger detection classifiers

    Gh stronger detection classifiers

    Add Random Forest and Gradient Boosting from sklearn to the single table detection tests. Being able to fool these classifiers would be a great improvement for generative models.

    opened by TanguyUrvoy 7
  • README.md example has a bug

    README.md example has a bug

    Environment Details

    Please indicate the following details about the environment in which you found the bug:

    • SDMetrics version: 0.7.0
    • Python version: 3.8
    • Operating System: Linux VM
    • pandas version: 1.2.4

    Error Description

    Following the README.md example of calculating the BoundaryAdherence:

    Running the code gives the following error: AttributeError: 'Series' object has no attribute 'columns'

    I would expect the code to generate the BoundaryAdherence for the start_date column of the real_data pandas dataframe and synthetic_Data pandas dataframe

    Steps to reproduce

    Follow this snippet from the README.md that shows the usage of BoundaryAdherence

    # calculate whether the synthetic data respects the min/max bounds
    # set by the real data
    from sdmetrics.single_table import BoundaryAdherence
    
    BoundaryAdherence.compute(
        real_data['start_date'],
        synthetic_data['start_date']
    )
    

    Solution

    I will open a PR for this later today or tomorrow.

    type(real_data['start_date']) 
    

    Returns a pandas.core.series.Series

    type(real_data[['start_date']])
    

    Returns a pandas.core.frame.DataFrame

    BoundaryAdherence.compute() expects DataFrames for both the real_data and synthetic_data arguments.

    # calculate whether the synthetic data respects the min/max bounds
    # set by the real data
    from sdmetrics.single_table import BoundaryAdherence
    
    BoundaryAdherence.compute(
        real_data[['start_date']],
        synthetic_data[['start_date']]
    )
    
    documentation resolution:resolved 
    opened by Pverheijen 5
  • KeyError: 'fields' , --> reports.generate 'fields'

    KeyError: 'fields' , --> reports.generate 'fields'

    Environment Details

    Please indicate the following details about the environment in which you found the bug:

    • SDMetrics version:'0.7.0'
    • Python version: 3.8.8
    • Operating System: Windows server 2016 dataserver

    Error Description

    Steps to reproduce

    I tried to use sdmetrics to compare the Synthetic data. Example of the data set

    Customer | State | Customer Lifetime Value | Response | Coverage | Education | Effective To Date | EmploymentStatus | Gender | Income | ... | Months Since Policy Inception | Number of Open Complaints | Number of Policies | Policy Type | Policy | Renew Offer Type | Sales Channel | Total Claim Amount | Vehicle Class | Vehicle Size -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- BU79786 | Washington | 2763.519279 | No | Basic | Bachelor | 2/24/11 | Employed | F | 56274 | ... | 5 | 0 | 1 | Corporate Auto | Corporate L3 | Offer1 | Agent | 384.811147 | Two-Door Car | Medsize QZ44356 | Arizona | 6979.535903 | No | Extended | Bachelor | 1/31/11 | Unemployed | F | 0 | ... | 42 | 0 | 8 | Personal Auto | Personal L3 | Offer3 | Agent | 1131.464935 | Four-Door Car | Medsiz

    loaded the Gaussian copula pkl model to generate the synthetic data and created a metadata on the original dataset to use it in quality report. I have pasted the Metadata created for the data above: {'fields': {'Customer': {'type': 'id', 'subtype': 'string'}, 'State': {'type': 'categorical'}, 'Customer Lifetime Value': {'type': 'numerical', 'subtype': 'float'}, 'Response': {'type': 'categorical'}, 'Coverage': {'type': 'categorical'}, 'Education': {'type': 'categorical'}, 'Effective To Date': {'type': 'categorical'}, 'EmploymentStatus': {'type': 'categorical'}, 'Gender': {'type': 'categorical'}, 'Income': {'type': 'numerical', 'subtype': 'integer'}, 'Location Code': {'type': 'categorical'}, 'Marital Status': {'type': 'categorical'}, 'Monthly Premium Auto': {'type': 'numerical', 'subtype': 'integer'}, 'Months Since Last Claim': {'type': 'numerical', 'subtype': 'integer'}, 'Months Since Policy Inception': {'type': 'numerical', 'subtype': 'integer'}, 'Number of Open Complaints': {'type': 'numerical', 'subtype': 'integer'}, 'Number of Policies': {'type': 'numerical', 'subtype': 'integer'}, 'Policy Type': {'type': 'categorical'}, 'Policy': {'type': 'categorical'}, 'Renew Offer Type': {'type': 'categorical'}, 'Sales Channel': {'type': 'categorical'}, 'Total Claim Amount': {'type': 'numerical', 'subtype': 'float'}, 'Vehicle Class': {'type': 'categorical'}, 'Vehicle Size': {'type': 'categorical'}}, 'primary_key': 'Customer'}.-->

    #creating metadata
    metadata.add_table(
        name='INS',
        data=data2,
        primary_key='Customer'
         )
    
    
    report = QualityReport()
    report.generate(data2, synthetic_data,metadata)
    
    
    error:
    Creating report:   0%|          | 0/4 [00:00<?, ?it/s]
    ---------------------------------------------------------------------------
    KeyError                                  Traceback (most recent call last)
    Input In [54], in <cell line: 2>()
          1 report = QualityReport()
    ----> 2 report.generate(data2, synthetic_data,metadata)
    
    File ~\.conda\envs\SDVENV\lib\site-packages\sdmetrics\reports\single_table\quality_report.py:72, in QualityReport.generate(self, real_data, synthetic_data, metadata)
         70 for metric in tqdm.tqdm(metrics, desc='Creating report'):
         71     try:
    ---> 72         self._metric_results[metric.__name__] = metric.compute_breakdown(
         73             real_data, synthetic_data, metadata)
         74     except IncomputableMetricError:
         75         # Metric is not compatible with this dataset.
         76         self._metric_results[metric.__name__] = {}
    
    File ~\.conda\envs\SDVENV\lib\site-packages\sdmetrics\single_table\multi_single_column.py:147, in MultiSingleColumnMetric.compute_breakdown(cls, real_data, synthetic_data, metadata, **kwargs)
        123 @classmethod
        124 def compute_breakdown(cls, real_data, synthetic_data, metadata=None, **kwargs):
        125     """Compute this metric broken down by column.
        126 
        127     This is done by computing the underlying SingleColumn metric to all the
       (...)
        145             A mapping of column name to metric output.
        146     """
    --> 147     return cls._compute(
        148         cls, real_data, synthetic_data, metadata, store_errors=True, **kwargs)
    
    File ~\.conda\envs\SDVENV\lib\site-packages\sdmetrics\single_table\multi_single_column.py:68, in MultiSingleColumnMetric._compute(self, real_data, synthetic_data, metadata, store_errors, **kwargs)
         43 def _compute(self, real_data, synthetic_data, metadata=None, store_errors=False, **kwargs):
         44     """Compute this metric for all columns.
         45 
         46     This is done by computing the underlying SingleColumn metric to all the
       (...)
         66             A mapping of column name to metric output.
         67     """
    ---> 68     real_data, synthetic_data, metadata = self._validate_inputs(
         69         real_data, synthetic_data, metadata)
         71     fields = self._select_fields(metadata, self.field_types)
         72     invalid_cols = set(metadata['fields'].keys()) - set(fields)
    
    File ~\.conda\envs\SDVENV\lib\site-packages\sdmetrics\single_table\base.py:119, in SingleTableMetric._validate_inputs(cls, real_data, synthetic_data, metadata)
        116 if not isinstance(metadata, dict):
        117     metadata = metadata.to_dict()
    --> 119 fields = metadata['fields']
        120 for column in real_data.columns:
        121     if column not in fields:
    
    KeyError: 'fields'
    
    
    bug resolution:WAI 
    opened by ketandaryanani 3
  • reports.generate 'fields' index issue

    reports.generate 'fields' index issue

    Environment Details

    Please indicate the following details about the environment in which you found the bug:

    • SDMetrics version: '0.7.0'
    • Python version: 3.9.13
    • Operating System: Windows under WSL2--Unbuntu

    Error Description

    Attepmting to generate report from the following command report = QualityReport() report.generate(x_y_df, synth_data, meta_data)

    Steps to reproduce

    x_y_df.to_json(reports_path + '/' + 'xy_df.json')
    
        with open(reports_path + '/' + 'xy_df.json') as f:
            meta_data = json.load(f)
    
        
        # Initialize report
        report = QualityReport()
        report.generate(x_y_df, synth_data, meta_data)
    

    Traceback:

    File "/home/bdeck8317/miniconda3/lib/python3.9/site-packages/sdmetrics/reports/single_table/quality_report.py", line 72, in generate
        self._metric_results[metric.__name__] = metric.compute_breakdown(
      File "/home/bdeck8317/miniconda3/lib/python3.9/site-packages/sdmetrics/single_table/multi_single_column.py", line 147, in compute_breakdown
        return cls._compute(
      File "/home/bdeck8317/miniconda3/lib/python3.9/site-packages/sdmetrics/single_table/multi_single_column.py", line 68, in _compute
        real_data, synthetic_data, metadata = self._validate_inputs(
      File "/home/bdeck8317/miniconda3/lib/python3.9/site-packages/sdmetrics/single_table/base.py", line 119, in _validate_inputs
        fields = metadata['fields']
    KeyError: 'fields'
    
    

    Thanks, Ben

    bug resolution:WAI 
    opened by bdeck8317 3
  • NewRowSynthesis: ValueError: multi-line expressions are only valid in the context of data, use DataFrame.eval

    NewRowSynthesis: ValueError: multi-line expressions are only valid in the context of data, use DataFrame.eval

    Environment Details

    • SDV version: sdv==0.17.1
    • Python version: Python 3.9.13
    • Operating System: Linux

    Error Description

    pandas==1.4.3

    ValueError when running NewRowSynthesis

    Steps to reproduce

    from sdmetrics.single_table import NewRowSynthesis
    
    metadata_obj, real_data = load_tabular_demo("student_placements_pii", metadata=True)
    
    model = GaussianCopula(
        primary_key="student_id"
    )
    model.fit(real_data)
    synthetic_data = model.sample(250)
    
    new_row_synthesis_score = NewRowSynthesis.compute(
        real_data=real_data, synthetic_data=synthetic_data, metadata=metadata_obj.to_dict()
    )
    
    ValueError                                Traceback (most recent call last)
    Cell In [43], line 11
          8 model.fit(real_data)
          9 synthetic_data = model.sample(250)
    ---> 11 new_row_synthesis_score = NewRowSynthesis.compute(
         12     real_data=real_data, synthetic_data=synthetic_data, metadata=metadata_obj.to_dict()
         13 )
    
    File ~/miniconda3/envs/mloperator/lib/python3.9/site-packages/sdmetrics/single_table/new_row_synthesis.py:104, in NewRowSynthesis.compute(cls, real_data, synthetic_data, metadata, numerical_match_tolerance, synthetic_sample_size)
        101     row_filter.append(field_filter)
        103 try:
    --> 104     matches = real_data.query(' and '.join(row_filter))
        105 except TypeError:
        106     if len(real_data) > 10000:
    
    File ~/miniconda3/envs/mloperator/lib/python3.9/site-packages/pandas/core/frame.py:4111, in DataFrame.query(self, expr, inplace, **kwargs)
       4109 kwargs["level"] = kwargs.pop("level", 0) + 1
       4110 kwargs["target"] = None
    -> 4111 res = self.eval(expr, **kwargs)
       4113 try:
       4114     result = self.loc[res]
    
    File ~/miniconda3/envs/mloperator/lib/python3.9/site-packages/pandas/core/frame.py:4240, in DataFrame.eval(self, expr, inplace, **kwargs)
       4237     kwargs["target"] = self
    ...
        328     )
        329 engine = _check_engine(engine)
        330 _check_parser(parser)
    
    ValueError: multi-line expressions are only valid in the context of data, use DataFrame.eval
    
    
    bug feature:metrics resolution:resolved 
    opened by darenr 2
  • BoundaryAdherence: ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

    BoundaryAdherence: ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

    Environment Details

    • SDV version: sdv==0.17.1
    • Python version: Python 3.9.13
    • Operating System: Linux

    Error Description

    pandas==1.4.3

    ValueError when running BoundaryAdherence

    Steps to reproduce

    from sdmetrics.single_column import BoundaryAdherence
    
    metadata_obj, real_data = load_tabular_demo("student_placements_pii", metadata=True)
    
    model = GaussianCopula(
        primary_key="student_id"
    )
    model.fit(real_data)
    synthetic_data = model.sample(250)
    
    BoundaryAdherence.compute(
        real_data=real_data, synthetic_data=synthetic_data
    )
    
    ValueError                                Traceback (most recent call last)
    Cell In [42], line 11
          8 model.fit(real_data)
          9 synthetic_data = model.sample(250)
    ---> 11 BoundaryAdherence.compute(
         12     real_data=real_data, synthetic_data=synthetic_data
         13 )
    
    File ~/miniconda3/envs/mloperator/lib/python3.9/site-packages/sdmetrics/single_column/statistical/boundary_adherence.py:46, in BoundaryAdherence.compute(cls, real_data, synthetic_data)
         32 @classmethod
         33 def compute(cls, real_data, synthetic_data):
         34     """Compute the boundary adherence of two continuous columns.
         35 
         36     Args:
       (...)
         44             The boundary adherence of the two columns.
         45     """
    ---> 46     real_data = pd.Series(real_data).dropna()
         47     synthetic_data = pd.Series(synthetic_data).dropna()
         49     if is_datetime(real_data):
    
    File ~/miniconda3/envs/mloperator/lib/python3.9/site-packages/pandas/core/series.py:367, in __init__(self, data, index, dtype, name, copy, fastpath)
        364         self.name = name
        365     return
    ...
       1528         f"The truth value of a {type(self).__name__} is ambiguous. "
       1529         "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
       1530     )
    
    ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
    
    
    bug resolution:WAI feature:metrics 
    opened by darenr 2
  • Update README.md to fix a bug

    Update README.md to fix a bug

    BoundaryAdherence expects DataFrames for the real and synthetic dataframe arguments, currently it is receiving series due to accessing date_range with single square brackets instead of double square brackets.

    opened by Pverheijen 2
  • Accademic Paper to cite?

    Accademic Paper to cite?

    Is there any paper to cite related to this library? I am evaluating some synthetic data for a publication so I wanted to include your contribution among the citations.

    Thanks Andrea Galloni

    question resolution:resolved 
    opened by andreagalloni92 2
  • SDMetrics 0.4.2 has incompatible copula version with SDV

    SDMetrics 0.4.2 has incompatible copula version with SDV

    Environment Details

    Please indicate the following details about the environment in which you found the bug:

    • SDMetrics version: 0.4.2
    • SDV version: 0.14.1
    • Python version: 3.8.10
    • Operating System: ubuntu server lts 20.4.04

    Error Description

    The latest SDMetrics version (which is installed by default when installing SDV) has incompatible copula requirements with downstream SDV.

    Steps to reproduce

    On a fresh virtual environment, install pip-tools.

    Place the following on a file named requirements.in

    sdv
    #sdmetrics==0.4.1
    

    Type the following commands

    pip install -r requirements.in
    pip-compile requirements.in
    

    pip-compile reports:

    Could not find a version that matches copulas<0.7,<0.8,>=0.6.1,>=0.7.0 (from sdv==0.14.1->-r requirements.txt (line 1))
    Tried: 0.0.0, 0.0.0, 0.1.0, 0.1.0, 0.1.1, 0.1.1, 0.2.0, 0.2.0, 0.2.1, 0.2.1, 0.2.3, 0.2.3, 0.2.4, 0.2.4, 0.2.5, 0.2.5, 0.3.0, 0.3.0, 0.3.2, 0.3.2, 0.3.3, 0.3.3, 0.4.0, 0.4.0, 0.5.0, 0.5.0, 0.5.1, 0.5.1, 0.6.0, 0.6.0, 0.6.1, 0.6.1, 0.7.0, 0.7.0
    Skipped pre-versions: 0.3.0.dev0, 0.3.0.dev0, 0.3.2.dev1, 0.3.2.dev1, 0.3.3.dev0, 0.3.3.dev0, 0.4.0.dev0, 0.4.0.dev0, 0.5.0.dev0, 0.5.0.dev0, 0.5.0.dev1, 0.5.0.dev1, 0.5.1.dev0, 0.5.1.dev0, 0.5.1.dev1, 0.5.1.dev1, 0.5.2.dev0, 0.5.2.dev0, 0.5.2.dev1, 0.5.2.dev1, 0.6.0.dev0, 0.6.0.dev0, 0.6.1.dev0, 0.6.1.dev0, 0.7.0.dev0, 0.7.0.dev0
    There are incompatible versions in the resolved dependencies:
      copulas<0.8,>=0.7.0 (from sdmetrics==0.4.2->sdv==0.14.1->-r requirements.txt (line 1))
      copulas<0.7,>=0.6.1 (from sdv==0.14.1->-r requirements.txt (line 1))
    

    From the setup.py of both projects, we can verify the above requirements.

    pip install works correctly, but we get the following (snippet):

    Collecting llvmlite<0.39,>=0.38.0rc1
      Using cached llvmlite-0.38.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.5 MB)
    Collecting charset-normalizer~=2.0.0; python_version >= "3"
      Using cached charset_normalizer-2.0.12-py3-none-any.whl (39 kB)
    Collecting idna<4,>=2.5; python_version >= "3"
      Using cached idna-3.3-py3-none-any.whl (61 kB)
    Collecting certifi>=2017.4.17
      Using cached certifi-2021.10.8-py2.py3-none-any.whl (149 kB)
    Collecting urllib3<1.27,>=1.21.1
      Using cached urllib3-1.26.9-py2.py3-none-any.whl (138 kB)
    ERROR: numba 0.55.1 has requirement numpy<1.22,>=1.18, but you'll have numpy 1.22.3 which is incompatible.
    ERROR: rdt 0.6.4 has requirement scipy<1.8,>=1.5.4, but you'll have scipy 1.8.0 which is incompatible.
    ERROR: sdmetrics 0.4.2 has requirement copulas<0.8,>=0.7.0, but you'll have copulas 0.6.1 which is incompatible.
    Installing collected packages: tqdm, typing-extensions, torch, numpy, six, python-dateutil, pytz, pandas, deepecho, scipy, threadpoolctl, joblib, scikit-learn, llvmlite, numba, pyts, pyyaml, psutil, rdt, fonttools, cycler, pyparsing, packaging, pillow, kiwisolver, matplotlib, copulas, sdmetrics, charset-normalizer, idna, certifi, urllib3, requests, torchvision, ctgan, graphviz, text-unidecode, Faker, sdv
    

    When uncommenting sdmetrics from requirements.in, both commands run "correctly".

    Furthermore, when pip-compile and pip have cached sdmetrics==0.4.1, they both select that version instead and no error is shown.

    The following file never compiles:

    sdv==0.14.1
    sdmetrics==0.4.2
    

    I don't know what the appropriate solution to something like this would be. I'm not a library developer.

    bug 
    opened by antheas 2
  • README doesn't accurately describe the output of `compute_metrics`

    README doesn't accurately describe the output of `compute_metrics`

    The current README doesn't print the latest output. More specifically, the command sdmetrics.compute_metrics(metrics, real_data, synthetic_data, metadata=metadata) currently doesn't print the same as what the README prints (e.g. the current code produces a column named error containing None values which the README doesn't have, as well as other changes).

    documentation resolution:obsolete 
    opened by fealho 2
  • Relational `KSTest` crashes with `IncomputableMetricError` if a table doesn't have numerical columns

    Relational `KSTest` crashes with `IncomputableMetricError` if a table doesn't have numerical columns

    Environment Details

    • SDMetrics version: 0.4.1
    • Python version: 3.7

    Error Description

    The relational KSTest is supposed to run the KSTest on all numerical columns in all tables and return the average score.

    However, this test crashes if it encounters a table that has no numerical columns. I expect this test to succeed as long as there is at least 1 numerical column in any of the tables.

    Steps to reproduce

    Use the relational demo dataset and pass it in with the metadata.

    from sdv.metrics.demos import load_multi_table_demo
    from sdv.metrics.relational import KSTest
    
    real_data, synthetic_data, metadata = load_multi_table_demo()
    KSTest.compute(real_data, synthetic_data, metadata)
    

    Output:

    /usr/local/lib/python3.7/dist-packages/sdmetrics/single_table/base.py in _select_fields(cls, metadata, types)
         78 
         79         if len(fields) == 0:
    ---> 80             raise IncomputableMetricError(f'Cannot find fields of types {types}')
         81 
         82         return fields
    
    IncomputableMetricError: Cannot find fields of types ('numerical',)
    

    I believe this is happening because table sessions has no numerical columns. Interestingly, it does work if I exclude the metadata object -- because then it starts assuming that the id field is a numerical column.

    KSTest.compute(real_data, synthetic_data)
    
    0.8555555555555556
    
    bug 
    opened by npatki 2
  • Detection metrics should only use statistically modeled columns (filter out the rest)

    Detection metrics should only use statistically modeled columns (filter out the rest)

    Problem Description

    The Detection metrics use machine learning to determine whether the real vs. synthetic data can be detected. For this to work, we should only be using columns that are statistically modeled.

    Expected behavior

    When running any of the detection metrics, the following columns should be ignored:

    • Primary keys
    • Foreign keys
    • Any other kinds of IDs
    • PII or sensitive data
    • Text data (or data created by RegEx)

    None of these columns provide any useful information for detection.

    The remaining data types are statistically modeled and should be included: numerical, datetime, categorical (non-PII), boolean

    Additional context

    We already filtered out primary keys in #119. The issue of foreign keys is discussed in #285.

    feature request 
    opened by npatki 0
  • Does removing foreign keys in detection metrics for multi-tables make sense?

    Does removing foreign keys in detection metrics for multi-tables make sense?

    Environment details

    If you are already running SDMetrics, please indicate the following details about the environment in which you are running it:

    • SDMetrics version: 0.8.2
    • Python version:
    • Operating System:

    Problem description

    Nice correction for DetectionMetric (Solving primary_key use for detection metrics #251, https://github.com/sdv-dev/SDMetrics/pull/251) Removing the primary key from the table makes more sense for the evaluation. But, what about foreign keys present in a table of a relational database. Does not the same problem also occur for the foreign keys too and they need to be deleted? Also, for the parent-child logistic detection metric, the foreign keys (referencing parent rows) in the child tables are no longer required when the denormalized tables are used.

    question under discussion 
    opened by mohamedgy 1
  • Visualize cardinality of foreign key columns

    Visualize cardinality of foreign key columns

    I'm filing this issue on behalf of a user request on our Slack.

    Problem Description

    Currently, the single column visualization function (utils.get_column_plot) only supports columns that are numerical, categorical, boolean or datetime. It would be nice to support foreign keys as well.

    Expected behavior

    If I provide a foreign key name into the utils.get_column_plot function, then I expect to see a plot of real vs. synthetic data:

    1. Compute the cardinality (# of children with the same parent) for the synthetic and real data. This will form 2 distributions for real & synthetic data
    2. Plot those distributions similar to how we plot numerical data. The x-axis label should read "Cardinality".

    This can be done from get_column_plot.

    from sdmetrics.reports import utils
    
    fig = utils.get_column_plot(
        real_data=real_table,
        synthetic_data=synthetic_table,
        column_name='foreign_key_column',
        metadata=my_table_metadata_dict
    )
    
    fig.show()
    
    feature request 
    opened by npatki 1
  • Quality Report crashes when numerical column has only `NaN` values

    Quality Report crashes when numerical column has only `NaN` values

    Environment Details

    • SDMetrics version: 0.8.0
    • Python version: 3.7
    • Operating System: Linux

    Error Description

    A numerical column in the real data may contain missing values. Sometimes, the synthetic data may only produce these missing values and fail to create any numerical values. In such cases, the software crashes when I try to produce a quality report.

    Expected Behavior: Certain metrics may not be computable if there are only NaN values. But instead of crashing the report, the error should be noted in the detailed breakdowns, and the report should still produce a score while ignoring the values (along with details, visualizations, etc.)

    Steps to reproduce

    import pandas as pd
    from sdmetrics.reports.single_table import QualityReport
    
    real_data = pd.DataFrame(data={
        'col1': [1, 2, 1, 3, 4],
        'col2': [2, 4, 1, 7, 1]
    })
    
    # the 'col2' only contain NaN values
    synthetic_data = pd.DataFrame(data={
        'col1': [1, 3, 2, 2, 1],
        'col2': [np.nan]*5
    })
    
    metadata = {
        'fields': {
            'col1': { 'type': 'numerical', 'subtype': 'integer' },
            'col2': { 'type': 'numerical', 'subtype': 'integer' }
        }
    }
    
    report = QualityReport()
    report.generate(real_data, synthetic_data, metadata)
    

    Output

    Creating report:  50%|█████     | 2/4 [00:00<00:00, 106.06it/s]
    ---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    <ipython-input-55-a6be2bd142ce> in <module>
         17 
         18 report = QualityReport()
    ---> 19 report.generate(real_data, synthetic_data, metadata)
    
    3 frames
    /usr/local/lib/python3.7/dist-packages/sdmetrics/reports/single_table/quality_report.py in generate(self, real_data, synthetic_data, metadata)
         71             try:
         72                 self._metric_results[metric.__name__] = metric.compute_breakdown(
    ---> 73                     real_data, synthetic_data, metadata)
         74             except IncomputableMetricError:
         75                 # Metric is not compatible with this dataset.
    
    /usr/local/lib/python3.7/dist-packages/sdmetrics/single_table/multi_column_pairs.py in compute_breakdown(cls, real_data, synthetic_data, metadata, **kwargs)
        128             synthetic = synthetic_data[list(sorted_columns)]
        129             breakdown[sorted_columns] = cls.column_pairs_metric.compute_breakdown(
    --> 130                 real, synthetic, **kwargs)
        131 
        132         return breakdown
    
    /usr/local/lib/python3.7/dist-packages/sdmetrics/column_pairs/statistical/correlation_similarity.py in compute_breakdown(cls, real_data, synthetic_data, coefficient)
         83 
         84         correlation_real, _ = correlation_fn(real_data[column1], real_data[column2])
    ---> 85         correlation_synthetic, _ = correlation_fn(synthetic_data[column1], synthetic_data[column2])
         86 
         87         if np.isnan(correlation_real) or np.isnan(correlation_synthetic):
    
    /usr/local/lib/python3.7/dist-packages/scipy/stats/stats.py in pearsonr(x, y)
       4014 
       4015     if n < 2:
    -> 4016         raise ValueError('x and y must have length at least 2.')
       4017 
       4018     x = np.asarray(x)
    
    ValueError: x and y must have length at least 2.
    

    Note: It is OK that the correlation metric is crashing (correlation is undefined if there are no values). But the report should not crash.

    bug feature:reports 
    opened by npatki 0
Releases(v0.8.1)
  • v0.8.1(Dec 10, 2022)

    This release fixes bugs in the existing metrics and reports. We also make the reports compatible with future SDV versions.

    New Features

    • Filter out additional sdtypes that will be available in future versions of SDV - Issue #265 by @katxiao
    • NewRowSynthesis should ignore PrimaryKey column - Issue #260 by @katxiao

    Bug Fixes

    • Visualization crashes if there are metric errors - Issue #272 by @katxiao
    • Score for TVComplement if synthetic data only has missing values - Issue #271 by @katxiao
    • Fix 'timestamp' column metadata in the multi table demo - Issue #267 by @katxiao
    • Fix 'duration' column in the single table demo - Issue #266 by @katxiao
    • README.md example has a bug - Issue #262 by @katxiao
    • Update README.md to fix a bug - Issue #263 by @katxiao
    • Visualization get_column_pair_plot: update parameter name to column_names - Issue #258 by @katxiao
    • "Column Shapes" and "Column Pair Trends" Calculation Inconsistency - Issue #254 by @katxiao
    • Diagnostic Report missing RangeCoverage for numerical columns - Issue #255 by @katxiao

    v0.8.0 - 2022-11-02

    This release introduces the DiagnosticReport, which helps a user verify – at a quick glance – that their data is valid. We also fix an existing bug with detection metrics.

    New Features

    • Fixes for new metadata - Issue #253 by @katxiao
    • Add default synthetic sample size to DiagnosticReport - Issue #248 by @katxiao
    • Exclude pii columns from single table metrics - Issue #245 by @katxiao
    • Accept both old and new metadata - Issue #244 by @katxiao
    • Address Diagnostic Report and metric edge cases - Issue #243 by @katxiao
    • Update visualization average per table - Issue #242 by @katxiao
    • Add save and load functionality to multi-table DiagnosticReport - Issue #218 by @katxiao
    • Visualization methods for the multi-table DiagnosticReport - Issue #217 by @katxiao
    • Add getter methods to multi-table DiagnosticReport - Issue #216 by @katxiao
    • Create multi-table DiagnosticReport - Issue #215 by @katxiao
    • Visualization methods for the single-table DiagnosticReport - Issue #211 by @katxiao
    • Add getter methods to single-table DiagnosticReport - Issue #210 by @katxiao
    • Create single-table DiagnosticReport - Issue #209 by @katxiao
    • Add save and load functionality to single-table DiagnosticReport - Issue #212 by @katxiao
    • Add single table diagnostic report - Issue #237 by @katxiao
    Source code(tar.gz)
    Source code(zip)
  • v0.8.0(Nov 16, 2022)

    This release introduces the DiagnosticReport, which helps a user verify – at a quick glance – that their data is valid. We also fix an existing bug with detection metrics.

    New Features

    • Fixes for new metadata - Issue #253 by @katxiao
    • Add default synthetic sample size to DiagnosticReport - Issue #248 by @katxiao
    • Exclude pii columns from single table metrics - Issue #245 by @katxiao
    • Accept both old and new metadata - Issue #244 by @katxiao
    • Address Diagnostic Report and metric edge cases - Issue #243 by @katxiao
    • Update visualization average per table - Issue #242 by @katxiao
    • Add save and load functionality to multi-table DiagnosticReport - Issue #218 by @katxiao
    • Visualization methods for the multi-table DiagnosticReport - Issue #217 by @katxiao
    • Add getter methods to multi-table DiagnosticReport - Issue #216 by @katxiao
    • Create multi-table DiagnosticReport - Issue #215 by @katxiao
    • Visualization methods for the single-table DiagnosticReport - Issue #211 by @katxiao
    • Add getter methods to single-table DiagnosticReport - Issue #210 by @katxiao
    • Create single-table DiagnosticReport - Issue #209 by @katxiao
    • Add save and load functionality to single-table DiagnosticReport - Issue #212 by @katxiao
    • Add single table diagnostic report - Issue #237 by @katxiao

    Bug Fixes

    • Detection test test doesn't look at metadata when determining which columns to use - Issue #119 by @R-Palazzo

    Internal Improvements

    • Remove torch dependency - Issue #233 by @katxiao
    • Update README - Issue #250 by @katxiao
    Source code(tar.gz)
    Source code(zip)
  • v0.7.0(Sep 27, 2022)

    This release introduces the QualityReport, which evaluates how well synthetic data captures mathematical properties from the real data. The QualityReport incorporates the new metrics introduced in the previous release, and allows users to get detailed results, visualize the scores, and save the report for future viewing. We also add utility methods for visualizing columns and pairs of columns.

    New Features

    • Catch typeerror in new row synthesis query - Issue #234 by @katxiao
    • Add NewRowSynthesis Metric - Issue #207 by @katxiao
    • Update plot utilities API - Issue #228 by @katxiao
    • Fix column pairs visualization bug - Issue #230 by @katxiao
    • Save version - Issue #229 by @katxiao
    • Update efficacy metrics API - Issue #227 by @katxiao
    • Add RangeCoverage Metric - Issue #208 by @katxiao
    • Add get_column_pairs_plot utility method - Issue #223 by @katxiao
    • Parse date as datetime - Issue #222 by @katxiao
    • Update error handling for reports - Issue #221 by @katxiao
    • Visualization API update - Issue #220 by @katxiao
    • Bug fixes for QualityReport - Issue #219 by @katxiao
    • Update column pair metric calculation - Issue #214 by @katxiao
    • Add get score methods for multi table QualityReport - Issue #190 by @katxiao
    • Add multi table QualityReport visualization methods - Issue #192 by @katxiao
    • Add plot_column visualization utility method - Issue #193 by @katxiao
    • Add save and load behavior to multi table QualityReport - Issue #188 by @katxiao
    • Create multi-table QualityReport - Issue #186 by @katxiao
    • Add single table QualityReport visualization methods - Issue #191 by @katxiao
    • Add save and load behavior to single table QualityReport - Issue #187 by @katxiao
    • Add get score methods for single table Quality Report - Issue #189 by @katxiao
    • Create single-table QualityReport - Issue #185 by @katxiao

    Internal Improvements

    • Auto apply "new" label instead of "pending review" - Issue #164 by @katxiao
    • fix typo - Issue #195 by @fealho
    Source code(tar.gz)
    Source code(zip)
  • v0.6.0(Aug 12, 2022)

    This release removes SDMetric's dependency on the RDT library, and also introduces new quality and diagnostic metrics. Additionally, we introduce a new compute_breakdown method that returns a breakdown of metric results.

    New Features

    • Handle null values correctly - Issue #194 by @katxiao
    • Add wrapper classes for new single and multi table metrics - Issue #169 by @katxiao
    • Add CorrelationSimilarity metric - Issue #143 by @katxiao
    • Add CardinalityShapeSimilarity metric - Issue #160 by @katxiao
    • Add CardinalityStatisticSimilarity metric - Issue #145 by @katxiao
    • Add ContingencySimilarity Metric - Issue #159 by @katxiao
    • Add TVComplement metric - Issue #142 by @katxiao
    • Add MissingValueSimilarity metric - Issue #139 by @katxiao
    • Add CategoryCoverage metric - Issue #140 by @katxiao
    • Add compute breakdown column for single column - Issue #152 by @katxiao
    • Add BoundaryAdherence metric - Issue #138 by @katxiao
    • Get KSComplement Score Breakdown - Issue #130 by @katxiao
    • Add StatisticSimilarity Metric - Issue #137 by @katxiao
    • New features for KSTest.compute - Issue #129 by @amontanez24

    Internal Improvements

    • Add integration tests and fixes - Issue #183 by @katxiao
    • Remove rdt hypertransformer dependency in timeseries metrics - Issue #176 by @katxiao
    • Replace rdt LabelEncoder with sklearn - Issue #178 by @katxiao
    • Remove rdt as a dependency - Issue #182 by @katxiao
    • Use sklearn's OneHotEncoder instead of rdt - Issue #170 by @katxiao
    • Remove KSTestExtended - Issue #180 by @katxiao
    • Remove TSFClassifierEfficacy and TSFCDetection metrics - Issue #171 by @katxiao
    • Update the default tags for a feature request - Issue #172 by @katxiao
    • Bump github macos version - Issue #174 by @katxiao
    • Fix pydocstyle to check sdmetrics - Issue #153 by @pvk-developer
    • Update the RDT version to 1.0 - Issue #150 by @pvk-developer
    • Update slack invite link - Issue #132 by @pvk-developer
    Source code(tar.gz)
    Source code(zip)
  • v0.5.0(May 10, 2022)

    This release fixes an error where the relational KSTest crashes if a table doesn't have numerical columns. It also includes some housekeeping, updating the pomegranate and copulas version requirements.

    Issues closed

    • Cap pomegranate to <0.14.7 - Issue #116 by @csala
    • Relational KSTest crashes with IncomputableMetricError if a table doesn't have numerical columns - Issue #109 by @katxiao
    Source code(tar.gz)
    Source code(zip)
  • v0.4.1(Dec 9, 2021)

    v0.4.1 - 2021-12-09

    This release improves the handling of metric errors, and updates the default transformer behavior used in SDMetrics.

    Issues closed

    • Report metric errors from compute_metrics - Issue #107 by @katxiao
    • Specify default categorical transformers - Issue #105 by @katxiao
    Source code(tar.gz)
    Source code(zip)
  • v0.4.0(Nov 16, 2021)

    This release adds support for Python 3.9 and updates dependencies to ensure compatibility with the rest of the SDV ecosystem, and upgrades to the latests RDT release.

    Issues closed

    • Replace sktime for pyts - Issue #103 by @pvk-developer
    • Add support for Python 3.9 - Issue #102 by @pvk-developer
    • Increase code style lint - Issue #80 by @fealho
    • Add pip check to CI workflows - Issue #79 by @pvk-developer
    • Upgrade dependency ranges - Issue #69 by @katxiao
    Source code(tar.gz)
    Source code(zip)
  • v0.3.2(Aug 17, 2021)

  • v0.3.1(Jul 12, 2021)

    v0.3.1 - 2021-07-12

    This release fixes a bug to make the privacy metrics available in the API docs. It also updates dependencies to ensure compatibility with the rest of the SDV ecosystem.

    Issues closed

    • CategoricalSVM not being imported - Issue #65 by @csala
    Source code(tar.gz)
    Source code(zip)
  • v0.3.0(Mar 31, 2021)

    This release includes privacy metrics to evaluate if the real data could be obtained or deduced from the synthetic samples. Additionally all the metrics have a normalize method which takes the raw_score generated by the metric and returns a value between 0 and 1.

    Issues closed

    • Add normalize method to metrics - Issue #51 by @csala and @fealho
    • Implement privacy metrics - Issue #36 by @ZhuofanXie and @fealho
    Source code(tar.gz)
    Source code(zip)
  • v0.2.0(Feb 24, 2021)

  • v0.1.3(Feb 15, 2021)

  • v0.1.2(Jan 27, 2021)

    Big fixing release that addresses several minor errors.

    Issues closed

    • More splits than classes - Issue #46 by @fealho
    • Scipy 1.6.0 causes an AttributeError - Issue #44 by @fealho
    • Time series metrics fails with variable length timeseries - Issue #42 by @fealho
    • ParentChildDetection metrics KeyError - Issue #39 by @csala
    Source code(tar.gz)
    Source code(zip)
  • v0.1.1(Dec 30, 2020)

    This version adds Time Series Detection and Efficacy metrics, as well as a fix to ensure that Single Table binary classification efficacy metrics work well with binary targets which are not boolean.

    Issues closed

    • Timeseries efficacy metrics - Issue #35 by @csala
    • Timeseries detection metrics - Issue #34 by @csala
    • Ensure binary classification targets are bool - Issue #33 by @csala
    Source code(tar.gz)
    Source code(zip)
  • v0.1.0(Dec 18, 2020)

    This release introduces a new project organization and API, with metrics grouped by data modality, with a common API:

    • Single Column
    • Column Pair
    • Single Table
    • Multi Table
    • Time Series

    Within each data modality, different families of metrics have been implemented:

    • Statistical
    • Detection
    • Bayesian Network and Gaussian Mixture Likelihood
    • Machine Learning Efficacy
    Source code(tar.gz)
    Source code(zip)
  • v0.0.4(Nov 27, 2020)

  • v0.0.3(Nov 20, 2020)

    Fix error on detection metrics when input data contains infinity or NaN values.

    Issues closed

    • ValueError: Input contains infinity or a value too large for dtype('float64') - Issue #11 by @csala
    Source code(tar.gz)
    Source code(zip)
  • v0.0.2(Aug 8, 2020)

  • v0.0.1(Jun 26, 2020)

Owner
The Synthetic Data Vault Project
The Synthetic Data Vault Project
Semi-supervised Video Deraining with Dynamical Rain Generator (CVPR, 2021, Pytorch)

S2VD Semi-supervised Video Deraining with Dynamical Rain Generator (CVPR, 2021) Requirements and Dependencies Ubuntu 16.04, cuda 10.0 Python 3.6.10, P

Zongsheng Yue 53 Nov 23, 2022
LBK 35 Dec 26, 2022
An Easy-to-use, Modular and Prolongable package of deep-learning based Named Entity Recognition Models.

DeepNER An Easy-to-use, Modular and Prolongable package of deep-learning based Named Entity Recognition Models. This repository contains complex Deep

Derrick 9 May 30, 2022
PyTorch for Semantic Segmentation

PyTorch for Semantic Segmentation This repository contains some models for semantic segmentation and the pipeline of training and testing models, impl

Zijun Deng 1.7k Jan 06, 2023
MIM: MIM Installs OpenMMLab Packages

MIM provides a unified API for launching and installing OpenMMLab projects and their extensions, and managing the OpenMMLab model zoo.

OpenMMLab 254 Jan 04, 2023
A flexible framework of neural networks for deep learning

Chainer: A deep learning framework Website | Docs | Install Guide | Tutorials (ja) | Examples (Official, External) | Concepts | ChainerX Forum (en, ja

Chainer 5.8k Jan 06, 2023
Geometry-Aware Learning of Maps for Camera Localization (CVPR2018)

Geometry-Aware Learning of Maps for Camera Localization This is the PyTorch implementation of our CVPR 2018 paper "Geometry-Aware Learning of Maps for

NVIDIA Research Projects 321 Nov 26, 2022
(AAAI 2021) Progressive One-shot Human Parsing

End-to-end One-shot Human Parsing This is the official repository for our two papers: Progressive One-shot Human Parsing (AAAI 2021) End-to-end One-sh

54 Dec 30, 2022
Cross View SLAM

Cross View SLAM This is the associated code and dataset repository for our paper I. D. Miller et al., "Any Way You Look at It: Semantic Crossview Loca

Ian D. Miller 99 Dec 09, 2022
This repo contains source code and materials for the TEmporally COherent GAN SIGGRAPH project.

TecoGAN This repository contains source code and materials for the TecoGAN project, i.e. code for a TEmporally COherent GAN for video super-resolution

Nils Thuerey 5.2k Jan 02, 2023
OpenMMLab Detection Toolbox and Benchmark

MMDetection is an open source object detection toolbox based on PyTorch. It is a part of the OpenMMLab project.

OpenMMLab 22.5k Jan 05, 2023
Official PyTorch Implementation of Unsupervised Learning of Scene Flow Estimation Fusing with Local Rigidity

UnRigidFlow This is the official PyTorch implementation of UnRigidFlow (IJCAI2019). Here are two sample results (~10MB gif for each) of our unsupervis

Liang Liu 28 Nov 16, 2022
Data from "HateCheck: Functional Tests for Hate Speech Detection Models" (Röttger et al., ACL 2021)

In this repo, you can find the data from our ACL 2021 paper "HateCheck: Functional Tests for Hate Speech Detection Models". "test_suite_cases.csv" con

Paul Röttger 43 Nov 11, 2022
Medical image analysis framework merging ANTsPy and deep learning

ANTsPyNet A collection of deep learning architectures and applications ported to the python language and tools for basic medical image processing. Bas

Advanced Normalization Tools Ecosystem 118 Dec 24, 2022
Generative Flow Networks

Flow Network based Generative Models for Non-Iterative Diverse Candidate Generation Implementation for our paper, submitted to NeurIPS 2021 (also chec

Emmanuel Bengio 381 Jan 04, 2023
Bio-Computing Platform Featuring Large-Scale Representation Learning and Multi-Task Deep Learning “螺旋桨”生物计算工具集

English | 简体中文 Latest News 2021.10.25 Paper "Docking-based Virtual Screening with Multi-Task Learning" is accepted by BIBM 2021. 2021.07.29 PaddleHeli

633 Jan 04, 2023
Code for "Learning From Multiple Experts: Self-paced Knowledge Distillation for Long-tailed Classification", ECCV 2020 Spotlight

Learning From Multiple Experts: Self-paced Knowledge Distillation for Long-tailed Classification Implementation of "Learning From Multiple Experts: Se

27 Nov 05, 2022
MolRep: A Deep Representation Learning Library for Molecular Property Prediction

MolRep: A Deep Representation Learning Library for Molecular Property Prediction Summary MolRep is a Python package for fairly measuring algorithmic p

AI-Health @NSCC-gz 83 Dec 24, 2022
Code of the paper "Part Detector Discovery in Deep Convolutional Neural Networks" by Marcel Simon, Erik Rodner and Joachim Denzler

Part Detector Discovery This is the code used in our paper "Part Detector Discovery in Deep Convolutional Neural Networks" by Marcel Simon, Erik Rodne

Computer Vision Group Jena 17 Feb 22, 2022
Gesture Volume Control Using OpenCV and MediaPipe

This Project Uses OpenCV and MediaPipe Hand solutions to identify hands and Change system volume by taking thumb and index finger positions

Pratham Bhatnagar 6 Sep 12, 2022