Monitor the stability of a pandas or spark dataframe ⚙︎

Overview

Population Shift Monitoring

Build status Package docs status Latest GitHub release GitHub Release Date PyPi downloads

POPMON logo

popmon is a package that allows one to check the stability of a dataset. popmon works with both pandas and spark datasets.

popmon creates histograms of features binned in time-slices, and compares the stability of the profiles and distributions of those histograms using statistical tests, both over time and with respect to a reference. It works with numerical, ordinal, categorical features, and the histograms can be higher-dimensional, e.g. it can also track correlations between any two features. popmon can automatically flag and alert on changes observed over time, such as trends, shifts, peaks, outliers, anomalies, changing correlations, etc, using monitoring business rules.

Traffic Light Overview

Announcements

Spark 3.0

With Spark 3.0, based on Scala 2.12, make sure to pick up the correct histogrammar jar files:

spark = SparkSession.builder.config(
    "spark.jars.packages",
    "io.github.histogrammar:histogrammar_2.12:1.0.20,io.github.histogrammar:histogrammar-sparksql_2.12:1.0.20",
).getOrCreate()

For Spark 2.X compiled against scala 2.11, in the string above simply replace 2.12 with 2.11.

Examples

Documentation

The entire popmon documentation including tutorials can be found at read-the-docs.

Notebooks

Tutorial Colab link
Basic tutorial Open in Colab
Detailed example (featuring configuration, Apache Spark and more) Open in Colab
Incremental datasets (online analysis) Open in Colab
Report interpretation (step-by-step guide) Open in Colab

Check it out

The popmon library requires Python 3.6+ and is pip friendly. To get started, simply do:

$ pip install popmon

or check out the code from our GitHub repository:

$ git clone https://github.com/ing-bank/popmon.git
$ pip install -e popmon

where in this example the code is installed in edit mode (option -e).

You can now use the package in Python with:

import popmon

Congratulations, you are now ready to use the popmon library!

Quick run

As a quick example, you can do:

import pandas as pd
import popmon
from popmon import resources

# open synthetic data
df = pd.read_csv(resources.data("test.csv.gz"), parse_dates=["date"])
df.head()

# generate stability report using automatic binning of all encountered features
# (importing popmon automatically adds this functionality to a dataframe)
report = df.pm_stability_report(time_axis="date", features=["date:age", "date:gender"])

# to show the output of the report in a Jupyter notebook you can simply run:
report

# or save the report to file
report.to_file("monitoring_report.html")

To specify your own binning specifications and features you want to report on, you do:

# time-axis specifications alone; all other features are auto-binned.
report = df.pm_stability_report(
    time_axis="date", time_width="1w", time_offset="2020-1-6"
)

# histogram selections. Here 'date' is the first axis of each histogram.
features = [
    "date:isActive",
    "date:age",
    "date:eyeColor",
    "date:gender",
    "date:latitude",
    "date:longitude",
    "date:isActive:age",
]

# Specify your own binning specifications for individual features or combinations thereof.
# This bin specification uses open-ended ("sparse") histograms; unspecified features get
# auto-binned. The time-axis binning, when specified here, needs to be in nanoseconds.
bin_specs = {
    "longitude": {"bin_width": 5.0, "bin_offset": 0.0},
    "latitude": {"bin_width": 5.0, "bin_offset": 0.0},
    "age": {"bin_width": 10.0, "bin_offset": 0.0},
    "date": {
        "bin_width": pd.Timedelta("4w").value,
        "bin_offset": pd.Timestamp("2015-1-1").value,
    },
}

# generate stability report
report = df.pm_stability_report(features=features, bin_specs=bin_specs, time_axis=True)

These examples also work with spark dataframes. You can see the output of such example notebook code here. For all available examples, please see the tutorials at read-the-docs.

Pipelines for monitoring dataset shift

Advanced users can leverage popmon's modular data pipeline to customize their workflow. Visualization of the pipeline can be useful when debugging, or for didactic purposes. There is a script included with the package that you can use. The plotting is configurable, and depending on the options you will obtain a result that can be used for understanding the data flow, the high-level components and the (re)use of datasets.

Pipeline Visualization

Example pipeline visualization (click to enlarge)

Resources

Presentations

Title Host Date Speaker
Popmon - population monitoring made easy Big Data Technology Warsaw Summit 2021 February 25, 2021 Simon Brugman
Popmon - population monitoring made easy Data Lunch @ Eneco October 29, 2020 Max Baak, Simon Brugman
Popmon - population monitoring made easy Data Science Summit 2020 October 16, 2020 Max Baak
Population Shift Monitoring Made Easy: the popmon package Online Data Science Meetup @ ING WBAA July 8 2020 Tomas Sostak
Popmon: Population Shift Monitoring Made Easy PyData Fest Amsterdam 2020 June 16, 2020 Tomas Sostak
Popmon: Population Shift Monitoring Made Easy Amundsen Community Meetup June 4, 2020 Max Baak

Articles

Title Date Author
Population Shift Analysis: Monitoring Data Quality with Popmon May 21, 2021 Vito Gentile
Popmon Open Source Package — Population Shift Monitoring Made Easy May 20, 2020 Nicole Mpozika

Project contributors

This package was authored by ING Wholesale Banking Advanced Analytics. Special thanks to the following people who have contributed to the development of this package: Ahmet Erdem, Fabian Jansen, Nanne Aben, Mathieu Grimal.

Contact and support

Please note that ING WBAA provides support only on a best-effort basis.

License

Copyright ING WBAA. popmon is completely free, open-source and licensed under the MIT license.

Comments
  • feat: hist_juxtaposition

    feat: hist_juxtaposition

    For now, the last_n is by default set to 2. Therefore, only two dates would appear in the dropdown. For the airline dataset if the last_n is set to max, popmon runs into the issue (for DEPARTURE feature) raised by Tomek https://github.com/ing-bank/popmon/issues/244.

    Screenshot 2022-08-15 at 19 39 24

    closes ing-bank/popmon#230

    enhancement 
    opened by pradyot-09 7
  • Error when stitching histograms

    Error when stitching histograms

    Discussed in https://github.com/ing-bank/popmon/discussions/142

    Originally posted by jeaninejuliettes September 29, 2021 Hello,

    I'm receiving an error when using stitch_histogram and I'm not sure what I'm doing wrong, hope anyone can help me. The error I get is: ValueError: Request to insert delta hists but time_bin_idx not set. Please do.

    The steps I take:

    I start with creating a histogrammar object of the original dataframe

    hists = df.pm_make_histograms() bin_specs = popmon.get_bin_specs(hists)

    later on I receive a new batch of data, which I add to my existing histograms

    new_hists = [new_df.pm_make_histograms(bin_specs=bin_specs)] hists_2= popmon.stitch_histograms(hists_basis=hists, hists_delta=new_hists, time_axis="batch")

    so far so good, but when I try to repeat these steps with yet another new batch of data, I receive the error

    new_hists_2 = [new_df_2.pm_make_histograms(bin_specs=bin_specs)] hists_3 = popmon.stitch_histograms(hists_basis=hists_2, hists_delta=new_hists_2, time_axis="batch")

    Is it not possible to stitch another histogram again? If not, I've found a bit of a cumbersome way to decide on what a good value for my time_bin_idx is. It works so far, but I'm expecting it too fail with other data (or not to work as expected). The way I define the time_bin_idx value is: int(np.ceil(max(hists_2[next(iter(hists_2))].bin_centers()) + 1))

    Hopefully you can point me in the right direction. Thanks!

    opened by mbaak 6
  • Error: cannot import name 'Report' from 'popmon.config'

    Error: cannot import name 'Report' from 'popmon.config'

    Code:

    import popmon from popmon import resources from popmon.config import Report

    Got error: ImportError Traceback (most recent call last) /tmp/ipykernel_707/1841834346.py in 3 import popmon 4 from popmon import resources ----> 5 from popmon.config import Report, Setting

    ImportError: cannot import name 'Report' from 'popmon.config' (/home/user/.local/lib/python3.7/site-packages/popmon/config.py)

    opened by lcheng61 4
  • Error with pydantic when using some custom settings in the report generation

    Error with pydantic when using some custom settings in the report generation

    With version 1.0.0, when using custom settings in df.pm_stability_report() like show_stats, I get an error stating such option is not allowed:

    ValidationError: 2 validation errors for Settings

    I couldn't reproduce it when using popmon==0.9.0.

    opened by gus-morales 3
  • DataProfiler - A Scalable Data Profiling Library

    DataProfiler - A Scalable Data Profiling Library

    Howdy!

    I'm reaching out as a maintainer of the DataProfiler library.

    I think it might be useful to your project so I'm reaching out! Would love to collaborate and see how we can help popmon.

    We effectively wrote a library to improve upon the objectives of pandas-profiling with some neat added functionality:

    • Auto-Detect & Load: CSV, AVRO, Parquet, JSON, Text, URL data = Data("your_filepath_or_url.csv")
    • Profile data: calculating statistics and doing entity detection (for PII) profile = Profiler(data)
    • Merge profiles: profile3 = profile1 + profile2; enabling distributed profile generation
    • Compare profiles: profile_diff = profile1.diff(profile2)
    • Generate reports: readable_report = profile.report(report_options={"output_format": "compact"})
    import json
    from dataprofiler import Data, Profiler
    
    data = Data("your_file.csv") # Auto-Detect & Load: CSV, AVRO, Parquet, JSON, Text, URL
    
    print(data.data.head(5)) # Access data directly via a compatible Pandas DataFrame
    
    profile = Profiler(data) # Calculate Statistics, Entity Recognition, etc
    
    readable_report = profile.report(report_options={"output_format": "compact"})
    
    print(json.dumps(readable_report, indent=4))
    
    opened by lettergram 3
  • Library doesn't run in Spark 3.0+: Replace the dependency of histogrammar with native Spark functionality

    Library doesn't run in Spark 3.0+: Replace the dependency of histogrammar with native Spark functionality

    Currently, the dependency with the library, which hasn't been further developed since 2016, creates a dependency with Scala 2.11 which limits the execution in Spark 3.0 (which was only built on Scala 2.12). I think I could replace the functionality with Bucketizer functionality in native spark.

    opened by kedemdor 3
  • Imports Optimized

    Imports Optimized

    isort helps you to sort your import list. It simply optimized the script and increases the readability.

    There is no big change in the concept. Algorithms are still working as well.

    I'm contributing for Hacktoberfest. I will appreciate it if you add the "Hacktoberfest" label to this PR. :) Thanks.

    opened by lnxpy 3
  • missing tutorial datasets

    missing tutorial datasets

    Hi, awesome tool!

    Advanced tutorial datasets are not in test_data dir, but still in notebooks dir, as far as I can see. Hence the advanced tutorial notebooks don't run out of the box, at least for me. I don't have permissions to push to a develop branch.

    Changes to be committed: (use "git reset HEAD ..." to unstage)

    renamed:    popmon/notebooks/flight_delays.csv.gz -> popmon/test_data/flight_delays.csv.gz
    renamed:    popmon/notebooks/flight_delays_reference.csv.gz -> popmon/test_data/flight_delays_reference.csv.gz
    

    Cheers - Alex

    opened by AlexKoutsman 3
  • build(deps): update docutils requirement from <0.17 to <0.20

    build(deps): update docutils requirement from <0.17 to <0.20

    Updates the requirements on docutils to permit the latest version.

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies 
    opened by dependabot[bot] 2
  • Feat/plotly express

    Feat/plotly express

    • The histograms, heatmaps and comparisons have been replaced with interactive Plotly graphs. Plotly.js is used to build the graphs on the go from json.
    • Initial tests show that plotly reports are smaller in size compared to matplotlib and the takes way less time for report generation compared to matplotlib.
    • use parameter 'online_report' to use plotly.js from cdn server and use report online. Else, plotly.js is embedded in the report and can be used offline too.

    newplot

    newplot (3)

    newplot (2)

    closes ing-bank/popmon#164

    enhancement reporting 
    opened by pradyot-09 2
  • Feature/heatmap time series plotting

    Feature/heatmap time series plotting

    Added heatmap feature to visualize features over time for EDA.

    image

    Screenshot 2022-05-09 at 15 03 53

    User can set Heatmap color map by giving the 'cmap' argument pm_stability report(). User can set top_n argument to deal with high cardinality. User can disable specific heatmap by giving heatmap name in the disable_heatmap[] argument in the pm_stability_report().

    closes ing-bank/popmon#185 closes ing-bank/popmon#199

    opened by pradyot-09 2
  • Rolling reference comparisons

    Rolling reference comparisons

    A wide variety of references is provided by popmon out-of-the-box. A reference may be static (a fixed training set, or the current dataset itself for exploratory data analysis) or dynamic (sliding or growing as more data becomes available). The reference is compared against batches, and they can be sequential (batched) or sliding (rolling).

    Popmon should enable all combinations, and currently lacks external reference + rolling comparison.

    | | Reference | Compare to | Implemented | |---|---|---|---| | Self-reference | Static | Self (batched) | ✓ | | External reference | Static | Batched | ✓ | | Rolling reference | Rolling | Rolling/sliding | ✓ | | Expanding reference | Expanding | Rolling/sliding | ✓ | | External reference | Static | Rolling/sliding | ✗ |

    Thanks to @LorenaPoenaru!

    enhancement 
    opened by sbrugman 0
  • code coverage of 100%

    code coverage of 100%

    The risk of breaking functionality on introducing new features could be reduced by ensuring that each line of code is covered by the tests and that this is enforced at test time. Other repos, such as this also use this.

    For that, we can include pytest-cov to the test dependencies and increase the test coverage until it passes (see this annswer).

    good first issue help wanted CI internal improvement 
    opened by sbrugman 0
  • Traffic light boundaries for count variables

    Traffic light boundaries for count variables

    The traffic light bounds provided by the pull/Z-score calculation are symmetrical. For count variables this can lead to bounds outside the constraints (below zero).

    enhancement statistics 
    opened by sbrugman 0
  • Reject unsupported column types

    Reject unsupported column types

    Running popmon on a DataFrame with columns containing mutable sequences, tuples or sets generates cryptic errors. popmon should return an error message.

    enhancement good first issue API 
    opened by sbrugman 0
Releases(v1.4.0)
Owner
ING Bank
ING Open-source projects
ING Bank
Catalogue data - A Python Scripts to prepare catalogue data

catalogue_data Scripts to prepare catalogue data. Setup Clone this repo. Install

BigScience Workshop 3 Mar 03, 2022
Pizza Orders Data Pipeline Usecase Solved by SQL, Sqoop, HDFS, Hive, Airflow.

PizzaOrders_DataPipeline There is a Tony who is owning a New Pizza shop. He knew that pizza alone was not going to help him get seed funding to expand

Melwin Varghese P 4 Jun 05, 2022
Two phase pipeline + StreamlitTwo phase pipeline + Streamlit

Two phase pipeline + Streamlit This is an example project that demonstrates how to create a pipeline that consists of two phases of execution. In betw

Rick Lamers 1 Nov 17, 2021
A lightweight, hub-and-spoke dashboard for multi-account Data Science projects

A lightweight, hub-and-spoke dashboard for cross-account Data Science Projects Introduction Modern Data Science environments often involve many indepe

AWS Samples 3 Oct 30, 2021
bigdata_analyse 大数据分析项目

bigdata_analyse 大数据分析项目 wish 采用不同的技术栈,通过对不同行业的数据集进行分析,期望达到以下目标: 了解不同领域的业务分析指标 深化数据处理、数据分析、数据可视化能力 增加大数据批处理、流处理的实践经验 增加数据挖掘的实践经验

Way 2.4k Dec 30, 2022
Generate lookml for views from dbt models

dbt2looker Use dbt2looker to generate Looker view files automatically from dbt models. Features Column descriptions synced to looker Dimension for eac

lightdash 126 Dec 28, 2022
University Challenge 2021 With Python

University Challenge 2021 This repository contains: The TeX file of the technical write-up describing the University / HYPER Challenge 2021 under late

2 Nov 27, 2021
Sensitivity Analysis Library in Python (Numpy). Contains Sobol, Morris, Fractional Factorial and FAST methods.

Sensitivity Analysis Library (SALib) Python implementations of commonly used sensitivity analysis methods. Useful in systems modeling to calculate the

SALib 663 Jan 05, 2023
ped-crash-techvol: Texas Ped Crash Tech Volume Pack

ped-crash-techvol: Texas Ped Crash Tech Volume Pack In conjunction with the Final Report "Identifying Risk Factors that Lead to Increase in Fatal Pede

Network Modeling Center; Center for Transportation Research; The University of Texas at Austin 2 Sep 28, 2022
A project consists in a set of assignements corresponding to a BI process: data integration, construction of an OLAP cube, qurying of a OPLAP cube and reporting.

TennisBusinessIntelligenceProject - A project consists in a set of assignements corresponding to a BI process: data integration, construction of an OLAP cube, qurying of a OPLAP cube and reporting.

carlo paladino 1 Jan 02, 2022
WithPipe is a simple utility for functional piping in Python.

A utility for functional piping in Python that allows you to access any function in any scope as a partial.

Michael Milton 1 Oct 26, 2021
A Python package for Bayesian forecasting with object-oriented design and probabilistic models under the hood.

Disclaimer This project is stable and being incubated for long-term support. It may contain new experimental code, for which APIs are subject to chang

Uber Open Source 1.6k Dec 29, 2022
Python dataset creator to construct datasets composed of OpenFace extracted features and Shimmer3 GSR+ Sensor datas

Python dataset creator to construct datasets composed of OpenFace extracted features and Shimmer3 GSR+ Sensor datas

Gabriele 3 Jul 05, 2022
Analyse the limit order book in seconds. Zoom to tick level or get yourself an overview of the trading day.

Analyse the limit order book in seconds. Zoom to tick level or get yourself an overview of the trading day. Correlate the market activity with the Apple Keynote presentations.

2 Jan 04, 2022
Hg002-qc-snakemake - HG002 QC Snakemake

HG002 QC Snakemake To Run Resources and data specified within snakefile (hg002QC

Juniper A. Lake 2 Feb 16, 2022
DenseClus is a Python module for clustering mixed type data using UMAP and HDBSCAN

DenseClus is a Python module for clustering mixed type data using UMAP and HDBSCAN. Allowing for both categorical and numerical data, DenseClus makes it possible to incorporate all features in cluste

Amazon Web Services - Labs 53 Dec 08, 2022
Recommendations from Cramer: On the show Mad-Money (CNBC) Jim Cramer picks stocks which he recommends to buy. We will use this data to build a portfolio

Backtesting the "Cramer Effect" & Recommendations from Cramer Recommendations from Cramer: On the show Mad-Money (CNBC) Jim Cramer picks stocks which

Gábor Vecsei 12 Aug 30, 2022
TheMachineScraper 🐱‍👤 is an Information Grabber built for Machine Analysis

TheMachineScraper 🐱‍👤 is a tool made purely for analysing machine data for any reason.

doop 5 Dec 01, 2022
Accurately separate the TLD from the registered domain and subdomains of a URL, using the Public Suffix List.

tldextract Python Module tldextract accurately separates the gTLD or ccTLD (generic or country code top-level domain) from the registered domain and s

John Kurkowski 1.6k Jan 03, 2023
PyNHD is a part of HyRiver software stack that is designed to aid in watershed analysis through web services.

A part of HyRiver software stack that provides access to NHD+ V2 data through NLDI and WaterData web services

Taher Chegini 23 Dec 14, 2022