PClean: A Domain-Specific Probabilistic Programming Language for Bayesian Data Cleaning

Last update: Dec 27, 2022

Related tags

Deep Learning PClean

Overview

PClean

PClean: A Domain-Specific Probabilistic Programming Language for Bayesian Data Cleaning

Warning: This is a rapidly evolving research prototype.

PClean was created at the MIT Probabilistic Computing Project.

If you use PClean in your research, please cite the our 2021 AISTATS paper:

PClean: Bayesian Data Cleaning at Scale with Domain-Specific Probabilistic Programming. Lew, A. K.; Agrawal, M.; Sontag, D.; and Mansinghka, V. K. (2021, March). In International Conference on Artificial Intelligence and Statistics (pp. 1927-1935). PMLR. (pdf)

Using PClean

To use PClean, create a Julia file with the following structure:

using PClean
using DataFrames: DataFrame
import CSV

# Load data
data = CSV.File(filepath) |> DataFrame

# Define PClean model
PClean.@model MyModel begin
    @class ClassName1 begin
        ...
    end

    ...
    
    @class ClassNameN begin
        ...
    end
end

# Align column names of CSV with variables in the model.
# Format is ColumnName CleanVariable DirtyVariable, or, if
# there is no corruption for a certain variable, one can omit
# the DirtyVariable.
query = @query MyModel.ClassNameN [
  HospitalName hosp.name             observed_hosp_name
  Condition    metric.condition.desc observed_condition
  ...
]

# Configure observed dataset
observations = [ObservedDataset(query, data)]

# Configuration
config = PClean.InferenceConfig(1, 2; use_mh_instead_of_pg=true)

# SMC initialization
state = initialize_trace(observations, config)

# Rejuvenation sweeps
run_inference!(state, config)

# Evaluate accuracy, if ground truth is available
ground_truth = CSV.File(filepath) |> CSV.DataFrame
results = evaluate_accuracy(data, ground_truth, state, query)

# Can print results.f1, results.precision, results.accuracy, etc.
println(results)

# Even without ground truth, can save the entire latent database to CSV files:
PClean.save_results(dir, dataset_name, state, observations)

Then, from this directory, run the Julia file.

JULIA_PROJECT=. julia my_file.jl

To learn to write a PClean model, see our paper, but note the surface syntax changes described below.

Differences from the paper

As a DSL embedded into Julia, our implementation of the PClean language has some differences, in terms of surface syntax, from the stand-alone syntax presented in our paper:

(1) Instead of latent class C ... end, we write @class C begin ... end.

(2) Instead of subproblem begin ... end, inference hints are given using ordinary Julia begin ... end blocks.

(3) Instead of parameter x ~ d(...), we use @learned x :: D{...}. The set of distributions D for parameters is somewhat restricted.

(4) Instead of x ~ d(...) preferring E, we write x ~ d(..., E).

(5) Instead of observe x as y, ... from C, write @query ModelName.C [x y; ...]. Clauses of the form x z y are also allowed, and tell PClean that the model variable C.z represents a clean version of x, whose observed (dirty) version is modeled as C.y. This is used when automatically reconstructing a clean, flat dataset.

The names of built-in distributions may also be different, e.g. AddTypos instead of typos, and ProportionsParameter instead of dirichlet.

PClean: A Domain-Specific Probabilistic Programming Language for Bayesian Data Cleaning

Related tags

Overview

PClean

Using PClean

Differences from the paper

Owner

MIT Probabilistic Computing Project

Compact Bilinear Pooling for PyTorch

Official implementation for the paper: Multi-label Classification with Partial Annotations using Class-aware Selective Loss

DockStream: A Docking Wrapper to Enhance De Novo Molecular Design

NALSM: Neuron-Astrocyte Liquid State Machine

code for our ECCV 2020 paper "A Balanced and Uncertainty-aware Approach for Partial Domain Adaptation"

Contrastive Learning for Many-to-many Multilingual Neural Machine Translation(mCOLT/mRASP2), ACL2021

imbalanced-DL: Deep Imbalanced Learning in Python

Synthetic Humans for Action Recognition, IJCV 2021

Deep Multimodal Neural Architecture Search

Using fully convolutional networks for semantic segmentation with caffe for the cityscapes dataset

Graph Attention Networks

EMNLP'2021: Simple Entity-centric Questions Challenge Dense Retrievers

[CVPR'21] Locally Aware Piecewise Transformation Fields for 3D Human Mesh Registration

Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Train DeepLab for Semantic Image Segmentation

Deep Reinforcement Learning for Keras.

A PyTorch implementation of "From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network" (ICCV2021)

Cache Requests in Deta Bases and Echo them with Deta Micros

Replication of Pix2Seq with Pretrained Model

[ICML 2021] A fast algorithm for fitting robust decision trees.