PClean: A Domain-Specific Probabilistic Programming Language for Bayesian Data Cleaning

Related tags

Deep LearningPClean
Overview

PClean

Build Status

PClean: A Domain-Specific Probabilistic Programming Language for Bayesian Data Cleaning

Warning: This is a rapidly evolving research prototype.

PClean was created at the MIT Probabilistic Computing Project.

If you use PClean in your research, please cite the our 2021 AISTATS paper:

PClean: Bayesian Data Cleaning at Scale with Domain-Specific Probabilistic Programming. Lew, A. K.; Agrawal, M.; Sontag, D.; and Mansinghka, V. K. (2021, March). In International Conference on Artificial Intelligence and Statistics (pp. 1927-1935). PMLR. (pdf)

Using PClean

To use PClean, create a Julia file with the following structure:

using PClean
using DataFrames: DataFrame
import CSV

# Load data
data = CSV.File(filepath) |> DataFrame

# Define PClean model
PClean.@model MyModel begin
    @class ClassName1 begin
        ...
    end

    ...
    
    @class ClassNameN begin
        ...
    end
end

# Align column names of CSV with variables in the model.
# Format is ColumnName CleanVariable DirtyVariable, or, if
# there is no corruption for a certain variable, one can omit
# the DirtyVariable.
query = @query MyModel.ClassNameN [
  HospitalName hosp.name             observed_hosp_name
  Condition    metric.condition.desc observed_condition
  ...
]

# Configure observed dataset
observations = [ObservedDataset(query, data)]

# Configuration
config = PClean.InferenceConfig(1, 2; use_mh_instead_of_pg=true)

# SMC initialization
state = initialize_trace(observations, config)

# Rejuvenation sweeps
run_inference!(state, config)

# Evaluate accuracy, if ground truth is available
ground_truth = CSV.File(filepath) |> CSV.DataFrame
results = evaluate_accuracy(data, ground_truth, state, query)

# Can print results.f1, results.precision, results.accuracy, etc.
println(results)

# Even without ground truth, can save the entire latent database to CSV files:
PClean.save_results(dir, dataset_name, state, observations)

Then, from this directory, run the Julia file.

JULIA_PROJECT=. julia my_file.jl

To learn to write a PClean model, see our paper, but note the surface syntax changes described below.

Differences from the paper

As a DSL embedded into Julia, our implementation of the PClean language has some differences, in terms of surface syntax, from the stand-alone syntax presented in our paper:

(1) Instead of latent class C ... end, we write @class C begin ... end.

(2) Instead of subproblem begin ... end, inference hints are given using ordinary Julia begin ... end blocks.

(3) Instead of parameter x ~ d(...), we use @learned x :: D{...}. The set of distributions D for parameters is somewhat restricted.

(4) Instead of x ~ d(...) preferring E, we write x ~ d(..., E).

(5) Instead of observe x as y, ... from C, write @query ModelName.C [x y; ...]. Clauses of the form x z y are also allowed, and tell PClean that the model variable C.z represents a clean version of x, whose observed (dirty) version is modeled as C.y. This is used when automatically reconstructing a clean, flat dataset.

The names of built-in distributions may also be different, e.g. AddTypos instead of typos, and ProportionsParameter instead of dirichlet.

Owner
MIT Probabilistic Computing Project
MIT Probabilistic Computing Project
Training RNNs as Fast as CNNs

News SRU++, a new SRU variant, is released. [tech report] [blog] The experimental code and SRU++ implementation are available on the dev branch which

ASAPP Research 2.1k Jan 01, 2023
Optimize Trading Strategies Using Freqtrade

Optimize trading strategy using Freqtrade Short demo on building, testing and optimizing a trading strategy using Freqtrade. The DevBootstrap YouTube

DevBootstrap 139 Jan 01, 2023
StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion

StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion Yinghao Aaron Li, Ali Zare, Nima Mesgarani We pres

Aaron (Yinghao) Li 282 Jan 01, 2023
PyGCL: A PyTorch Library for Graph Contrastive Learning

PyGCL is a PyTorch-based open-source Graph Contrastive Learning (GCL) library, which features modularized GCL components from published papers, standa

PyGCL 588 Dec 31, 2022
⚓ Eurybia monitor model drift over time and securize model deployment with data validation

View Demo · Documentation · Medium article 🔍 Overview Eurybia is a Python library which aims to help in : Detecting data drift and model drift Valida

MAIF 172 Dec 27, 2022
Implementation of paper "Decision-based Black-box Attack Against Vision Transformers via Patch-wise Adversarial Removal"

Patch-wise Adversarial Removal Implementation of paper "Decision-based Black-box Attack Against Vision Transformers via Patch-wise Adversarial Removal

4 Oct 12, 2022
BiSeNet based on pytorch

BiSeNet BiSeNet based on pytorch 0.4.1 and python 3.6 Dataset Download CamVid dataset from Google Drive or Baidu Yun(6xw4). Pretrained model Download

367 Dec 26, 2022
Matlab Python Heuristic Battery Opt - SMOP conversion and manual conversion

SMOP is Small Matlab and Octave to Python compiler. SMOP translates matlab to py

Tom Xu 1 Jan 12, 2022
Data Augmentation Using Keras and Python

Data-Augmentation-Using-Keras-and-Python Data augmentation is the process of increasing the number of training dataset. Keras library offers a simple

Happy N. Monday 3 Feb 15, 2022
Code for technical report "An Improved Baseline for Sentence-level Relation Extraction".

RE_improved_baseline Code for technical report "An Improved Baseline for Sentence-level Relation Extraction". Requirements torch = 1.8.1 transformers

Wenxuan Zhou 74 Nov 29, 2022
A repository with exploration into using transformers to predict DNA ↔ transcription factor binding

Transcription Factor binding predictions with Attention and Transformers A repository with exploration into using transformers to predict DNA ↔ transc

Phil Wang 62 Dec 20, 2022
An official repository for Paper "Uformer: A General U-Shaped Transformer for Image Restoration".

Uformer: A General U-Shaped Transformer for Image Restoration Zhendong Wang, Xiaodong Cun, Jianmin Bao and Jianzhuang Liu Paper: https://arxiv.org/abs

Zhendong Wang 497 Dec 22, 2022
This repo is duplication of jwyang/faster-rcnn.pytorch

Faster RCNN Pytorch This repo is duplication of jwyang/faster-rcnn.pytorch C/C++ code are removed and easier to study. Python 3.8.5 Ubuntu 20.04.1 LTS

Kim Jihwan 1 Jan 14, 2022
[CVPRW 21] "BNN - BN = ? Training Binary Neural Networks without Batch Normalization", Tianlong Chen, Zhenyu Zhang, Xu Ouyang, Zechun Liu, Zhiqiang Shen, Zhangyang Wang

BNN - BN = ? Training Binary Neural Networks without Batch Normalization Codes for this paper BNN - BN = ? Training Binary Neural Networks without Bat

VITA 40 Dec 30, 2022
An image base contains 490 images for learning (400 cars and 90 boats), and another 21 images for testingAn image base contains 490 images for learning (400 cars and 90 boats), and another 21 images for testing

SVM Données Une base d’images contient 490 images pour l’apprentissage (400 voitures et 90 bateaux), et encore 21 images pour fait des tests. Prétrait

Achraf Rahouti 3 Nov 30, 2021
MoViNets PyTorch implementation: Mobile Video Networks for Efficient Video Recognition;

MoViNet-pytorch Pytorch unofficial implementation of MoViNets: Mobile Video Networks for Efficient Video Recognition. Authors: Dan Kondratyuk, Liangzh

189 Dec 20, 2022
IJON is an annotation mechanism that analysts can use to guide fuzzers such as AFL.

IJON SPACE EXPLORER IJON is an annotation mechanism that analysts can use to guide fuzzers such as AFL. Using only a small (usually one line) annotati

Chair for Sys­tems Se­cu­ri­ty 146 Dec 16, 2022
automated systems to assist guarding corona Virus precautions for Closed Rooms (e.g. Halls, offices, etc..)

Automatic-precautionary-guard automated systems to assist guarding corona Virus precautions for Closed Rooms (e.g. Halls, offices, etc..) what is this

badra 0 Jan 06, 2022
Software for Multimodalty 2D+3D Facial Expression Recognition (FER) UI

EmotionUI Software for Multimodalty 2D+3D Facial Expression Recognition (FER) UI. demo screenshot (with RealSense) required packages Python = 3.6 num

Yang Jiao 2 Dec 23, 2021
This project is a loose implementation of paper "Algorithmic Financial Trading with Deep Convolutional Neural Networks: Time Series to Image Conversion Approach"

Stock Market Buy/Sell/Hold prediction Using convolutional Neural Network This repo is an attempt to implement the research paper titled "Algorithmic F

Asutosh Nayak 136 Dec 28, 2022