A Python package for generating concise, high-quality summaries of a probability distribution

Overview

GoodPoints

A Python package for generating concise, high-quality summaries of a probability distribution

GoodPoints is a collection of tools for compressing a distribution more effectively than independent sampling:

  • Given an initial summary of n input points, kernel thinning returns s << n output points with comparable integration error across a reproducing kernel Hilbert space
  • Compress++ reduces the runtime of generic thinning algorithms with minimal loss in accuracy

Installation

To install the goodpoints package, use the following pip command:

pip install goodpoints

Getting started

The primary kernel thinning function is thin in the kt module:

from goodpoints import kt
coreset = kt.thin(X, m, split_kernel, swap_kernel, delta=0.5, seed=123, store_K=False)
    """Returns kernel thinning coreset of size floor(n/2^m) as row indices into X
    
    Args:
      X: Input sequence of sample points with shape (n, d)
      m: Number of halving rounds
      split_kernel: Kernel function used by KT-SPLIT (typically a square-root kernel, krt);
        split_kernel(y,X) returns array of kernel evaluations between y and each row of X
      swap_kernel: Kernel function used by KT-SWAP (typically the target kernel, k);
        swap_kernel(y,X) returns array of kernel evaluations between y and each row of X
      delta: Run KT-SPLIT with constant failure probabilities delta_i = delta/n
      seed: Random seed to set prior to generation; if None, no seed will be set
      store_K: If False, runs O(nd) space version which does not store kernel
        matrix; if True, stores n x n kernel matrix
    """

For example uses, please refer to the notebook examples/kt/run_kt_experiment.ipynb.

The primary Compress++ function is compresspp in the compress module:

from goodpoints import compress
coreset = compress.compresspp(X, halve, thin, g)
    """Returns Compress++(g) coreset of size sqrt(n) as row indices into X

    Args: 
        X: Input sequence of sample points with shape (n, d)
        halve: Function that takes in an (n', d) numpy array Y and returns 
          floor(n'/2) distinct row indices into Y, identifying a halved coreset
        thin: Function that takes in an (n', d) numpy array Y and returns
          2^g sqrt(n') row indices into Y, identifying a thinned coreset
        g: Oversampling factor
    """

For example uses, please refer to the code examples/compress/construct_compresspp_coresets.py.

Examples

Code in the examples directory uses the goodpoints package to recreate the experiments of the following research papers.


Kernel Thinning

@article{dwivedi2021kernel,
  title={Kernel Thinning},
  author={Raaz Dwivedi and Lester Mackey},
  journal={arXiv preprint arXiv:2105.05842},
  year={2021}
}
  1. The script examples/kt/submit_jobs_run_kt.py reproduces the vignette experiments of Kernel Thinning on a Slurm cluster by executing examples/kt/run_kt_experiment.ipynb with appropriate parameters. For the MCMC examples, it assumes that necessary data was downloaded and pre-processed following the steps listed in examples/kt/preprocess_mcmc_data.ipynb, where in the last code block we report the median heuristic based bandwidth parameteters (along with the code to compute it).
  2. After all results have been generated, the notebook plot_results.ipynb can be used to reproduce the figures of Kernel Thinning.

Generalized Kernel Thinning

@article{dwivedi2021generalized,
  title={Generalized Kernel Thinning},
  author={Raaz Dwivedi and Lester Mackey},
  journal={arXiv preprint arXiv:2110.01593},
  year={2021}
}
  1. The script examples/gkt/submit_gkt_jobs.py reproduces the vignette experiments of Generalized Kernel Thinning on a Slurm cluster by executing examples/gkt/run_generalized_kt_experiment.ipynb with appropriate parameters. For the MCMC examples, it assumes that necessary data was downloaded and pre-processed following the steps listed in examples/kt/preprocess_mcmc_data.ipynb.
  2. Once the coresets are generated, examples/gkt/compute_test_function_errors.ipynb can be used to generate integration errors for different test functions.
  3. After all results have been generated, the notebook examples/gkt/plot_gkt_results.ipynb can be used to reproduce the figures of Generalized Kernel Thinning.

Distribution Compression in Near-linear Time

@article{shetti2021distribution,
  title={Distribution Compression in Near-linear Time},
  author={Abhishek Shetty and Raaz Dwivedi and Lester Mackey},
  journal={arXiv preprint arXiv:2111.07941},
  year={2021}
}
  1. The notebook examples/compress/script_to_deploy_jobs.ipynb reproduces the experiments of Distribution Compression in Near-linear Time in the following manner: 1a. It generates various coresets and computes their mmds by executing examples/compress/construct_{THIN}_coresets.py for THIN in {compresspp, kt, st, herding} with appropriate parameters, where the flag kt stands for kernel thinning, st stands for standard thinning (choosing every t-th point), and herding refers to kernel herding. 1b. It compute the runtimes of different algorithms by executing examples/compress/run_time.py. 1c. For the MCMC examples, it assumes that necessary data was downloaded and pre-processed following the steps listed in examples/kt/preprocess_mcmc_data.ipynb. 1d. The notebook currently deploys these jobs on a slurm cluster, but setting deploy_slurm = False in examples/compress/script_to_deploy_jobs.ipynb will submit the jobs as independent python calls on terminal.
  2. After all results have been generated, the notebook examples/compress/plot_compress_results.ipynb can be used to reproduce the figures of Distribution Compression in Near-linear Time.
  3. The script examples/compress/construct_compresspp_coresets.py contains the function recursive_halving that converts a halving algorithm into a thinning algorithm by recursively halving.
  4. The script examples/compress/construct_herding_coresets.py contains the herding function that runs kernel herding algorithm introduced by Yutian Chen, Max Welling, and Alex Smola.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Owner
Microsoft
Open source projects and samples from Microsoft
Microsoft
Multiwavelets-based operator model

Multiwavelet model for Operator maps Gaurav Gupta, Xiongye Xiao, and Paul Bogdan Multiwavelet-based Operator Learning for Differential Equations In Ne

Gaurav 33 Dec 04, 2022
🤗 Push your spaCy pipelines to the Hugging Face Hub

spacy-huggingface-hub: Push your spaCy pipelines to the Hugging Face Hub This package provides a CLI command for uploading any trained spaCy pipeline

Explosion 30 Oct 09, 2022
Delving into Localization Errors for Monocular 3D Object Detection, CVPR'2021

Delving into Localization Errors for Monocular 3D Detection By Xinzhu Ma, Yinmin Zhang, Dan Xu, Dongzhan Zhou, Shuai Yi, Haojie Li, Wanli Ouyang. Intr

XINZHU.MA 124 Jan 04, 2023
Code for PhySG: Inverse Rendering with Spherical Gaussians for Physics-based Relighting and Material Editing

PhySG: Inverse Rendering with Spherical Gaussians for Physics-based Relighting and Material Editing CVPR 2021. Project page: https://kai-46.github.io/

Kai Zhang 141 Dec 14, 2022
Code used to generate the results appearing in "Train longer, generalize better: closing the generalization gap in large batch training of neural networks"

Train longer, generalize better - Big batch training This is a code repository used to generate the results appearing in "Train longer, generalize bet

Elad Hoffer 145 Sep 16, 2022
Little Ball of Fur - A graph sampling extension library for NetworKit and NetworkX (CIKM 2020)

Little Ball of Fur is a graph sampling extension library for Python. Please look at the Documentation, relevant Paper, Promo video and External Resour

Benedek Rozemberczki 619 Dec 14, 2022
Augmented Traffic Control: A tool to simulate network conditions

Augmented Traffic Control Full documentation for the project is available at http://facebook.github.io/augmented-traffic-control/. Overview Augmented

Meta Archive 4.3k Jan 08, 2023
Python implementation of ADD: Frequency Attention and Multi-View based Knowledge Distillation to Detect Low-Quality Compressed Deepfake Images, AAAI2022.

ADD: Frequency Attention and Multi-View based Knowledge Distillation to Detect Low-Quality Compressed Deepfake Images Binh M. Le & Simon S. Woo, "ADD:

2 Oct 24, 2022
Learning to Reach Goals via Iterated Supervised Learning

Vanilla GCSL This repository contains a vanilla implementation of "Learning to Reach Goals via Iterated Supervised Learning" proposed by Dibya Gosh et

Christoph Heindl 4 Aug 10, 2022
My tensorflow implementation of "A neural conversational model", a Deep learning based chatbot

Deep Q&A Table of Contents Presentation Installation Running Chatbot Web interface Results Pretrained model Improvements Upgrade Presentation This wor

Conchylicultor 2.9k Dec 28, 2022
Automatic Differentiation Multipole Moment Molecular Forcefield

Automatic Differentiation Multipole Moment Molecular Forcefield Performance notes On a single gpu, using waterbox_31ang.pdb example from MPIDplugin wh

4 Jan 07, 2022
Classify music genre from a 10 second sound stream using a Neural Network.

MusicGenreClassification Academic research in the field of Deep Learning (Deep Neural Networks) and Sound Processing, Tel Aviv University. Featured in

Matan Lachmish 453 Dec 27, 2022
A tensorflow model that predicts if the image is of a cat or of a dog.

Quick intro Hello and thank you for your interest in my project! This is the backend part of a two-repo application. The other part can be found here

Tudor Matei 0 Mar 08, 2022
Official PyTorch implementation of "Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble" (NeurIPS'21)

Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble This is the code for reproducing the results of the paper Uncertainty-Bas

43 Nov 23, 2022
Pytoydl: A toy deep learning framework built upon numpy.

Documents: https://pytoydl.readthedocs.io/zh/latest/ Pytoydl A toy deep learning framework built upon numpy. You can star this repository to keep trac

28 Dec 10, 2022
Code for paper "Do Language Models Have Beliefs? Methods for Detecting, Updating, and Visualizing Model Beliefs"

This is the codebase for the paper: Do Language Models Have Beliefs? Methods for Detecting, Updating, and Visualizing Model Beliefs Directory Structur

Peter Hase 19 Aug 21, 2022
modelvshuman is a Python library to benchmark the gap between human and machine vision

modelvshuman is a Python library to benchmark the gap between human and machine vision. Using this library, both PyTorch and TensorFlow models can be evaluated on 17 out-of-distribution datasets with

Bethge Lab 244 Jan 03, 2023
Code for Environment Dynamics Decomposition (ED2).

ED2 Code for Environment Dynamics Decomposition (ED2). Installation Follow the installation in MBPO and Dreamer. Usage First follow the SD2 method for

0 Aug 10, 2021
VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning

VisualGPT Our Paper VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning Main Architecture of Our VisualGPT Downloa

Vision CAIR Research Group, KAUST 140 Dec 28, 2022
Statsmodels: statistical modeling and econometrics in Python

About statsmodels statsmodels is a Python package that provides a complement to scipy for statistical computations including descriptive statistics an

statsmodels 8.1k Jan 02, 2023