Implementation of "Selection via Proxy: Efficient Data Selection for Deep Learning" from ICLR 2020.

Overview

Selection via Proxy: Efficient Data Selection for Deep Learning

This repository contains a refactored implementation of "Selection via Proxy: Efficient Data Selection for Deep Learning" from ICLR 2020.

If you use this code in your research, please use the following BibTeX entry.

@inproceedings{
    coleman2020selection,
    title={Selection via Proxy: Efficient Data Selection for Deep Learning},
    author={Cody Coleman and Christopher Yeh and Stephen Mussmann and Baharan Mirzasoleiman and Peter Bailis and Percy Liang and Jure Leskovec and Matei Zaharia},
    booktitle={International Conference on Learning Representations},
    year={2020},
    url={https://openreview.net/forum?id=HJg2b0VYDr}
}

The original code is also available as a zip file, but lacks documentation, uses outdated packages, and won't be maintained. Please use this repository instead and report issues here.

Setup

Prerequisites

Installation

git clone https://github.com/stanford-futuredata/selection-via-proxy.git
cd selection-via-proxy
pip install -e .

or simply

pip install git+https://github.com/stanford-futuredata/selection-via-proxy.git

Quickstart

Perform active learning on CIFAR10 from the command line:

python -m svp.cifar active

Or from the python interpreter:

from svp.cifar.active import active
active()

"Selection via proxy" happens when --proxy-arch doesn't match --arch:

# ResNet20 selecting data for a ResNet164
python -m svp.cifar active --proxy-arch preact20 --arch preact164

For help, see python -m svp.cifar active --help or active()'s docstrinng.

Example Usage

Below are more examples of the command line interface that cover different datasets (e.g., CIFAR100, ImageNet, Amazon Review Polarity) and commands (e.g., train, coreset).

Basic Training

CIFAR10 and CIFAR100

Preliminaries

None. The CIFAR10 and CIFAR100 datasets will download if they don't exist in ./data/cifar10 and ./data/cifar100 respectively.

Examples
# Train ResNet164 with pre-activation (https://arxiv.org/abs/1603.05027) on CIFAR10.
python -m svp.cifar train --dataset cifar10 --arch preact164

Replace --dataset CIFAR10 with --dataset CIFAR100 to run on CIFAR100 rather than CIFAR10.

# Train ResNet164 with pre-activation (https://arxiv.org/abs/1603.05027) on CIFAR100.
python -m svp.cifar train --dataset cifar100 --arch preact164

The same is true for all the python -m svp.cifar commands below

ImageNet

Preliminaries
  • Download the ImageNet dataset into a directory called imagenet.
  • Extract the images.
# Extract train data.
mkdir train && mv ILSVRC2012_img_train.tar train/ && cd train
tar -xvf ILSVRC2012_img_train.tar && rm -f ILSVRC2012_img_train.tar
find . -name "*.tar" | while read NAME ; do mkdir -p "${NAME%.tar}"; tar -xvf "${NAME}" -C "${NAME%.tar}"; rm -f "${NAME}"; done
# Extract validation data.
cd ../ && mkdir val && mv ILSVRC2012_img_val.tar val/ && cd val && tar -xvf ILSVRC2012_img_val.tar
wget -qO- https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh | bash
  • Replace /path/to/data in all the python -m svp.imagenet commands below with the path to the imagenet directory you created. Note, do not include imagenet in the path; the script will automatically do that.
Examples
# Train ResNet50 (https://arxiv.org/abs/1512.03385).
python -m svp.imagenet train --dataset-dir '/path/to/data' --arch resnet50 --num-workers 20

For convenience, you can use larger batch sizes and scale learning rates according to "Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour" with --scale-learning-rates:

# Train ResNet50 with a batch size of 1048 and scaled learning rates accordingly.
python -m svp.imagenet train --dataset-dir '/path/to/data' --arch resnet50 --num-workers 20 \
    --batch-size 1048 --scale-learning-rates

Mixed precision training is also supported using apex. Apex isn't installed during the pip install instructions above, so please follow the installation instructions in the apex repository before running the command below.

# Use mixed precision training to train ResNet50 with a batch size of 1048 and scale learning rates accordingly.
python -m svp.imagenet train --dataset-dir '/path/to/data' --arch resnet50 --num-workers 20 \
    --batch-size 1048 --scale-learning-rates --fp16

Amazon Review Polarity and Full

Preliminaries
tar -xvzf amazon_review_full_csv.tar.gz
tar -xvzf amazon_review_polarity_csv.tar.gz
  • Replace /path/to/data in all the python -m svp.amazon commands below with the path to the root directory you created. Note, do not include amazon_review_full_csv or amazon_review_polarity_csv in the path; the script will automatically do that.
Examples
# Train VDCNN29 (https://arxiv.org/abs/1606.01781) on Amazon Review Polarity.
python -m svp.amazon train --datasets-dir '/path/to/data' --dataset amazon_review_polarity --arch vdcnn29-conv \
    --num-workers 4 --eval-num-workers 8

Replace --dataset amazon_review_polarity with --dataset amazon_review_full to run on Amazon Review Full rather than Amazon Review Polarity.

# Train VDCNN29 (https://arxiv.org/abs/1606.01781) on Amazon Review Full.
python -m svp.amazon train --datasets-dir '/path/to/data' --dataset amazon_review_full --arch vdcnn29-maxpool \
    --num-workers 4 --eval-num-workers 8

The same is true for all the python -m svp.amazon commands below

Active learning

Active learning selects points to label from a large pool of unlabeled data by repeatedly training a model on a small pool of labeled data and selecting additional examples to label based on the model’s uncertainty (e.g., the entropy of predicted class probabilities) or other heuristics. The commands below demonstrate how to perform active learning on CIFAR10, CIFAR100, ImageNet, Amazon Review Polarity and Amazon Review Full with a variety of models and selection methods.

CIFAR10 and CIFAR100

Baseline Approach
# Perform active learning with ResNet164 for both selection and the final predictions.
python -m svp.cifar active --dataset cifar10 --arch preact164 --num-workers 4 \
	--selection-method least_confidence \
	--initial-subset 1000 \
	--round 4000 \
	--round 5000 \
	--round 5000 \
	--round 5000 \
	--round 5000
Selection via Proxy

If the model architectures (arch vs proxy_arch) or the learning rate schedules don't match, "selection via proxy" (SVP) is performed and two separate models are trained. The proxy is used for selecting which examples to label, while the target is only used for evaluating the quality of the selection. By default, the target model (arch) is trained and evaluated after each selection round. To change this behavior set eval_target_at to evaluate at a specific labeling budget(s) or set train_target to False to skip evaluating the target model.

# Perform active learning with ResNet20 for selection and ResNet164 for the final predictions.
python -m svp.cifar active --dataset cifar10 --arch preact164 --num-workers 4 \
	--selection-method least_confidence --proxy-arch preact20 \
	--initial-subset 1000 \
	--round 4000 \
	--round 5000 \
	--round 5000 \
	--round 5000 \
	--round 5000 \
	--eval-target-at 25000

To train the proxy for fewer epochs, use the --proxy-* options as shown below:

# Perform active learning with ResNet20 after only 50 epochs for selection.
python -m svp.cifar active --dataset cifar10 --arch preact164 --num-workers 4 \
	--selection-method least_confidence --proxy-arch preact20 \
	--proxy-learning-rate 0.01 --proxy-epochs 1 \
	--proxy-learning-rate 0.1 --proxy-epochs 45 \
	--proxy-learning-rate 0.01 --proxy-epochs 4 \
	--initial-subset 1000 \
	--round 4000 \
	--round 5000 \
	--round 5000 \
	--round 5000 \
	--round 5000 \
	--eval-target-at 25000

ImageNet

Baseline Approach
# Perform active learning with ResNet50 for both selection and the final predictions.
python -m svp.imagenet active --datasets-dir '/path/to/data' --arch resnet50 --num-workers 20
Selection via Proxy

If the model architectures (arch vs proxy_arch) or the learning rate schedules don't match, "selection via proxy" (SVP) is performed and two separate models are trained. The proxy is used for selecting which examples to label, while the target is only used for evaluating the quality of the selection. By default, the target model (arch) is trained and evaluated after each selection round. To change this behavior set eval_target_at to evaluate at a specific labeling budget(s) or set train_target to False to skip evaluating the target model.

# Perform active learning with ResNet18 for selection and ResNet50 for the final predictions.
python -m svp.imagenet active --datasets-dir '/path/to/data' --arch resnet50 --num-workers 20 \
    --proxy-arch resnet18 --proxy-batch-size 1028 --proxy-scale-learning-rates \
    --eval-target-at 512467

To train the proxy for fewer epochs, use the --proxy-* options as shown below:

# Perform active learning with ResNet18 after only 45 epochs for selection.
python -m svp.imagenet active --datasets-dir '/path/to/data' --arch resnet50 --num-workers 20 \
    --proxy-arch resnet18 --proxy-batch-size 1028 --proxy-scale-learning-rates \
    --eval-target-at 512467 \
    --proxy-learning-rate 0.0167 --proxy-epochs 1 \
    --proxy-learning-rate 0.0333 --proxy-epochs 1 \
    --proxy-learning-rate 0.05 --proxy-epochs 1 \
    --proxy-learning-rate 0.0667 --proxy-epochs 1 \
    --proxy-learning-rate 0.0833 --proxy-epochs 1 \
    --proxy-learning-rate 0.1 --proxy-epochs 25 \
    --proxy-learning-rate 0.01 --proxy-epochs 15

Amazon Review Polarity and Full

Baseline Approach
# Perform active learning with VDCNN29 for both selection and the final predictions.
python -m svp.amazon active --datasets-dir '/path/to/data' --dataset amazon_review_polarity  --num-workers 8 \
    --arch vdcnn29-conv --selection-method least_confidence
Selection via Proxy

If the model architectures (arch vs proxy_arch) or the learning rate schedules don't match, "selection via proxy" (SVP) is performed and two separate models are trained. The proxy is used for selecting which examples to label, while the target is only used for evaluating the quality of the selection. By default, the target model (arch) is trained and evaluated after each selection round. To change this behavior set eval_target_at to evaluate at a specific labeling budget(s) or set train_target to False to skip evaluating the target model. You can evaluate a series of selections later using the precomputed_selection option.

# Perform active learning with VDCNN9 for selection and VDCNN29 for the final predictions.
python -m svp.amazon active --datasets-dir '/path/to/data' --dataset amazon_review_polarity --num-workers 8 \
    --arch vdcnn29-conv --selection-method least_confidence \
    --proxy-arch vdcnn9-maxpool --eval-target-at 1440000

To use fastText as a proxy, Install fastText 0.1.0 and replace /path/to/fastText/fasttext in the python -m svp.amazon fasttext commands below with the path to the fastText binary you created.

# For convenience, save fastText results in a separate directory
mkdir fasttext
# Perform active learning with fastText.
python -m svp.amazon fasttext '/path/to/fastText/fasttext' --run-dir fasttext \
    --datasets-dir '/path/to/data' --dataset amazon_review_polarity --selection-method least_confidence \
    --size 72000 --size 360000 --size 720000 --size 1080000 --size 1440000
# Get the most recent timestamp from the fasttext directory.
fasttext_path="fasttext/$(ls fasttext | sort -nr | head -n 1)"
# Use selected labeled data from fastText to train VDCNN29
python -m svp.amazon active --datasets-dir '/path/to/data' --dataset amazon_review_polarity --num-workers 8 \
    --arch vdcnn29-conv --selection-method least_confidence \
    --precomputed-selection $fasttext_path --eval-target-at 1440000

Core-set Selection

Core-set selection techniques start with a large labeled or unlabeled dataset and aim to find a small subset that accurately approximates the full dataset by selecting representative examples. The commands below demonstrate how to perform core-set selection on CIFAR10, CIFAR100, ImageNet, Amazon Review Polarity and Amazon Review Full with a variety of models and selection methods.

CIFAR10 and CIFAR100

Baseline Approach
# Perform core-set selection with an oracle that uses ResNet164 for both selection and the final predictions.
python -m svp.cifar coreset --dataset cifar10 --arch preact164 --num-workers 4 \
    --subset 25000 --selection-method forgetting_events
Selection via Proxy
# Perform core-set selection with ResNet20 selecting for ResNet164.
python -m svp.cifar coreset --dataset cifar10 --arch preact164 --num-workers 4 \
    --subset 25000 --selection-method forgetting_events \
    --proxy-arch preact20

To train the proxy for fewer epochs, use the --proxy-* options as shown below:

# Perform core-set selection with ResNet20 after only 50 epochs.
python -m svp.cifar coreset --dataset cifar10 --arch preact164 --num-workers 4 \
    --subset 25000 --selection-method forgetting_events \
    --proxy-arch preact20 \
	--proxy-learning-rate 0.01 --proxy-epochs 1 \
	--proxy-learning-rate 0.1 --proxy-epochs 45 \
	--proxy-learning-rate 0.01 --proxy-epochs 4

ImageNet

Baseline Approach
# Perform core-set selection with an oracle that uses ResNet50 for both selection and the final predictions.
python -m svp.imagenet coreset --datasets-dir '/path/to/data' --arch resnet50 --num-workers 20 \
    --subset 768700 --selection-method forgetting_events
Selection via Proxy
# Perform core-set selection with ResNet18 selecting for ResNet50.
python -m svp.imagenet coreset --datasets-dir '/path/to/data' --arch resnet50 --num-workers 20 \
    --subset 768700 --selection-method forgetting_events \
    --proxy-arch resnet18 --proxy-batch-size 1028 --proxy-scale-learning-rates

Amazon Review Polarity and Full

Baseline Approach
# Perform core-set selection with an oracle that uses VDCNN29 for both selection and the final predictions.
python -m svp.amazon coreset --datasets-dir '/path/to/data' --dataset amazon_review_polarity --num-workers 8 \
    --arch vdcnn29-conv --subset 2160000  --selection-method entropy
Selection via Proxy
# Perform core-set selection with VDCNN9 selecting for VDCNN29.
python -m svp.amazon coreset --datasets-dir '/path/to/data' --dataset amazon_review_polarity --num-workers 8 \
    --arch vdcnn29-conv --subset 2160000 --selection-method entropy \
    --proxy-arch vdcnn9-maxpool

To use fastText as a proxy, Install fastText 0.1.0 and replace /path/to/fastText/fasttext in the python -m svp.amazon fasttext commands below with the path to the fastText binary you created.

# For convenience, save fastText results in a separate directory
mkdir fasttext
# Perform core-set selection with fastText.
python -m svp.amazon fasttext '/path/to/fastText/fasttext' --run-dir fasttext \
    --datasets-dir '/path/to/data' --dataset amazon_review_polarity \
    --selection-method entropy --size 3600000 --size 2160000
# Get the most recent timestamp from the fasttext directory.
fasttext_path="fasttext/$(ls fasttext | sort -nr | head -n 1)"
# Use selected labeled data from fastText to train VDCNN29
python -m svp.amazon coreset --datasets-dir '/path/to/data' --dataset amazon_review_polarity --num-workers 8 \
    --arch vdcnn29-conv --precomputed-selection $fasttext_path
Owner
Stanford Future Data Systems
We are a CS research group at Stanford building data-intensive systems
Stanford Future Data Systems
PyTorch implementation of our paper How robust are discriminatively trained zero-shot learning models?

How robust are discriminatively trained zero-shot learning models? This repository contains the PyTorch implementation of our paper How robust are dis

Mehmet Kerim Yucel 5 Feb 04, 2022
Matching python environment code for Lux AI 2021 Kaggle competition, and a gym interface for RL models.

Lux AI 2021 python game engine and gym This is a replica of the Lux AI 2021 game ported directly over to python. It also sets up a classic Reinforceme

Geoff McDonald 74 Nov 03, 2022
Deep Learning (with PyTorch)

Deep Learning (with PyTorch) This notebook repository now has a companion website, where all the course material can be found in video and textual for

Alfredo Canziani 6.2k Jan 07, 2023
ICRA 2021 "Towards Precise and Efficient Image Guided Depth Completion"

PENet: Precise and Efficient Depth Completion This repo is the PyTorch implementation of our paper to appear in ICRA2021 on "Towards Precise and Effic

232 Dec 25, 2022
N-gram models- Unsmoothed, Laplace, Deleted Interpolation

N-gram models- Unsmoothed, Laplace, Deleted Interpolation

Ravika Nagpal 1 Jan 04, 2022
A resource for learning about deep learning techniques from regression to LSTM and Reinforcement Learning using financial data and the fitness functions of algorithmic trading

A tour through tensorflow with financial data I present several models ranging in complexity from simple regression to LSTM and policy networks. The s

195 Dec 07, 2022
Code base of object detection

rmdet code base of object detection. 环境安装: 1. 安装conda python环境 - `conda create -n xxx python=3.7/3.8` - `conda activate xxx` 2. 运行脚本,自动安装pytorch1

3 Mar 08, 2022
Sparse Physics-based and Interpretable Neural Networks

Sparse Physics-based and Interpretable Neural Networks for PDEs This repository contains the code and manuscript for research done on Sparse Physics-b

28 Jan 03, 2023
Reducing Information Bottleneck for Weakly Supervised Semantic Segmentation (NeurIPS 2021)

Reducing Information Bottleneck for Weakly Supervised Semantic Segmentation (NeurIPS 2021) The implementation of Reducing Infromation Bottleneck for W

Jungbeom Lee 81 Dec 16, 2022
The Deep Learning with Julia book, using Flux.jl.

Deep Learning with Julia DL with Julia is a book about how to do various deep learning tasks using the Julia programming language and specifically the

Logan Kilpatrick 67 Dec 25, 2022
Portfolio Optimization and Quantitative Strategic Asset Allocation in Python

Riskfolio-Lib Quantitative Strategic Asset Allocation, Easy for Everyone. Description Riskfolio-Lib is a library for making quantitative strategic ass

Riskfolio 1.7k Jan 07, 2023
Code base for "On-the-Fly Test-time Adaptation for Medical Image Segmentation"

On-the-Fly Adaptation Official Pytorch Code base for On-the-Fly Test-time Adaptation for Medical Image Segmentation Paper Introduction One major probl

Jeya Maria Jose 17 Nov 10, 2022
Pytorch Implementation of DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis (TTS Extension)

DiffSinger - PyTorch Implementation PyTorch implementation of DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis (TTS Extension). Status

Keon Lee 152 Jan 02, 2023
Open-CyKG: An Open Cyber Threat Intelligence Knowledge Graph

Open-CyKG: An Open Cyber Threat Intelligence Knowledge Graph Model Description Open-CyKG is a framework that is constructed using an attenti

Injy Sarhan 34 Jan 05, 2023
Code & Data for Enhancing Photorealism Enhancement

Code & Data for Enhancing Photorealism Enhancement

Intel ISL (Intel Intelligent Systems Lab) 1.1k Jan 08, 2023
The official repository for BaMBNet

BaMBNet-Pytorch Paper

Junjun Jiang 18 Dec 04, 2022
A GUI to automatically create a TOPAS-readable MLC simulation file

Python script to create a TOPAS-readable simulation file descriring a Multi-Leaf-Collimator. Builds the MLC using the data from a 3D .stl file.

Sebastian Schäfer 0 Jun 19, 2022
Text-to-Image generation

Generate vivid Images for Any (Chinese) text CogView is a pretrained (4B-param) transformer for text-to-image generation in general domain. Read our p

THUDM 1.3k Dec 29, 2022
Laplace Redux -- Effortless Bayesian Deep Learning

Laplace Redux - Effortless Bayesian Deep Learning This repository contains the code to run the experiments for the paper Laplace Redux - Effortless Ba

Runa Eschenhagen 28 Dec 07, 2022
[Open Source]. The improved version of AnimeGAN. Landscape photos/videos to anime

[Open Source]. The improved version of AnimeGAN. Landscape photos/videos to anime

CC 4.4k Dec 27, 2022