True Few-Shot Learning with Language Models

Overview

True Few-Shot Learning with Language Models

This codebase supports using language models (LMs) for true few-shot learning: learning to perform a task using a limited number of examples from a single task distribution. We choose prompts and hyperparameters for few-shot learning methods using no additional held-out data via methods like cross-validation and minimum description length. The code reproduces the results in our paper and supports two forms of few-shot learning:

  1. "In-context" learning using LMs similar to GPT-3. Here, we format a few training examples as input to the LM using a natural language "prompt," and we use the LM to predict the next token. We include the code for in-context learning primarily in the top-level directory (largely in eval_lm.py).
  2. Finetuning via ADAPET, which learns from supervised examples using a modified classification loss alongside an auxiliary masked LM objective. We include the code for finetuning ADAPET in subdirectories (e.g., src/ for training/evaluation code).

You can run this codebase with GPT-2/DistilGPT-2 (using HuggingFace Transformers) or GPT-3 (if you have a key from OpenAI). The underlying model you use is abstracted away using a common API. Below, we describe how to reproduce our results, as well as how to download our precomputed results that we used to produce our paper's plots.

General Setup Instructions

Clone this repo into your current working directory. Then, follow the general setup instructions below:

cd true_few_shot    # Move into cloned directory
export BASE=$PWD    # Store top-level directory location
mkdir data exp_out  # Make directories for data and experiment results, or symlink to a location that can host large files

Continue below to reproduce our prompt selection experiments (with GPT models on LAMA or SuperGLUE). Skip to Hyperparameter Selection to reproduce our ADAPET results.

True Few-Shot Prompt Selection for GPT

Download data for LAMA experiments (LAMA-UHN data, LAMA and LPAQA prompts, and LAMA vocab):

cd data

# Download LAMA-UHN
wget https://www.cis.uni-muenchen.de/~poerner/blobs/e-bert/LAMA_UHN.zip
unzip LAMA_UHN.zip
mv data/* .
rmdir data
rm LAMA_UHN.zip

# Download LAMA vocab
wget https://dl.fbaipublicfiles.com/LAMA/common_vocab_cased.txt

# Download original LAMA (to get manual prompts)
wget https://dl.fbaipublicfiles.com/LAMA/data.zip
unzip data.zip
mv data/* .
rmdir data
rm data.zip

# Download LPAQA prompts from Google Drive (or download manually from https://drive.google.com/file/d/15ypcAYvQGYRtIQ-GH7qSLNDlJZ2Sqs0H/view?usp=sharing)
gdown --id 15ypcAYvQGYRtIQ-GH7qSLNDlJZ2Sqs0H
unzip lpaqa.zip
rm lpaqa.zip

Then, create a virtual Python 3.7+ environment. We installed and activated a Python 3.7 with Anaconda 3 (downloadable from docs.anaconda.com) like so:

conda create -y -n true_few_shot python=3.7
conda activate true_few_shot
# To deactivate the environment, use conda deactivate

Next, install the dependencies for this repo:

cd $BASE
pip install -r requirements_prompt.txt

To experiment with GPT2/DistilGPT2 models, you'll then need to install PyTorch to use transformers models (PyTorch download instructions at pytorch.org). (PyTorch installation is not required for GPT3 models.) We used PyTorch 1.7.1 (with CUDA 11.0.194 for GPU inference), installed with the below command:

torch_version_suffix="+cu110" # Use "+cu100" for CUDA 10.0, "+cu101" for CUDA 10.1, and "" for CUDA 10.2
pip install torch==1.7.1${torch_version_suffix} torchvision==0.8.2${torch_version_suffix} -f https://download.pytorch.org/whl/torch_stable.html

To experiment with GPT3 models, if you have GPT3 access, set your API key as a bash variable (otherwise, you can still run this repo using GPT2 models):

export OPENAI_API_KEY="api-key-here"

At this point, you'll need to get the results from LM(s) on LAMA or SuperGLUE, in order to later evaluate MDL/CV/Test Accuracy. You can get these results by:

  1. Running our true few-shot prompt selection experiments yourself by continuing below, or
  2. Loading our pre-run experiments for GPT2 (instead of having to run them yourself), by skipping to this section

After either of the above, we'll describe how you can plot the results in our paper.

Running true few-shot prompt selection experiments yourself

First, move to the top-level directory in this repo (e.g., run cd $BASE). From there, the following command will run inference with DistilGPT2 on LAMA-UHN:

python eval_lm.py --engine distilgpt2 --num_train 5 --seeds 0

This command chooses 5 random examples from LAMA-UHN as training examples, randomly orders them in 5!=120 different ways, and appends an unseen (test) example from LAMA-UHN to the end. Then, the code evaluates the log-probability of the correct answer for each train/test example, which we'll use later to compute CV/MDL/Test Accuracy. Below we describe each command line flag:

  • --engine: specifies the model you'd like to evaluate with
  • --num_train: specifies the number of training examples you'd like to use (here, 5); using more than 5 will result in evaluating MDL/CV/test accuracy using only a subset of all possible training example permutations, as described in our paper experiments when we use >5 training examples
  • --seeds: takes a list of integers (e.g., 0 1 2) and uses each integer as a random seed for sampling a new LAMA training set (per relation); so if you run with --seeds 0 1 2, you'll run on LAMA 3 times in total (useful for calculating mean and std. error over several runs).

We use a similar command to run on CB, RTE, and WiC, just adding a couple extra flags:

python eval_lm.py --engine distilgpt2 --num_train 5 --seeds 0 --data_name super_glue --rels cb rte wic

The --data_name flag specifies that you want to run on SuperGLUE datasets, and the --rels flags specifies which SuperGLUE datasets you'd like to run on. BoolQ is also supported (using --rels boolq), but be warned that the inputs are quite long, so which can make running GPT2 models time-consuming and running GPT3 models costly. Other datasets require some extra modification to use their respective few-shot approach described in the GPT3 paper.

At this point, if you've run one of the above eval_lm.py commands, you can skip to Post-processing GPT Results. Below, we show the specific commands we used to reproduce different sets of results in our paper.

For reference, we include a full bash loop you can run to reproduce all of our LAMA experiments that use 5 training examples:

for ENGINE in 'distilgpt2' 'gpt2' 'gpt2-medium' 'gpt2-large' 'gpt2-xl' 'ada' 'babbage' 'curie' 'davinci'; do  # Different models, in order: DistilGPT2, GPT2 (117M, 345M, 782M, 1.5B), and GPT3 (2.7B, 6.7B, 13B, 175B) 
for NT in 5; do  # 5 training examples
for SEED in 0 1 2 3 4; do  # 5 random seeds to sample different training sets
python eval_lm.py --engine $ENGINE --num_train $NT --seeds $SEED
# Results will be saved to $BASE/data/rel2template2results.data_name-TREx_UHN.engine-$ENGINE.num_train-$NT.sort_by_weight-False/seed-$SEED
done
done
done

You'll probably want to parallelize the calls to eval_lm.py, since the above will take a while on a single GPU. Running with GPT3 models (ada, babbage, curie, davinci) will query the OpenAI API (so you'll be charged for these queries).

Similarly, we include the full bash loop you can run to reproduce our results for true few-shot prompt selection varying the number of training examples used for selection:

for ENGINE in 'distilgpt2' 'gpt2' 'gpt2-medium' 'gpt2-large' 'gpt2-xl' 'ada' 'babbage'; do  # Here, we don't use curie (13B) and davinci (175B) models for cost reasons 
for NT in 5 10 15 20 30 40; do  # you won't need to run with NT=5 if you ran the above bash loop
for SEED in 0 1 2 3 4; do  # 5 random seeds to sample different training sets
python eval_lm.py --engine $ENGINE --num_train $NT --seeds $SEED
# Results will be saved to $BASE/data/rel2template2results.data_name-TREx_UHN.engine-$ENGINE.num_train-$NT.sort_by_weight-False/seed-$SEED
done
done
done

Lastly, we include the bash loop for reproducing our SuperGLUE experiments:

for ENGINE in 'distilgpt2' 'gpt2' 'gpt2-medium' 'gpt2-large' 'gpt2-xl' 'ada' 'babbage' 'curie' 'davinci'; do 
for NT in 5; do
for SEED in 0 1 2 3 4; do
python eval_lm.py --engine $ENGINE --num_train $NT --seeds $SEED --data_name super_glue --rels cb rte wic 
# Results will be saved to $BASE/data/rel2template2results.data_name-super_glue_UHN.engine-$ENGINE.num_train-$NT.sort_by_weight-False/seed-$SEED
done
done
done

Post-processing GPT Results

We save a lot of statistics about the predictions made by different GPT models using the eval_lm.py command above, which would make it time-consuming to load all of the data in every time we'd like to plot different results. Thus, we first extract the stats that we care about (e.g., stats we need to compute cv/mdl) and save them to smaller files like so:

cd $BASE
DATA_NAME="TREx"     # Use the "TREx" split of LAMA-UHN
FIELDS="nlls ranks"  # names of stats we use to compute cv/mdl/acc ("nlls" for LM Negative Log-Likelihood, "ranks" to get the rank of the true answer, according to the LM -- we will convert this to accuracy later on) 
python extract_fields.py --data_name $DATA_NAME --keys $FIELDS  # see the command line flags in this script, if you'd like to just extract results for a subset of models, training seeds, etc.

For SuperGLUE, we extract the stats for computing cv/mdl/acc as follows:

cd $BASE
DATA_NAME="super_glue"                   # Use super_glue datasets (WiC, RTE, CB)
FIELDS="verbalizer_nlls verbalize_accs"  # names of stats we use to compute cv/mdl/acc ("verbalizer_nlls" to get the NLL of the true answer after eliminating tokens that aren't "verbalizer" tokens or class names; "verbalizer_accs" to get the accuracy when only consider the probabilities of classes with class names instead of all possible tokens) 
python extract_fields.py --data_name $DATA_NAME --keys $FIELDS  # see the command line flags in this script, if you'd like to just extract results for a subset of models, training seeds, etc.

Now you can move to plotting the results from prompt selection using the results from the above commands.

Loading our pre-run experiments for prompt selection

You can load the results from our GPT-2 evaluation runs, to avoid having to evaluate all of the models yourself:

cd $BASE/data
gdown --id 1LSvNS_M47a8QcaZ5-ebNadG_ceW4LzIU  # or download manually from https://drive.google.com/file/d/1LSvNS_M47a8QcaZ5-ebNadG_ceW4LzIU/view
tar -xzvf eval_lm_results.tar.gz
mv eval_lm_results/* .
rmdir eval_lm_results
rm eval_lm_results.tar.gz

We do not have permission from OpenAI to release our GPT-3 results (please send us an email if this is an issue for you). However, our pre-computed GPT-2 results will still allow you to move on to plotting our results for GPT-2 models below.

Plotting results from true few-shot prompt selection

Then, you can compute and save our main paper plots (figures 1-2, as well as 5) for GPT2/DistilGPT2 models like so:

cd $BASE
python plot_results.py --exp 'TREx-vary_models'

The above command will save plots to $BASE/plots/TREx-vary_models. If you have also computed GPT-3 results, simply add the "--use_gpt3" flag to plot GPT-3 results as well.

To plot our other results, change the value for the experiment flag --exp:

--exp Plots results for...
TREx-vary_models various model sizes (all sizes; figures 1, 2, 5)
TREx-vary_num_train various numbers of training examples (all sizes up to 6.7B; figures 3, 4)
TREx-vary_criterion various model selection criteria (all criteria described in the Appendix)
RTE RTE
CB CB
WiC WiC

True Few-Shot Hyperparameter Selection for ADAPET

Here, we describe how to choose hyperparameters for ADAPET, a few-shot learning method that finetunes a language model using a classification loss alongside an auxiliary masked language modeling objective. We use a modified version of their original code, which we include in this repo. We now detail how to setup and run our version of the repo to reproduce our results.

ADAPET trains on GPU, so you'll need to ensure that CUDA is installed for NVIDIA GPUs. First, move to the main directory (cd $BASE) and deactivate any existing virtual environments (e.g. conda deactivate if using conda). Then run source bin/init.sh, which will automatically:

  • Download the FewGLUE and SuperGLUE datasets in data/fewglue/{task} and data/superglue/{task} respectively.
  • Install and setup environment with correct dependencies into a virtual environment.

If you run into issues installing the virtual env with the source bin/init.sh, you can also use conda (as we did) to run experiments instead. We installed Python 3.7.10 using Anaconda 4.8.3, as well as the required dependencies (including PyTorch 1.5) like so:

conda create -y -n adapet python=3.7.10
conda activate adapet
pip install -r requirements.txt

To train with an AMD GPU instead of NVIDIA, you'll need to install ROCm (AMD's CUDA) and then install a ROCm-compatible version of PyTorch 1.8+ instead of PyTorch 1.5 as done above (see pytorch.org for instructions).

We shuffle FewGLUE and generate the 3 additional random subsets of SuperGLUE used in our paper with:

python subsample_superglue.py

For ReCoRD, you'll also need to download separate files containing the few-shot training set labels (we had to format these in a special way for our evaluation scripts later on):

cd $BASE/data/fewglue/ReCoRD
gdown --id 1oTURVQ0Zvoq5cQr7PtqPMc775bp9o6fz
tar -xzvf eval_train_labels.tar.gz
rm eval_train_labels.tar.gz
mv eval_train_labels/* .
rmdir eval_train_labels

You can train an ADAPET model on the full WiC dataset like so:

TN="WiC"     # For other tasks, change to "CB" "COPA" "BoolQ" "RTE" "WSC" "MultiRC" "ReCoRD"
TSS=0        # Random seed for sampling training set. Use TSS=0 for FewGLUE. We used TSS in 0 1 2 3 for our 4 training sets
FMR="False"  # Whether or not to use a fixed masking ratio (as opposed to variable masking ratio -- see ADAPET Appendix Table 12. We always use "False" or variable masking, following ADAPET.)
MA=0.105     # Mask alpha or fraction to use (we sweep over 0.075 0.10 0.105 0.15 following ADAPET)
checkpoint_dirname="tss-$TSS.ma-$MA.fmr-$FMR"
checkpoint_dir="exp_out/fewglue/$TN/albert-xxlarge-v2/$checkpoint_dirname"
mkdir -p $checkpoint_dir  # Make directory for saving results
python src/train.py -c config/$TN.json -k "exp_name='$checkpoint_dirname'" "mask_alpha=$MA" "fixed_mask_ratio=$FMR" "train_set_seed=$TSS" "save_model=True"

Evaluate the K=8 fold cross-validation loss for the above dataset/hyperparameters by training 8 models as follows (the code refers to "folds" as "blocks", following terminology in MDL):

# Set same hyperparameters as before
TN="WiC"
TSS=0
FMR="False"
MA=0.105

SM="cv"      # selection method: "cv" for cross-validation or "mdl" for minimum description length
NB=8         # number of blocks/folds for mdl/cv evaluation 
for BN in $(seq 0 $((NB-1))); do  # iterate over all block (fold) numbers
    checkpoint_dirname="tss-$TSS.ma-$MA.fmr-$FMR.sm-$SM.nb-$NB.bn-$BN"
    checkpoint_dir="exp_out/fewglue/$TN/albert-xxlarge-v2/$checkpoint_dirname"
    mkdir -p $checkpoint_dir
    python src/train.py -c config/$TN.json -k "exp_name='$checkpoint_dirname'" "selection_method='$SM'" "num_blocks=$NB" "block_no=$BN" "mask_alpha=$MA" "fixed_mask_ratio=$FMR" "train_set_seed=$TSS"
done

For more details on how to train/evaluate/test ADAPET, please see the README of the original ADAPET repo, which we used for our code (with only minor modifications).

You can load the results from our training runs, to avoid having to train all of the models yourself:

# Download training run results
cd $BASE/exp_out
rm -r fewglue  # delete any existing results, so we can replace them with our pre-computed results
gdown --id 1Kz5E7v-ejLFeLGSUKPvy9aAPe9_NByZa  # or download manually from https://drive.google.com/file/d/1NAd8nhvpQTl3AYG3jT2ijBtm313JtSDY/view
tar -xzvf fewglue_adapet_results.tar.gz
rm fewglue_adapet_results.tar.gz

You can then print the results from CV/MDL hyperparameter selection with:

cd $BASE
python adapet.py

The above command will print a latex table showing the results for the best/worst/mean/median hyperparameters, as well as the CV/MDL-chosen hyperparameters. You can also show a subset of results (e.g., if you haven't trained on all SuperGLUE tasks), by using command line flags:

cd $BASE
TNS="MultiRC WiC BoolQ"  # Task Names to evaluate on, from {'BoolQ', 'CB', 'COPA', 'RTE', 'WiC', 'WSC', 'MultiRC', 'ReCoRD'} \
TSSS="0 1"  # Train Set Seeds to show mean/std. dev. results for (we used 0 1 2 3 in our paper) \
SMS="cv mdl"  # Selection Methods used to choosen hyperparameters, in {'cv', 'mdl'}
python adapet.py --tns $TNS --tsss $TSSS --sms $SMS

Feel free to open an issue if you have any questions, and have fun true few-shot learning!

Bibtex Citation

@article{perez2021true,
  author = {Ethan Perez and Douwe Kiela and Kyunghyun Cho},
  title = {True Few-Shot Learning with Language Models},
  journal={arXiv},
  year = {2021},
  url = {https://arxiv.org/abs/2105.11447}
}
Owner
Ethan Perez
I'm a Ph.D. student in NLP at NYU. My research is on developing question-answering methods that can generalize to harder questions than we have supervision for.
Ethan Perez
It is an open dataset for object detection in remote sensing images.

RSOD-Dataset It is an open dataset for object detection in remote sensing images. The dataset includes aircraft, oiltank, playground and overpass. The

136 Dec 08, 2022
Implicit Model Specialization through DAG-based Decentralized Federated Learning

Federated Learning DAG Experiments This repository contains software artifacts to reproduce the experiments presented in the Middleware '21 paper "Imp

Operating Systems and Middleware Group 5 Oct 16, 2022
TrackFormer: Multi-Object Tracking with Transformers

TrackFormer: Multi-Object Tracking with Transformers This repository provides the official implementation of the TrackFormer: Multi-Object Tracking wi

Tim Meinhardt 321 Dec 29, 2022
공공장소에서 눈만 돌리면 CCTV가 보인다는 말이 과언이 아닐 정도로 CCTV가 우리 생활에 깊숙이 자리 잡았습니다.

ObsCare_Main 소개 공공장소에서 눈만 돌리면 CCTV가 보인다는 말이 과언이 아닐 정도로 CCTV가 우리 생활에 깊숙이 자리 잡았습니다. CCTV의 대수가 급격히 늘어나면서 관리와 효율성 문제와 더불어, 곳곳에 설치된 CCTV를 개별 관제하는 것으로는 응급 상

5 Jul 07, 2022
Code for paper: Group-CAM: Group Score-Weighted Visual Explanations for Deep Convolutional Networks

Group-CAM By Zhang, Qinglong and Rao, Lu and Yang, Yubin [State Key Laboratory for Novel Software Technology at Nanjing University] This repo is the o

zhql 98 Nov 16, 2022
Python implementation of ADD: Frequency Attention and Multi-View based Knowledge Distillation to Detect Low-Quality Compressed Deepfake Images, AAAI2022.

ADD: Frequency Attention and Multi-View based Knowledge Distillation to Detect Low-Quality Compressed Deepfake Images Binh M. Le & Simon S. Woo, "ADD:

2 Oct 24, 2022
Transfer Learning library for Deep Neural Networks.

Transfer and meta-learning in Python Each folder in this repository corresponds to a method or tool for transfer/meta-learning. xfer-ml is a standalon

Amazon 245 Dec 08, 2022
LaneDetectionAndLaneKeeping - Lane Detection And Lane Keeping

LaneDetectionAndLaneKeeping This project is part of my bachelor's thesis. The go

5 Jun 27, 2022
Deep Learning and Logical Reasoning from Data and Knowledge

Logic Tensor Networks (LTN) Logic Tensor Network (LTN) is a neurosymbolic framework that supports querying, learning and reasoning with both rich data

171 Dec 29, 2022
PyTorch implementation of ARM-Net: Adaptive Relation Modeling Network for Structured Data.

A ready-to-use framework of latest models for structured (tabular) data learning with PyTorch. Applications include recommendation, CRT prediction, healthcare analytics, and etc.

48 Nov 30, 2022
The Official PyTorch Implementation of DiscoBox.

DiscoBox: Weakly Supervised Instance Segmentation and Semantic Correspondence from Box Supervision Paper | Project page | Demo (Youtube) | Demo (Bilib

NVIDIA Research Projects 89 Jan 09, 2023
Hydra: an Extensible Fuzzing Framework for Finding Semantic Bugs in File Systems

Hydra: An Extensible Fuzzing Framework for Finding Semantic Bugs in File Systems Paper Finding Semantic Bugs in File Systems with an Extensible Fuzzin

gts3.org (<a href=[email protected])"> 129 Dec 15, 2022
This is the repository for Learning to Generate Piano Music With Sustain Pedals

SusPedal-Gen This is the official repository of Learning to Generate Piano Music With Sustain Pedals Demo Page Dataset The dataset used in this projec

Joann Ching 12 Sep 02, 2022
ML models implementation practice

Let's implement various ML algorithms with numpy/tf Vanilla Neural Network https://towardsdatascience.com/lets-code-a-neural-network-in-plain-numpy-ae

Jinsoo Heo 4 Jul 04, 2021
Sequence to Sequence (seq2seq) Recurrent Neural Network (RNN) for Time Series Forecasting

Sequence to Sequence (seq2seq) Recurrent Neural Network (RNN) for Time Series Forecasting Note: You can find here the accompanying seq2seq RNN forecas

Guillaume Chevalier 1k Dec 25, 2022
PyTorch implementation of the end-to-end coreference resolution model with different higher-order inference methods.

End-to-End Coreference Resolution with Different Higher-Order Inference Methods This repository contains the implementation of the paper: Revealing th

Liyan 52 Jan 04, 2023
Python implementation of MULTIseq barcode alignment using fuzzy string matching and GMM barcode assignment

Python implementation of MULTIseq barcode alignment using fuzzy string matching and GMM barcode assignment.

MT Schmitz 2 Feb 11, 2022
Official implementation of Densely connected normalizing flows

Densely connected normalizing flows This repository is the official implementation of NeurIPS 2021 paper Densely connected normalizing flows. Poster a

Matej Grcić 31 Dec 12, 2022
Python program that works as a contact list

Lista de Contatos Programa em Python que funciona como uma lista de contatos. Features Adicionar novo contato Remover contato Atualizar contato Pesqui

Victor B. Lino 3 Dec 16, 2021
Computer Vision and Pattern Recognition, NUS CS4243, 2022

CS4243_2022 Computer Vision and Pattern Recognition, NUS CS4243, 2022 Cloud Machine #1 : Google Colab (Free GPU) Follow this Notebook installation : h

Xavier Bresson 142 Dec 15, 2022