The repository for the paper "When Do You Need Billions of Words of Pretraining Data?"

Last update: Nov 25, 2022

Overview

pretraining-learning-curves

This is the repository for the paper When Do You Need Billions of Words of Pretraining Data?

Edge Probing

We use jiant1 for our edge probing experiments. This tutorial can help you set up the environment and get started with jiant.

Below is an example of how to reproduce our dependency labelling experiment with roberta-base-1B-3, which is one of the MiniBERTas we probe.

Download and Preprocess the Data

The commands below help you get and tokenize the data for the dependency labelling task. Remember to change directory to the root of the jiant and activate your jiant environment first.

mkdir data

mkdir data/edges

probing/data/get_ud_data.sh data/edges/dep_ewt

python probing/get_edge_data_labels.py -o data/edges/dep_ewt/labels.txt -i data/edges/dep_ewt/*.json

python probing/retokenize_edge_data.py -t nyu-mll/roberta-base-1B-3  data/edges/dep_ewt/*.json

Run the Experiment

If you have not used jiant before, you will probably need to set two critical environment variables:

$JIANT_PROJECT_PREFIX: the directory where logs and model checkpoints will be saved.

$JIANT_DATA_DIR: The data directory. Set it to PATH/TO/LOCAL/REPO/data

Now, you are ready to run the probing program:

python main.py –config_file jiant/config/edgeprobe/edgeprobe_miniberta.conf\ 
–overrides “exp_name=DL_tutorial, target_tasks=edges-dep-ud-ewt,\
transformers_output_mode=mix, input_module=nyu-mll/roberta-base-1B-3,\ 
target_train_val_interval=1000, batch_size=32, target_train_max_vals=130, lr=0.0005”

A logging message will be printed out after each validation. You should expect validation f1 to exceed 90 in only a few validations.

The final validation result will be printed after the experiment is finished, and can also be found in $JIANT_PROJECT_PREFIX/DL_tutorial/results.tsv. You should expect the final validation f1 to be around 95.

Minimum Description Length Probing with Edge Probing tasks

For this experiment, we use this fork of jiant1.

BLiMP

The code for our BLiMP experiments can be found here. You can already check results for our MiniBERTas.

If you want to rerun experiments on your own, we have prepared BLiMP data so you only need to include all dependencies for the environment and run scripts following the tutorial here. Note that when intalling dependencies CUDA version could be a problem when installing mxnet.

SuperGLUE

We use jiant2 for our SuperGLUE experiments. Get started with jiant2 using this guide and examples.

The repository for the paper "When Do You Need Billions of Words of Pretraining Data?"

Related tags

Overview

pretraining-learning-curves

Edge Probing

Download and Preprocess the Data

Run the Experiment

Minimum Description Length Probing with Edge Probing tasks

BLiMP

SuperGLUE

Owner

ML² AT CILVR

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped

Implementation of MA-Trace - a general-purpose multi-agent RL algorithm for cooperative environments.

Tooling for the Common Objects In 3D dataset.

Tutorial repo for an end-to-end Data Science project

Python utility to generate filesystem content for Obsidian.

NasirKhusraw - The TSP solved using genetic algorithm and show TSP path overlaid on a map of the Iran provinces & their capitals.

Axel - 3D printed robotic hands and they controll with Raspberry Pi and Arduino combo

Build an Amazon SageMaker Pipeline to Transform Raw Texts to A Knowledge Graph

Beyond Image to Depth: Improving Depth Prediction using Echoes (CVPR 2021)

Another pytorch implementation of FCN (Fully Convolutional Networks)

OCRA (Object-Centric Recurrent Attention) source code

Efficient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network

Official implementation of "SinIR: Efficient General Image Manipulation with Single Image Reconstruction" (ICML 2021)

Source code for Acorn, the precision farming rover by Twisted Fields

Project dự đoán giá cổ phiếu bằng thuật toán LSTM gồm: code train và code demo

Computer Vision is an elective course of MSAI, SCSE, NTU, Singapore

Trustworthy AI related projects

Codebase for the Summary Loop paper at ACL2020

OBBDetection: an oriented object detection toolbox modified from MMdetection

Making self-supervised learning work on molecules by using their 3D geometry to pre-train GNNs. Implemented in DGL and Pytorch Geometric.