The repository for the paper "When Do You Need Billions of Words of Pretraining Data?"

Overview

pretraining-learning-curves

This is the repository for the paper When Do You Need Billions of Words of Pretraining Data?

Edge Probing

We use jiant1 for our edge probing experiments. This tutorial can help you set up the environment and get started with jiant.

Below is an example of how to reproduce our dependency labelling experiment with roberta-base-1B-3, which is one of the MiniBERTas we probe.

Download and Preprocess the Data

The commands below help you get and tokenize the data for the dependency labelling task. Remember to change directory to the root of the jiant and activate your jiant environment first.

mkdir data

mkdir data/edges

probing/data/get_ud_data.sh data/edges/dep_ewt

python probing/get_edge_data_labels.py -o data/edges/dep_ewt/labels.txt -i data/edges/dep_ewt/*.json

python probing/retokenize_edge_data.py -t nyu-mll/roberta-base-1B-3  data/edges/dep_ewt/*.json

Run the Experiment

If you have not used jiant before, you will probably need to set two critical environment variables:

$JIANT_PROJECT_PREFIX: the directory where logs and model checkpoints will be saved.

$JIANT_DATA_DIR: The data directory. Set it to PATH/TO/LOCAL/REPO/data

Now, you are ready to run the probing program:

python main.py –config_file jiant/config/edgeprobe/edgeprobe_miniberta.conf\ 
–overrides “exp_name=DL_tutorial, target_tasks=edges-dep-ud-ewt,\
transformers_output_mode=mix, input_module=nyu-mll/roberta-base-1B-3,\ 
target_train_val_interval=1000, batch_size=32, target_train_max_vals=130, lr=0.0005”

A logging message will be printed out after each validation. You should expect validation f1 to exceed 90 in only a few validations.

The final validation result will be printed after the experiment is finished, and can also be found in $JIANT_PROJECT_PREFIX/DL_tutorial/results.tsv. You should expect the final validation f1 to be around 95.

Minimum Description Length Probing with Edge Probing tasks

For this experiment, we use this fork of jiant1.

BLiMP

The code for our BLiMP experiments can be found here. You can already check results for our MiniBERTas.

If you want to rerun experiments on your own, we have prepared BLiMP data so you only need to include all dependencies for the environment and run scripts following the tutorial here. Note that when intalling dependencies CUDA version could be a problem when installing mxnet.

SuperGLUE

We use jiant2 for our SuperGLUE experiments. Get started with jiant2 using this guide and examples.

Owner
ML² AT CILVR
The Machine Learning for Language Group at NYU CILVR
ML² AT CILVR
CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped

CSWin-Transformer This repo is the official implementation of "CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows". Th

Microsoft 409 Jan 06, 2023
Implementation of MA-Trace - a general-purpose multi-agent RL algorithm for cooperative environments.

Off-Policy Correction For Multi-Agent Reinforcement Learning This repository is the official implementation of Off-Policy Correction For Multi-Agent R

4 Aug 18, 2022
Tooling for the Common Objects In 3D dataset.

CO3D: Common Objects In 3D This repository contains a set of tools for working with the Common Objects in 3D (CO3D) dataset. Download the dataset The

Facebook Research 724 Jan 06, 2023
Tutorial repo for an end-to-end Data Science project

End-to-end Data Science project This is the repo with the notebooks, code, and additional material used in the ITI's workshop. The goal of the session

Deena Gergis 127 Dec 30, 2022
Python utility to generate filesystem content for Obsidian.

Security Vault Generator Quickly parse, format, and output common frameworks/content for Obsidian.md. There is a strong focus on MITRE ATT&CK because

Justin Angel 73 Dec 02, 2022
NasirKhusraw - The TSP solved using genetic algorithm and show TSP path overlaid on a map of the Iran provinces & their capitals.

Nasir Khusraw : Travelling Salesman Problem The TSP solved using genetic algorithm. This project show TSP path overlaid on a map of the Iran provinces

J Brave 2 Sep 01, 2022
Axel - 3D printed robotic hands and they controll with Raspberry Pi and Arduino combo

Axel It's our graduation project about 3D printed robotic hands and they control

0 Feb 14, 2022
Build an Amazon SageMaker Pipeline to Transform Raw Texts to A Knowledge Graph

Build an Amazon SageMaker Pipeline to Transform Raw Texts to A Knowledge Graph This repository provides a pipeline to create a knowledge graph from ra

AWS Samples 3 Jan 01, 2022
Beyond Image to Depth: Improving Depth Prediction using Echoes (CVPR 2021)

Beyond Image to Depth: Improving Depth Prediction using Echoes (CVPR 2021) Kranti Kumar Parida, Siddharth Srivastava, Gaurav Sharma. We address the pr

Kranti Kumar Parida 33 Jun 27, 2022
Another pytorch implementation of FCN (Fully Convolutional Networks)

FCN-pytorch-easiest Trying to be the easiest FCN pytorch implementation and just in a get and use fashion Here I use a handbag semantic segmentation f

Y. Dong 158 Dec 21, 2022
OCRA (Object-Centric Recurrent Attention) source code

OCRA (Object-Centric Recurrent Attention) source code Hossein Adeli and Seoyoung Ahn Please cite this article if you find this repository useful: For

Hossein Adeli 2 Jun 18, 2022
Efficient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network

Efficient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network Paddle-PANet 目录 结果对比 论文介绍 快速安装 结果对比 CTW1500 Method Backbone Fine

7 Aug 08, 2022
Official implementation of "SinIR: Efficient General Image Manipulation with Single Image Reconstruction" (ICML 2021)

SinIR (Official Implementation) Requirements To install requirements: pip install -r requirements.txt We used Python 3.7.4 and f-strings which are in

47 Oct 11, 2022
Source code for Acorn, the precision farming rover by Twisted Fields

Acorn precision farming rover This is the software repository for Acorn, the precision farming rover by Twisted Fields. For more information see twist

Twisted Fields 198 Jan 02, 2023
Project dự đoán giá cổ phiếu bằng thuật toán LSTM gồm: code train và code demo

Web predicts stock prices using Long - Short Term Memory algorithm Give me some start please!!! User interface image: Choose: DayBegin, DayEnd, Stock

Vo Thuong Truong Nhon 8 Nov 11, 2022
Computer Vision is an elective course of MSAI, SCSE, NTU, Singapore

[AI6122] Computer Vision is an elective course of MSAI, SCSE, NTU, Singapore. The repository corresponds to the AI6122 of Semester 1, AY2021-2022, starting from 08/2021. The instructor of this course

HT. Li 5 Sep 12, 2022
Trustworthy AI related projects

Trustworthy AI This repository aims to include trustworthy AI related projects from Huawei Noah's Ark Lab. Current projects include: Causal Structure

HUAWEI Noah's Ark Lab 589 Dec 30, 2022
Codebase for the Summary Loop paper at ACL2020

Summary Loop This repository contains the code for ACL2020 paper: The Summary Loop: Learning to Write Abstractive Summaries Without Examples. Training

Canny Lab @ The University of California, Berkeley 44 Nov 04, 2022
OBBDetection: an oriented object detection toolbox modified from MMdetection

OBBDetection note: If you have questions or good suggestions, feel free to propose issues and contact me. introduction OBBDetection is an oriented obj

MIXIAOXIN_HO 3 Nov 11, 2022
Making self-supervised learning work on molecules by using their 3D geometry to pre-train GNNs. Implemented in DGL and Pytorch Geometric.

3D Infomax improves GNNs for Molecular Property Prediction Video | Paper We pre-train GNNs to understand the geometry of molecules given only their 2D

Hannes Stärk 95 Dec 30, 2022