Princeton NLP's pre-training library based on fairseq with DeepSpeed kernel integration ๐Ÿšƒ

Overview


This repository provides a library for efficient training of masked language models (MLM), built with fairseq. We fork fairseq to give researchers more flexibility when using our training scripts, while also making it easier to adapt our code contributions into other projects.

Why DinkyTrain?

The Dinky runs between Princeton Junction and Princeton and is the shortest scheduled commuter rail line in the United States. We also aim to make pre-training short and accessible to everyone.

Our Contributions

  • DeepSpeed transformer kernel integration
  • A training recipe for efficient MLM pre-training
  • An easy-to-follow guideline of using fairseq for MLM pre-training.

Other fairseq features:

See the fairseq repo and its documentation for more details on how to use and extend fairseq.

DinkyTrain for Efficient MLM Pre-training

Quick Links

Overview

You can reproduce the pre-training experiments of our recent paper Should You Mask 15% in Masked Language Modeling?, where we find that higher masking rates can lead to more efficient pre-training.

Installation

  • PyTorch version >= 1.5.0
  • Python version >= 3.6
  • To install fairseq and develop locally:
git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./
  • For faster training (FP16) install NVIDIA's apex library:
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
  --global-option="--deprecated_fused_adam" --global-option="--xentropy" \
  --global-option="--fast_multihead_attn" ./
  • For faster training (DeepSpeed cuda kernel) install DeepSpeed library and compile the DeepSpeed kernel
DS_BUILD_TRANSFORMER=1 DS_BUILD_STOCHASTIC_TRANSFORMER=1 pip install deepspeed
  • For large datasets install PyArrow: pip install pyarrow
  • If you use Docker make sure to increase the shared memory size either with --ipc=host or --shm-size as command line options to nvidia-docker run .

Trouble-shooting:

  • If using lower version of Python, you might encounter import problems with importlib.metadata. Try pip install importlib-metadata.
  • To install apex and deepspeed, you will need nvcc (CUDA compiler).
  • When installing apex, if you encounter the error Cuda extensions are bing compiled with a version of Cuda that does not match ..., go to setup.py and comment out the line that raised the error (at your own risk).
  • Both apex and deepspeed installation require a high gcc version to support c++14. If you encounter relevant errors, update your gcc.

Data Pre-processing

Tokenization: First, download the GPT2 BPE vocabulary:

wget -O gpt2_bpe/encoder.json https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json
wget -O gpt2_bpe/vocab.bpe https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe

Then, tokenize your raw data:

python -m examples.roberta.multiprocessing_bpe_encoder \
    --encoder-json gpt2_bpe/encoder.json \
    --vocab-bpe gpt2_bpe/vocab.bpe \
    --inputs ${SPLIT}.raw \
    --outputs ${SPLIT}.bpe \
    --keep-empty \
    --workers 8

Finally, index and binarize your data:

fairseq-preprocess \
    --only-source \
    --srcdict gpt2_bpe/dict.txt \
    --trainpref ${TRAIN_SPLIT}.bpe \
    --validpref ${VALID_SPLIT}.bpe \
    --testpref ${TEST_SPLIT}.bpe \
    --destdir output-bin \
    --workers 8

Alternatively: Use our pre-processed data: We preprocessed Wikipedia+BookCorpus and shared it on Huggingface dataset. It is ~22GB and contains two epochs of data, each epoch being sliced into 8 shards. You can download it using git:

git lfs install # Git lfs is needed for downloading
git clone https://huggingface.co/datasets/princeton-nlp/wikibook_fairseq_format

Pre-training

Use our script for efficient pre-training

GPU={number of GPUs} DATA_DIR={data path} [DEEPSPEED=1] bash run_efficient_mlm_recipe.sh

Flags explained

  • GPU: number of GPUs.
  • DATA_DIR: directory to the processed pre-training data. If you are using our preprocessed dataset, DATA_DIR should be:
DATA_DIR=$(seq 0 15 | sed -e 's/^/wikibook_fairseq_format\/bin-shard/' | sed -e 's/$/-8/' | paste -sd ':')
  • DEEPSPEED (optional): if set to 1, the DeepSpeed CUDA kernel will be used.

Please refer to the script for more hyperparameter choices.

Fine-tuning on GLUE and SQuAD

All our checkpoints can be converted to HuggingFace transformers models (see next nextion) and use the transformers package for fine-tuning. Fairseq also supports fine-tuning on GLUE.

First, download the preprocessed GLUE data (you can also process by yourself following the preprocess section above):

git lfs install # Git lfs is needed for downloading
git clone https://huggingface.co/datasets/princeton-nlp/glue_fairseq_format

Then use the following script for fine-tuning

DATA_DIR={path to the data directory} \
TASK={glue task name (mnli qnli qqp rte sst2 mrpc cola stsb)} \
LR={learning rate} \
BSZ={batch size} \
EPOCHS={number of epochs} \
SEED={random seed} \
CKPT_DIR={checkpoint's directory} \
CKPT_NAME={checkpoint's name} \
[DEEPSPEED=1] bash finetune_glue.sh

For fine-tuning on SQuAD, please convert the models to HuggingFace checkpoints following the next section and use HuggingFace's examples.

Convert to HuggingFace

We also provide conversion codes so that you can easily turn Fairseq checkpoints into HuggingFace checkpoints. Usage:

cd scripts
[PRELAYERNORM=1] [FROM_DS=1] python convert_fs_ckpt_to_hf_ckpt.py --fr {fairseq checkpoint} --to {huggingface checkpoint path} --hf_model_config {roberta-base/roberta-large}

Flags explained:

  • PRELAYERNORM=1: Using pre layer-norm (default is post layer-norm).
  • FROM_DS=1: The Fairseq checkpoint uses DeepSpeed's cuda kernel.
  • --fr: The path to the Fairseq checkpoint.
  • --to: The path you want to save the HuggingFace checkpoint to.
  • --hf_model_config: roberta-base or roberta-large.

IMPORTANT: all our models use pre layer norm, which is not supported by HuggingFace yet. To use it, import the model class from huggingface/modeling_roberta_prelayernorm.py. For example:

from huggingface.modeling_roberta_prelayernorm import RobertaForSequenceClassification

For more configuration, please refer to convert_fs_ckpt_to_hf_ckpt.py.

Model List

Here are the HuggingFace checkpoints of our models in the paper Should You Mask 15% in Masked Language Modeling. Results are development set performance.

Model MNLI QNLI QQP SST-2
princeton-nlp/efficient_mlm_m0.15 84.2 90.9 87.8 93.3
princeton-nlp/efficient_mlm_m0.20 84.1 91.3 87.9 92.7
princeton-nlp/efficient_mlm_m0.30 84.2 91.6 88.0 93.0
princeton-nlp/efficient_mlm_m0.40 84.5 91.6 88.1 92.8
princeton-nlp/efficient_mlm_m0.50 84.1 91.1 88.1 92.7
princeton-nlp/efficient_mlm_m0.60 83.2 90.7 87.8 92.6
princeton-nlp/efficient_mlm_m0.70 82.3 89.4 87.5 91.9
princeton-nlp/efficient_mlm_m0.80 80.8 87.9 87.1 90.5
princeton-nlp/efficient_mlm_m0.15-801010 83.7 90.4 87.8 93.2
princeton-nlp/efficient_mlm_m0.40-801010 84.3 91.2 87.9 93.0

We also offer the original (deepspeed) fairseq checkpoints here.

Bugs or Questions?

If you hav an questions, or encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!

Citation

@article{wettig2022should,
   title={Should You Mask 15% in Masked Language Modeling?},
   author={Wettig, Alexander and Gao, Tianyu and Zhong, Zexuan and Chen, Danqi},
   boo={arXiv preprint arXiv:2202.08005},
   year={2022}
}

Acknowledgment

Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48โ€“53.

  • Our efficient training recipe is based on the following paper:

Peter Izsak, Moshe Berchansky, and Omer Levy. 2021. How to train BERT with an academic budget. In Empirical Methods in Natural Language Processing (EMNLP), pages 10644โ€“10652.

Owner
Princeton Natural Language Processing
Princeton Natural Language Processing
Repository for Project Insight: NLP as a Service

Project Insight NLP as a Service Contents Introduction Features Installation Setup and Documentation Project Details Demonstration Directory Details H

Abhishek Kumar Mishra 286 Dec 06, 2022
PyTorch Language Model for 1-Billion Word (LM1B / GBW) Dataset

PyTorch Large-Scale Language Model A Large-Scale PyTorch Language Model trained on the 1-Billion Word (LM1B) / (GBW) dataset Latest Results 39.98 Perp

Ryan Spring 114 Nov 04, 2022
TensorFlow code and pre-trained models for BERT

BERT ***** New March 11th, 2020: Smaller BERT Models ***** This is a release of 24 smaller BERT models (English only, uncased, trained with WordPiece

Google Research 32.9k Jan 08, 2023
Revisiting Pre-trained Models for Chinese Natural Language Processing (Findings of EMNLP 2020)

This repository contains the resources in our paper "Revisiting Pre-trained Models for Chinese Natural Language Processing", which will be published i

Yiming Cui 463 Dec 30, 2022
Train ๐Ÿค—transformers with DeepSpeed: ZeRO-2, ZeRO-3

Fork from https://github.com/huggingface/transformers/tree/86d5fb0b360e68de46d40265e7c707fe68c8015b/examples/pytorch/language-modeling at 2021.05.17.

Junbum Lee 12 Oct 26, 2022
Application for shadowing Chinese.

chinese-shadowing Simple APP for shadowing chinese. With this application, it is very easy to record yourself, play the sound recorded and listen to s

Thomas Hirtz 5 Sep 06, 2022
Grading tools for Advanced NLP (11-711)Grading tools for Advanced NLP (11-711)

Grading tools for Advanced NLP (11-711) Installation You'll need docker and unzip to use this repo. For docker, visit the official guide to get starte

Hao Zhu 2 Sep 27, 2022
Code repository for "It's About Time: Analog clock Reading in the Wild"

it's about time Code repository for "It's About Time: Analog clock Reading in the Wild" Packages required: pytorch (used 1.9, any reasonable version s

52 Nov 10, 2022
Applying "Load What You Need: Smaller Versions of Multilingual BERT" to LaBSE

smaller-LaBSE LaBSE(Language-agnostic BERT Sentence Embedding) is a very good method to get sentence embeddings across languages. But it is hard to fi

Jeong Ukjae 13 Sep 02, 2022
This is a GUI program that will generate a word search puzzle image

Word Search Puzzle Generator Table of Contents About The Project Built With Getting Started Prerequisites Installation Usage Roadmap Contributing Cont

11 Feb 22, 2022
This is a MD5 password/passphrase brute force tool

CROWES-PASS-CRACK-TOOl This is a MD5 password/passphrase brute force tool How to install: Do 'git clone https://github.com/CROW31/CROWES-PASS-CRACK-TO

9 Mar 02, 2022
EdiTTS: Score-based Editing for Controllable Text-to-Speech

Official implementation of EdiTTS: Score-based Editing for Controllable Text-to-Speech

Neosapience 99 Jan 02, 2023
InferSent sentence embeddings

InferSent InferSent is a sentence embeddings method that provides semantic representations for English sentences. It is trained on natural language in

Facebook Research 2.2k Dec 27, 2022
A large-scale (194k), Multiple-Choice Question Answering (MCQA) dataset designed to address realworld medical entrance exam questions.

MedMCQA MedMCQA : A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering A large-scale, Multiple-Choice Question Answe

MedMCQA 24 Nov 30, 2022
Simple Annotated implementation of GPT-NeoX in PyTorch

Simple Annotated implementation of GPT-NeoX in PyTorch This is a simpler implementation of GPT-NeoX in PyTorch. We have taken out several optimization

labml.ai 101 Dec 03, 2022
Installation, test and evaluation of Scribosermo speech-to-text engine

Scribosermo STT Setup Scribosermo is a LGPL licensed, open-source speech recognition engine to "Train fast Speech-to-Text networks in different langua

Florian Quirin 3 Jun 20, 2022
Yet Another Neural Machine Translation Toolkit

YANMTT YANMTT is short for Yet Another Neural Machine Translation Toolkit. For a backstory how I ended up creating this toolkit scroll to the bottom o

Raj Dabre 121 Jan 05, 2023
DeLighT: Very Deep and Light-Weight Transformers

DeLighT: Very Deep and Light-weight Transformers This repository contains the source code of our work on building efficient sequence models: DeFINE (I

Sachin Mehta 440 Dec 18, 2022
Client library to download and publish models and other files on the huggingface.co hub

huggingface_hub Client library to download and publish models and other files on the huggingface.co hub Do you have an open source ML library? We're l

Hugging Face 644 Jan 01, 2023
IMDB film review sentiment classification based on BERT's supervised learning model.

IMDB film review sentiment classification based on BERT's supervised learning model. On the other hand, the model can be extended to other natural language multi-classification tasks.

Paris 1 Apr 17, 2022