EMNLP 2021 - Frustratingly Simple Pretraining Alternatives to Masked Language Modeling

Overview

Frustratingly Simple Pretraining Alternatives to Masked Language Modeling

This is the official implementation for "Frustratingly Simple Pretraining Alternatives to Masked Language Modeling" (EMNLP 2021).

Requirements

  • torch
  • transformers
  • datasets
  • scikit-learn
  • tensorflow
  • spacy

How to pre-train

1. Clone this repository

git clone https://github.com/gucci-j/light-transformer-emnlp2021.git

2. Install required packages

cd ./light-transformer-emnlp2021
pip install -r requirements.txt

requirements.txt is located just under light-transformer-emnlp2021.

We also need spaCy's en_core_web_sm for preprocessing. If you have not installed this model, please run python -m spacy download en_core_web_sm.

3. Preprocess datasets

cd ./src/utils
python preprocess_roberta.py --path=/path/to/save/data/

You need to specify the following argument:

  • path: (str) Where to save the processed data?

4. Pre-training

You need to secify configs as command line arguments. Sample configs for pre-training MLM are shown as below. python pretrainer.py --help will display helper messages.

cd ../
python pretrainer.py \
--data_dir=/path/to/dataset/ \
--do_train \
--learning_rate=1e-4 \
--weight_decay=0.01 \
--adam_epsilon=1e-8 \
--max_grad_norm=1.0 \
--num_train_epochs=1 \
--warmup_steps=12774 \
--save_steps=12774 \
--seed=42 \
--per_device_train_batch_size=16 \
--logging_steps=100 \
--output_dir=/path/to/save/weights/ \
--overwrite_output_dir \
--logging_dir=/path/to/save/log/files/ \
--disable_tqdm=True \
--prediction_loss_only \
--fp16 \
--mlm_prob=0.15 \
--pretrain_model=RobertaForMaskedLM 
  • pretrain_model should be selected from:
    • RobertaForMaskedLM (MLM)
    • RobertaForShuffledWordClassification (Shuffle)
    • RobertaForRandomWordClassification (Random)
    • RobertaForShuffleRandomThreeWayClassification (Shuffle+Random)
    • RobertaForFourWayTokenTypeClassification (Token Type)
    • RobertaForFirstCharPrediction (First Char)

Check the pre-training process

You can monitor the progress of pre-training via the Tensorboard. Simply run the following:

tensorboard --logdir=/path/to/log/dir/

Distributed training

pretrainer.py is compatible with distributed training. Sample configs for pre-training MLM are as follows.

python -m torch/distributed/launch.py \
--nproc_per_node=8 \
pretrainer.py \
--data_dir=/path/to/dataset/ \
--model_path=None \
--do_train \
--learning_rate=5e-5 \
--weight_decay=0.01 \
--adam_epsilon=1e-8 \
--max_grad_norm=1.0 \
--num_train_epochs=1 \
--warmup_steps=24000 \
--save_steps=1000 \
--seed=42 \
--per_device_train_batch_size=8 \
--logging_steps=100 \
--output_dir=/path/to/save/weights/ \
--overwrite_output_dir \
--logging_dir=/path/to/save/log/files/ \
--disable_tqdm \
--prediction_loss_only \
--fp16 \
--mlm_prob=0.15 \
--pretrain_model=RobertaForMaskedLM 

For more details about launch.py, please refer to https://github.com/pytorch/pytorch/blob/master/torch/distributed/launch.py.

Mixed precision training

Installation

  • For PyTorch version >= 1.6, there is a native functionality to enable mixed precision training.
  • For older versions, NVIDIA apex must be installed.
    • You might encounter some errors when installing apex due to permission problems. To fix these, specify export TMPDIR='/path/to/your/favourite/dir/' and change permissions of all files under apex/.git/ to 777.
    • You also need to specify an optimisation method from https://nvidia.github.io/apex/amp.html.

Usage
To use mixed precision during pre-training, just specify --fp16 as an input argument. For older PyTorch versions, also specify --fp16_opt_level from O0, O1, O2, and O3.

How to fine-tune

GLUE

  1. Download GLUE data

    git clone https://github.com/huggingface/transformers
    python transformers/utils/download_glue_data.py
    
  2. Create a json config file
    You need to create a .json file for configuration or use command line arguments.

    {
        "model_name_or_path": "/path/to/pretrained/weights/",
        "tokenizer_name": "roberta-base",
        "task_name": "MNLI",
        "do_train": true,
        "do_eval": true,
        "data_dir": "/path/to/MNLI/dataset/",
        "max_seq_length": 128,
        "learning_rate": 2e-5,
        "num_train_epochs": 3, 
        "per_device_train_batch_size": 32,
        "per_device_eval_batch_size": 128,
        "logging_steps": 500,
        "logging_first_step": true,
        "save_steps": 1000,
        "save_total_limit": 2,
        "evaluate_during_training": true,
        "output_dir": "/path/to/save/models/",
        "overwrite_output_dir": true,
        "logging_dir": "/path/to/save/log/files/",
        "disable_tqdm": true
    }

    For task_name and data_dir, please choose one from CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, and WNLI.

  3. Fine-tune

    python run_glue.py /path/to/json/
    

    Instead of specifying a JSON path, you can directly specify configs as input arguments.
    You can also monitor training via Tensorboard.
    --help option will display a helper message.

SQuAD

  1. Download SQuAD data

    cd ./utils
    python download_squad_data.py --save_dir=/path/to/squad/
    
  2. Fine-tune

    cd ..
    export SQUAD_DIR=/path/to/squad/
    python run_squad.py \
    --model_type roberta \
    --model_name_or_path=/path/to/pretrained/weights/ \
    --tokenizer_name roberta-base \
    --do_train \
    --do_eval \
    --do_lower_case \
    --data_dir=$SQUAD_DIR \
    --train_file $SQUAD_DIR/train-v1.1.json \
    --predict_file $SQUAD_DIR/dev-v1.1.json \
    --per_gpu_train_batch_size 16 \
    --per_gpu_eval_batch_size 32 \
    --learning_rate 3e-5 \
    --weight_decay=0.01 \
    --warmup_steps=3327 \
    --num_train_epochs 10.0 \
    --max_seq_length 384 \
    --doc_stride 128 \
    --logging_steps=278 \
    --save_steps=50000 \
    --patience=5 \
    --objective_type=maximize \
    --metric_name=f1 \
    --overwrite_output_dir \
    --evaluate_during_training \
    --output_dir=/path/to/save/weights/ \
    --logging_dir=/path/to/save/logs/ \
    --seed=42 
    

    Similar to pre-training, you can monitor the fine-tuning status via Tensorboard.
    --help option will display a helper message.

Citation

@inproceedings{yamaguchi-etal-2021-frustratingly,
    title = "Frustratingly Simple Pretraining Alternatives to Masked Language Modeling",
    author = "Yamaguchi, Atsuki  and
      Chrysostomou, George  and
      Margatina, Katerina  and
      Aletras, Nikolaos",
    booktitle = "Proceedings of the 2021 Conference on Empirical
Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2021",
    publisher = "Association for Computational Linguistics",
}

License

MIT License

Owner
Atsuki Yamaguchi
NLP researcher
Atsuki Yamaguchi
AlphaNet Improved Training of Supernet with Alpha-Divergence

AlphaNet: Improved Training of Supernet with Alpha-Divergence This repository contains our PyTorch training code, evaluation code and pretrained model

Facebook Research 87 Oct 10, 2022
Code and models for ICCV2021 paper "Robust Object Detection via Instance-Level Temporal Cycle Confusion".

Robust Object Detection via Instance-Level Temporal Cycle Confusion This repo contains the implementation of the ICCV 2021 paper, Robust Object Detect

Xin Wang 69 Oct 13, 2022
Learned Initializations for Optimizing Coordinate-Based Neural Representations

Learned Initializations for Optimizing Coordinate-Based Neural Representations Project Page | Paper Matthew Tancik*1, Ben Mildenhall*1, Terrance Wang1

Matthew Tancik 127 Jan 03, 2023
Perspective: Julia for Biologists

Perspective: Julia for Biologists 1. Examples Speed: Example 1 - Single cell data and network inference Domain: Single cell data Methodology: Network

Elisabeth Roesch 55 Dec 02, 2022
Mini Software that give reminder to drink water as per your weight.

Water Notification Desktop Python The Mini Software built in Python (tkinter) that will remind you to drink water on specific time span based on your

Om Jogani 5 Dec 16, 2022
Starter kit for getting started in the Music Demixing Challenge.

Music Demixing Challenge - Starter Kit 👉 Challenge page This repository is the Music Demixing Challenge Submission template and Starter kit! Clone th

AIcrowd 106 Dec 20, 2022
Pytorch for Segmentation

Pytorch for Semantic Segmentation This repo has been deprecated currently and I will not maintain it. Meanwhile, I strongly recommend you can refer to

ycszen 411 Nov 22, 2022
A concise but complete implementation of CLIP with various experimental improvements from recent papers

x-clip (wip) A concise but complete implementation of CLIP with various experimental improvements from recent papers Install $ pip install x-clip Usag

Phil Wang 515 Dec 26, 2022
Interpolation-based reduced-order models

Interpolation-reduced-order-models Interpolation-based reduced-order models High-fidelity computational fluid dynamics (CFD) solutions are time consum

Donovan Blais 1 Jan 10, 2022
A PyTorch-Based Framework for Deep Learning in Computer Vision

TorchCV: A PyTorch-Based Framework for Deep Learning in Computer Vision @misc{you2019torchcv, author = {Ansheng You and Xiangtai Li and Zhen Zhu a

Donny You 2.2k Jan 09, 2023
implementation of paper - You Only Learn One Representation: Unified Network for Multiple Tasks

YOLOR implementation of paper - You Only Learn One Representation: Unified Network for Multiple Tasks To reproduce the results in the paper, please us

Kin-Yiu, Wong 1.8k Jan 04, 2023
Codes for CIKM'21 paper 'Self-Supervised Graph Co-Training for Session-based Recommendation'.

COTREC Codes for CIKM'21 paper 'Self-Supervised Graph Co-Training for Session-based Recommendation'. Requirements: Python 3.7, Pytorch 1.6.0 Best Hype

Xin Xia 42 Dec 09, 2022
3D Human Pose Machines with Self-supervised Learning

3D Human Pose Machines with Self-supervised Learning Keze Wang, Liang Lin, Chenhan Jiang, Chen Qian, and Pengxu Wei, “3D Human Pose Machines with Self

Chenhan Jiang 398 Dec 20, 2022
Unofficial pytorch implementation of 'Image Inpainting for Irregular Holes Using Partial Convolutions'

pytorch-inpainting-with-partial-conv Official implementation is released by the authors. Note that this is an ongoing re-implementation and I cannot f

Naoto Inoue 525 Jan 01, 2023
RL Algorithms with examples in Python / Pytorch / Unity ML agents

Reinforcement Learning Project This project was created to make it easier to get started with Reinforcement Learning. It now contains: An implementati

Rogier Wachters 3 Aug 19, 2022
CVPR 2021 Challenge on Super-Resolution Space

Learning the Super-Resolution Space Challenge NTIRE 2021 at CVPR Learning the Super-Resolution Space challenge is held as a part of the 6th edition of

andreas 104 Oct 26, 2022
Post-training Quantization for Neural Networks with Provable Guarantees

Post-training Quantization for Neural Networks with Provable Guarantees Authors: Jinjie Zhang ( Yixuan Zhou 2 Nov 29, 2022

RADIal is available now! Check the download section

Latest news: RADIal is available now! Check the download section. However, because we are currently working on the data anonymization, we provide for

valeo.ai 55 Jan 03, 2023
DA2Lite is an automated model compression toolkit for PyTorch.

DA2Lite (Deep Architecture to Lite) is a toolkit to compress and accelerate deep network models. ⭐ Star us on GitHub — it helps!! Frameworks & Librari

Sinhan Kang 7 Mar 22, 2022
Stock-Prediction - prediction of stock market movements using sentiment analysis and deep learning.

Stock-Prediction- In this project, we aim to enhance the prediction of stock market movements using sentiment analysis and deep learning. We divide th

5 Jan 25, 2022