Use AutoModelForSeq2SeqLM in Huggingface Transformers to train COMET

Overview

Training COMET using seq2seq setting

Use AutoModelForSeq2SeqLM in Huggingface Transformers to train COMET. The codes are modified from run_summarization.py in the official example codes for transformers version 4.16.0.dev0.

The ./deepspeed/ folder is copied from https://github.com/huggingface/transformers/tree/master/tests/deepspeed .

The training data of ATOMIC2020 can be downloaded at https://allenai.org/data/atomic-2020. You need to convert the .tsv file to .csv to be compatible with the dataloader in transformers.

Dependencies

python

torch==1.7.1
cudatoolkit=11.0
transformers==4.15.0
deepspeed==0.5.10

others

GCC/G++ 5.2.0 (to complie deepspeed ops)

Usage

1. Normal training without memory optimization:

CUDA_VISIBLE_DEVICES=0 python models/comet_seq2seq.py \
    --model_name_or_path t5-small \
    --do_train \
    --train_file /path/to/train.csv \
    --source_prefix "" \
    --output_dir data/models/t5-small \
    --overwrite_output_dir \
    --gradient_accumulation_steps=4 \
    --per_device_train_batch_size=8 \
    --per_device_eval_batch_size=4 \
    --max_source_length 16 \
    --max_target_length 18 \
    --text_column head_event --summary_column tail_event \
    --save_strategy epoch \
    --num_train_epochs 3 \
    --learning_rate 1e-5 

2. Train with gradient_checkpointing=True. Smaller memory usage, meanwhile lower training speed.

CUDA_VISIBLE_DEVICES=0 python models/comet_seq2seq.py \
    --model_name_or_path t5-small \
    --do_train \
    --train_file /path/to/train.csv \
    --source_prefix "" \
    --output_dir data/models/t5-small \
    --overwrite_output_dir \
    --gradient_accumulation_steps=4 \
    --per_device_train_batch_size=8 \
    --per_device_eval_batch_size=4 \
    --max_source_length 16 \
    --max_target_length 18 \
    --text_column head_event --summary_column tail_event \
    --save_strategy epoch \
    --num_train_epochs 3 \
    --learning_rate 1e-5 \
    --gradient_checkpointing

3. Train with DeepSpeed (Either zero-stage2 or zero-stage3)

# google/t5-3B training, on 2080Ti (11GB)
deepspeed --include localhost:0,1 --master_port 30000 models/comet_seq2seq.py \
    --deepspeed deepspeed/ds_config_zero2.json \
    --model_name_or_path google/t5-xl-lm-adapt \
    --do_train \
    --train_file data/kg/atomic2020_data-feb2021/train.csv \
    --source_prefix "" \
    --output_dir data/models/comet/t5_xl_s2_bs32_fp16 \
    --overwrite_output_dir \
    --gradient_accumulation_steps=1 \
    --per_device_train_batch_size=16 \
    --max_source_length 16 \
    --max_target_length 18 \
    --text_column head_event --summary_column tail_event \
    --save_strategy epoch \
    --num_train_epochs 3 \
    --learning_rate 1e-5 \
    --fp16

4. Comparison of memory usage of different memory optimization methods

Compare the memory usage on NVIDIA RTX A6000 (48685MB memory) and Nvidia GeForce 3090 (24268MB memory).

1. fp16

T5-3B: effects of fp16. A 20% reduce of memory size.

Device fp16 Batch Size x Grad-Accum x Num-GPU Memory Usage Time to Train a Batch
vanilla A6000 False 8x4x1 47.5k M 1.5s/32ex
vanilla A6000 True 8x4x1 31k M 1.0s/32ex
vanilla 3090 False 1x32x1 -
vanilla 3090 True 1x32x1 -

2. gradient_checkpointing

T5-3B: Effects of gradient_checkpointing.

Device fp16 Batch Size x Grad-Accum x Num-GPU Memory Usage Time to Train a Batch
vanilla A6000 False 8x4x1 47k M 1.5s/32ex
vanilla A6000 True 8x4x1 31k M 1.0s/32ex
grad-ckpt A6000 False 8x4x1 46.4k M 1.3s/32ex
grad-ckpt A6000 True 8x4x1 23.9k M 1.1/32ex
vanilla 3090 True 1x32x1 -
grad-ckpt 3090 True 1x32x1 23.8k M 15s/32ex

3. Deepspeed stage 2

T5-3B: Effects of deepspeed.

Device fp16 Batch Size x Grad-Accum x Num-GPU Memory Usage Time to Train a Batch
vanilla 3090 True 1x32x1 -
grad-ckpt 3090 True 1x32x1 23k M 13.5s/32ex
stage2 3090 True 32x1x1 20.3k M 7.5s/32ex
stage2 3090 True 16x1x2 20.3k M 6.36s/32ex
stage2 3090 True 32x1x2 20.3k M 3.75s/32ex

4. Deepspeed stage 3

stage3 will lead to smaller usage of memory but way smaller training speed.

5. Automatic Evaluation Result on ATOMIC2020 data

BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-L CIDEr
T5-3B (no deepspeed), lr1e-5, epoch 3 0.346 0.184 0.12 0.084 0.19 0.422 0.646
T5-3B (no deepspeed), lr1e-5, epoch 2 0.348 0.185 0.121 0.085 0.19 0.424 0.651
T5-3B (no deepspeed), lr1e-5, epoch 1 0.343 0.177 0.113 0.079 0.186 0.416 0.629
T5-3B (ds_stage2, fp16) epoch 3 0.340 0.182 0.118 0.083 0.189 0.418 0.637
T5-3B (ds_stage2, fp16) epoch 2 0.337 0.177 0.114 0.078 0.189 0.419 0.633
T5-3B (ds_stage2, fp16) epoch 1 0.335 0.174 0.112 0.076 0.186 0.415 0.632

Useful discussions regarding environment setups

TODO

DeepSpeed without Trainer(): https://huggingface.co/docs/transformers/main_classes/deepspeed#deepspeed-non-trainer-integration

Owner
tqfang
Ph.D. at HKUST, interested in commonsense in NLP
tqfang
BERT, LDA, and TFIDF based keyword extraction in Python

BERT, LDA, and TFIDF based keyword extraction in Python kwx is a toolkit for multilingual keyword extraction based on Google's BERT and Latent Dirichl

Andrew Tavis McAllister 41 Dec 27, 2022
Mednlp - Medical natural language parsing and utility library

Medical natural language parsing and utility library A natural language medical

Paul Landes 3 Aug 24, 2022
Pytorch NLP library based on FastAI

Quick NLP Quick NLP is a deep learning nlp library inspired by the fast.ai library It follows the same api as fastai and extends it allowing for quick

Agis pof 283 Nov 21, 2022
Multispeaker & Emotional TTS based on Tacotron 2 and Waveglow

This Repository contains a sample code for Tacotron 2, WaveGlow with multi-speaker, emotion embeddings together with a script for data preprocessing.

Ivan Didur 106 Jan 01, 2023
Google and Stanford University released a new pre-trained model called ELECTRA

Google and Stanford University released a new pre-trained model called ELECTRA, which has a much compact model size and relatively competitive performance compared to BERT and its variants. For furth

Yiming Cui 1.2k Dec 30, 2022
Extract Keywords from sentence or Replace keywords in sentences.

FlashText This module can be used to replace keywords in sentences or extract keywords from sentences. It is based on the FlashText algorithm. Install

Vikash Singh 5.3k Jan 01, 2023
This is a GUI program that will generate a word search puzzle image

Word Search Puzzle Generator Table of Contents About The Project Built With Getting Started Prerequisites Installation Usage Roadmap Contributing Cont

11 Feb 22, 2022
Learning Spatio-Temporal Transformer for Visual Tracking

STARK The official implementation of the paper Learning Spatio-Temporal Transformer for Visual Tracking Highlights The strongest performances Tracker

Multimedia Research 485 Jan 04, 2023
Semantic search through a vectorized Wikipedia (SentenceBERT) with the Weaviate vector search engine

Semantic search through Wikipedia with the Weaviate vector search engine Weaviate is an open source vector search engine with build-in vectorization a

SeMI Technologies 191 Dec 26, 2022
Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

Spanish Language Models 💃🏻 A repository part of the MarIA project. Corpora 📃 Corpora Number of documents Number of tokens Size (GB) BNE 201,080,084

Plan de Tecnologías del Lenguaje - Gobierno de España 203 Dec 20, 2022
PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

Microsoft 105 Jan 08, 2022
A BERT-based reverse dictionary of Korean proverbs

Wisdomify A BERT-based reverse-dictionary of Korean proverbs. 김유빈 : 모델링 / 데이터 수집 / 프로젝트 설계 / back-end 김종윤 : 데이터 수집 / 프로젝트 설계 / front-end / back-end 임용

94 Dec 08, 2022
We have built a Voice based Personal Assistant for people to access files hands free in their device using natural language processing.

Voice Based Personal Assistant We have built a Voice based Personal Assistant for people to access files hands free in their device using natural lang

Rushabh 2 Nov 13, 2021
SHAS: Approaching optimal Segmentation for End-to-End Speech Translation

SHAS: Approaching optimal Segmentation for End-to-End Speech Translation In this repo you can find the code of the Supervised Hybrid Audio Segmentatio

Machine Translation @ UPC 21 Dec 20, 2022
iSTFTNet : Fast and Lightweight Mel-spectrogram Vocoder Incorporating Inverse Short-time Fourier Transform

iSTFTNet : Fast and Lightweight Mel-spectrogram Vocoder Incorporating Inverse Short-time Fourier Transform This repo try to implement iSTFTNet : Fast

Rishikesh (ऋषिकेश) 126 Jan 02, 2023
Segmenter - Transformer for Semantic Segmentation

Segmenter - Transformer for Semantic Segmentation

592 Dec 27, 2022
DeBERTa: Decoding-enhanced BERT with Disentangled Attention

DeBERTa: Decoding-enhanced BERT with Disentangled Attention This repository is the official implementation of DeBERTa: Decoding-enhanced BERT with Dis

Microsoft 1.2k Jan 03, 2023
💛 Code and Dataset for our EMNLP 2021 paper: "Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes"

Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes Official PyTorch implementation and EmoCause evaluatio

Hyunwoo Kim 50 Dec 21, 2022
Official implementations for various pre-training models of ERNIE-family, covering topics of Language Understanding & Generation, Multimodal Understanding & Generation, and beyond.

English|简体中文 ERNIE是百度开创性提出的基于知识增强的持续学习语义理解框架,该框架将大数据预训练与多源丰富知识相结合,通过持续学习技术,不断吸收海量文本数据中词汇、结构、语义等方面的知识,实现模型效果不断进化。ERNIE在累积 40 余个典型 NLP 任务取得 SOTA 效果,并在 G

5.4k Jan 03, 2023
An IVR Chatbot which can exponentially reduce the burden of companies as well as can improve the consumer/end user experience.

IVR-Chatbot Achievements 🏆 Team Uhtred won the Maverick 2.0 Bot-a-thon 2021 organized by AbInbev India. ❓ Problem Statement As we all know that, lot

ARYAMAAN PANDEY 9 Dec 08, 2022