Transformer training code for sequential tasks

Overview

Sequential Transformer

This is a code for training Transformers on sequential tasks such as language modeling. Unlike the original Transformer architecture, it uses caching of previous representations and relative position embeddings to better adapt to sequential tasks. In addition, the code also implements the following projects as described below and in this blog post:

Requirements

You need PyTorch 0.4.1 or above and a cuda-enabled GPU to run the code. If there are multiple GPUs available, the code uses nn.DataParallel to utilize them. For better efficiency, enable distributed training by --distributed argument, which can run on multiple nodes.

Adaptive Attention Span

This code can be used for running experiments in Adaptive Attention Span for Transformers paper. The adaptive span allows a model to learn an optimal context size for each self-attention head from training data. As shown in the below figure, only few heads require long attention span, thus making it possible to increase the context size to 8k tokens without increasing computation time and memory footprint significantly.

An argument --adapt-span enables adaptive span. Otherwise a model will have a fixed attention span. The adaptive-span is implemented as a nn.Module to make it easier to plug it into other models.

Running experiments in the paper

Scripts for running experiments in the paper are located in ./experiments/ directory. For example, a smaller 8-layer version of our model can be trained on a single GPU by running:

bash experiments/enwik8_small.sh

It should reach about 1.3bpc on dev after 150k steps.

For training larger models, multiple GPUs are recommended. In the script files, you can configure the number of available GPUs. Increase the --batch-split argument if you run out of GPU memory (it splits batches into smaller pieces without changing the final result).

We obtained the following results in our experiments:

Experiment #params dev test
enwik8 38M 1.04 bpb 1.02 bpb
enwik8_large 209M 1.00 bpb 0.98 bpb
text8 39M 1.05 bpc 1.11 bpc
text8_large 209M 1.01 bpc 1.07 bpc

A large model training takes about 1.2sec/batch near the end (initially it's faster because the attention spans are smaller) on 8 V100 GPUs. So, for example, the whole enwik8_large training of 170k steps should take less than 2.4 days.

Pre-trained models

You can download pre-trained models by running the get_pretrained.sh script. Then the same scripts in ./experiments/ can be used to evaluate those models. Since the download script puts models in ./checkpoints/, make sure there is no file with the same name. Note that these pre-trained models are obtained by rerunning the training scripts after the code cleanup, so there are small differences from the above results due to the randomness of the training.

All-attention Network

The code also can be used for training All-attention Networks introduced in Augmenting Self-attention with Persistent Memory. If --pers-mem-size argument is set to N, all FF sublayers will be removed from the model and N persistent memory vectors will be added to every self-attention sublayer. The following experiments can be found in ./experiments/ directory.

Experiment #params dev test
enwik8_pers_small.sh 39M 1.03 bpb 1.01 bpb
enwik8_pers.sh 114M 1.00 bpb 0.98 bpb
wiki103_pers.sh 133M 18.8 ppl * 19.7 ppl *

(*This number is slightly better than the paper because it includes end-of-line as a token.)

License

The code is licensed under CC-BY-NC license. See the LICENSE file for more details.

Acknowledgement

We thank Xavier Martinet for helping with cleaning the code. The data preprocessing scripts are downloaded from awd-lstm and transformer-XL repos. The adagrad_with_grad_clip.py is mostly adapted from PyTorch.

Owner
Meta Research
Meta Research
Sapiens is a human antibody language model based on BERT.

Sapiens: Human antibody language model ____ _ / ___| __ _ _ __ (_) ___ _ __ ___ \___ \ / _` | '_ \| |/ _ \ '

Merck Sharp & Dohme Corp. a subsidiary of Merck & Co., Inc. 13 Nov 20, 2022
Yet Another Compiler Visualizer

yacv: Yet Another Compiler Visualizer yacv is a tool for visualizing various aspects of typical LL(1) and LR parsers. Check out demo on YouTube to see

Ashutosh Sathe 129 Dec 17, 2022
Yet Another Sequence Encoder - Encode sequences to vector of vector in python !

Yase Yet Another Sequence Encoder - encode sequences to vector of vectors in python ! Why Yase ? Yase enable you to encode any sequence which can be r

Pierre PACI 12 Aug 19, 2021
Th2En & Th2Zh: The large-scale datasets for Thai text cross-lingual summarization

Th2En & Th2Zh: The large-scale datasets for Thai text cross-lingual summarization ๐Ÿ“ฅ Download Datasets ๐Ÿ“ฅ Download Trained Models INTRODUCTION TH2ZH (

Nakhun Chumpolsathien 5 Jan 03, 2022
Super Tickets in Pre-Trained Language Models: From Model Compression to Improving Generalization (ACL 2021)

Structured Super Lottery Tickets in BERT This repo contains our codes for the paper "Super Tickets in Pre-Trained Language Models: From Model Compress

Chen Liang 16 Dec 11, 2022
Large-scale Knowledge Graph Construction with Prompting

Large-scale Knowledge Graph Construction with Prompting across tasks (predictive and generative), and modalities (language, image, vision + language, etc.)

ZJUNLP 161 Dec 28, 2022
Source code of paper "BP-Transformer: Modelling Long-Range Context via Binary Partitioning"

BP-Transformer This repo contains the code for our paper BP-Transformer: Modeling Long-Range Context via Binary Partition Zihao Ye, Qipeng Guo, Quan G

Zihao Ye 119 Nov 14, 2022
topic modeling on unstructured data in Space news articles retrieved from the Guardian (UK) newspaper using API

NLP Space News Topic Modeling Photos by nasa.gov (1, 2, 3, 4, 5) and extremetech.com Table of Contents Project Idea Data acquisition Primary data sour

edesz 1 Jan 03, 2022
Code for paper "Role-oriented Network Embedding Based on Adversarial Learning between Higher-order and Local Features"

Role-oriented Network Embedding Based on Adversarial Learning between Higher-order and Local Features Train python main.py --dataset brazil-flights C

wang zhang 0 Jun 28, 2022
A minimal Conformer ASR implementation adapted from ESPnet.

Conformer ASR A minimal Conformer ASR implementation adapted from ESPnet. Introduction I want to use the pre-trained English ASR model provided by ESP

Niu Zhe 3 Jan 24, 2022
1 Jun 28, 2022
Telegram AI chat bot written in Python using Pyrogram

Aurora_Al Just another Telegram AI chat bot written in Python using Pyrogram. A public running instance can be found on telegram as @AuroraAl. Require

โ™—CฯƒNฯ™UาฝRฯƒR_MาฝSฦ™EฦšาฝาฝR 1 Oct 31, 2021
Ongoing research training transformer language models at scale, including: BERT & GPT-2

Megatron (1 and 2) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA.

NVIDIA Corporation 3.5k Dec 30, 2022
A Telegram bot to add notes to Flomo.

flomo bot ไฝฟ็”จ Telegram ๆœบๅ™จไบบๅ‘้€็ฌ”่ฎฐๅˆฐไฝ ็š„ Flomo. ไฝ ้œ€่ฆๆœ‰ไธ€ๅฐๅฏ่ฎฟ้—ฎ Telegram ็š„ๆœๅŠกๅ™จใ€‚ Steps @BotFather ๆ–ฐๅปบๆœบๅ™จไบบ๏ผŒ่Žทๅ– token Flomo ๅฎ˜็ฝ‘่Žทๅ– API๏ผŒ้“พๆŽฅ https://flomoapp.com/mine?source=in

Zhen 44 Dec 30, 2022
kochat

Kochat ์ฑ—๋ด‡ ๋นŒ๋”๋Š” ์„ฑ์— ์•ˆ์ฐจ๊ณ , ์ž์‹ ๋งŒ์˜ ๋”ฅ๋Ÿฌ๋‹ ์ฑ—๋ด‡ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ๋งŒ๋“œ์‹œ๊ณ  ์‹ถ์œผ์‹ ๊ฐ€์š”? Kochat์„ ์ด์šฉํ•˜๋ฉด ์†์‰ฝ๊ฒŒ ์ž์‹ ๋งŒ์˜ ๋”ฅ๋Ÿฌ๋‹ ์ฑ—๋ด‡ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ๋นŒ๋“œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. # 1. ๋ฐ์ดํ„ฐ์…‹ ๊ฐ์ฒด ์ƒ์„ฑ dataset = Dataset(ood=True) #

1 Oct 25, 2021
Code for Emergent Translation in Multi-Agent Communication

Emergent Translation in Multi-Agent Communication PyTorch implementation of the models described in the paper Emergent Translation in Multi-Agent Comm

Facebook Research 75 Jul 15, 2022
PIZZA - a task-oriented semantic parsing dataset

The PIZZA dataset continues the exploration of task-oriented parsing by introducing a new dataset for parsing pizza and drink orders, whose semantics cannot be captured by flat slots and intents.

17 Dec 14, 2022
Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

PLBART Code pre-release of our work, Unified Pre-training for Program Understanding and Generation accepted at NAACL 2021. Note. A detailed documentat

Wasi Ahmad 138 Dec 30, 2022
Beyond Paragraphs: NLP for Long Sequences

Beyond Paragraphs: NLP for Long Sequences

AI2 338 Dec 02, 2022
ChatterBot is a machine learning, conversational dialog engine for creating chat bots

ChatterBot ChatterBot is a machine-learning based conversational dialog engine build in Python which makes it possible to generate responses based on

Gunther Cox 12.8k Jan 03, 2023