Code for our ALiBi method for transformer language models.

Overview

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

This repository contains the code and models for our paper Train Short, Test Long. This file explains how to run our experiments on the WikiText-103 dataset. Read the paper here.

Attention with Linear Biases (ALiBi) is very simple! Instead of adding position embeddings at the bottom of the transformer stack (which we don't) we add a linear bias to each attention score, as depicted in the figure above. The 'm' hyperparam is head-specific and is not learned- it is set at the beginning of training. We have a function that automatically generates these m values given the number of heads in the model.

ALiBi allows the model to be trained on, for example, 1024 tokens, and then do inference on 2048 (or much more) tokens without any finetuning. It's also able to improve performance, even when not extrapolating, in lower resource language modeling settings.

The implementation is very simple.

  1. Remove the position embeddings from the model: https://github.com/ofirpress/attention_with_linear_biases/blob/master/fairseq/models/transformer.py#L941
  2. Set up the relative bias matrix, here: https://github.com/ofirpress/attention_with_linear_biases/blob/master/fairseq/models/transformer.py#L742
  3. Add the bias matrix to the mask, which is then added in each attention score computation: https://github.com/ofirpress/attention_with_linear_biases/blob/master/fairseq/models/transformer.py#L1011
  4. (This might not be necessary in other frameworks.) Move the mask computation to before the layer loop, to make the transformer a tiny bit faster: https://github.com/ofirpress/attention_with_linear_biases/blob/master/fairseq/models/transformer.py#L949

Thats it!

Citation:

@misc{press2021train,
      title={Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation}, 
      author={Ofir Press and Noah A. Smith and Mike Lewis},
      year={2021},
      eprint={2108.12409},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

WikiText-103

Requirements and Installation

This repository is a fork of the Fairseq repository and so has the same requirements.

Once you've installed the dependencies, you can install this repository by running:

pip install --editable .

Preparing the data

To download and preprocess the data, run:

cd examples/language_model/
bash prepare-wikitext-103.sh
cd ../..


TEXT=examples/language_model/wikitext-103
python preprocess.py \
    --only-source \
    --trainpref $TEXT/wiki.train.tokens \
    --validpref $TEXT/wiki.valid.tokens \
    --testpref $TEXT/wiki.test.tokens \
    --destdir data-bin/wikitext-103 \
    --workers 20

Training and Inference

To train a language model with attention with linear baises (ALiBi), on input sequences with 512 tokens, run:

python train.py --task language_modeling     data-bin/wikitext-103     --save-dir wt103/  --arch transformer_lm_wiki103     --max-update 286000 --max-lr 1.0 --t-mult 2 --lr-period-updates 270000 --lr-scheduler cosine --lr-shrink 0.75     --warmup-updates 16000 --warmup-init-lr 1e-07 --min-lr 1e-09 --optimizer nag --lr 0.0001 --clip-norm 0.1     --criterion adaptive_loss --seed 1 --fp16     --sample-break-mode none --skip-invalid-size-inputs-valid-test --ddp-backend=no_c10d --no-epoch-checkpoints --tokens-per-sample 512 --max-tokens 9216 --update-freq 1  

For input sequences larger than 512 (and up to 2048) tokens, just change the --tokens-per-sample.

To train the model with inputs of 3072 tokens, the --update-freq parameter must be changed to 3 and the --max-tokens parameter must be reduced to 3072.

Saved Checkpoints

If you'd like to download our trained models on WikiText-103, they are available here:

Input Length Link
64 https://dl.fbaipublicfiles.com/train_short_test_long/wt103/alibi_wt103_L64.pt
128 https://dl.fbaipublicfiles.com/train_short_test_long/wt103/alibi_wt103_L128.pt
256 https://dl.fbaipublicfiles.com/train_short_test_long/wt103/alibi_wt103_L256.pt
512 https://dl.fbaipublicfiles.com/train_short_test_long/wt103/alibi_wt103_L512.pt
1024 https://dl.fbaipublicfiles.com/train_short_test_long/wt103/alibi_wt103_L1024.pt
1536 https://dl.fbaipublicfiles.com/train_short_test_long/wt103/alibi_wt103_L1536.pt
2048 https://dl.fbaipublicfiles.com/train_short_test_long/wt103/alibi_wt103_L2048.pt
3072 https://dl.fbaipublicfiles.com/train_short_test_long/wt103/alibi_wt103_L3072.pt

Rename the file you downloaded to checkpoint_best.pt if you'd like to follow the directions below.

Inference

For nonoverlapping evaluation of the validation set, run:

l=1024; fairseq-eval-lm data-bin/wikitext-103/     --path wt103/checkpoint_best.pt  --sample-break-mode none --gen-subset valid   --max-sentences 1 --model-overrides "{'max_tokens':$l, 'tokens_per_sample':$l, 'max_target_positions':$l}"  --tokens-per-sample $l --max-tokens $l  --max-target-positions $l  --context-window 0

where l is set to the length of input subsequences during validation (l=1024 in the above example).

Owner
Ofir Press
PhD student @uwnlp
Ofir Press
Generate saved_model, tfjs, tf-trt, EdgeTPU, CoreML, quantized tflite and .pb from .tflite.

tflite2tensorflow Generate saved_model, tfjs, tf-trt, EdgeTPU, CoreML, quantized tflite and .pb from .tflite. 1. Supported Layers No. TFLite Layer TF

Katsuya Hyodo 214 Dec 29, 2022
Python script for performing depth completion from sparse depth and rgb images using the msg_chn_wacv20. model in Tensorflow Lite.

TFLite-msg_chn_wacv20-depth-completion Python script for performing depth completion from sparse depth and rgb images using the msg_chn_wacv20. model

Ibai Gorordo 2 Oct 04, 2021
DECAF: Deep Extreme Classification with Label Features

DECAF DECAF: Deep Extreme Classification with Label Features @InProceedings{Mittal21, author = "Mittal, A. and Dahiya, K. and Agrawal, S. and Sain

46 Nov 06, 2022
This is the face keypoint train code of project face-detection-project

face-key-point-pytorch 1. Data structure The structure of landmarks_jpg is like below: |--landmarks_jpg |----AFW |------AFW_134212_1_0.jpg |------AFW_

I‘m X 3 Nov 27, 2022
Fast and robust certifiable relative pose estimation

Fast and Robust Relative Pose Estimation for Calibrated Cameras This repository contains the code for the relative pose estimation between two central

42 Dec 06, 2022
A simple and lightweight genetic algorithm for optimization of any machine learning model

geneticml This package contains a simple and lightweight genetic algorithm for optimization of any machine learning model. Installation Use pip to ins

Allan Barcelos 8 Aug 10, 2022
Baseline powergrid model for NY

Baseline-powergrid-model-for-NY Table of Contents About The Project Built With Usage License Contact Acknowledgements About The Project As the urgency

Anderson Energy Lab at Cornell 6 Nov 24, 2022
Neuron class provides LNU (Linear Neural Unit), QNU (Quadratic Neural Unit), RBF (Radial Basis Function), MLP (Multi Layer Perceptron), MLP-ELM (Multi Layer Perceptron - Extreme Learning Machine) neurons learned with Gradient descent or LeLevenberg–Marquardt algorithm

Neuron class provides LNU (Linear Neural Unit), QNU (Quadratic Neural Unit), RBF (Radial Basis Function), MLP (Multi Layer Perceptron), MLP-ELM (Multi Layer Perceptron - Extreme Learning Machine) neu

Filip Molcik 38 Dec 17, 2022
The AugNet Python module contains functions for the fast computation of image similarity.

AugNet AugNet: End-to-End Unsupervised Visual Representation Learning with Image Augmentation arxiv link In our work, we propose AugNet, a new deep le

Ming 74 Dec 28, 2022
Language Used: Python . Made in Jupyter(Anaconda) notebook.

FACE-DETECTION-ATTENDENCE-SYSTEM Made in Jupyter(Anaconda) notebook. Language Used: Python Steps to perform before running the program : Install Anaco

1 Jan 12, 2022
List of papers, code and experiments using deep learning for time series forecasting

Deep Learning Time Series Forecasting List of state of the art papers focus on deep learning and resources, code and experiments using deep learning f

Alexander Robles 2k Jan 06, 2023
Weakly Supervised Segmentation by Tensorflow.

Weakly Supervised Segmentation by Tensorflow. Implements semantic segmentation in Simple Does It: Weakly Supervised Instance and Semantic Segmentation, by Khoreva et al. (CVPR 2017).

CHENG-YOU LU 52 Dec 27, 2022
Context Decoupling Augmentation for Weakly Supervised Semantic Segmentation

Context Decoupling Augmentation for Weakly Supervised Semantic Segmentation The code of: Context Decoupling Augmentation for Weakly Supervised Semanti

54 Dec 12, 2022
This is the official implement of paper "ActionCLIP: A New Paradigm for Action Recognition"

This is an official pytorch implementation of ActionCLIP: A New Paradigm for Video Action Recognition [arXiv] Overview Content Prerequisites Data Prep

268 Jan 09, 2023
WaveFake: A Data Set to Facilitate Audio DeepFake Detection

WaveFake: A Data Set to Facilitate Audio DeepFake Detection This is the code repository for our NeurIPS 2021 (Track on Datasets and Benchmarks) paper

Chair for Sys­tems Se­cu­ri­ty 27 Dec 22, 2022
Styleformer - Official Pytorch Implementation

Styleformer -- Official PyTorch implementation Styleformer: Transformer based Generative Adversarial Networks with Style Vector(https://arxiv.org/abs/

Jeeseung Park 159 Dec 12, 2022
transfer attack; adversarial examples; black-box attack; unrestricted Adversarial Attacks on ImageNet; CVPR2021 天池黑盒竞赛

transfer_adv CVPR-2021 AIC-VI: unrestricted Adversarial Attacks on ImageNet CVPR2021 安全AI挑战者计划第六期赛道2:ImageNet无限制对抗攻击 介绍 : 深度神经网络已经在各种视觉识别问题上取得了最先进的性能。

25 Dec 08, 2022
This is a code repository for paper OODformer: Out-Of-Distribution Detection Transformer

OODformer: Out-Of-Distribution Detection Transformer This repo is the official the implementation of the OODformer: Out-Of-Distribution Detection Tran

34 Dec 02, 2022
The implementation of PEMP in paper "Prior-Enhanced Few-Shot Segmentation with Meta-Prototypes"

Prior-Enhanced network with Meta-Prototypes (PEMP) This is the PyTorch implementation of PEMP. Overview of PEMP Meta-Prototypes & Adaptive Prototypes

Jianwei ZHANG 8 Oct 14, 2021
[MedIA2021]MIDeepSeg: Minimally Interactive Segmentation of Unseen Objects from Medical Images Using Deep Learning

MIDeepSeg: Minimally Interactive Segmentation of Unseen Objects from Medical Images Using Deep Learning [MedIA or Arxiv] and [Demo] This repository pr

Healthcare Intelligence Laboratory 92 Dec 08, 2022