PyTorch implementation of Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation.

Last update: Jul 27, 2022

Overview

ALiBi

PyTorch implementation of Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation.

Quickstart

Clone this repository.

git clone https://github.com/jaketae/alibi.git

Navigate to the cloned directory. You can use the bare-bone ALiBi decoder via

>>> import torch; from alibi import ALiBiConfig, ALiBiTransformer
>>> config  = ALiBiConfig()
>>> model = ALiBiTransformer(config)
>>> x = torch.randn(8, 100, 256)
>>> model(x).shape
torch.Size([8, 100, 256])

By default, the model comes with the following parameters:

ALiBiConfig(
    num_layers=6, 
    d_model=256, 
    num_heads=8, 
    max_len=256, 
    dropout=0.1, 
    causal=True, 
    expansion_factor=1
)

To use an encoder instead of a decoder, simply toggle causal=False.

Abstract

Since the introduction of the transformer model by Vaswani et al. (2017), a fundamental question remains open: how to achieve extrapolation at inference time to longer sequences than seen during training? We first show that extrapolation can be improved by changing the position representation method, though we find that existing proposals do not allow efficient extrapolation. We introduce a simple and efficient method, Attention with Linear Biases (ALiBi), that allows for extrapolation. ALiBi does not add positional embeddings to the word embeddings; instead, it biases the query-key attention scores with a term that is proportional to their distance. We show that this method allows training a 1.3 billion parameter model on input sequences of length 1024 that extrapolates to input sequences of length 2048, achieving the same perplexity as a sinusoidal position embedding model trained on inputs of length 2048, 11% faster and using 11% less memory. ALiBi's inductive bias towards recency allows it to outperform multiple strong position methods on the WikiText-103 benchmark. Finally, we provide analysis of ALiBi to understand why it leads to better performance.

Citation

@misc{press2021train,
	title        = {Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation},
	author       = {Ofir Press and Noah A. Smith and Mike Lewis},
	year         = 2021,
	eprint       = {2108.12409},
	archiveprefix = {arXiv},
	primaryclass = {cs.CL}
}

PyTorch implementation of Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation.

Related tags

Overview

ALiBi

Quickstart

Abstract

Citation

Owner

Jake Tae

Plug and play transformer you can find network structure and official complete code by clicking List

Distributed Evolutionary Algorithms in Python

This is official implementaion of paper "Token Shift Transformer for Video Classification".

Python implementation of NARS (Non-Axiomatic-Reasoning-System)

An unofficial implementation of "Unpaired Image Super-Resolution using Pseudo-Supervision." CVPR2020

Annealed Flow Transport Monte Carlo

Project page of the paper 'Analyzing Perception-Distortion Tradeoff using Enhanced Perceptual Super-resolution Network' (ECCVW 2018)

Sound-guided Semantic Image Manipulation - Official Pytorch Code (CVPR 2022)

Cache Requests in Deta Bases and Echo them with Deta Micros

Pre-training of Graph Augmented Transformers for Medication Recommendation

PyTorch implementation of GLOM

An efficient PyTorch library for Global Wheat Detection using YOLOv5. The project is based on this Kaggle competition Global Wheat Detection (2021).

Buffon’s needle: one of the oldest problems in geometric probability

SAFL: A Self-Attention Scene Text Recognizer with Focal Loss

Python implementation of "Elliptic Fourier Features of a Closed Contour"

Simple (but Strong) Baselines for POMDPs

Enabling dynamic analysis of Legacy Embedded Systems in full emulated environment

Nested cross-validation is necessary to avoid biased model performance in embedded feature selection in high-dimensional data with tiny sample sizes

SporeAgent: Reinforced Scene-level Plausibility for Object Pose Refinement

Calculates JMA (Japan Meteorological Agency) seismic intensity (shindo) scale from acceleration data recorded in NumPy array