Examples of using sparse attention, as in "Generating Long Sequences with Sparse Transformers"

Last update: Dec 28, 2022

Related tags

Overview

Status: Archive (code is provided as-is, no updates expected)

Update August 2020: For an example repository that achieves state-of-the-art modeling performance on CIFAR-10 using Sparse Transformers, please see https://github.com/openai/distribution_augmentation

Sparse Attention

This repository contains the sparse attention primitives used in Sparse Transformers (see blog and paper). Specifically, it includes the following:

A faster implementation of normal attention (the upper triangle is not computed, and many operations are fused).
An implementation of "strided" and "fixed" attention, as in the Sparse Transformers paper.
A simple recompute decorator, which can be adapted for usage with attention.

We hope this code can further accelerate research into sparse attention.

An example Transformer implementation which is close to the version we use internally can be found at https://github.com/openai/blocksparse/blob/master/examples/transformer/enwik8.py.

Overview of kernels

The repository contains fused implementations of the attention operation, which takes in Q, K, V matrices (all of dimensionality batch, time, dim) representing the queries, keys, and values for a sequence. For every query element, a weighted sum of the values is returned, where the weightings are determined by the scaled matrix product of Q and K^T.

The kernels allow specification of block sparsity in the QK^T matrix. This means you define a pattern of 0/1s on a [time/blocksize, time/blocksize] matrix of blocks, and the values where it is 0 will not be computed, and not be included in the softmax calculation. Additionally, one can define "callbacks" on the computed blocks, which will further mask out values in any given block from the softmax (though the matrix product will still be computed for those elements).

Block sizes of {8, 16, 32, 64} are supported, and slight advantages in speed may be seen from using larger blocks.

Prerequisites

For fp32 and blocksize 32, any NVIDIA GPU past Kepler can be used (i.e. compute capability beyond 3.5).

For fp16 and blocksize 8, 16, 32, 64, a GPU with Tensor Cores (e.g. the V100 GPU, compute capability >= 7.0) is required.

The primary dependency is the OpenAI blocksparse package.

With CUDA 10 and tensorflow-gpu, you can install blocksparse with pip install blocksparse.

For other setups, you must install blocksparse from source, and directions can be found in the root of the repository.

Examples

Run the following on a non-V100 GPU:

python attention.py

On a V100 GPU:

python attention.py fp16

General usage

An example can be found at the bottom of attention.py.

full_attn_tf = attention_impl(q, k, v, heads=4, attn_mode="all", recompute=True)
full_attn_bs = blocksparse_attention_impl(q, k, v, heads=4, attn_mode="all", recompute=True)

# first step of strided attention
local_attn_bs = blocksparse_attention_impl(q, k, v, heads=4, attn_mode="local", local_attn_ctx=32, recompute=True)
local_attn_tf = attention_impl(q, k, v, heads=4, attn_mode="local", local_attn_ctx=32, recompute=True)

# second step of strided attention
strided_attn_bs = blocksparse_attention_impl(q, k, v, heads=4, attn_mode="strided", local_attn_ctx=32, recompute=True)
strided_attn_tf = attention_impl(q, k, v, heads=4, attn_mode="strided", local_attn_ctx=32, recompute=True)

# # the 'fixed' attention pattern
fixed = blocksparse_attention_impl(q, k, v, heads=4, attn_mode="fixed", local_attn_ctx=128, num_verts=4, vertsize=1, recompute=True)

Referencing this work

If you find this helpful in your work, you can consider citing the following:

@article{child2019sparsetransformer,
  title={Generating Long Sequences with Sparse Transformers},
  author={Child, Rewon and Gray, Scott and Radford, Alec and Sutskever, Ilya},
  journal={URL https://openai.com/blog/sparse-transformers},
  year={2019}
}

Examples of using sparse attention, as in "Generating Long Sequences with Sparse Transformers"

Related tags

Overview

Sparse Attention

Overview of kernels

Prerequisites

Examples

General usage

Referencing this work

Owner

OpenAI

Automated question generation and question answering from Turkish texts using text-to-text transformers

BookNLP, a natural language processing pipeline for books

EasyTransfer is designed to make the development of transfer learning in NLP applications easier.

pyMorfologik MorfologikpyMorfologik - Python binding for Morfologik.

Global Rhythm Style Transfer Without Text Transcriptions

Transformation spoken text to written text

The PyTorch based implementation of continuous integrate-and-fire (CIF) module.

Multi-Task Pre-Training for Plug-and-Play Task-Oriented Dialogue System

This is the library for the Unbounded Interleaved-State Recurrent Neural Network (UIS-RNN) algorithm, corresponding to the paper Fully Supervised Speaker Diarization.

:mag: Transformers at scale for question answering & neural search. Using NLP via a modular Retriever-Reader-Pipeline. Supporting DPR, Elasticsearch, HuggingFace's Modelhub...

text to speech toolkit. 好用的中文语音合成工具箱，包含语音编码器、语音合成器、声码器和可视化模块。

CrossNER: Evaluating Cross-Domain Named Entity Recognition (AAAI-2021)

Code for paper "Which Training Methods for GANs do actually Converge? (ICML 2018)"

Textpipe: clean and extract metadata from text

Prompt tuning toolkit for GPT-2 and GPT-Neo

Nateve compiler developed with python.

A Python script which randomly chooses and prints a file from a directory.

Black for Python docstrings and reStructuredText (rst).

A simple word search made in python

A Plover python dictionary allowing for consistent symbol input with specification of attachment and capitalisation in one stroke.