Reproducing the Linear Multihead Attention introduced in Linformer paper (Linformer: Self-Attention with Linear Complexity)

Last update: Dec 23, 2022

Overview

Linear Multihead Attention (Linformer)

PyTorch Implementation of reproducing the Linear Multihead Attention introduced in Linformer paper (Linformer: Self-Attention with Linear Complexity), which demonstrates that the self-attention mechanism can be approximated by a low-rank matrix and reduces the overall self-attention complexity from O(n^2) to O(n) in both time and space.

Implementation

This is an efficient implementation followed with the PyTorch official torch.nn.MultiheadAttention class and F.multi_head_attention_forward function.

Three additional argments defined in LinearMultiheadAttention: sequence length, the projected dimention k and the parameter sharing.

seq_len: the sequence length. Default: 100.
proj_k: the projected dimention `k` in Linformer paper. Default: 128.
param_sharing: parameter sharing mode: layerwise, none. headwise is not implemented. Default: none.

Usage

Examples of using torch.nn.MultiheadAttention:

>>> import torch
>>> multihead_attn = torch.nn.MultiheadAttention(embed_dim, num_heads)
>>> attn_output, attn_output_weights = multihead_attn(query, key, value)

Examples of using LinearMultiheadAttention:

>>> from linear_multihead_attention import LinearMultiheadAttention
>>> multihead_attn = LinearMultiheadAttention(embed_dim, num_heads) 
>>> attn_output, attn_output_weights = multihead_attn(query, key, value)

Examples of using LinearMultiheadAttention with the sequence length of 512 and :

>>> from linear_multihead_attention import LinearMultiheadAttention
>>> multihead_attn = LinearMultiheadAttention(embed_dim, num_heads, seq_len=512, proj_k=256, param_sharing='layerwise') 
>>> attn_output, attn_output_weights = multihead_attn(query, key, value)

Linear-DETR: Replace torch.nn.MultiheadAttention in DETR with LinearMultiheadAttention in three lines in models/transformer.py, it saved much more memory and space, hope to have a comparable performance:

from linear_multihead_attention import LinearMultiheadAttention

# TransformerEncoderLayer
# self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout, seq_len=w*h, proj_k=64) # where w, h are from `bs, c, h, w = src.shape`


# TransformerDecoderLayer
# self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
# self.multihead_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)

self.self_attn = LinearMultiheadAttention(d_model, nhead, dropout=dropout, seq_len=num_queries, proj_k=64) # where num_queries = args.num_queries
self.multihead_attn = LinearMultiheadAttention(d_model, nhead, dropout=dropout, seq_len=w*h, proj_k=64) # where w, h are from `bs, c, h, w = src.shape`

Results on DETR

TODO

Citation

@misc{wang2020linformer,
    title={Linformer: Self-Attention with Linear Complexity},
    author={Sinong Wang and Belinda Z. Li and Madian Khabsa and Han Fang and Hao Ma},
    year={2020},
    eprint={2006.04768},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

Reproducing the Linear Multihead Attention introduced in Linformer paper (Linformer: Self-Attention with Linear Complexity)

Related tags

Overview

Implementation

Usage

Results on DETR

Citation

Owner

Kui Xu

Fully featured implementation of Routing Transformer

One Stop Anomaly Shop: Anomaly detection using two-phase approach: (a) pre-labeling using statistics, Natural Language Processing and static rules; (b) anomaly scoring using supervised and unsupervised machine learning.

Simple translation demo showcasing our headliner package.

An extensive UI tool built using new data scraped from BBC News

VampiresVsWerewolves - Our Implementation of a MiniMax algorithm with alpha beta pruning in the context of an in-class competition

2021海华AI挑战赛·中文阅读理解·技术组·第三名

:P Some basic stuff I'm gonna use for my upcoming Agile Software Development and Devops

Graph Coloring - Weighted Vertex Coloring Problem

The (extremely) naive sentiment classification function based on NBSVM trained on wisesight_sentiment

Code and checkpoints for training the transformer-based Table QA models introduced in the paper TAPAS: Weakly Supervised Table Parsing via Pre-training.

Bu Chatbot, Konya Bilim Merkezi Yen için tasarlanmış olan bir projedir.

Python bot created with Selenium that can guess the daily Wordle word correct 96.8% of the time.

AMUSE - financial summarization

Simple Python script to scrape youtube channles of "Parity Technologies and Web3 Foundation" and translate them to well-known braille language or any language

Code for papers "Generation-Augmented Retrieval for Open-Domain Question Answering" and "Reader-Guided Passage Reranking for Open-Domain Question Answering", ACL 2021

voice2json is a collection of command-line tools for offline speech/intent recognition on Linux

FactSumm: Factual Consistency Scorer for Abstractive Summarization

Product-Review-Summarizer - Created a product review summarizer which clustered thousands of product reviews and summarized them into a maximum of 500 characters, saving precious time of customers and helping them make a wise buying decision.

🎐 a python library for doing approximate and phonetic matching of strings.

Transformer - A TensorFlow Implementation of the Transformer: Attention Is All You Need