Implementation of Nyström Self-attention, from the paper Nyströmformer

Overview

Nyström Attention

Implementation of Nyström Self-attention, from the paper Nyströmformer.

Yannic Kilcher video

Install

$ pip install nystrom-attention

Usage

import torch
from nystrom_attention import NystromAttention

attn = NystromAttention(
    dim = 512,
    dim_head = 64,
    heads = 8,
    num_landmarks = 256,    # number of landmarks
    pinv_iterations = 6,    # number of moore-penrose iterations for approximating pinverse. 6 was recommended by the paper
    residual = True         # whether to do an extra residual with the value or not. supposedly faster convergence if turned on
)

x = torch.randn(1, 16384, 512)
mask = torch.ones(1, 16384).bool()

attn(x, mask = mask) # (1, 16384, 512)

Nyströmformer, layers of Nyström attention

import torch
from nystrom_attention import Nystromformer

model = Nystromformer(
    dim = 512,
    dim_head = 64,
    heads = 8,
    depth = 6,
    num_landmarks = 256,
    pinv_iterations = 6
)

x = torch.randn(1, 16384, 512)
mask = torch.ones(1, 16384).bool()

model(x, mask = mask) # (1, 16384, 512)

You can also import it as Nyströmer if you wish

from nystrom_attention import Nystromer

Citations

@misc{xiong2021nystromformer,
    title   = {Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention},
    author  = {Yunyang Xiong and Zhanpeng Zeng and Rudrasis Chakraborty and Mingxing Tan and Glenn Fung and Yin Li and Vikas Singh},
    year    = {2021},
    eprint  = {2102.03902},
    archivePrefix = {arXiv},
    primaryClass = {cs.CL}
}
Comments
  • Clarification on masking

    Clarification on masking

    Given the dimensionality of the mask argument, (N, T), I'm assuming this is a boolean mask for masking out padding tokens. I created the following function to generate such a mask given an input tensor:

    def _create_pad_mask(self, x: torch.LongTensor) -> torch.BoolTensor:
        mask = torch.ones_like(x).to(torch.bool)
        mask[x==0] = False
        return mask
    

    where 0 is the padding token, setting positions to False so not to attend to them.

    However, I am unsure how to apply a causal mask to the attention layers so to prevent my decoder from accessing future elements. I couldn't see an example of this in the full Nystromformer module. How can I achieve this?

    For context, I am trying to apply the causal mask generated by the following function:

    def _create_causal_mask(self, x: torch.LongTensor) -> torch.FloatTensor:
        size = x.shape[1]
        mask = (torch.triu(torch.ones(size, size)) == 1).transpose(0, 1)
        mask = mask.float().masked_fill_(mask == 0, float('-inf')).masked_fill_(mask==1, 0.0)
        return mask
    

    One way I can think of is to set return_attn to True, apply the mask on the returned attention weights then matmul with the value tensor. But this has a few issues:

    • Having to return v
    • Computing the full attention matrix (I think), defeating the entire point of linear attention
    • Needlessly calculating out only to discard it.

    Is this just a limitation of Nystrom attention? Or am I overlooking something obvious?

    Thanks

    opened by vvvm23 3
  • Possible bug with padding

    Possible bug with padding

    Hey there,

    I was going through the code and I noticed the following, which I found curious.

    In Line 75, you pad the input tensor to a multiple of num_landmarks from the front:

    x = F.pad(x, (0, 0, padding, 0), value = 0)
    

    In Line 144 you trim the extra padding elements you inserted in the output tensor from the end.

    out = out[:, :n]
    

    Am I not getting something, or should we be removing the front elements of out?

    out = out[:, out.size(1) - n:]
    
    opened by georgepar 2
  • Nystrom for Image processing

    Nystrom for Image processing

    thank you for sharing the wondeful code. I am working on image processing and wanted to try your code for the same. I have 2 doubts:

    1. How to select residual_conv_kernel? I could not find any details for the same. also, it is enabled by a flag. When should we enable it and when to disable it?
    2. Is there any guideline for deciding num_landmarks for image processing task?

    Thanks

    opened by paragon1234 1
  • Error when mask is of the same size as that of the input X

    Error when mask is of the same size as that of the input X

    Hi,

    First of all, thank you for putting such an easy to use implementation on GitHub. I'm trying to incorporate the nystrom attention into a legacy codebase, it previously used to provide the input X and the mask (off the same dimensions as X) to a Multi headed Attention Layer.

    When I'm trying to integrate nystrom attention with it, it runs alright without the mask. But, when I pass the mask alongside it, it throws einops rearrange error.

    Sorry, if this is a very basic question, but how would you recommend I deal with handling 3D mask (same dimensions as the size of input) in the codebase.

    Best, VB

    opened by Vaibhavs10 1
  • ViewBackward inplace deprecation warning

    ViewBackward inplace deprecation warning

    Hello again,

    The following code results in a UserWarning in PyTorch 1.8.1.

    In [1]: from nystrom_attention.nystrom_attention import NystromAttention
    
    In [2]: import torch
    
    In [3]: attn = NystromAttention(256)
    
    In [4]: x = torch.randn(1, 8192, 256)
    
    In [5]: attn(x)
    /home/alex/.tmp/nystrom-attention/nystrom_attention/nystrom_attention.py:91: UserWarning: Output 0 of ViewBackward is a view and is being modified inplace. This view is an output of a function that returns multiple views. Inplace operators on such views are being deprecated and will be forbidden starting from version 1.8. Consider using `unsafe_` version of the function that produced this view or don't modify this view inplace. (Triggered internally at  ../torch/csrc/autograd/variable.cpp:547.)
      q *= self.scale
    Out[5]:
    tensor([[[-0.0449, -0.1726,  0.1409,  ...,  0.0127,  0.2287, -0.2437],
             [-0.1132,  0.3229, -0.1279,  ...,  0.0084, -0.3307, -0.2351],
             [ 0.0361,  0.1013,  0.0828,  ...,  0.1045, -0.1627,  0.0736],
             ...,
             [ 0.0018,  0.1385, -0.1716,  ..., -0.0366, -0.0682,  0.0241],
             [ 0.1497,  0.0149, -0.0020,  ..., -0.0352, -0.1126,  0.0193],
             [ 0.1341,  0.0077,  0.1627,  ..., -0.0363,  0.1057, -0.2071]]],
           grad_fn=<SliceBackward>)
    

    Not a huge issue, but worth mentioning

    opened by vvvm23 1
  • Relative position encoding

    Relative position encoding

    Similar to the question raised for the performer architecture , is it possible to implement a relative position encoding given the methodology in which attention is calculated?

    opened by jdcla 1
  • How can we implement

    How can we implement "batch_first" in Nystrom attention?

    Hi,

    Thanks a lot for implementing the nystromformer attention algorithm! Very nice job!

    I am wondering whether it is feasible to add the "batch_first" option in the nystrom attention algorithm? This allow the algorithm to be integrated in the existing pytorch transformer encoder architecture.

    opened by mark0935git 0
  • x-transformers

    x-transformers

    Hi @lucidrains - just wondering if we can plug in Nystrom Attention with x-transformers?

    I've been plugging in Vision Transformers with X-transformers but am wondering if its possible to have a Nystrom transformer with x-transformer improvements to plug into a ViT?

    opened by robbohua 0
Owner
Phil Wang
Working with Attention. It's all we need.
Phil Wang
TEA: A Sequential Recommendation Framework via Temporally Evolving Aggregations

TEA: A Sequential Recommendation Framework via Temporally Evolving Aggregations Requirements python 3.6 torch 1.9 numpy 1.19 Quick Start The experimen

DMIRLAB 4 Oct 16, 2022
(JMLR'19) A Python Toolbox for Scalable Outlier Detection (Anomaly Detection)

Python Outlier Detection (PyOD) Deployment & Documentation & Stats Build Status & Coverage & Maintainability & License PyOD is a comprehensive and sca

Yue Zhao 6.6k Jan 03, 2023
Download from Onlyfans.com.

OnlySave: Onlyfans downloader Getting Started: Download the setup executable from the latest release. Install and run. Only works on Windows currently

4 May 30, 2022
Ultra-lightweight human body posture key point CNN model. ModelSize:2.3MB HUAWEI P40 NCNN benchmark: 6ms/img,

Ultralight-SimplePose Support NCNN mobile terminal deployment Based on MXNET(=1.5.1) GLUON(=0.7.0) framework Top-down strategy: The input image is t

223 Dec 27, 2022
PyTorch Implementation of Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis

Daft-Exprt - PyTorch Implementation PyTorch Implementation of Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis The

Keon Lee 47 Dec 18, 2022
Homepage of paper: Paint Transformer: Feed Forward Neural Painting with Stroke Prediction, ICCV 2021.

Paint Transformer: Feed Forward Neural Painting with Stroke Prediction [Paper] [PaddlePaddle Implementation] Homepage of paper: Paint Transformer: Fee

442 Dec 16, 2022
PyTorch implementation of paper "IBRNet: Learning Multi-View Image-Based Rendering", CVPR 2021.

IBRNet: Learning Multi-View Image-Based Rendering PyTorch implementation of paper "IBRNet: Learning Multi-View Image-Based Rendering", CVPR 2021. IBRN

Google Interns 371 Jan 03, 2023
Official Repsoitory for "Mish: A Self Regularized Non-Monotonic Neural Activation Function" [BMVC 2020]

Mish: Self Regularized Non-Monotonic Activation Function BMVC 2020 (Official Paper) Notes: (Click to expand) A considerably faster version based on CU

Xa9aX ツ 1.2k Dec 29, 2022
Simple node deletion tool for onnx.

snd4onnx Simple node deletion tool for onnx. I only test very miscellaneous and limited patterns as a hobby. There are probably a large number of bugs

Katsuya Hyodo 6 May 15, 2022
sktime companion package for deep learning based on TensorFlow

NOTE: sktime-dl is currently being updated to work correctly with sktime 0.6, and wwill be fully relaunched over the summer. The plan is Refactor and

sktime 573 Jan 05, 2023
High performance distributed framework for training deep learning recommendation models based on PyTorch.

High performance distributed framework for training deep learning recommendation models based on PyTorch.

340 Dec 30, 2022
A package, and script, to perform imaging transcriptomics on a neuroimaging scan.

Imaging Transcriptomics Imaging transcriptomics is a methodology that allows to identify patterns of correlation between gene expression and some prop

Alessio Giacomel 10 Dec 27, 2022
An implementation of the efficient attention module.

Efficient Attention An implementation of the efficient attention module. Description Efficient attention is an attention mechanism that substantially

Shen Zhuoran 194 Dec 15, 2022
A different spin on dataclasses.

dataklasses Dataklasses is a library that allows you to quickly define data classes using Python type hints. Here's an example of how you use it: from

David Beazley 752 Nov 18, 2022
Stochastic gradient descent with model building

Stochastic Model Building (SMB) This repository includes a new fast and robust stochastic optimization algorithm for training deep learning models. Th

S. Ilker Birbil 22 Jan 19, 2022
This repository contains the code for the paper in EMNLP 2021: "HRKD: Hierarchical Relational Knowledge Distillation for Cross-domain Language Model Compression".

HRKD: Hierarchical Relational Knowledge Distillation for Cross-domain Language Model Compression This repository contains the code for the paper in EM

Chenhe Dong 2 Mar 24, 2022
Extracts data from the database for a graph-node and stores it in parquet files

subgraph-extractor Extracts data from the database for a graph-node and stores it in parquet files Installation For developing, it's recommended to us

Cardstack 0 Jan 10, 2022
Inference code for "StylePeople: A Generative Model of Fullbody Human Avatars" paper. This code is for the part of the paper describing video-based avatars.

NeuralTextures This is repository with inference code for paper "StylePeople: A Generative Model of Fullbody Human Avatars" (CVPR21). This code is for

Visual Understanding Lab @ Samsung AI Center Moscow 18 Oct 06, 2022
Generate Contextual Directory Wordlist For Target Org

PathPermutor Generate Contextual Directory Wordlist For Target Org This script generates contextual wordlist for any target org based on the set of UR

8 Jun 23, 2021
Template repository for managing machine learning research projects built with PyTorch-Lightning

Tutorial Repository with a minimal example for showing how to deploy training across various compute infrastructure.

Sidd Karamcheti 3 Feb 11, 2022