Implementation of Nyström Self-attention, from the paper Nyströmformer

Last update: Jan 02, 2023

Overview

Nyström Attention

Implementation of Nyström Self-attention, from the paper Nyströmformer.

Install

$ pip install nystrom-attention

Usage

import torch
from nystrom_attention import NystromAttention

attn = NystromAttention(
    dim = 512,
    dim_head = 64,
    heads = 8,
    num_landmarks = 256,    # number of landmarks
    pinv_iterations = 6,    # number of moore-penrose iterations for approximating pinverse. 6 was recommended by the paper
    residual = True         # whether to do an extra residual with the value or not. supposedly faster convergence if turned on
)

x = torch.randn(1, 16384, 512)
mask = torch.ones(1, 16384).bool()

attn(x, mask = mask) # (1, 16384, 512)

Nyströmformer, layers of Nyström attention

import torch
from nystrom_attention import Nystromformer

model = Nystromformer(
    dim = 512,
    dim_head = 64,
    heads = 8,
    depth = 6,
    num_landmarks = 256,
    pinv_iterations = 6
)

x = torch.randn(1, 16384, 512)
mask = torch.ones(1, 16384).bool()

model(x, mask = mask) # (1, 16384, 512)

You can also import it as Nyströmer if you wish

from nystrom_attention import Nystromer

Citations

@misc{xiong2021nystromformer,
    title   = {Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention},
    author  = {Yunyang Xiong and Zhanpeng Zeng and Rudrasis Chakraborty and Mingxing Tan and Glenn Fung and Yin Li and Vikas Singh},
    year    = {2021},
    eprint  = {2102.03902},
    archivePrefix = {arXiv},
    primaryClass = {cs.CL}
}

Comments

Clarification on masking
Given the dimensionality of the mask argument, (N, T), I'm assuming this is a boolean mask for masking out padding tokens. I created the following function to generate such a mask given an input tensor:

def _create_pad_mask(self, x: torch.LongTensor) -> torch.BoolTensor: mask = torch.ones_like(x).to(torch.bool) mask[x==0] = False return mask

where 0 is the padding token, setting positions to False so not to attend to them.

However, I am unsure how to apply a causal mask to the attention layers so to prevent my decoder from accessing future elements. I couldn't see an example of this in the full Nystromformer module. How can I achieve this?

For context, I am trying to apply the causal mask generated by the following function:

def _create_causal_mask(self, x: torch.LongTensor) -> torch.FloatTensor: size = x.shape[1] mask = (torch.triu(torch.ones(size, size)) == 1).transpose(0, 1) mask = mask.float().masked_fill_(mask == 0, float('-inf')).masked_fill_(mask==1, 0.0) return mask

One way I can think of is to set return_attn to True, apply the mask on the returned attention weights then matmul with the value tensor. But this has a few issues:

Having to return v

Computing the full attention matrix (I think), defeating the entire point of linear attention

Needlessly calculating out only to discard it.

Is this just a limitation of Nystrom attention? Or am I overlooking something obvious?

Thanks
opened by vvvm23 3
Possible bug with padding
Hey there,

I was going through the code and I noticed the following, which I found curious.

In Line 75, you pad the input tensor to a multiple of num_landmarks from the front:

x = F.pad(x, (0, 0, padding, 0), value = 0)

In Line 144 you trim the extra padding elements you inserted in the output tensor from the end.

out = out[:, :n]

Am I not getting something, or should we be removing the front elements of out?

out = out[:, out.size(1) - n:]
opened by georgepar 2
Nystrom for Image processing
thank you for sharing the wondeful code. I am working on image processing and wanted to try your code for the same. I have 2 doubts:

How to select residual_conv_kernel? I could not find any details for the same. also, it is enabled by a flag. When should we enable it and when to disable it?

Is there any guideline for deciding num_landmarks for image processing task?

Thanks
opened by paragon1234 1
Error when mask is of the same size as that of the input X

Hi,

First of all, thank you for putting such an easy to use implementation on GitHub. I'm trying to incorporate the nystrom attention into a legacy codebase, it previously used to provide the input X and the mask (off the same dimensions as X) to a Multi headed Attention Layer.

When I'm trying to integrate nystrom attention with it, it runs alright without the mask. But, when I pass the mask alongside it, it throws einops rearrange error.

Sorry, if this is a very basic question, but how would you recommend I deal with handling 3D mask (same dimensions as the size of input) in the codebase.

Best, VB

opened by Vaibhavs10 1

ViewBackward inplace deprecation warning

Hello again,

The following code results in a UserWarning in PyTorch 1.8.1.

In [1]: from nystrom_attention.nystrom_attention import NystromAttention

In [2]: import torch

In [3]: attn = NystromAttention(256)

In [4]: x = torch.randn(1, 8192, 256)

In [5]: attn(x)
/home/alex/.tmp/nystrom-attention/nystrom_attention/nystrom_attention.py:91: UserWarning: Output 0 of ViewBackward is a view and is being modified inplace. This view is an output of a function that returns multiple views. Inplace operators on such views are being deprecated and will be forbidden starting from version 1.8. Consider using `unsafe_` version of the function that produced this view or don't modify this view inplace. (Triggered internally at  ../torch/csrc/autograd/variable.cpp:547.)
  q *= self.scale
Out[5]:
tensor([[[-0.0449, -0.1726,  0.1409,  ...,  0.0127,  0.2287, -0.2437],
         [-0.1132,  0.3229, -0.1279,  ...,  0.0084, -0.3307, -0.2351],
         [ 0.0361,  0.1013,  0.0828,  ...,  0.1045, -0.1627,  0.0736],
         ...,
         [ 0.0018,  0.1385, -0.1716,  ..., -0.0366, -0.0682,  0.0241],
         [ 0.1497,  0.0149, -0.0020,  ..., -0.0352, -0.1126,  0.0193],
         [ 0.1341,  0.0077,  0.1627,  ..., -0.0363,  0.1057, -0.2071]]],
       grad_fn=<SliceBackward>)

Not a huge issue, but worth mentioning

opened by vvvm23 1

Relative position encoding

Similar to the question raised for the performer architecture , is it possible to implement a relative position encoding given the methodology in which attention is calculated?

opened by jdcla 1
How can we implement "batch_first" in Nystrom attention?

Hi,

Thanks a lot for implementing the nystromformer attention algorithm! Very nice job!

I am wondering whether it is feasible to add the "batch_first" option in the nystrom attention algorithm? This allow the algorithm to be integrated in the existing pytorch transformer encoder architecture.

opened by mark0935git 0
x-transformers

Hi @lucidrains - just wondering if we can plug in Nystrom Attention with x-transformers?

I've been plugging in Vision Transformers with X-transformers but am wondering if its possible to have a Nystrom transformer with x-transformer improvements to plug into a ViT?

opened by robbohua 0

Releases(0.0.11)

0.0.11(Apr 6, 2021)

Source code(tar.gz)
Source code(zip)
0.0.10(Mar 18, 2021)

Source code(tar.gz)
Source code(zip)
0.0.9(Feb 24, 2021)

Source code(tar.gz)
Source code(zip)
0.0.8(Feb 18, 2021)

Source code(tar.gz)
Source code(zip)
0.0.7(Feb 14, 2021)

Source code(tar.gz)
Source code(zip)
0.0.6(Feb 12, 2021)

Source code(tar.gz)
Source code(zip)
0.0.5(Feb 12, 2021)

Source code(tar.gz)
Source code(zip)
0.0.4(Feb 12, 2021)

Source code(tar.gz)
Source code(zip)
0.0.3(Feb 12, 2021)

Source code(tar.gz)
Source code(zip)
0.0.2(Feb 12, 2021)

Source code(tar.gz)
Source code(zip)
0.0.1(Feb 11, 2021)

Source code(tar.gz)
Source code(zip)

Owner

Phil Wang

Working with Attention. It's all we need.

GitHub Repository

Ensemble Visual-Inertial Odometry (EnVIO)

Ensemble Visual-Inertial Odometry (EnVIO) Authors : Jae Hyung Jung, Yeongkwon Choe, and Chan Gook Park 1. Overview This is a ROS package of Ensemble V

95 Jan 03, 2023

PyTorch implementation of Neural View Synthesis and Matching for Semi-Supervised Few-Shot Learning of 3D Pose

Neural View Synthesis and Matching for Semi-Supervised Few-Shot Learning of 3D Pose Release Notes The official PyTorch implementation of Neural View S

20 Oct 09, 2022

Code release for Convolutional Two-Stream Network Fusion for Video Action Recognition

Convolutional Two-Stream Network Fusion for Video Action Recognition

676 Dec 31, 2022

PSPNet in Chainer

PSPNet This is an unofficial implementation of Pyramid Scene Parsing Network (PSPNet) in Chainer. Training Requirement Python 3.4.4+ Chainer 3.0.0b1+

76 Dec 12, 2022

GLNet for Memory-Efficient Segmentation of Ultra-High Resolution Images

GLNet for Memory-Efficient Segmentation of Ultra-High Resolution Images Collaborative Global-Local Networks for Memory-Efﬁcient Segmentation of Ultra-

298 Dec 12, 2022

Point Cloud Denoising input segmentation output raw point-cloud valid/clear fog rain de-noised Abstract Lidar sensors are frequently used in environme

75 Nov 24, 2022

Detector for Log4Shell exploitation attempts

log4shell-detector Detector for Log4Shell exploitation attempts Idea The problem with the log4j CVE-2021-44228 exploitation is that the string can be

729 Dec 25, 2022

Joint deep network for feature line detection and description

SOLD² - Self-supervised Occlusion-aware Line Description and Detection This repository contains the implementation of the paper: SOLD² : Self-supervis

427 Dec 27, 2022

This repo is customed for VisDrone.

Object Detection for VisDrone(无人机航拍图像目标检测) My environment 1、Windows10 (Linux available) 2、tensorflow = 1.12.0 3、python3.6 (anaconda) 4、cv2 5、ensemble

53 Jul 17, 2022

Hi Guys, here I am providing examples, which will help you in Lerarning Python

LearningPython Hi guys, here I am trying to include as many practice examples of Python Language, as i Myself learn, and hope these will help you in t

4 Feb 03, 2022

[ICCV'21] Neural Radiance Flow for 4D View Synthesis and Video Processing

NeRFlow [ICCV'21] Neural Radiance Flow for 4D View Synthesis and Video Processing Datasets The pouring dataset used for experiments can be download he

44 Dec 20, 2022

Implementation of the Point Transformer layer, in Pytorch

Point Transformer - Pytorch Implementation of the Point Transformer self-attention layer, in Pytorch. The simple circuit above seemed to have allowed

501 Jan 03, 2023

This repository contains the code for TABS, a 3D CNN-Transformer hybrid automated brain tissue segmentation algorithm using T1w structural MRI scans

This repository contains the code for TABS, a 3D CNN-Transformer hybrid automated brain tissue segmentation algorithm using T1w structural MRI scans. TABS relies on a Res-Unet backbone, with a Vision

6 Nov 07, 2022

Implementation of Nyström Self-attention, from the paper Nyströmformer

Related tags

Overview

Nyström Attention

Install

Usage

Citations

Comments

Releases(0.0.11)

0.0.11(Apr 6, 2021)

0.0.10(Mar 18, 2021)

0.0.9(Feb 24, 2021)

0.0.8(Feb 18, 2021)

0.0.7(Feb 14, 2021)

0.0.6(Feb 12, 2021)

0.0.5(Feb 12, 2021)

0.0.4(Feb 12, 2021)

0.0.3(Feb 12, 2021)

0.0.2(Feb 12, 2021)

0.0.1(Feb 11, 2021)

Owner

Phil Wang

Ensemble Visual-Inertial Odometry (EnVIO)

PyTorch implementation of Neural View Synthesis and Matching for Semi-Supervised Few-Shot Learning of 3D Pose

Code release for Convolutional Two-Stream Network Fusion for Video Action Recognition

PSPNet in Chainer

GLNet for Memory-Efficient Segmentation of Ultra-High Resolution Images

Point Cloud Denoising input segmentation output raw point-cloud valid/clear fog rain de-noised Abstract Lidar sensors are frequently used in environme

Detector for Log4Shell exploitation attempts

Joint deep network for feature line detection and description

This repo is customed for VisDrone.

Hi Guys, here I am providing examples, which will help you in Lerarning Python

[ICCV'21] Neural Radiance Flow for 4D View Synthesis and Video Processing

Implementation of the Point Transformer layer, in Pytorch

This repository contains the code for TABS, a 3D CNN-Transformer hybrid automated brain tissue segmentation algorithm using T1w structural MRI scans

[SIGGRAPH Asia 2021] DeepVecFont: Synthesizing High-quality Vector Fonts via Dual-modality Learning.

Official implementation for paper: A Latent Transformer for Disentangled Face Editing in Images and Videos.

Bootstrapped Unsupervised Sentence Representation Learning (ACL 2021)

Contains code for Deep Kernelized Dense Geometric Matching

Tooling for GANs in TensorFlow

Analysis of rationale selection in neural rationale models

Imbalanced Gradients: A Subtle Cause of Overestimated Adversarial Robustness