Official PyTorch implementation of Less is More: Pay Less Attention in Vision Transformers.

Last update: Jan 01, 2023

Overview

Less is More: Pay Less Attention in Vision Transformers

Official PyTorch implementation of Less is More: Pay Less Attention in Vision Transformers.

By Zizheng Pan, Bohan Zhuang, Haoyu He, Jing Liu and Jianfei Cai.

In our paper, we present a novel Less attention vIsion Transformer (LIT), building upon the fact that convolutions, fully-connected (FC) layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences. LIT uses pure multi-layer perceptrons (MLPs) to encode rich local patterns in the early stages while applying self-attention modules to capture longer dependencies in deeper layers. Moreover, we further propose a learned deformable token merging module to adaptively fuse informative patches in a non-uniform manner.

If you use this code for a paper please cite:

@article{pan2021less,
  title={Less is More: Pay Less Attention in Vision Transformers},
  author={Pan, Zizheng and Zhuang, Bohan and He, Haoyu and Liu, Jing and Cai, Jianfei},
  journal={arXiv preprint arXiv:2105.14217},
  year={2021}
}

Usage

First, clone this repository.

git clone https://github.com/MonashAI/LIT

Next, create a conda virtual environment.

# Make sure you have a NVIDIA GPU.
cd LIT/
bash setup_env.sh [conda_install_path] [env_name]

# For example
bash setup_env.sh /home/anaconda3 lit

Note: We use PyTorch 1.7.1 with CUDA 10.1 for all experiments. The setup_env.sh has illustrated all dependencies we used in our experiments. You may want to edit this file to install a different version of PyTorch or any other packages.

Data Preparation

Download the ImageNet 2012 dataset from here, and prepare the dataset based on this script. The file structure should look like:

imagenet
├── train
│   ├── class1
│   │   ├── img1.jpeg
│   │   ├── img2.jpeg
│   │   └── ...
│   ├── class2
│   │   ├── img3.jpeg
│   │   └── ...
│   └── ...
└── val
    ├── class1
    │   ├── img4.jpeg
    │   ├── img5.jpeg
    │   └── ...
    ├── class2
    │   ├── img6.jpeg
    │   └── ...
    └── ...

Model Zoo

We provide baseline LIT models pretrained on ImageNet 2012.

Name	Params (M)	FLOPs (G)	Top-1 Acc. (%)	Model	Log
LIT-Ti	19	3.6	81.1	google drive/github	log
LIT-S	27	4.1	81.5	google drive/github	log
LIT-M	48	8.6	83.0	google drive/github	log
LIT-B	86	15.0	83.4	google drive/github	log

Training and Evaluation

In our implementation, we have different training strategies for LIT-Ti and other LIT models. Therefore, we provide two codebases.

For LIT-Ti, please refer to code_for_lit_ti.

For LIT-S, LIT-M, LIT-B, please refer to code_for_lit_s_m_b.

License

This repository is released under the Apache 2.0 license as found in the LICENSE file.

Acknowledgement

This repository has adopted codes from DeiT, PVT and Swin, we thank the authors for their open-sourced code.

This is an official implementation of CvT: Introducing Convolutions to Vision Transformers.

Introduction This is an official implementation of CvT: Introducing Convolutions to Vision Transformers. We present a new architecture, named Convolut

175 Jan 8, 2023

Official implementation of cosformer-attention in cosFormer: Rethinking Softmax in Attention

cosFormer Official implementation of cosformer-attention in cosFormer: Rethinking Softmax in Attention Update log 2022/2/28 Add core code License This

120 Dec 15, 2022

Official implementation of CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

CrossViT This repository is the official implementation of CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification. ArXiv If

168 Dec 29, 2022

The official implementation of ELSA: Enhanced Local Self-Attention for Vision Transformer

ELSA: Enhanced Local Self-Attention for Vision Transformer By Jingkai Zhou, Pich

87 Dec 19, 2022

Implementation of SE3-Transformers for Equivariant Self-Attention, in Pytorch.

SE3 Transformer - Pytorch Implementation of SE3-Transformers for Equivariant Self-Attention, in Pytorch. May be needed for replicating Alphafold2 resu

207 Dec 23, 2022

Pytorch implementation for our ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering".

TRAnsformer Routing Networks (TRAR) This is an official implementation for ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visu

49 Nov 10, 2022

PyTorch Implementation of CvT: Introducing Convolutions to Vision Transformers

CvT: Introducing Convolutions to Vision Transformers Pytorch implementation of CvT: Introducing Convolutions to Vision Transformers Usage: img = torch

193 Jan 3, 2023

A PyTorch implementation of ViTGAN based on paper ViTGAN: Training GANs with Vision Transformers.

ViTGAN: Training GANs with Vision Transformers A PyTorch implementation of ViTGAN based on paper ViTGAN: Training GANs with Vision Transformers. Refer

127 Dec 23, 2022

Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image classification, in Pytorch

Transformer in Transformer Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image c

272 Dec 23, 2022

Comments

Problem about DCN

I have some problems about compiling DCN, thus it is hard for me to use the DCN.deform_conv2d_forward and DCN.deform_conv2d_backward functions. Can 'deform_conv2d_naive' be used instead of this part? Or is there other methods for me to accomplish this DCN.deform_conv2d part?

opened by jarygrace 3
How to use LITNet as a beckbone of object detection

Could you please release the code of using LITNet as a beckbone of RetinaNet as mentioned in your paper? I wonder how to use it as a beckbone for object detection...

opened by sunhuisunhui 2

Official PyTorch implementation of Less is More: Pay Less Attention in Vision Transformers.

Related tags

Overview

Less is More: Pay Less Attention in Vision Transformers

Usage

Data Preparation

Model Zoo

Training and Evaluation

License

Acknowledgement

You might also like...

This is an official implementation of CvT: Introducing Convolutions to Vision Transformers.

Official implementation of cosformer-attention in cosFormer: Rethinking Softmax in Attention

Official implementation of CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

The official implementation of ELSA: Enhanced Local Self-Attention for Vision Transformer

Implementation of SE3-Transformers for Equivariant Self-Attention, in Pytorch.

Pytorch implementation for our ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering".

PyTorch Implementation of CvT: Introducing Convolutions to Vision Transformers

A PyTorch implementation of ViTGAN based on paper ViTGAN: Training GANs with Vision Transformers.

Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image classification, in Pytorch

Comments

Problem about DCN

How to use LITNet as a beckbone of object detection

Releases(v2.1)

v2.1(Mar 10, 2022)

v2.0(Jun 26, 2021)

v1.0(Jun 8, 2021)

Owner

A Python package to create, run, and post-process MODFLOW-based models.

A web-based application for quick, scalable, and automated hyperparameter tuning and stacked ensembling in Python.

Official repository of my book: "Deep Learning with PyTorch Step-by-Step: A Beginner's Guide"

Deep Reinforcement Learning for mobile robot navigation in ROS Gazebo simulator

Proof of concept GnuCash Webinterface

When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset of 53,000+ Legal Holdings

Reference PyTorch implementation of "End-to-end optimized image compression with competition of prior distributions"

FOSS Digital Asset Distribution Platform built on Frappe.

DenseNet Implementation in Keras with ImageNet Pretrained Models

Easy to use Audio Tagging in PyTorch

Estimating and Exploiting the Aleatoric Uncertainty in Surface Normal Estimation

A deep learning library that makes face recognition efficient and effective

Qimera: Data-free Quantization with Synthetic Boundary Supporting Samples

StyleGAN2 Webtoon / Anime Style Toonify

Implementation for the paper SMPLicit: Topology-aware Generative Model for Clothed People (CVPR 2021)

CROSS-LINGUAL ABILITY OF MULTILINGUAL BERT: AN EMPIRICAL STUDY

An efficient PyTorch implementation of the winning entry of the 2017 VQA Challenge.

The Pytorch code of "Joint Distribution Matters: Deep Brownian Distance Covariance for Few-Shot Classification", CVPR 2022 (Oral).

The source code of the ICCV2021 paper "PIRenderer: Controllable Portrait Image Generation via Semantic Neural Rendering"

A Web API for automatic background removal using Deep Learning. App is made using Flask and deployed on Heroku.