DeepViT

This repo is the official implementation of "DeepViT: Towards Deeper Vision Transformer". The repo is based on the timm library (https://github.com/rwightman/pytorch-image-models) by Ross Wightman

Introduction

Deep Vision Transformer is initially described in arxiv, which observes the attention collapese phenomenon when training deep vision transformers: In this paper, we show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper. More specifically, we empirically observe that such scaling difficulty is caused by the attention collapse issue: as the transformer goes deeper, the attention maps gradually become similar and even much the same after certain layers. In other words, the feature maps tend to be identical in the top layers of deep ViT models. This fact demonstrates that in deeper layers of ViTs, the self-attention mechanism fails to learn effective concepts for representation learning and hinders the model from getting expected performance gain. Based on above observation, we propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity at different layers with negligible computation and memory cost. The pro-posed method makes it feasible to train deeper ViT models with consistent performance improvements via minor modification to existing ViT models. Notably, when training a deep ViT model with 32 transformer blocks, the Top-1 classification accuracy can be improved by 1.6% on ImageNet.

2. DeepViT Models

Model	Re-attention	Top1 Acc (%)	#params	#Similar Blocks	Checkpoint
ViT-16	NA	78.88	24.5M	5	[here](comming soon)
DeepViT-16	FC	79.10	24.5M	0	[here](comming soon)
ViT-24	NA	79.35	36.3M	11	[here](comming soon)
DeepViT-24	FC	79.99	36.3M	0	[here](comming soon)
ViT-32	NA	79.27	48.1M	15	[here](comming soon)
DeepViT_t-32	FC	80.90	48.1M	0	[here](comming soon)

Citing DeepVit

@article{zhou2021deepvit,
  title={DeepViT: Towards Deeper Vision Transformer},
  author={Zhou, Daquan and Kang, Bingyi and Jin, Xiaojie and Yang, Linjie and Lian, Xiaochen and Hou, Qibin and Feng, Jiashi},
  journal={arXiv preprint arXiv:2103.11886},
  year={2021}
}

《DeepViT: Towards Deeper Vision Transformer》(2021)

Related tags

Overview

DeepViT

Introduction

2. DeepViT Models

Citing DeepVit

Owner

Investigating Attention Mechanism in 3D Point Cloud Object Detection (arXiv 2021)

KGDet: Keypoint-Guided Fashion Detection (AAAI 2021)

PyTorch reimplementation of Diffusion Models

Image-to-image translation with conditional adversarial nets

Differentiable Wavetable Synthesis

Computer-Vision-Paper-Reviews - Computer Vision Paper Reviews with Key Summary along Papers & Codes

DABO: Data Augmentation with Bilevel Optimization

Short and long time series classification using convolutional neural networks

The implementation of the lifelong infinite mixture model

ZSL-KG is a general-purpose zero-shot learning framework with a novel transformer graph convolutional network (TrGCN) to learn class representation from common sense knowledge graphs.

PyTorch implementation of paper: AdaAttN: Revisit Attention Mechanism in Arbitrary Neural Style Transfer, ICCV 2021.

CNN Based Meta-Learning for Noisy Image Classification and Template Matching

PSML: A Multi-scale Time-series Dataset for Machine Learning in Decarbonized Energy Grids

Do Smart Glasses Dream of Sentimental Visions? Deep Emotionship Analysis for Eyewear Devices

Repository containing the PhD Thesis "Formal Verification of Deep Reinforcement Learning Agents"

Public repository created to store my custom-made tools for Just Dance (UbiArt Engine)

PyTorch ,ONNX and TensorRT implementation of YOLOv4

PyTorch-Geometric Implementation of MarkovGNN: Graph Neural Networks on Markov Diffusion

Allele-specific pipeline for unbiased read mapping(WIP), QTL discovery(WIP), and allelic-imbalance analysis

Simple renderer for use with MuJoCo (>=2.1.2) Python Bindings.