[CVPR 2022] "The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy" by Tianlong Chen, Zhenyu Zhang, Yu Cheng, Ahmed Awadallah, Zhangyang Wang

Last update: Nov 26, 2022

Overview

The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy

Codes for this paper: [CVPR 2022] The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy.

Tianlong Chen, Zhenyu Zhang, Yu Cheng, Ahmed Awadallah, Zhangyang Wang.

Overview

Vision transformers (ViTs) have gained increasing popularity as they are commonly believed to own higher modeling capacity and representation flexibility, than traditional convolutional networks. However, it is questionable whether such potential has been fully unleashed in practice, as the learned ViTs often suffer from over-smoothening, yielding likely redundant models.

Recent works made preliminary attempts to identify and alleviate such redundancy, e.g., via regularizing embedding similarity or re-injecting convolution-like structures. However, a “head-to-toe assessment” regarding the extent of redundancy in ViTs, and how much we could gain by thoroughly mitigating such, has been absent for this field.

This paper, for the first time, systematically studies the ubiquitous existence of redundancy at all three levels: patch embedding, attention map, and weight space. In view of them, we advocate a principle of diversity for training ViTs, by presenting corresponding regularizers that encourage the representation diversity and coverage at each of those levels, that enabling capturing more discriminative information.

Extensive experiments on ImageNet with a number of ViT backbones validate the effectiveness of our proposals, largely eliminating the observed ViT redundancy and significantly boosting the model generalization. For example, our diversified DeiT obtains 0.70% ∼1.76% accuracy boosts on ImageNet with highly reduced similarity.

Prerequisites

Install PyTorch 1.7.0+ and torchvision 0.8.1+ and pytorch-image-models 0.3.2:

conda install -c pytorch torchvision
pip install timm==0.3.2

Training on ImageNet

./script/run_deit_small_diverse.sh [data/imagenet] (Deit-Small-12layers)
./script/run_deit_small_24layer_diverse.sh [data/imagenet] (Deit-Small-24layers)

Citation

TBD

Acknowledgement

https://github.com/facebookresearch/deit

[CVPR 2022] "The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy" by Tianlong Chen, Zhenyu Zhang, Yu Cheng, Ahmed Awadallah, Zhangyang Wang

Related tags

Overview

The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy

Overview

Prerequisites

Training on ImageNet

Citation

Acknowledgement

Owner

VITA

MonoRCNN is a monocular 3D object detection method for automonous driving

Cognition-aware Cognate Detection

Code for our paper "SimCLS: A Simple Framework for Contrastive Learning of Abstractive Summarization", ACL 2021

TART - A PyTorch implementation for Transition Matrix Representation of Trees with Transposed Convolutions

Keyword-BERT: Keyword-Attentive Deep Semantic Matching

Code for SIMMC 2.0: A Task-oriented Dialog Dataset for Immersive Multimodal Conversations

Source code for our Paper "Learning in High-Dimensional Feature Spaces Using ANOVA-Based Matrix-Vector Multiplication"

Towards Rolling Shutter Correction and Deblurring in Dynamic Scenes (CVPR2021)

Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow

This is the official PyTorch implementation for "Mesa: A Memory-saving Training Framework for Transformers".

Official repo for QHack—the quantum machine learning hackathon

Implementation of StyleSpace Analysis: Disentangled Controls for StyleGAN Image Generation in PyTorch

Shallow Convolutional Neural Networks for Human Activity Recognition using Wearable Sensors

A 2D Visual Localization Framework based on Essential Matrices [ICRA2020]

Hybrid CenterNet - Hybrid-supervised object detection / Weakly semi-supervised object detection

MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens

Open-source Monocular Python HawkEye for Tennis

VOLO: Vision Outlooker for Visual Recognition

Object-aware Contrastive Learning for Debiased Scene Representation

Automates Machine Learning Pipeline with Feature Engineering and Hyper-Parameters Tuning :rocket: