So-ViT: Mind Visual Tokens for Vision Transformer

Related tags

Deep LearningSo-ViT
Overview

So-ViT: Mind Visual Tokens for Vision Transformer

      

Introduction

This repository contains the source code under PyTorch framework and models trained on ImageNet-1K dataset for the following paper:

@articles{So-ViT,
    author = {Jiangtao Xie, Ruiren Zeng, Qilong Wang, Ziqi Zhou, Peihua Li},
    title = {So-ViT: Mind Visual Tokens for Vision Transformer},
    booktitle = {arXiv:2104.10935},
    year = {2021}
}

The Vision Transformer (ViT) heavily depends on pretraining using ultra large-scale datasets (e.g. ImageNet-21K or JFT-300M) to achieve high performance, while significantly underperforming on ImageNet-1K if trained from scratch. We propose a novel So-ViT model toward addressing this problem, by carefully considering the role of visual tokens.

Above all, for classification head, the ViT only exploits class token while entirely neglecting rich semantic information inherent in high-level visual tokens. Therefore, we propose a new classification paradigm, where the second-order, cross-covariance pooling of visual tokens is combined with class token for final classification. Meanwhile, a fast singular value power normalization is proposed for improving the second-order pooling.

Second, the ViT employs the naïve method of one linear projection of fixed-size image patches for visual token embedding, lacking the ability to model translation equivariance and locality. To alleviate this problem, we develop a light-weight, hierarchical module based on off-the-shelf convolutions for visual token embedding.

Classification results

Classification results (single crop 224x224, %) on ImageNet-1K validation set

Network Top-1 Accuracy Pre-trained models
Paper reported Upgrade GoogleDrive BaiduCloud
So-ViT-7 76.2 76.8 Coming soon Coming soon
So-ViT-10 77.9 78.7 Coming soon Coming soon
So-ViT-14 81.8 82.3 Coming soon Coming soon
So-ViT-19 82.4 82.8 Coming soon Coming soon

Installation and Usage

  1. Install PyTorch (>=1.6.0)
  2. Install timm (==0.3.4)
  3. pip install thop
  4. type git clone https://github.com/jiangtaoxie/So-ViT
  5. prepare the dataset as follows
.
├── train
│   ├── class1
│   │   ├── class1_001.jpg
│   │   ├── class1_002.jpg
|   |   └── ...
│   ├── class2
│   ├── class3
│   ├── ...
│   ├── ...
│   └── classN
└── val
    ├── class1
    │   ├── class1_001.jpg
    │   ├── class1_002.jpg
    |   └── ...
    ├── class2
    ├── class3
    ├── ...
    ├── ...
    └── classN

for training from scracth

sh model_name.sh  # model_name = {So_vit_7/10/14/19}

Acknowledgment

pytorch: https://github.com/pytorch/pytorch

timm: https://github.com/rwightman/pytorch-image-models

T2T-ViT: https://github.com/yitu-opensource/T2T-ViT

Contact

If you have any questions or suggestions, please contact me

[email protected]

Owner
Jiangtao Xie
Jiangtao Xie
LAnguage Model Analysis

LAMA: LAnguage Model Analysis LAMA is a probe for analyzing the factual and commonsense knowledge contained in pretrained language models. The dataset

Meta Research 960 Jan 08, 2023
Lightweight Python library for adding real-time object tracking to any detector.

Norfair is a customizable lightweight Python library for real-time 2D object tracking. Using Norfair, you can add tracking capabilities to any detecto

Tryolabs 1.7k Jan 05, 2023
View model summaries in PyTorch!

torchinfo (formerly torch-summary) Torchinfo provides information complementary to what is provided by print(your_model) in PyTorch, similar to Tensor

Tyler Yep 1.5k Jan 05, 2023
This code is for our paper "VTGAN: Semi-supervised Retinal Image Synthesis and Disease Prediction using Vision Transformers"

ICCV Workshop 2021 VTGAN This code is for our paper "VTGAN: Semi-supervised Retinal Image Synthesis and Disease Prediction using Vision Transformers"

Sharif Amit Kamran 25 Dec 08, 2022
Weakly Supervised End-to-End Learning (NeurIPS 2021)

WeaSEL: Weakly Supervised End-to-end Learning This is a PyTorch-Lightning-based framework, based on our End-to-End Weak Supervision paper (NeurIPS 202

Auton Lab, Carnegie Mellon University 131 Jan 06, 2023
Python scripts form performing stereo depth estimation using the HITNET model in ONNX.

ONNX-HITNET-Stereo-Depth-estimation Python scripts form performing stereo depth estimation using the HITNET model in ONNX. Stereo depth estimation on

Ibai Gorordo 30 Nov 08, 2022
Semantically Contrastive Learning for Low-light Image Enhancement

Semantically Contrastive Learning for Low-light Image Enhancement Here, we propose an effective semantically contrastive learning paradigm for Low-lig

48 Dec 16, 2022
Website which uses Deep Learning to generate horror stories.

Creepypasta - Text Generator Website which uses Deep Learning to generate horror stories. View Demo · View Website Repo · Report Bug · Request Feature

Dhairya Sharma 5 Oct 14, 2022
A voice recognition assistant similar to amazon alexa, siri and google assistant.

kenyan-Siri Build an Artificial Assistant Full tutorial (video) To watch the tutorial, click on the image below Installation For windows users (run th

Alison Parker 3 Aug 19, 2022
Image Deblurring using Generative Adversarial Networks

DeblurGAN arXiv Paper Version Pytorch implementation of the paper DeblurGAN: Blind Motion Deblurring Using Conditional Adversarial Networks. Our netwo

Orest Kupyn 2.2k Jan 01, 2023
Perspective: Julia for Biologists

Perspective: Julia for Biologists 1. Examples Speed: Example 1 - Single cell data and network inference Domain: Single cell data Methodology: Network

Elisabeth Roesch 55 Dec 02, 2022
[CVPR-2021] UnrealPerson: An adaptive pipeline for costless person re-identification

UnrealPerson: An Adaptive Pipeline for Costless Person Re-identification In our paper (arxiv), we propose a novel pipeline, UnrealPerson, that decreas

ZhangTianyu 70 Oct 10, 2022
Consensus Learning from Heterogeneous Objectives for One-Class Collaborative Filtering

Consensus Learning from Heterogeneous Objectives for One-Class Collaborative Filtering This repository provides the source code of "Consensus Learning

SeongKu-Kang 6 Apr 29, 2022
Machine learning library for fast and efficient Gaussian mixture models

This repository contains code which implements the Stochastic Gaussian Mixture Model (S-GMM) for event-based datasets Dependencies CMake Premake4 Blaz

Omar Oubari 1 Dec 19, 2022
Algorithmic encoding of protected characteristics and its implications on disparities across subgroups

Algorithmic encoding of protected characteristics and its implications on disparities across subgroups This repository contains the code for the paper

Team MIRA - BioMedIA 15 Oct 24, 2022
Cave Generation using metaballs in Blender. Originally created by sdfgeoff, Edited by Myself (Archie Jaskowicz).

Blender-Cave-Generation Cave Generation using metaballs in Blender. Originally created by sdfgeoff, Edited by Myself (Archie Jaskowicz). Installation

2 Dec 28, 2022
MLPs for Vision and Langauge Modeling (Coming Soon)

MLP Architectures for Vision-and-Language Modeling: An Empirical Study MLP Architectures for Vision-and-Language Modeling: An Empirical Study (Code wi

Yixin Nie 27 May 09, 2022
Deep Video Matting via Spatio-Temporal Alignment and Aggregation [CVPR2021]

Deep Video Matting via Spatio-Temporal Alignment and Aggregation [CVPR2021] Paper: https://arxiv.org/abs/2104.11208 Introduction Despite the significa

76 Dec 07, 2022
The MATH Dataset

Measuring Mathematical Problem Solving With the MATH Dataset This is the repository for Measuring Mathematical Problem Solving With the MATH Dataset b

Dan Hendrycks 267 Dec 26, 2022
Convert Apple NeuralHash model for CSAM Detection to ONNX.

Apple NeuralHash is a perceptual hashing method for images based on neural networks. It can tolerate image resize and compression.

Asuhariet Ygvar 1.5k Dec 31, 2022