Pytorch reimplementation of the Vision Transformer (An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale)

Overview

Vision Transformer

Pytorch reimplementation of Google's repository for the ViT model that was released with the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.

This paper show that Transformers applied directly to image patches and pre-trained on large datasets work really well on image recognition task.

fig1

Vision Transformer achieve State-of-the-Art in image recognition task with standard Transformer encoder and fixed-size patches. In order to perform classification, author use the standard approach of adding an extra learnable "classification token" to the sequence.

fig2

Usage

1. Download Pre-trained model (Google's Official Checkpoint)

  • Available models: ViT-B_16(85.8M), R50+ViT-B_16(97.96M), ViT-B_32(87.5M), ViT-L_16(303.4M), ViT-L_32(305.5M), ViT-H_14(630.8M)
    • imagenet21k pre-train models
      • ViT-B_16, ViT-B_32, ViT-L_16, ViT-L_32, ViT-H_14
    • imagenet21k pre-train + imagenet2012 fine-tuned models
      • ViT-B_16-224, ViT-B_16, ViT-B_32, ViT-L_16-224, ViT-L_16, ViT-L_32
    • Hybrid Model(Resnet50 + Transformer)
      • R50-ViT-B_16
# imagenet21k pre-train
wget https://storage.googleapis.com/vit_models/imagenet21k/{MODEL_NAME}.npz

# imagenet21k pre-train + imagenet2012 fine-tuning
wget https://storage.googleapis.com/vit_models/imagenet21k+imagenet2012/{MODEL_NAME}.npz

2. Train Model

python3 train.py --name cifar10-100_500 --dataset cifar10 --model_type ViT-B_16 --pretrained_dir checkpoint/ViT-B_16.npz

CIFAR-10 and CIFAR-100 are automatically download and train. In order to use a different dataset you need to customize data_utils.py.

The default batch size is 512. When GPU memory is insufficient, you can proceed with training by adjusting the value of --gradient_accumulation_steps.

Also can use Automatic Mixed Precision(Amp) to reduce memory usage and train faster

python3 train.py --name cifar10-100_500 --dataset cifar10 --model_type ViT-B_16 --pretrained_dir checkpoint/ViT-B_16.npz --fp16 --fp16_opt_level O2

Results

To verify that the converted model weight is correct, we simply compare it with the author's experimental results. We trained using mixed precision, and --fp16_opt_level was set to O2.

imagenet-21k

model dataset resolution acc(official) acc(this repo) time
ViT-B_16 CIFAR-10 224x224 - 0.9908 3h 13m
ViT-B_16 CIFAR-10 384x384 0.9903 0.9906 12h 25m
ViT_B_16 CIFAR-100 224x224 - 0.923 3h 9m
ViT_B_16 CIFAR-100 384x384 0.9264 0.9228 12h 31m
R50-ViT-B_16 CIFAR-10 224x224 - 0.9892 4h 23m
R50-ViT-B_16 CIFAR-10 384x384 0.99 0.9904 15h 40m
R50-ViT-B_16 CIFAR-100 224x224 - 0.9231 4h 18m
R50-ViT-B_16 CIFAR-100 384x384 0.9231 0.9197 15h 53m
ViT_L_32 CIFAR-10 224x224 - 0.9903 2h 11m
ViT_L_32 CIFAR-100 224x224 - 0.9276 2h 9m
ViT_H_14 CIFAR-100 224x224 - WIP

imagenet-21k + imagenet2012

model dataset resolution acc
ViT-B_16-224 CIFAR-10 224x224 0.99
ViT_B_16-224 CIFAR-100 224x224 0.9245
ViT-L_32 CIFAR-10 224x224 0.9903
ViT-L_32 CIFAR-100 224x224 0.9285

shorter train

  • In the experiment below, we used a resolution size (224x224).
  • tensorboard
upstream model dataset total_steps /warmup_steps acc(official) acc(this repo)
imagenet21k ViT-B_16 CIFAR-10 500/100 0.9859 0.9859
imagenet21k ViT-B_16 CIFAR-10 1000/100 0.9886 0.9878
imagenet21k ViT-B_16 CIFAR-100 500/100 0.8917 0.9072
imagenet21k ViT-B_16 CIFAR-100 1000/100 0.9115 0.9216

Visualization

The ViT consists of a Standard Transformer Encoder, and the encoder consists of Self-Attention and MLP module. The attention map for the input image can be visualized through the attention score of self-attention.

Visualization code can be found at visualize_attention_map.

fig3

Reference

Citations

@article{dosovitskiy2020,
  title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
  author={Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and  Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil},
  journal={arXiv preprint arXiv:2010.11929},
  year={2020}
}
Owner
Eunkwang Jeon
Eunkwang Jeon
Clockwork Variational Autoencoder

Clockwork Variational Autoencoders (CW-VAE) Vaibhav Saxena, Jimmy Ba, Danijar Hafner If you find this code useful, please reference in your paper: @ar

Vaibhav Saxena 35 Nov 06, 2022
A PyTorch-based library for semi-supervised learning

News If you want to join TorchSSL team, please e-mail Yidong Wang ([email protected]<

1k Jan 06, 2023
Learning RAW-to-sRGB Mappings with Inaccurately Aligned Supervision (ICCV 2021)

Learning RAW-to-sRGB Mappings with Inaccurately Aligned Supervision (ICCV 2021) PyTorch implementation of Learning RAW-to-sRGB Mappings with Inaccurat

Zhilu Zhang 53 Dec 20, 2022
Official implementation of SynthTIGER (Synthetic Text Image GEneratoR) ICDAR 2021

🐯 SynthTIGER: Synthetic Text Image GEneratoR Official implementation of SynthTIGER | Paper | Datasets Moonbin Yim1, Yoonsik Kim1, Han-cheol Cho1, Sun

Clova AI Research 256 Jan 05, 2023
[ICCV 2021] Target Adaptive Context Aggregation for Video Scene Graph Generation

Target Adaptive Context Aggregation for Video Scene Graph Generation This is a PyTorch implementation for Target Adaptive Context Aggregation for Vide

Multimedia Computing Group, Nanjing University 44 Dec 14, 2022
Code and hyperparameters for the paper "Generative Adversarial Networks"

Generative Adversarial Networks This repository contains the code and hyperparameters for the paper: "Generative Adversarial Networks." Ian J. Goodfel

Ian Goodfellow 3.5k Jan 08, 2023
Research Artifact of USENIX Security 2022 Paper: Automated Side Channel Analysis of Media Software with Manifold Learning

Automated Side Channel Analysis of Media Software with Manifold Learning Official implementation of USENIX Security 2022 paper: Automated Side Channel

Yuanyuan Yuan 175 Jan 07, 2023
Self-supervised learning on Graph Representation Learning (node-level task)

graph_SSL Self-supervised learning on Graph Representation Learning (node-level task) How to run the code To run GRACE, sh run_GRACE.sh To run GCA, sh

Namkyeong Lee 3 Dec 31, 2021
Context Decoupling Augmentation for Weakly Supervised Semantic Segmentation

Context Decoupling Augmentation for Weakly Supervised Semantic Segmentation The code of: Context Decoupling Augmentation for Weakly Supervised Semanti

54 Dec 12, 2022
This is a model made out of Neural Network specifically a Convolutional Neural Network model

This is a model made out of Neural Network specifically a Convolutional Neural Network model. This was done with a pre-built dataset from the tensorflow and keras packages. There are other alternativ

9 Oct 18, 2022
A framework for using LSTMs to detect anomalies in multivariate time series data. Includes spacecraft anomaly data and experiments from the Mars Science Laboratory and SMAP missions.

Telemanom (v2.0) v2.0 updates: Vectorized operations via numpy Object-oriented restructure, improved organization Merge branches into single branch fo

Kyle Hundman 844 Dec 28, 2022
Code to compute permutation and drop-column importances in Python scikit-learn models

Feature importances for scikit-learn machine learning models By Terence Parr and Kerem Turgutlu. See Explained.ai for more stuff. The scikit-learn Ran

Terence Parr 537 Dec 31, 2022
68 keypoint annotations for COFW test data

68 keypoint annotations for COFW test data This repository contains manually annotated 68 keypoints for COFW test data (original annotation of CFOW da

31 Dec 06, 2022
This repo contains research materials released by members of the Google Brain team in Tokyo.

Brain Tokyo Workshop 🧠 🗼 This repo contains research materials released by members of the Google Brain team in Tokyo. Past Projects Weight Agnostic

Google 1.2k Jan 02, 2023
Repo for CVPR2021 paper "QPIC: Query-Based Pairwise Human-Object Interaction Detection with Image-Wide Contextual Information"

QPIC: Query-Based Pairwise Human-Object Interaction Detection with Image-Wide Contextual Information by Masato Tamura, Hiroki Ohashi, and Tomoaki Yosh

105 Dec 23, 2022
[IROS2021] NYU-VPR: Long-Term Visual Place Recognition Benchmark with View Direction and Data Anonymization Influences

NYU-VPR This repository provides the experiment code for the paper Long-Term Visual Place Recognition Benchmark with View Direction and Data Anonymiza

Automation and Intelligence for Civil Engineering (AI4CE) Lab @ NYU 22 Sep 28, 2022
ALFRED - A Benchmark for Interpreting Grounded Instructions for Everyday Tasks

ALFRED A Benchmark for Interpreting Grounded Instructions for Everyday Tasks Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han,

ALFRED 204 Dec 15, 2022
Classifies galaxy morphology with Bayesian CNN

Zoobot Zoobot classifies galaxy morphology with deep learning. This code will let you: Reproduce and improve the Galaxy Zoo DECaLS automated classific

Mike Walmsley 39 Dec 20, 2022
Comp445 project - Data Communications & Computer Networks

COMP-445 Data Communications & Computer Networks Change Python version in Conda

Peng Zhao 2 Oct 03, 2022
A curated list of awesome papers for Semantic Retrieval (TOIS Accepted: Semantic Models for the First-stage Retrieval: A Comprehensive Review).

A curated list of awesome papers for Semantic Retrieval (TOIS Accepted: Semantic Models for the First-stage Retrieval: A Comprehensive Review).

Yinqiong Cai 189 Dec 28, 2022