《Rethinking Sptil Dimensions of Vision Trnsformers》(2021)

Last update: Dec 27, 2022

Related tags

Deep Learning pit

Overview

Rethinking Spatial Dimensions of Vision Transformers

Byeongho Heo, Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Junsuk Choe, Seong Joon Oh | Paper

NAVER AI LAB

Abstract

Vision Transformer (ViT) extends the application range of transformers from language processing to computer vision tasks as being an alternative architecture against the existing convolutional neural networks (CNN). Since the transformer-based architecture has been innovative for computer vision modeling, the design convention towards an effective architecture has been less studied yet. From the successful design principles of CNN, we investigate the role of the spatial dimension conversion and its effectiveness on the transformer-based architecture. We particularly attend the dimension reduction principle of CNNs; as the depth increases, a conventional CNN increases channel dimension and decreases spatial dimensions. We empirically show that such a spatial dimension reduction is beneficial to a transformer architecture as well, and propose a novel Pooling-based Vision Transformer (PiT) upon the original ViT model. We show that PiT achieves the improved model capability and generalization performance against ViT. Throughout the extensive experiments, we further show PiT outperforms the baseline on several tasks such as image classification, object detection and robustness evaluation.

Model performance

We compared performance of PiT with DeiT models in various training settings. Throughput (imgs/sec) values are measured in a machine with single V100 gpu with 128 batche size.

Network	FLOPs	# params	imgs/sec	Vanilla	+CutMix	+DeiT	+Distill
DeiT-Ti	1.3 G	5.7 M	2564	68.7	68.5	72.2	74.5
PiT-Ti	0.71 G	4.9 M	3030	71.3	72.6	73.0	74.6
PiT-XS	1.4 G	10.6 M	2128	72.4	76.8	78.1	79.1

DeiT-S	4.6 G	22.1 M	980	68.7	76.5	79.8	81.2
PiT-S	2.9 G	23.5 M	1266	73.3	79.0	80.9	81.9

DeiT-B	17.6 G	86.6 M	303	69.3	75.3	81.8	83.4
PiT-B	12.5 G	73.8 M	348	76.1	79.9	82.0	84.0

Pretrained weights

Model name	FLOPs	accuracy	weights
`pit_ti`	0.71 G	73.0	link
`pit_xs`	1.4 G	78.1	link
`pit_s`	2.9 G	80.9	link
`pit_b`	12.5 G	82.0	link

`pit_ti_distilled`	0.71 G	74.6	link
`pit_xs_distilled`	1.4 G	79.1	link
`pit_s_distilled`	2.9 G	81.9	link
`pit_b_distilled`	12.5 G	84.0	link

Dependancies

Our implementations are tested on following libraries with Python 3.6.9 and CUDA 10.1.

torch: 1.7.1
torchvision: 0.8.2
timm: 0.3.4
einops: 0.3.0

Install other dependencies using the following command.

pip install -r requirements.txt

How to use models

You can build PiT models directly

import torch
import pit

model = pit.pit_s(pretrained=False)
model.load_state_dict(torch.load('./weights/pit_s_809.pth'))
print(model(torch.randn(1, 3, 224, 224)))

Or using timm function

import torch
import timm
import pit

model = timm.create_model('pit_s', pretrained=False)
model.load_state_dict(torch.load('./weights/pit_s_809.pth'))
print(model(torch.randn(1, 3, 224, 224)))

To use models trained with distillation, you should use _distilled model and weights.

import torch
import pit

model = pit.pit_s_distilled(pretrained=False)
model.load_state_dict(torch.load('./weights/pit_s_distill_819.pth'))
print(model(torch.randn(1, 3, 224, 224)))

License

Copyright 2021-present NAVER Corp.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Citation

@article{heo2021pit,
    title={Rethinking Spatial Dimensions of Vision Transformers},
    author={Byeongho Heo and Sangdoo Yun and Dongyoon Han and Sanghyuk Chun and Junsuk Choe and Seong Joon Oh},
    journal={arXiv: 2103.16302},
    year={2021},
}

《Rethinking Sptil Dimensions of Vision Trnsformers》(2021)

Related tags

Overview

Rethinking Spatial Dimensions of Vision Transformers

Abstract

Model performance

Pretrained weights

Dependancies

How to use models

License

Citation

Owner

NAVER AI

Este conversor criará a medida exata para sua receita de capuccino gelado da grandiosa Rafaella Ballerini!

J.A.R.V.I.S is an AI virtual assistant made in python.

Official code for the CVPR 2022 (oral) paper "Extracting Triangular 3D Models, Materials, and Lighting From Images".

Official Codes for Graph Modularity:Towards Understanding the Cross-Layer Transition of Feature Representations in Deep Neural Networks.

Official PyTorch implementation of "BlendGAN: Implicitly GAN Blending for Arbitrary Stylized Face Generation" (NeurIPS 2021)

SANet: A Slice-Aware Network for Pulmonary Nodule Detection

Genshin-assets - 👧 Public documentation & static assets for Genshin Impact data.

A TensorFlow implementation of the Mnemonic Descent Method.

PyTorch implementation of InstaGAN: Instance-aware Image-to-Image Translation

Resources related to EMNLP 2021 paper "FAME: Feature-Based Adversarial Meta-Embeddings for Robust Input Representations"

Async API for controlling Hue Lights

Malware Bypass Research using Reinforcement Learning

CrossMLP - The repository offers the official implementation of our BMVC 2021 paper (oral) in PyTorch.

Patch-Based Deep Autoencoder for Point Cloud Geometry Compression

A community run, 5-day PyTorch Deep Learning Bootcamp

S-attack library. Official implementation of two papers "Are socially-aware trajectory prediction models really socially-aware?" and "Vehicle trajectory prediction works, but not everywhere".

Multi-task Multi-agent Soft Actor Critic for SMAC

Video Background Music Generation with Controllable Music Transformer (ACM MM 2021 Oral)

SimplEx - Explaining Latent Representations with a Corpus of Examples

Logistic Bandit experiments. Official code for the paper "Jointly Efficient and Optimal Algorithms for Logistic Bandits".