This is an official implementation for "Video Swin Transformers".

Last update: Jan 03, 2023

Overview

Video Swin Transformer

By Ze Liu*, Jia Ning*, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin and Han Hu.

This repo is the official implementation of "Video Swin Transformer". It is based on mmaction2.

Updates

06/25/2021 Initial commits

Introduction

Video Swin Transformer is initially described in "Video Swin Transformer", which advocates an inductive bias of locality in video Transformers, leading to a better speed-accuracy trade-off compared to previous approaches which compute self-attention globally even with spatial-temporal factorization. The locality of the proposed video architecture is realized by adapting the Swin Transformer designed for the image domain, while continuing to leverage the power of pre-trained image models. Our approach achieves state-of-the-art accuracy on a broad range of video recognition benchmarks, including on action recognition (84.9 top-1 accuracy on Kinetics-400 and 86.1 top-1 accuracy on Kinetics-600 with ~20x less pre-training data and ~3x smaller model size) and temporal modeling (69.6 top-1 accuracy on Something-Something v2).

Results and Models

Kinetics 400

Backbone	Pretrain	Lr Schd	spatial crop	[email protected]	[email protected]	#params	FLOPs	config	model
Swin-T	ImageNet-1K	30ep	224	78.8	93.6	28M	87.9G	config	github/baidu
Swin-S	ImageNet-1K	30ep	224	80.6	94.5	50M	165.9G	config	github/baidu
Swin-B	ImageNet-1K	30ep	224	80.6	94.6	88M	281.6G	config	github/baidu
Swin-B	ImageNet-22K	30ep	224	82.7	95.5	88M	281.6G	config	github/baidu

Kinetics 600

Backbone	Pretrain	Lr Schd	spatial crop	[email protected]	[email protected]	#params	FLOPs	config	model
Swin-B	ImageNet-22K	30ep	224	84.0	96.5	88M	281.6G	config	github/baidu

Something-Something V2

Backbone	Pretrain	Lr Schd	spatial crop	[email protected]	[email protected]	#params	FLOPs	config	model
Swin-B	Kinetics 400	60ep	224	69.6	92.7	89M	320.6G	config	github/baidu

Notes:

Pre-trained image models can be downloaded from Swin Transformer for ImageNet Classification.
The pre-trained model of SSv2 could be downloaded at github/baidu.
Access code for baidu is swin.

Usage

Installation

Please refer to install.md for installation.

We also provide docker file cuda10.1 (image url) and cuda11.0 (image url) for convenient usage.

Data Preparation

Please refer to data_preparation.md for a general knowledge of data preparation. The supported datasets are listed in supported_datasets.md.

Inference

# single-gpu testing
python tools/test.py <CONFIG_FILE> <CHECKPOINT_FILE> --eval top_k_accuracy

# multi-gpu testing
bash tools/dist_test.sh <CONFIG_FILE> <CHECKPOINT_FILE> <GPU_NUM> --eval top_k_accuracy

Training

To train a video recognition model with pre-trained image models (for Kinetics-400 and Kineticc-600 datasets), run:

# single-gpu training
python tools/train.py <CONFIG_FILE> --cfg-options model.backbone.pretrained=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]

# multi-gpu training
bash tools/dist_train.sh <CONFIG_FILE> <GPU_NUM> --cfg-options model.backbone.pretrained=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]

For example, to train a Swin-T model for Kinetics-400 dataset with 8 gpus, run:

bash tools/dist_train.sh configs/recognition/swin/swin_tiny_patch244_window877_kinetics400_1k.py 8 --cfg-options model.backbone.pretrained=<PRETRAIN_MODEL>

To train a video recognizer with pre-trained video models (for Something-Something v2 datasets), run:

# single-gpu training
python tools/train.py <CONFIG_FILE> --cfg-options load_from=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]

# multi-gpu training
bash tools/dist_train.sh <CONFIG_FILE> <GPU_NUM> --cfg-options load_from=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]

For example, to train a Swin-B model for SSv2 dataset with 8 gpus, run:

bash tools/dist_train.sh configs/recognition/swin/swin_base_patch244_window1677_sthv2.py 8 --cfg-options load_from=<PRETRAIN_MODEL>

Note: use_checkpoint is used to save GPU memory. Please refer to this page for more details.

Apex (optional):

We use apex for mixed precision training by default. To install apex, use our provided docker or run:

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

If you would like to disable apex, comment out the following code block in the configuration files:

# do not use mmcv version fp16
fp16 = None
optimizer_config = dict(
    type="DistOptimizerHook",
    update_interval=1,
    grad_clip=None,
    coalesce=True,
    bucket_size_mb=-1,
    use_fp16=True,
)

Citation

If you find our work useful in your research, please cite:

@article{liu2021video,
  title={Video Swin Transformer},
  author={Liu, Ze and Ning, Jia and Cao, Yue and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Hu, Han},
  journal={arXiv preprint arXiv:2106.13230},
  year={2021}
}

@article{liu2021Swin,
  title={Swin Transformer: Hierarchical Vision Transformer using Shifted Windows},
  author={Liu, Ze and Lin, Yutong and Cao, Yue and Hu, Han and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Guo, Baining},
  journal={arXiv preprint arXiv:2103.14030},
  year={2021}
}

This is an official implementation for "Video Swin Transformers".

Related tags

Overview

Video Swin Transformer

Updates

Introduction

Results and Models

Kinetics 400

Kinetics 600

Something-Something V2

Usage

Installation

Data Preparation

Inference

Training

Apex (optional):

Citation

Other Links

Owner

Swin Transformer

GeneDisco is a benchmark suite for evaluating active learning algorithms for experimental design in drug discovery.

Minimal deep learning library written from scratch in Python, using NumPy/CuPy.

[CVPR 2022 Oral] Versatile Multi-Modal Pre-Training for Human-Centric Perception

MEND: Model Editing Networks using Gradient Decomposition

Model Agnostic Interpretability for Multiple Instance Learning

A complete, self-contained example for training ImageNet at state-of-the-art speed with FFCV

A library for researching neural networks compression and acceleration methods.

Official code of Team Yao at Multi-Modal-Fact-Verification-2022

Band-Adaptive Spectral-Spatial Feature Learning Neural Network for Hyperspectral Image Classification

NCNN implementation of Real-ESRGAN. Real-ESRGAN aims at developing Practical Algorithms for General Image Restoration.

PromptDet: Expand Your Detector Vocabulary with Uncurated Images

Near-Optimal Sparse Allreduce for Distributed Deep Learning (published in PPoPP'22)

Tutorials and implementations for "Self-normalizing networks"

Exposure Time Calculator (ETC) and radial velocity precision estimator for the Near InfraRed Planet Searcher (NIRPS) spectrograph

RGB-stacking 🛑 🟩 🔷 for robotic manipulation

Code for CPM-2 Pre-Train

The Agriculture Domain of ERPNext comes with features to record crops and land

A PyTorch implementation of "DGC-Net: Dense Geometric Correspondence Network"

Personalized Transfer of User Preferences for Cross-domain Recommendation (PTUPCDR)

[NeurIPS 2021] Deceive D: Adaptive Pseudo Augmentation for GAN Training with Limited Data