Local-Global Stratified Transformer for Efficient Video Recognition

Overview

DualFormer

This repo is the implementation of our manuscript entitled "Local-Global Stratified Transformer for Efficient Video Recognition". Our model is built on a popular video package called mmaction2. This repo also refers to the code templates provided by PVT, Twins and Swin. This repo is released under the Apache 2.0 license.

Introduction

DualFormer is a Transformer architecture that can effectively and efficiently perform space-time attention for video recognition. Specifically, our DualFormer stratifies the full space-time attention into dual cascaded levels, i.e., to first learn fine-grained local space-time interactions among nearby 3D tokens, followed by the capture of coarse-grained global dependencies between the query token and the coarse-grained global pyramid contexts. Experimental results show the superiority of DualFormer on five video benchmarks against existing methods. In particular, DualFormer sets new state-of-the-art 82.9%/85.2% top-1 accuracy on Kinetics-400/600 with ∼1000G inference FLOPs which is at least 3.2× fewer than existing methods with similar performances.

Installation & Requirement

Please refer to install.md for installation. The docker files are also provided for convenient usage - cuda10.1 and cuda11.0.

All models are trained on 8 Nvidia A100 GPUs. For example, training a DualFormer-T on Kinetics-400 takes ∼31 hours on 8 A100 GPUs, while training a larger model DualFormer-B on Kinetics-400 requires ∼3 days on 8 A100 GPUs.

Data Preparation

Please first see data_preparation.md for a general knowledge of data preparation.

  • For Kinetics-400/600, as these are dynamic datasets (videos may be removed from YouTube), we employ this repo to download the original files and the annotatoins. Only a few number of corrupted videos are removed (around 50).
  • For other datasets, i.e., HMDB-51, UCF-101 and Diving-48, we use the data downloader provided by mmaction2 as aforementioned.

The full supported datasets are listed below (more details in supported_datasets.md):

HMDB51 (Homepage) (ICCV'2011) UCF101 (Homepage) (CRCV-IR-12-01) ActivityNet (Homepage) (CVPR'2015) Kinetics-[400/600/700] (Homepage) (CVPR'2017)
SthV1 (Homepage) (ICCV'2017) SthV2 (Homepage) (ICCV'2017) Diving48 (Homepage) (ECCV'2018) Jester (Homepage) (ICCV'2019)
Moments in Time (Homepage) (TPAMI'2019) Multi-Moments in Time (Homepage) (ArXiv'2019) HVU (Homepage) (ECCV'2020) OmniSource (Homepage) (ECCV'2020)

Models

We present a major part of the model results, the configuration files, and downloading links in the following table. The FLOPs is computed by fvcore, where we omit the classification head since it has low impact to the FLOPs.

Dataset Version Pretrain GFLOPs Param (M) Top-1 Config Download
K400 Tiny IN-1K 240 21.8 79.5 link link
K400 Small IN-1K 636 48.9 80.6 link link
K400 Base IN-1K 1072 86.8 81.1 link link
K600 Base IN-22K 1072 86.8 85.2 link link
Diving-48 Small K400 1908 48.9 81.8 link link
HMDB-51 Small K400 1908 48.9 76.4 link link
UCF-101 Small K400 1908 48.9 97.5 link link

Visualization

We visualize the attention maps at the last layer of our model generated by Grad-CAM on Kinetics-400. As shown in the following three gifs, our model successfully learns to focus on the relevant parts in the video clip. Left: flying kites. Middle: counting money. Right: walking dogs.

You can use the following commend to visualize the attention weights:

python demo/demo_gradcam.py 
    
     
     
       --target-layer-name 
      
        --out-filename 
        
       
      
     
    
   

For example, to visualize the last layer of DualFormer-S on a K400 video (-cii-Z0dW2E_000020_000030.mp4), please run:

python demo/demo_gradcam.py \
    configs/recognition/dualformer/dualformer_small_patch244_window877_kinetics400_1k.py \
    checkpoints/k400/dualformer_small_patch244_window877.pth \
    /dataset/kinetics-400/train_files/-cii-Z0dW2E_000020_000030.mp4 \
    --target-layer-name backbone/blocks/3/3 --fps 10 \
    --out-filename output/-cii-Z0dW2E_000020_000030.gif

User Guide

Folder Structure

As our implementation is based on mmaction2, we specify our contributions as follows:

Testing

# single-gpu testing
python tools/test.py 
    
    
      --eval top_k_accuracy

# multi-gpu testing
bash tools/dist_test.sh 
      
       
       
         --eval top_k_accuracy 
       
      
     
    
   

Example 1: to validate a DualFormer-T model on Kinetics-400 dataset with 8 GPUs, please run:

bash tools/dist_test.sh configs/recognition/dualformer/dualformer_tiny_patch244_window877_kinetics400_1k.py checkpoints/k400/dualformer_tiny_patch244_window877.pth 8 --eval top_k_accuracy

You will obtain the result as follows:

Example 2: to validate a DualFormer-S model on Diving-48 dataset with 4 GPUs, please run:

bash tools/dist_test.sh configs/recognition/dualformer/dualformer_small_patch244_window877_diving48.py checkpoints/diving48/dualformer_small_patch244_window877.pth 4 --eval top_k_accuracy 

The output will be as follows:

Training from scratch

To train a video recognition model from scratch for Kinetics-400, please run:

# single-gpu training
python tools/train.py 
   
     [other optional arguments]

# multi-gpu training
bash tools/dist_train.sh 
     
     
       [other optional arguments]

     
    
   

For example, to train a DualFormer-T model for Kinetics-400 dataset with 8 GPUs, please run:

bash tools/dist_train.sh ./configs/recognition/dualformer/dualformer_tiny_patch244_window877_kinetics400_1k.py 8 

Training a DualFormer-S model for Kinetics-400 dataset with 8 GPUs, please run:

bash tools/dist_train.sh ./configs/recognition/dualformer/dualformer_small_patch244_window877_kinetics400_1k.py 8 

Training with pre-trained 2D models

To train a video recognition model with pre-trained image models, please run:

# single-gpu training
python tools/train.py 
   
     --cfg-options model.backbone.pretrained=
    
      [model.backbone.use_checkpoint=True] [other optional arguments]

# multi-gpu training
bash tools/dist_train.sh 
      
      
        --cfg-options model.backbone.pretrained=
       
         [model.backbone.use_checkpoint=True] [other optional arguments] 
       
      
     
    
   

For example, to train a DualFormer-T model for Kinetics-400 dataset with 8 GPUs, please run:

bash tools/dist_train.sh ./configs/recognition/dualformer/dualformer_tiny_patch244_window877_kinetics400_1k.py 8 --cfg-options model.backbone.pretrained=
    

   

Training a DualFormer-B model for Kinetics-400 dataset with 8 GPUs, please run:

bash tools/dist_train.sh ./configs/recognition/dualformer/dualformer_base_patch244_window877_kinetics400_1k.py 8 --cfg-options model.backbone.pretrained=
    

   

Note: use_checkpoint is used to save GPU memory. Please refer to this page for more details.

Training with Token Labelling

We also present the first attempt to improve the video recognition model by generalizing Token Labelling to videos as additional augmentations, in which MixToken is turned off as it does not work on our video datasets. For instance, to train a small version of DualFormer using DualFormer-B as the annotation model on the fly, please run:

bash tools/dist_train.sh configs/recognition/dualformer/dualformer_tiny_tokenlabel_patch244_window877_kinetics400_1k.py 8 --cfg-options model.backbone.pretrained='checkpoints/pretrained_2d/dualformer_tiny.pth' --validate 

Notice that we place the checkpoint of the annotation model at 'checkpoints/k400/dualformer_base_patch244_window877.pth'. You can change it to anywhere you want, or modify the path variable in this file.

We present two examples of visualization of token labelling on video data. For simiplicity, we omit several frames and thus each example only shows 5 frames with uniform sampling rate. For each frame, each value p(i,j) on the left hand side means the pseudo label (index) at each patch of the last stage provided by the annotation model.

  • Visualization example 1 (Correct label: pushing cart, index: 262).
  • Visualization example 2 (Correct label: dribbling basketball, index: 99).

              

Apex (optional):

We use apex for mixed precision training by default. To install apex, use our provided docker or run:

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

If you would like to disable apex, comment out the following code block in the configuration files:

# do not use mmcv version fp16
fp16 = None
optimizer_config = dict(
    type="DistOptimizerHook",
    update_interval=1,
    grad_clip=None,
    coalesce=True,
    bucket_size_mb=-1,
    use_fp16=True,
)

Citation

If you find our work useful in your research, please cite:

@article{liang2021dualformer,
         title={DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition}, 
         author={Yuxuan Liang and Pan Zhou and Roger Zimmermann and Shuicheng Yan},
         year={2021},
         journal={arXiv preprint arXiv:2112.04674},
}

Acknowledgement

We would like to thank the authors of the following helpful codebases:

Please kindly consider star these related packages as well. Thank you much for your attention.

Owner
Sea AI Lab
Sea AI Lab
Awesome Remote Sensing Toolkit based on PaddlePaddle.

基于飞桨框架开发的高性能遥感图像处理开发套件,端到端地完成从训练到部署的全流程遥感深度学习应用。 最新动态 PaddleRS 即将发布alpha版本!欢迎大家试用 简介 PaddleRS是遥感科研院所、相关高校共同基于飞桨开发的遥感处理平台,支持遥感图像分类,目标检测,图像分割,以及变化检测等常用遥

146 Dec 11, 2022
List some popular DeepFake models e.g. DeepFake, FaceSwap-MarekKowal, IPGAN, FaceShifter, FaceSwap-Nirkin, FSGAN, SimSwap, CihaNet, etc.

deepfake-models List some popular DeepFake models e.g. DeepFake, CihaNet, SimSwap, FaceSwap-MarekKowal, IPGAN, FaceShifter, FaceSwap-Nirkin, FSGAN, Si

Mingcan Xiang 100 Dec 17, 2022
HistoSeg : Quick attention with multi-loss function for multi-structure segmentation in digital histology images

HistoSeg : Quick attention with multi-loss function for multi-structure segmentation in digital histology images Histological Image Segmentation This

Saad Wazir 11 Dec 16, 2022
For auto aligning, cropping, and scaling HR and LR images for training image based neural networks

ImgAlign For auto aligning, cropping, and scaling HR and LR images for training image based neural networks Usage Make sure OpenCV is installed, 'pip

15 Dec 04, 2022
Implementation of the pix2pix model on satellite images

This repo shows how to implement and use the pix2pix GAN model for image to image translation. The model is demonstrated on satellite images, and the

3 May 24, 2022
This is the official PyTorch implementation of our paper: "Artistic Style Transfer with Internal-external Learning and Contrastive Learning".

Artistic Style Transfer with Internal-external Learning and Contrastive Learning This is the official PyTorch implementation of our paper: "Artistic S

51 Dec 20, 2022
RuleBERT: Teaching Soft Rules to Pre-Trained Language Models

RuleBERT: Teaching Soft Rules to Pre-Trained Language Models (Paper) (Slides) (Video) RuleBERT is a pre-trained language model that has been fine-tune

16 Aug 24, 2022
Adaptive, interpretable wavelets across domains (NeurIPS 2021)

Adaptive wavelets Wavelets which adapt given data (and optionally a pre-trained model). This yields models which are faster, more compressible, and mo

Yu Group 50 Dec 16, 2022
(ImageNet pretrained models) The official pytorch implemention of the TPAMI paper "Res2Net: A New Multi-scale Backbone Architecture"

Res2Net The official pytorch implemention of the paper "Res2Net: A New Multi-scale Backbone Architecture" Our paper is accepted by IEEE Transactions o

Res2Net Applications 928 Dec 29, 2022
An Unpaired Sketch-to-Photo Translation Model

Unpaired-Sketch-to-Photo-Translation We have released our code at https://github.com/rt219/Unsupervised-Sketch-to-Photo-Synthesis This project is the

38 Oct 28, 2022
A method that utilized Generative Adversarial Network (GAN) to interpret the black-box deep image classifier models by PyTorch.

A method that utilized Generative Adversarial Network (GAN) to interpret the black-box deep image classifier models by PyTorch.

Yunxia Zhao 3 Dec 29, 2022
A python library for self-supervised learning on images.

Lightly is a computer vision framework for self-supervised learning. We, at Lightly, are passionate engineers who want to make deep learning more effi

Lightly 2k Jan 08, 2023
PyTorch code for Composing Partial Differential Equations with Physics-Aware Neural Networks

FInite volume Neural Network (FINN) This repository contains the PyTorch code for models, training, and testing, and Python code for data generation t

Cognitive Modeling 20 Dec 18, 2022
A repository for generating stylized talking 3D and 3D face

style_avatar A repository for generating stylized talking 3D faces and 2D videos. This is the repository for paper Imitating Arbitrary Talking Style f

Haozhe Wu 191 Dec 22, 2022
Data and code for ICCV 2021 paper Distant Supervision for Scene Graph Generation.

Distant Supervision for Scene Graph Generation Data and code for ICCV 2021 paper Distant Supervision for Scene Graph Generation. Introduction The pape

THUNLP 23 Dec 31, 2022
PyTorch implementation of SCAFFOLD (Stochastic Controlled Averaging for Federated Learning, ICML 2020).

Scaffold-Federated-Learning PyTorch implementation of SCAFFOLD (Stochastic Controlled Averaging for Federated Learning, ICML 2020). Environment numpy=

KI 30 Dec 29, 2022
Time Series Forecasting with Temporal Fusion Transformer in Pytorch

Forecasting with the Temporal Fusion Transformer Multi-horizon forecasting often contains a complex mix of inputs – including static (i.e. time-invari

Nicolás Fornasari 6 Jan 24, 2022
SSD: Single Shot MultiBox Detector pytorch implementation focusing on simplicity

SSD: Single Shot MultiBox Detector Introduction Here is my pytorch implementation of 2 models: SSD-Resnet50 and SSDLite-MobilenetV2.

Viet Nguyen 149 Jan 07, 2023
A Deep learning based streamlit web app which can tell with which bollywood celebrity your face resembles.

Project Name: Which Bollywood Celebrity You look like A Deep learning based streamlit web app which can tell with which bollywood celebrity your face

BAPPY AHMED 20 Dec 28, 2021
This repository is a series of notebooks that show solutions for the projects at Dataquest.io.

Dataquest Project Solutions This repository is a series of notebooks that show solutions for the projects at Dataquest.io. Of course, there are always

Dataquest 1.1k Dec 30, 2022