Temporally Efficient Vision Transformer for Video Instance Segmentation, CVPR 2022, Oral

Last update: Dec 31, 2022

Overview

Temporally Efficient Vision Transformer for Video Instance Segmentation

Temporally Efficient Vision Transformer for Video Instance Segmentation (CVPR 2022, Oral)

by Shusheng Yang^1,3, Xinggang Wang^1

📧, Yu Li⁴, Yuxin Fang¹, Jiemin Fang^1,2, Wenyu Liu¹, Xun Zhao³, Ying Shan³.

¹ School of EIC, HUST, ² AIA, HUST, ³ ARC Lab, Tencent PCG, ⁴ IDEA.

(^📧) corresponding author.

This repo provides code, models and training/inference recipes for TeViT(Temporally Efficient Vision Transformer for Video Instance Segmentation).
TeViT is a transformer-based end-to-end video instance segmentation framework. We build our framework upon the query-based instance segmentation methods, i.e., QueryInst.
We propose a messenger shift mechanism in the transformer backbone, as well as a spatiotemporal query interaction head in the instance heads. These two designs fully utlizes both frame-level and instance-level temporal context information and obtains strong temporal modeling capacity with negligible extra computational cost.

Models and Main Results

We provide both checkpoints and codalab server submissions on YouTube-VIS-2019 dataset.

Name	AP	[email protected]	[email protected]	[email protected]	[email protected]	model	submission
TeViT_MsgShifT	46.3	70.6	50.9	45.2	54.3	link	link
TeViT_MsgShifT_MST	46.9	70.1	52.9	45.0	53.4	link	link

We have conducted multiple runs due to the training instability and checkpoints above are all the best one among multiple runs. The average performances are reported in our paper.
Besides basic models, we also provide TeViT with ResNet-50 and Swin-L backbone, models are also trained on YouTube-VIS-2019 dataset.
MST denotes multi-scale traning.

Name	AP	[email protected]	[email protected]	[email protected]	[email protected]	model	submission
TeViT_R50	42.1	67.8	44.8	41.3	49.9	link	link
TeViT_Swin-L_MST	56.8	80.6	63.1	52.0	63.3	link	link

Due to backbone limitations, TeViT models with ResNet-50 and Swin-L backbone are conducted with STQI Head only (i.e., without our proposed messenger shift mechanism).
With Swin-L as backbone network, we apply more instance queries (i.e., from 100 to 300) and stronger data augmentation strategies. Both of them can further boost the final performance.

Installation

Prerequisites

Linux
Python 3.7+
CUDA 10.2+
GCC 5+

Prepare

Clone the repository locally:

git clone https://github.com/hustvl/TeViT.git

Create a conda virtual environment and activate it:

conda create --name tevit python=3.7.7
conda activate tevit

Install YTVOS Version API from youtubevos/cocoapi:

pip install git+https://github.com/youtubevos/cocoapi.git#"egg=pycocotools&subdirectory=PythonAPI

Install Python requirements

torch==1.9.0
torchvision==0.10.0
mmcv==1.4.8
pip install -r requirements.txt

Please follow Docs to install MMDetection

python setup.py develop

Download YouTube-VIS 2019 dataset from here, and organize dataset as follows:

TeViT
├── data
│   ├── youtubevis
│   │   ├── train
│   │   │   ├── 003234408d
│   │   │   ├── ...
│   │   ├── val
│   │   │   ├── ...
│   │   ├── annotations
│   │   │   ├── train.json
│   │   │   ├── valid.json

Inference

python tools/test_vis.py configs/tevit/tevit_msgshift.py $PATH_TO_CHECKPOINT

After inference process, the predicted results is stored in results.json, submit it to the evaluation server to get the final performance.

Training

Download the COCO pretrained QueryInst with PVT-B1 backbone from here.
Train TeViT with 8 GPUs:

./tools/dist_train.sh configs/tevit/tevit_msgshift.py 8 --no-validate --cfg-options load_from=$PATH_TO_PRETRAINED_WEIGHT

Train TeViT with multi-scale data augmentation:

./tools/dist_train.sh configs/tevit/tevit_msgshift_mstrain.py 8 --no-validate --cfg-options load_from=$PATH_TO_PRETRAINED_WEIGHT

The whole training process will cost about three hours with 8 TESLA V100 GPUs.
To train TeViT with ResNet-50 or Swin-L backbone, please download the COCO pretrained weights from QueryInst.

Acknowledgement ❤️

This code is mainly based on mmdetection and QueryInst, thanks for their awesome work and great contributions to the computer vision community!

Citation

If you find our paper and code useful in your research, please consider giving a star ⭐ and citation 📝 :

@inproceedings{yang2022tevit,
  title={Temporally Efficient Vision Transformer for Video Instance Segmentation,
  author={Yang, Shusheng and Wang, Xinggang and Li, Yu and Fang, Yuxin and Fang, Jiemin and Liu and Zhao, Xun and Shan, Ying},
  booktitle =   {Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR)},
  year      =   {2022}
}

Temporally Efficient Vision Transformer for Video Instance Segmentation, CVPR 2022, Oral

Related tags

Overview

Temporally Efficient Vision Transformer for Video Instance Segmentation

Models and Main Results

Installation

Prerequisites

Prepare

Inference

Training

Acknowledgement ❤️

Citation

Owner

Hust Visual Learning Team

⚖️🔁🔮🕵️‍♂️🦹🖼️ Code for Measuring the Contribution of Multiple Model Representations in Detecting Adversarial Instances paper.

Omnidirectional Scene Text Detection with Sequential-free Box Discretization (IJCAI 2019). Including competition model, online demo, etc.

A lightweight library to compare different PyTorch implementations of the same network architecture.

Open source code for the paper of Neural Sparse Voxel Fields.

Repository for RNNs using TensorFlow and Keras - LSTM and GRU Implementation from Scratch - Simple Classification and Regression Problem using RNNs

Fully Convolutional DenseNet (A.K.A 100 layer tiramisu) for semantic segmentation of images implemented in TensorFlow.

A high-performance Python-based I/O system for large (and small) deep learning problems, with strong support for PyTorch.

deep-prae

Image Lowpoly based on Centroid Voronoi Diagram via python-opencv and taichi

Tensorflow2 Keras-based Semantic Segmentation Models Implementation

[MICCAI'20] AlignShift: Bridging the Gap of Imaging Thickness in 3D Anisotropic Volumes

nnFormer: Interleaved Transformer for Volumetric Segmentation

[SDM 2022] Towards Similarity-Aware Time-Series Classification

Keyword spotting on Arm Cortex-M Microcontrollers

Using modified BiSeNet for face parsing in PyTorch

Voxel-based Network for Shape Completion by Leveraging Edge Generation (ICCV 2021, oral)

Official implementation of the paper Image Generators with Conditionally-Independent Pixel Synthesis https://arxiv.org/abs/2011.13775

Lacmus is a cross-platform application that helps to find people who are lost in the forest using computer vision and neural networks.

Supporting code for "Autoregressive neural-network wavefunctions for ab initio quantum chemistry".

pytorch implementation of the ICCV'21 paper "MVTN: Multi-View Transformation Network for 3D Shape Recognition"

Temporally Efficient Vision Transformer for Video Instance Segmentation, CVPR 2022, Oral

Related tags

Overview

Temporally Efficient Vision Transformer for Video Instance Segmentation

Models and Main Results

Installation

Prerequisites

Prepare

Inference

Training

Acknowledgement ❤️

Citation

Owner

Hust Visual Learning Team

⚖️🔁🔮🕵️‍♂️🦹🖼️ Code for *Measuring the Contribution of Multiple Model Representations in Detecting Adversarial Instances* paper.

Omnidirectional Scene Text Detection with Sequential-free Box Discretization (IJCAI 2019). Including competition model, online demo, etc.

A lightweight library to compare different PyTorch implementations of the same network architecture.

Open source code for the paper of Neural Sparse Voxel Fields.

Repository for RNNs using TensorFlow and Keras - LSTM and GRU Implementation from Scratch - Simple Classification and Regression Problem using RNNs

Fully Convolutional DenseNet (A.K.A 100 layer tiramisu) for semantic segmentation of images implemented in TensorFlow.

A high-performance Python-based I/O system for large (and small) deep learning problems, with strong support for PyTorch.

deep-prae

Image Lowpoly based on Centroid Voronoi Diagram via python-opencv and taichi

Tensorflow2 Keras-based Semantic Segmentation Models Implementation

[MICCAI'20] AlignShift: Bridging the Gap of Imaging Thickness in 3D Anisotropic Volumes

nnFormer: Interleaved Transformer for Volumetric Segmentation

[SDM 2022] Towards Similarity-Aware Time-Series Classification

Keyword spotting on Arm Cortex-M Microcontrollers

Using modified BiSeNet for face parsing in PyTorch

Voxel-based Network for Shape Completion by Leveraging Edge Generation (ICCV 2021, oral)

Official implementation of the paper Image Generators with Conditionally-Independent Pixel Synthesis https://arxiv.org/abs/2011.13775

Lacmus is a cross-platform application that helps to find people who are lost in the forest using computer vision and neural networks.

Supporting code for "Autoregressive neural-network wavefunctions for ab initio quantum chemistry".

pytorch implementation of the ICCV'21 paper "MVTN: Multi-View Transformation Network for 3D Shape Recognition"

⚖️🔁🔮🕵️‍♂️🦹🖼️ Code for Measuring the Contribution of Multiple Model Representations in Detecting Adversarial Instances paper.