AOT (Associating Objects with Transformers) in PyTorch

Last update: Dec 14, 2022

Related tags

Overview

AOT (Associating Objects with Transformers) in PyTorch

A modular reference PyTorch implementation of Associating Objects with Transformers for Video Object Segmentation (NIPS 2021). [paper]

Highlights

High performance: up to 85.5% (R50-AOTL) on YouTube-VOS 2018 and 82.1% (SwinB-AOTL) on DAVIS-2017 Test-dev under standard settings.
High efficiency: up to 51fps (AOTT) on DAVIS-2017 (480p) even with 10 objects and 41fps on YouTube-VOS (1.3x480p). AOT can process multiple objects (less than a pre-defined number, 10 in default) as efficiently as processing a single object. This project also supports inferring any number of objects together within a video by automatic separation and aggregation.
Multi-GPU training and inference
Mixed precision training and inference
Test-time augmentation: multi-scale and flipping augmentations are supported.

TODO

Code documentation
Demo tool
Adding your own dataset

Requirements

Python3
pytorch >= 1.7.0 and torchvision
opencv-python
Pillow

Optional (for better efficiency):

Pytorch Correlation (recommend to install from source instead of using pip)

Demo

Coming

Model Zoo and Results

Pre-trained models and corresponding results reproduced by this project can be found in MODEL_ZOO.md.

Getting Started

Prepare datasets:

Please follow the below instruction to prepare datasets in each correspondding folder.
- Static
  
  datasets/Static: pre-training dataset with static images. A guidance can be found in AFB-URR.
- YouTube-VOS
  
  A commonly-used large-scale VOS dataset.
  
  datasets/YTB/2019: version 2019, download link. train is required for training. valid (6fps) and valid_all_frames (30fps, optional) are used for evaluation.
  
  datasets/YTB/2018: version 2018, download link. Only valid (6fps) and valid_all_frames (30fps, optional) are required for this project and used for evaluation.
- DAVIS
  
  A commonly-used small-scale VOS dataset.
  
  datasets/DAVIS: TrainVal (480p) contains both the training and validation split. Test-Dev (480p) contains the Test-dev split. The full-resolution version is also supported for training and evluation but not required.
Prepare ImageNet pre-trained encoders

Select and download below checkpoints into pretrain_models:
- MobileNet-V2 (default encoder)
- MobileNet-V3
- ResNet-50
- ResNet-101
- ResNeSt-101
- Swin-Base
The current default training configs are not optimized for encoders larger than ResNet-50. If you want to use larger encoders, we recommond to early stop the main-training stage at 80,000 iteration (100,000 in default) to avoid over-fitting on the seen classes of YouTube-VOS.
Training and Evaluation

The example script will train AOTT with 2 stages using 4 GPUs and auto-mixed precision (--amp). The first stage is a pre-training stage using Static dataset, and the second stage is main-training stage, which uses both YouTube-VOS 2019 train and DAVIS-2017 train for training, resulting in a model can generalize to different domains (YouTube-VOS and DAVIS) and different frame rates (6fps, 24fps, and 30fps).

Notably, you can use only the YouTube-VOS 2019 train split in the second stage by changing pre_ytb_dav to pre_ytb, which leads to better YouTube-VOS performance on unseen classes. Besides, if you don't want to do the first stage, you can start the training from stage ytb, but the performance will drop about 1~2% absolutely.

After the training is finished, the example script will evaluate the model on YouTube-VOS and DAVIS, and the results will be packed into Zip files. For calculating scores, please use offical YouTube-VOS servers (2018 server and 2019 server) and offical DAVIS toolkit.

Adding your own dataset

Coming

Troubleshooting

Waiting

Citations

Please consider citing the related paper(s) in your publications if it helps your research.

@inproceedings{yang2021aot,
  title={Associating Objects with Transformers for Video Object Segmentation},
  author={Yang, Zongxin and Wei, Yunchao and Yang, Yi},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2021}
}

License

This project is released under the BSD-3-Clause license. See LICENSE for additional details.

AOT (Associating Objects with Transformers) in PyTorch

Related tags

Overview

AOT (Associating Objects with Transformers) in PyTorch

Highlights

TODO

Requirements

Demo

Model Zoo and Results

Getting Started

Adding your own dataset

Troubleshooting

Citations

License

Owner

Streaming Anomaly Detection Framework in Python (Outlier Detection for Streaming Data)

TilinGNN: Learning to Tile with Self-Supervised Graph Neural Network (SIGGRAPH 2020)

Alpha-IoU: A Family of Power Intersection over Union Losses for Bounding Box Regression

Flexible Option Learning - NeurIPS 2021

Code for “ACE-HGNN: Adaptive Curvature ExplorationHyperbolic Graph Neural Network”

Code for "Steerable Pyramid Transform Enables Robust Left Ventricle Quantification"

The toolkit to generate auto labeled datasets

PixelPyramids: Exact Inference Models from Lossless Image Pyramids (ICCV 2021)

🥈78th place in Riiid Solution🥈

A annotation of yolov5-5.0

ICCV2021 Oral SA-ConvONet: Sign-Agnostic Optimization of Convolutional Occupancy Networks

Mini-hmc-jax - A simple implementation of Hamiltonian Monte Carlo in JAX

Deep Residual Networks with 1K Layers

thundernet ncnn

Scripts of Machine Learning Algorithms from Scratch. Implementations of machine learning models and algorithms using nothing but NumPy with a focus on accessibility. Aims to cover everything from basic to advance.

an implementation of Video Frame Interpolation via Adaptive Separable Convolution using PyTorch

An executor that loads ONNX models and embeds documents using the ONNX runtime.

基于YoloX目标检测+DeepSort算法实现多目标追踪Baseline

Easy-to-use,Modular and Extendible package of deep-learning based CTR models .

Neural machine translation between the writings of Shakespeare and modern English using TensorFlow