UniFormer - official implementation of UniFormer

Overview

UniFormer

This repo is the official implementation of "Uniformer: Unified Transformer for Efficient Spatiotemporal Representation Learning". It currently includes code and models for the following tasks:

Updates

01/13/2022

[Initial commits]:

  1. Pretrained models on ImageNet-1K, Kinetics-400, Kinetics-600, Something-Something V1&V2

  2. The supported code and models for image classification and video classification are provided.

Introduction

UniFormer (Unified transFormer) is introduce in arxiv, which effectively unifies 3D convolution and spatiotemporal self-attention in a concise transformer format. We adopt local MHRA in shallow layers to largely reduce computation burden and global MHRA in deep layers to learn global token relation.

UniFormer achieves strong performance on video classification. With only ImageNet-1K pretraining, our UniFormer achieves 82.9%/84.8% top-1 accuracy on Kinetics-400/Kinetics-600, while requiring 10x fewer GFLOPs than other comparable methods (e.g., 16.7x fewer GFLOPs than ViViT with JFT-300M pre-training). For Something-Something V1 and V2, our UniFormer achieves 60.9% and 71.2% top-1 accuracy respectively, which are new state-of-the-art performances.

teaser

Main results on ImageNet-1K

Please see image_classification for more details.

More models with large resolution and token labeling will be released soon.

Model Pretrain Resolution Top-1 #Param. FLOPs
UniFormer-S ImageNet-1K 224x224 82.9 22M 3.6G
UniFormer-S† ImageNet-1K 224x224 83.4 24M 4.2G
UniFormer-B ImageNet-1K 224x224 83.9 50M 8.3G

Main results on Kinetics-400

Please see video_classification for more details.

Model Pretrain #Frame Sampling Method FLOPs K400 Top-1 K600 Top-1
UniFormer-S ImageNet-1K 16x1x4 16x4 167G 80.8 82.8
UniFormer-S ImageNet-1K 16x1x4 16x8 167G 80.8 82.7
UniFormer-S ImageNet-1K 32x1x4 32x4 438G 82.0 -
UniFormer-B ImageNet-1K 16x1x4 16x4 387G 82.0 84.0
UniFormer-B ImageNet-1K 16x1x4 16x8 387G 81.7 83.4
UniFormer-B ImageNet-1K 32x1x4 32x4 1036G 82.9 84.5*

* Since Kinetics-600 is too large to train (>1 month in single node with 8 A100 GPUs), we provide model trained in multi node (around 2 weeks with 32 V100 GPUs), but the result is lower due to the lack of tuning hyperparameters.

Main results on Something-Something

Please see video_classification for more details.

Model Pretrain #Frame FLOPs SSV1 Top-1 SSV2 Top-1
UniFormer-S K400 16x3x1 125G 57.2 67.7
UniFormer-S K600 16x3x1 125G 57.6 69.4
UniFormer-S K400 32x3x1 329G 58.8 69.0
UniFormer-S K600 32x3x1 329G 59.9 70.4
UniFormer-B K400 16x3x1 290G 59.1 70.4
UniFormer-B K600 16x3x1 290G 58.8 70.2
UniFormer-B K400 32x3x1 777G 60.9 71.1
UniFormer-B K600 32x3x1 777G 61.0 71.2

Main results on downstream tasks

We have conducted extensive experiments on downstream tasks and achieved comparable results with SOTA models.

Code and models will be released in two weeks.

Cite Uniformer

If you find this repository useful, please use the following BibTeX entry for citation.

@misc{li2022uniformer,
      title={Uniformer: Unified Transformer for Efficient Spatiotemporal Representation Learning}, 
      author={Kunchang Li and Yali Wang and Peng Gao and Guanglu Song and Yu Liu and Hongsheng Li and Yu Qiao},
      year={2022},
      eprint={2201.04676},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

License

This project is released under the MIT license. Please see the LICENSE file for more information.

Contributors and Contact Information

UniFormer is maintained by Kunchang Li.

For help or issues using UniFormer, please submit a GitHub issue.

For other communications related to UniFormer, please contact Kunchang Li ([email protected]).

Owner
SenseTime X-Lab
Powered by X-Lab, SenseTime Research
SenseTime X-Lab
We will see a basic program that is basically a hint to brute force attack to crack passwords. In other words, we will make a program to Crack Any Password Using Python. Show some ❤️ by starring this repository!

Crack Any Password Using Python We will see a basic program that is basically a hint to brute force attack to crack passwords. In other words, we will

Ananya Chatterjee 11 Dec 03, 2022
Scalable Optical Flow-based Image Montaging and Alignment

SOFIMA SOFIMA (Scalable Optical Flow-based Image Montaging and Alignment) is a tool for stitching, aligning and warping large 2d, 3d and 4d microscopy

Google Research 16 Dec 21, 2022
Lowest memory consumption and second shortest runtime in NTIRE 2022 challenge on Efficient Super-Resolution

FMEN Lowest memory consumption and second shortest runtime in NTIRE 2022 on Efficient Super-Resolution. Our paper: Fast and Memory-Efficient Network T

33 Dec 01, 2022
DRIFT is a tool for Diachronic Analysis of Scientific Literature.

About DRIFT is a tool for Diachronic Analysis of Scientific Literature. The application offers user-friendly and customizable utilities for two modes:

Rajaswa Patil 108 Dec 12, 2022
Code for the Paper "Diffusion Models for Handwriting Generation"

Code for the Paper "Diffusion Models for Handwriting Generation"

62 Dec 21, 2022
Invert and perturb GAN images for test-time ensembling

GAN Ensembling Project Page | Paper | Bibtex Ensembling with Deep Generative Views. Lucy Chai, Jun-Yan Zhu, Eli Shechtman, Phillip Isola, Richard Zhan

Lucy Chai 93 Dec 08, 2022
[NeurIPS 2021] "Drawing Robust Scratch Tickets: Subnetworks with Inborn Robustness Are Found within Randomly Initialized Networks" by Yonggan Fu, Qixuan Yu, Yang Zhang, Shang Wu, Xu Ouyang, David Cox, Yingyan Lin

Drawing Robust Scratch Tickets: Subnetworks with Inborn Robustness Are Found within Randomly Initialized Networks Yonggan Fu, Qixuan Yu, Yang Zhang, S

12 Dec 11, 2022
Python package facilitating the use of Bayesian Deep Learning methods with Variational Inference for PyTorch

PyVarInf PyVarInf provides facilities to easily train your PyTorch neural network models using variational inference. Bayesian Deep Learning with Vari

342 Dec 02, 2022
A pytorch implementation of Paper "Improved Training of Wasserstein GANs"

WGAN-GP An pytorch implementation of Paper "Improved Training of Wasserstein GANs". Prerequisites Python, NumPy, SciPy, Matplotlib A recent NVIDIA GPU

Marvin Cao 1.4k Dec 14, 2022
Arch-Net: Model Distillation for Architecture Agnostic Model Deployment

Arch-Net: Model Distillation for Architecture Agnostic Model Deployment The official implementation of Arch-Net: Model Distillation for Architecture A

MEGVII Research 22 Jan 05, 2023
Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(n²) Memory"

Memory Efficient Attention Pytorch Implementation of a memory efficient multi-head attention as proposed in the paper, Self-attention Does Not Need O(

Phil Wang 180 Jan 05, 2023
Graph neural network message passing reframed as a Transformer with local attention

Adjacent Attention Network An implementation of a simple transformer that is equivalent to graph neural network where the message passing is done with

Phil Wang 49 Dec 28, 2022
Exemplo de implementação do padrão circuit breaker em python

fast-circuit-breaker Circuit breakers existem para permitir que uma parte do seu sistema falhe sem destruir todo seu ecossistema de serviços. Michael

James G Silva 17 Nov 10, 2022
Transformer in Vision

Transformer-in-Vision Recent Transformer-based CV and related works. Welcome to comment/contribute! Keep updated. Resource SCENIC: A JAX Library for C

Yong-Lu Li 1.1k Dec 30, 2022
Convert Table data to approximate values with GUI

Table_Editor Convert Table data to approximate values with GUIs... usage - Import methods for extension Tables. Imported method supposed to have only

CLJ 1 Jan 10, 2022
Source code for our EMNLP'21 paper 《Raise a Child in Large Language Model: Towards Effective and Generalizable Fine-tuning》

Child-Tuning Source code for EMNLP 2021 Long paper: Raise a Child in Large Language Model: Towards Effective and Generalizable Fine-tuning. 1. Environ

46 Dec 12, 2022
Visualizing lattice vibration information from phonon dispersion to atoms (For GPUMD)

Phonon-Vibration-Viewer (For GPUMD) Visualizing lattice vibration information from phonon dispersion for primitive atoms. In this tutorial, we will in

Liangting 6 Dec 10, 2022
Text completion with Hugging Face and TensorFlow.js running on Node.js

Katana ML Text Completion 🤗 Description Runs with with Hugging Face DistilBERT and TensorFlow.js on Node.js distilbert-model - converter from Hugging

Katana ML 2 Nov 04, 2022
E-RAFT: Dense Optical Flow from Event Cameras

E-RAFT: Dense Optical Flow from Event Cameras This is the code for the paper E-RAFT: Dense Optical Flow from Event Cameras by Mathias Gehrig, Mario Mi

Robotics and Perception Group 71 Dec 12, 2022
A list of multi-task learning papers and projects.

This page contains a list of papers on multi-task learning for computer vision. Please create a pull request if you wish to add anything. If you are interested, consider reading our recent survey pap

svandenh 297 Dec 17, 2022