AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition

Last update: Dec 26, 2022

Related tags

Deep Learning AdaFocusV2

Overview

AdaFocusV2

This repo contains the official code and pre-trained models for AdaFocusV2.

AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition

Introduction

Recent works have shown that the computational efficiency of video recognition can be significantly improved by reducing the spatial redundancy. As a representative work, the adaptive focus method (AdaFocus) has achieved a favorable trade-off between accuracy and inference speed by dynamically identifying and attending to the informative regions in each video frame. However, AdaFocus requires a complicated three-stage training pipeline (involving reinforcement learning), leading to slow convergence and is unfriendly to practitioners. This work reformulates the training of AdaFocus as a simple one-stage algorithm by introducing a differentiable interpolation-based patch selection operation, enabling efficient end-to-end optimization. We further present an improved training scheme to address the issues introduced by the one-stage formulation, including the lack of supervision, input diversity and training stability. Moreover, a conditional-exit technique is proposed to perform temporal adaptive computation on top of AdaFocus without additional training. Extensive experiments on six benchmark datasets (i.e., ActivityNet, FCVID, Mini-Kinetics, Something-Something V1&V2, and Jester) demonstrate that our model significantly outperforms the original AdaFocus and other competitive baselines, while being considerably more simple and efficient to train.

Results

Compared with AdaFocusV1

ActivityNet, FCVID and Mini-Kinetics

Something-Something V1&V2 and Jester

Visualization

Get Started

Please go to the folder Experiments on ActivityNet, FCVID and Mini-Kinetics and Experiments on Sth-Sth and Jester for specific docs.

Contact

If you have any question, feel free to contact the authors or raise an issue. Yulin Wang: [email protected].

AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition

Related tags

Overview

AdaFocusV2

Introduction

Results

Get Started

Contact

Owner

Repository accompanying the "Sign Pose-based Transformer for Word-level Sign Language Recognition" paper

StyleTransfer - Open source style transfer project, based on VGG19

[NeurIPS-2021] Mosaicking to Distill: Knowledge Distillation from Out-of-Domain Data

Dealing With Misspecification In Fixed-Confidence Linear Top-m Identification

Tensorflow implementation for "Improved Transformer for High-Resolution GANs" (NeurIPS 2021).

Epidemiology analysis package

Code for Universal Semi-Supervised Semantic Segmentation models paper accepted in ICCV 2019

Source code for "MusCaps: Generating Captions for Music Audio" (IJCNN 2021)

MVSDF - Learning Signed Distance Field for Multi-view Surface Reconstruction

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch

Code for training and evaluation of the model from "Language Generation with Recurrent Generative Adversarial Networks without Pre-training"

CDTrans: Cross-domain Transformer for Unsupervised Domain Adaptation

OpenABC-D: A Large-Scale Dataset For Machine Learning Guided Integrated Circuit Synthesis

Dynamic Bottleneck for Robust Self-Supervised Exploration

A vision library for performing sliced inference on large images/small objects

Source code for Acorn, the precision farming rover by Twisted Fields

Implementation for our AAAI2021 paper (Entity Structure Within and Throughout: Modeling Mention Dependencies for Document-Level Relation Extraction).

Computer Vision application in the web

HyperCube: Implicit Field Representations of Voxelized 3D Models

The Official Implementation of the ICCV-2021 Paper: Semantically Coherent Out-of-Distribution Detection.