AdaFocus (ICCV 2021) Adaptive Focus for Efficient Video Recognition

Last update: Dec 21, 2022

Related tags

Deep Learning AdaFocus

Overview

AdaFocus (ICCV 2021)

This repo contains the official code and pre-trained models for AdaFocus.

Adaptive Focus for Efficient Video Recognition

Reference

If you find our code or paper useful for your research, please cite:

@InProceedings{Wang_2021_ICCV,
author = {Wang, Yulin and Chen, Zhaoxi and Jiang, Haojun and Song, Shiji and Han, Yizeng and Huang, Gao},
title = {Adaptive Focus for Efficient Video Recognition},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2021}
}

Introduction

In this paper, we explore the spatial redundancy in video recognition with the aim to improve the computational efficiency. It is observed that the most informative region in each frame of a video is usually a small image patch, which shifts smoothly across frames. Therefore, we model the patch localization problem as a sequential decision task, and propose a reinforcement learning based approach for efficient spatially adaptive video recognition (AdaFocus). In specific, a light-weighted ConvNet is first adopted to quickly process the full video sequence, whose features are used by a recurrent policy network to localize the most task-relevant regions. Then the selected patches are inferred by a high-capacity network for the final prediction. During offline inference, once the informative patch sequence has been generated, the bulk of computation can be done in parallel, and is efficient on modern GPU devices. In addition, we demonstrate that the proposed method can be easily extended by further considering the temporal redundancy, e.g., dynamically skipping less valuable frames. Extensive experiments on five benchmark datasets, i.e., ActivityNet, FCVID, Mini-Kinetics, Something-Something V1&V2, demonstrate that our method is significantly more efficient than the competitive baselines.

Result

ActivityNet

Something-Something V1&V2

Visualization

Requirements

python 3.8
pytorch 1.7.0
torchvision 0.8.0
hydra 1.1.0

Datasets

Please get train/test splits file for each dataset from Google Drive and put them in PATH_TO_DATASET.
Download videos from following links, or contact the corresponding authors for the access. Save them to PATH_TO_DATASET/videos

ActivityNet-v1.3
FCVID
Mini-Kinetics. Please download Kinetics 400, for Mini-Kinetics used in our paper, you need to use the train/val splits file.

Extract frames using ops/video_jpg.py, the frames will be saved to PATH_TO_DATASET/frames. Minor modifications on file path are needed when extracting frames from different dataset.

Pre-trained Models

Please download pretrained weights and checkpoints from Google Drive.

globalcnn.pth.tar: pretrained weights for global CNN (MobileNet-v2).
localcnn.pth.tar: pretrained weights for local CNN (ResNet-50).
128checkpoint.pth.tar: checkpoint of stage 1 for patch size 128x128.
160checkpoint.pth.tar: checkpoint of stage 1 for patch size 160x128.
192checkpoint.pth.tar: checkpoint of stage 1 for patch size 192x128.

Training

Here we take training model with patch size 128x128 on ActivityNet dataset for example.
All logs and checkpoints will be saved in the directory: ./outputs/YYYY-MM-DD/HH-MM-SS
Note that we store a set of default paramenter in conf/default.yaml which can override through command line. You can also use your own config files.
Before training, please initialize Global CNN and Local CNN by fine-tuning the ImageNet pre-trained models in Pytorch using the following command:

for Global CNN:

CUDA_VISIBLE_DEVICES=0,1 python main_dist.py dataset=actnet data_dir=PATH_TO_DATASET train_stage=0 batch_size=64 workers=8 dropout=0.8 lr_type=cos backbone_lr=0.01 epochs=15 dist_url=tcp://127.0.0.1:8857 random_patch=true patch_size=128 glance_size=224 eval_freq=5 consensus=gru hidden_dim=1024 pretrain_glancer=true

for Local CNN:

CUDA_VISIBLE_DEVICES=0,1 python main_dist.py dataset=actnet data_dir=PATH_TO_DATASET train_stage=0 batch_size=64 workers=8 dropout=0.8 lr_type=cos backbone_lr=0.01 epochs=15 dist_url=tcp://127.0.0.1:8857 random_patch=true patch_size=128 glance_size=224 eval_freq=5 consensus=gru hidden_dim=1024 pretrain_glancer=false

Training stage 1, pretrained weights for Global CNN and Local CNN are required:

CUDA_VISIBLE_DEVICES=0,1 python main_dist.py dataset=actnet data_dir=PATH_TO_DATASET train_stage=1 batch_size=64 workers=8 dropout=0.8 lr_type=cos backbone_lr=0.0005 fc_lr=0.05 epochs=50 dist_url=tcp://127.0.0.1:8857 random_patch=true patch_size=128 glance_size=224 eval_freq=5 consensus=gru hidden_dim=1024 pretrained_glancer=PATH_TO_CHECKPOINTS pretrained_focuser=PATH_TO_CHECKPOINTS

Training stage 2, a stage-1 checkpoint is required:

CUDA_VISIBLE_DEVICES=0 python main_dist.py dataset=actnet data_dir=PATH_TO_DATASET train_stage=2 batch_size=64 workers=8 dropout=0.8 lr_type=cos backbone_lr=0.0005 fc_lr=0.05 epochs=50 random_patch=false patch_size=128 glance_size=224 action_dim=49 eval_freq=5 consensus=gru hidden_dim=1024 resume=PATH_TO_CHECKPOINTS multiprocessing_distributed=false distributed=false

Training stage 3, a stage-2 checkpoint is required:

CUDA_VISIBLE_DEVICES=0,1 python main_dist.py dataset=actnet data_dir=PATH_TO_DATASET train_stage=3 batch_size=64 workers=8 dropout=0.8 lr_type=cos backbone_lr=0.0005 fc_lr=0.005 epochs=10 random_patch=false patch_size=128 glance_size=224 action_dim=49 eval_freq=5 consensus=gru hidden_dim=1024 resume=PATH_TO_CHECKPOINTS multiprocessing_distributed=false distributed=false

Contact

If you have any question, feel free to contact the authors or raise an issue. Yulin Wang: [email protected].

Acknowledgement

We use implementation of MobileNet-v2 and ResNet from Pytorch source code. We also borrow some codes for dataset preparation from AR-Net and PPO from here.

AdaFocus (ICCV 2021) Adaptive Focus for Efficient Video Recognition

Related tags

Overview

AdaFocus (ICCV 2021)

Reference

Introduction

Result

Requirements

Datasets

Pre-trained Models

Training

Contact

Acknowledgement

Owner

Rainforest Wang

Differentiable Optimizers with Perturbations in Pytorch

This repository contains the code for the paper "PIFu: Pixel-Aligned Implicit Function for High-Resolution Clothed Human Digitization"

Continual Learning of Long Topic Sequences in Neural Information Retrieval

Adversarial-autoencoders - Tensorflow implementation of Adversarial Autoencoders

Code for the paper "JANUS: Parallel Tempered Genetic Algorithm Guided by Deep Neural Networks for Inverse Molecular Design"

FCA: Learning a 3D Full-coverage Vehicle Camouflage for Multi-view Physical Adversarial Attack

Building Ellee — A GPT-3 and Computer Vision Powered Talking Robotic Teddy Bear With Human Level Conversation Intelligence

Monk is a low code Deep Learning tool and a unified wrapper for Computer Vision.

Diabet Feature Engineering - Predict whether people have diabetes when their characteristics are specified

This thesis is mainly concerned with state-space methods for a class of deep Gaussian process (DGP) regression problems

一个多模态内容理解算法框架，其中包含数据处理、预训练模型、常见模型以及模型加速等模块。

Code from Daniel Lemire, A Better Alternative to Piecewise Linear Time Series Segmentation

Official implementation of our neural-network-based fast diffuse room impulse response generator (FAST-RIR)

Pytorch and Keras Implementations of Hyperspectral Image Classification -- Traditional to Deep Models: A Survey for Future Prospects.

An executor that performs image segmentation on fashion items

利用yolov5和TensorRT从0到1实现目标检测的模型训练到模型部署全过程

This repository contains the code and models for the following paper.

Implementation of Continuous Sparsification, a method for pruning and ticket search in deep networks

U-2-Net: U Square Net - Modified for paired image training of style transfer

Unofficial Implementation of MLP-Mixer in TensorFlow