TCPNet - Temporal-attentive-Covariance-Pooling-Networks-for-Video-Recognition

Overview

Temporal-attentive-Covariance-Pooling-Networks-for-Video-Recognition

This is an implementation of TCPNet.

arch

Introduction

For video recognition task, a global representation summarizing the whole contents of the video snippets plays an important role for the final performance. However, existing video architectures usually generate it by using a simple, global average pooling (GAP) method, which has limited ability to capture complex dynamics of videos. For image recognition task, there exist evidences showing that covariance pooling has stronger representation ability than GAP. Unfortunately, such plain covariance pooling used in image recognition is an orderless representative, which cannot model spatio-temporal structure inherent in videos. Therefore, this paper proposes a Temporal-attentive Covariance Pooling (TCP), inserted at the end of deep architectures, to produce powerful video representations. Specifi- cally, our TCP first develops a temporal attention module to adaptively calibrate spatio-temporal features for the succeeding covariance pooling, approximatively producing attentive covariance representations. Then, a temporal covariance pooling performs temporal pooling of the attentive covariance representations to char- acterize both intra-frame correlations and inter-frame cross-correlations of the calibrated features. As such, the proposed TCP can capture complex temporal dynamics. Finally, a fast matrix power normalization is introduced to exploit geometry of covariance representations. Note that our TCP is model-agnostic and can be flexibly integrated into any video architectures, resulting in TCPNet for effective video recognition. The extensive experiments on six benchmarks (e.g., Kinetics, Something-Something V1 and Charades) using various video architectures show our TCPNet is clearly superior to its counterparts, while having strong generalization ability.

Citation

@InProceedings{Gao_2021_TCP,
                author = {Zilin, Gao and Qilong, Wang and Bingbing, Zhang and Qinghua, Hu and Peihua, Li},
                title = {Temporal-attentive Covariance Pooling Networks for Video Recognition},
                booktitle = {arxiv preprint axXiv:2021.06xxx},
                year = {2021}
  }

Model Zoo

Kinetics-400

Method Backbone frames 1 crop Acc (%) 30 views Acc (%) Model Pretrained Model test log
TCPNet TSN R50 8f 72.4/90.4 75.3/91.8 K400_TCP_TSN_R50_8f Img1K_R50_GCP log
TCPNet TEA R50 8f 73.9/91.6 76.8/92.9 K400_TCP_TEA_R50_8f Img1K_Res2Net50_GCP log
TCPNet TSN R152 8f 75.7/92.2 78.3/93.7 K400_TCP_TSN_R152_8f Img11K_1K_R152_GCP log
TCPNet TSN R50 16f 73.9/91.2 75.8/92.1 K400_TCP_TSN_R50_16f Img1K_R50_GCP log
TCPNet TEA R50 16f 75.3/92.2 77.2/93.1 K400_TCP_TEA_R50_16f Img1K_Res2Net50_GCP log
TCPNet TSN R152 16f 77.2/93.1 79.3/94.0 K400_TCP_TSN_R152_16f Img11K_1K_R152_GCP TODO

Mini-Kinetics-200

Method Backbone frames 1 crop Acc (%) 30 views Acc (%) Model Pretrained Model
TCPNet TSN R50 8f 78.7 80.7 K200_TCP_TSN_8f K400_TCP_TSN_R50_8f

Environments

pytorch v1.0+(for TCP_TSN); v1.0~1.4(for TCP+TEA)

ffmpeg

graphviz pip install graphviz

tensorboard pip install tensorboardX

tqdm pip install tqdm

scikit-learn conda install scikit-learn

matplotlib conda install -c conda-forge matplotlib

fvcore pip install 'git+https://github.com/facebookresearch/fvcore'

Dataset Preparation

We provide a detailed dataset preparation guideline for Kinetics-400 and Mini-Kinetics-200. See Dataset preparation.

StartUp

  1. download the pretrained model and put it in pretrained_models/
  2. execute the training script file e.g.: sh script/K400/train_TCP_TSN_8f_R50.sh
  3. execute the inference script file e.g.: sh script/K400/test_TCP_TSN_R50_8f.sh

TCP Code


├── ops
|    ├── TCP
|    |   ├── TCP_module.py
|    |   ├── TCP_att_module.py
|    |   ├── TSA.py
|    |   └── TCA.py
|    ├ ...
├ ...

Acknowledgement

  • We thank TSM for providing well-designed 2D action recognition toolbox.
  • We also refer to some functions from iSQRT, TEA and Non-local.
  • Mini-K200 dataset samplling strategy follows Mini_K200.
  • We would like to thank Facebook for developing pytorch toolbox.

Thanks for their work!

Owner
Zilin Gao
Zilin Gao
Collection of common code that's shared among different research projects in FAIR computer vision team.

fvcore fvcore is a light-weight core library that provides the most common and essential functionality shared in various computer vision frameworks de

Meta Research 1.5k Jan 07, 2023
Rethinking Portrait Matting with Privacy Preserving

Rethinking Portrait Matting with Privacy Preserving This is the official repository of the paper Rethinking Portrait Matting with Privacy Preserving.

184 Jan 03, 2023
A script written in Python that returns a consensus string and profile matrix of a given DNA string(s) in FASTA format.

A script written in Python that returns a consensus string and profile matrix of a given DNA string(s) in FASTA format.

Zain 1 Feb 01, 2022
[CVPR'21 Oral] Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning [CVPR'21, Oral] By Zhicheng Huang*, Zhaoyang Zeng*, Yupan H

Multimedia Research 196 Dec 13, 2022
Source Code For Template-Based Named Entity Recognition Using BART

Template-Based NER Source Code For Template-Based Named Entity Recognition Using BART Training Training train.py Inference inference.py Corpus ATIS (h

174 Dec 19, 2022
Code for "Learning the Best Pooling Strategy for Visual Semantic Embedding", CVPR 2021

Learning the Best Pooling Strategy for Visual Semantic Embedding Official PyTorch implementation of the paper Learning the Best Pooling Strategy for V

Jiacheng Chen 106 Jan 06, 2023
Python and Julia in harmony.

PythonCall & JuliaCall Bringing Python® and Julia together in seamless harmony: Call Python code from Julia and Julia code from Python via a symmetric

Christopher Rowley 414 Jan 07, 2023
A unified 3D Transformer Pipeline for visual synthesis

Overview This is the official repo for the paper: "NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion". NÜWA is a unified multimodal

Microsoft 2.6k Jan 03, 2023
Enhancing Column Generation by a Machine-Learning-BasedPricing Heuristic for Graph Coloring

Enhancing Column Generation by a Machine-Learning-BasedPricing Heuristic for Graph Coloring (to appear at AAAI 2022) We propose a machine-learning-bas

YunzhuangS 2 May 02, 2022
LocUNet is a deep learning method to localize a UE based solely on the reported signal strengths from a set of BSs.

LocUNet LocUNet is a deep learning method to localize a UE based solely on the reported signal strengths from a set of BSs. The method utilizes accura

4 Oct 05, 2022
Self-Adaptable Point Processes with Nonparametric Time Decays

NPPDecay This is our implementation for the paper Self-Adaptable Point Processes with Nonparametric Time Decays, by Zhimeng Pan, Zheng Wang, Jeff M. P

zpan 2 Sep 24, 2022
Code for Domain Adaptive Video Segmentation via Temporal Consistency Regularization in ICCV 2021

Domain Adaptive Video Segmentation via Temporal Consistency Regularization Updates 08/2021: check out our domain adaptation for sematic segmentation p

36 Dec 12, 2022
PyTorch implementation of Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

Simple PyTorch Implementation of "Grokking" Implementation of Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets Usage Running

Teddy Koker 15 Sep 29, 2022
Labelbox is the fastest way to annotate data to build and ship artificial intelligence applications

Labelbox Labelbox is the fastest way to annotate data to build and ship artificial intelligence applications. Use this github repository to help you s

labelbox 1.7k Dec 29, 2022
Official implementation of ACTION-Net: Multipath Excitation for Action Recognition (CVPR'21).

ACTION-Net Official implementation of ACTION-Net: Multipath Excitation for Action Recognition (CVPR'21). Getting Started EgoGesture data folder struct

V-Sense 171 Dec 26, 2022
This project is based on RIFE and aims to make RIFE more practical for users by adding various features and design new models

CPM 项目描述 CPM(Chinese Pretrained Models)模型是北京智源人工智能研究院和清华大学发布的中文大规模预训练模型。官方发布了三种规模的模型,参数量分别为109M、334M、2.6B,用户需申请与通过审核,方可下载。 由于原项目需要考虑大模型的训练和使用,需要安装较为复杂

hzwer 190 Jan 08, 2023
Morphable Detector for Object Detection on Demand

Morphable Detector for Object Detection on Demand (ICCV 2021) PyTorch implementation of the paper Morphable Detector for Object Detection on Demand. I

9 Feb 23, 2022
根据midi文件演奏“风物之诗琴”的脚本 "Windsong Lyre" auto play

Genshin-lyre-auto-play 简体中文 | English 简介 根据midi文件演奏“风物之诗琴”的脚本。由Python驱动,在此承诺, ⚠️ 项目内绝不含任何能够引起安全问题的代码。 前排提示:所有键盘在动但是原神没反应的都是因为没有管理员权限,双击run.bat或者以管理员模式

御坂17032号 386 Jan 01, 2023
Official Implementation of LARGE: Latent-Based Regression through GAN Semantics

LARGE: Latent-Based Regression through GAN Semantics [Project Website] [Google Colab] [Paper] LARGE: Latent-Based Regression through GAN Semantics Yot

83 Dec 06, 2022
Face recognition. Redefined.

FaceFinder Use a powerful CNN to identify faces in images! TABLE OF CONTENTS About The Project Built With Getting Started Prerequisites Installation U

BleepLogger 20 Jun 16, 2021