Implementation of "With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition, BMVC, 2021" in PyTorch

Overview

Multimodal Temporal Context Network (MTCN)

This repository implements the model proposed in the paper:

Evangelos Kazakos, Jaesung Huh, Arsha Nagrani, Andrew Zisserman, Dima Damen, With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition, BMVC, 2021

Project webpage

arXiv paper

Citing

When using this code, kindly reference:

@INPROCEEDINGS{kazakos2021MTCN,
  author={Kazakos, Evangelos and Huh, Jaesung and Nagrani, Arsha and Zisserman, Andrew and Damen, Dima},
  booktitle={British Machine Vision Conference (BMVC)},
  title={With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition},
  year={2021}}

NOTE

Although we train MTCN using visual SlowFast features extracted from a model trained with video clips of 2s, at Table 3 of our paper and Table 1 of Appendix (Table 6 in the arXiv version) where we compare MTCN with SOTA, the results of SlowFast are from [1] where the model is trained with video clips of 1s. In the following table, we provide the results of SlowFast trained with 2s, for a direct comparison as we use this model to extract the visual features.

alt text

Requirements

Project's requirements can be installed in a separate conda environment by running the following command in your terminal: $ conda env create -f environment.yml.

Features

The extracted features for each dataset can be downloaded using the following links:

EPIC-KITCHENS-100:

EGTEA:

Pretrained models

We provide pretrained models for EPIC-KITCHENS-100:

  • Audio-visual transformer link
  • Language model link

Ground-truth

Train

EPIC-KITCHENS-100

To train the audio-visual transformer on EPIC-KITCHENS-100, run:

python train_av.py --dataset epic-100 --train_hdf5_path /path/to/epic-kitchens-100/features/audiovisual_slowfast_features_train.hdf5 
--val_hdf5_path /path/to/epic-kitchens-100/features/audiovisual_slowfast_features_val.hdf5 
--train_pickle /path/to/epic-kitchens-100-annotations/EPIC_100_train.pkl 
--val_pickle /path/to/epic-kitchens-100-annotations/EPIC_100_validation.pkl 
--batch-size 32 --lr 0.005 --optimizer sgd --epochs 100 --lr_steps 50 75 --output_dir /path/to/output_dir 
--num_layers 4 -j 8 --classification_mode all --seq_len 9

To train the language model on EPIC-KITCHENS-100, run:

python train_lm.py --dataset epic-100 --train_pickle /path/to/epic-kitchens-100-annotations/EPIC_100_train.pkl 
--val_pickle /path/to/epic-kitchens-100-annotations/EPIC_100_validation.pkl 
--verb_csv /path/to/epic-kitchens-100-annotations/EPIC_100_verb_classes.csv
--noun_csv /path/to/epic-kitchens-100-annotations/EPIC_100_noun_classes.csv
--batch-size 64 --lr 0.001 --optimizer adam --epochs 100 --lr_steps 50 75 --output_dir /path/to/output_dir 
--num_layers 4 -j 8 --num_gram 9 --dropout 0.1

EGTEA

To train the visual-only transformer on EGTEA (EGTEA does not have audio), run:

python train_av.py --dataset egtea --train_hdf5_path /path/to/egtea/features/visual_slowfast_features_train_split1.hdf5
--val_hdf5_path /path/to/egtea/features/visual_slowfast_features_test_split1.hdf5
--train_pickle /path/to/EGTEA_annotations/train_split1.pkl --val_pickle /path/to/EGTEA_annotations/test_split1.pkl 
--batch-size 32 --lr 0.001 --optimizer sgd --epochs 50 --lr_steps 25 38 --output_dir /path/to/output_dir 
--num_layers 4 -j 8 --classification_mode all --seq_len 9

To train the language model on EGTEA,

python train_lm.py --dataset egtea --train_pickle /path/to/EGTEA_annotations/train_split1.pkl
--val_pickle /path/to/EGTEA_annotations/test_split1.pkl 
--action_csv /path/to/EGTEA_annotations/actions_egtea.csv
--batch-size 64 --lr 0.001 --optimizer adam --epochs 50 --lr_steps 25 38 --output_dir /path/to/output_dir 
--num_layers 4 -j 8 --num_gram 9 --dropout 0.1

Test

EPIC-KITCHENS-100

To test the audio-visual transformer on EPIC-KITCHENS-100, run:

python test_av.py --dataset epic-100 --test_hdf5_path /path/to/epic-kitchens-100/features/audiovisual_slowfast_features_val.hdf5
--test_pickle /path/to/epic-kitchens-100-annotations/EPIC_100_validation.pkl
--checkpoint /path/to/av_model/av_checkpoint.pyth --seq_len 9 --num_layers 4 --output_dir /path/to/output_dir
--split validation

To obtain scores of the model on the test set, simply use --test_hdf5_path /path/to/epic-kitchens-100/features/audiovisual_slowfast_features_test.hdf5, --test_pickle /path/to/epic-kitchens-100-annotations/EPIC_100_test_timestamps.pkl and --split test instead. Since the labels for the test set are not available the script will simply save the scores without computing the accuracy of the model.

To evaluate your model on the validation set, follow the instructions in this link. In the same link, you can find instructions for preparing the scores of the model for submission in the evaluation server and obtain results on the test set.

Finally, to filter out improbable sequences using LM, run:

python test_av_lm.py --dataset epic-100
--test_pickle /path/to/epic-kitchens-100-annotations/EPIC_100_validation.pkl 
--test_scores /path/to/audio-visual-results.pkl
--checkpoint /path/to/lm_model/lm_checkpoint.pyth
--num_gram 9 --split validation

Note that, --test_scores /path/to/audio-visual-results.pkl are the scores predicted from the audio-visual transformer. To obtain scores on the test set, use --test_pickle /path/to/epic-kitchens-100-annotations/EPIC_100_test_timestamps.pkl and --split test instead.

Since we are providing the trained models for EPIC-KITCHENS-100, av_checkpoint.pyth and lm_checkpoint.pyth in the test scripts above could be either the provided pretrained models or model_best.pyth that is the your own trained model.

EGTEA

To test the visual-only transformer on EGTEA, run:

python test_av.py --dataset egtea --test_hdf5_path /path/to/egtea/features/visual_slowfast_features_test_split1.hdf5
--test_pickle /path/to/EGTEA_annotations/test_split1.pkl
--checkpoint /path/to/v_model/model_best.pyth --seq_len 9 --num_layers 4 --output_dir /path/to/output_dir
--split test_split1

To filter out improbable sequences using LM, run:

python test_av_lm.py --dataset egtea
--test_pickle /path/to/EGTEA_annotations/test_split1.pkl 
--test_scores /path/to/visual-results.pkl
--checkpoint /path/to/lm_model/model_best.pyth
--num_gram 9 --split test_split1

In each case, you can extract attention weights by simply including --extract_attn_weights at the input arguments of the test script.

References

[1] Dima Damen, Hazel Doughty, Giovanni Maria Farinella, , Antonino Furnari, Jian Ma,Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, andMichael Wray, Rescaling Egocentric Vision: Collection Pipeline and Challenges for EPIC-KITCHENS-100, IJCV, 2021

License

The code is published under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, found here.

Owner
Evangelos Kazakos
Evangelos Kazakos
Cache Requests in Deta Bases and Echo them with Deta Micros

Deta Echo Cache Leverage the awesome Deta Micros and Deta Base to cache requests and echo them as needed. Stop worrying about slow public APIs or agre

Gingerbreadfork 8 Dec 07, 2021
Exadel CompreFace is a free and open-source face recognition GitHub project

Exadel CompreFace is a leading free and open-source face recognition system Exadel CompreFace is a free and open-source face recognition service that

Exadel 2.6k Jan 04, 2023
[CVPR21] LightTrack: Finding Lightweight Neural Network for Object Tracking via One-Shot Architecture Search

LightTrack: Finding Lightweight Neural Networks for Object Tracking via One-Shot Architecture Search The official implementation of the paper LightTra

Multimedia Research 290 Dec 24, 2022
LaneAF: Robust Multi-Lane Detection with Affinity Fields

LaneAF: Robust Multi-Lane Detection with Affinity Fields This repository contains Pytorch code for training and testing LaneAF lane detection models i

155 Dec 17, 2022
Let's Git - Versionsverwaltung & Open Source Hausaufgabe

Let's Git - Versionsverwaltung & Open Source Hausaufgabe Herzlich Willkommen zu dieser Hausaufgabe für unseren MOOC: Let's Git! Wir hoffen, dass Du vi

1 Dec 13, 2021
MusicYOLO framework uses the object detection model, YOLOx, to locate notes in the spectrogram.

MusicYOLO MusicYOLO framework uses the object detection model, YOLOX, to locate notes in the spectrogram. Its performance on the ISMIR2014 dataset, MI

Xianke Wang 2 Aug 02, 2022
WarpDrive: Extremely Fast End-to-End Deep Multi-Agent Reinforcement Learning on a GPU

WarpDrive is a flexible, lightweight, and easy-to-use open-source reinforcement learning (RL) framework that implements end-to-end multi-agent RL on a single GPU (Graphics Processing Unit).

Salesforce 334 Jan 06, 2023
CVNets: A library for training computer vision networks

CVNets: A library for training computer vision networks This repository contains the source code for training computer vision models. Specifically, it

Apple 1.1k Jan 03, 2023
[NeurIPS 2021 Spotlight] Code for Learning to Compose Visual Relations

Learning to Compose Visual Relations This is the pytorch codebase for the NeurIPS 2021 Spotlight paper Learning to Compose Visual Relations. Demo Imag

Nan Liu 88 Jan 04, 2023
Processed, version controlled history of Minecraft's generated data and assets

mcmeta Processed, version controlled history of Minecraft's generated data and assets Repository structure Each of the following branches has a commit

Misode 75 Dec 28, 2022
Pseudo-rng-app - whos needs science to make a random number when you have pseudoscience?

Pseudo-random numbers with pseudoscience rng is so complicated! Why cant we have a horoscopic, vibe-y way of calculating a random number? Why cant rng

Andrew Blance 1 Dec 27, 2021
Self-supervised learning algorithms provide a way to train Deep Neural Networks in an unsupervised way using contrastive losses

Self-supervised learning Self-supervised learning algorithms provide a way to train Deep Neural Networks in an unsupervised way using contrastive loss

Arijit Das 2 Mar 26, 2022
LBK 35 Dec 26, 2022
A computer vision pipeline to identify the "icons" in Christian paintings

Christian-Iconography A computer vision pipeline to identify the "icons" in Christian paintings. A bit about iconography. Iconography is related to id

Rishab Mudliar 3 Jul 30, 2022
A-ESRGAN aims to provide better super-resolution images by using multi-scale attention U-net discriminators.

A-ESRGAN: Training Real-World Blind Super-Resolution with Attention-based U-net Discriminators The authors are hidden for the purpose of double blind

77 Dec 16, 2022
AEI: Actors-Environment Interaction with Adaptive Attention for Temporal Action Proposals Generation

AEI: Actors-Environment Interaction with Adaptive Attention for Temporal Action Proposals Generation A pytorch-version implementation codes of paper:

11 Dec 13, 2022
object detection; robust detection; ACM MM21 grand challenge; Security AI Challenger Phase VII

赛题背景 在商品知识产权领域,知识产权体现为在线商品的设计和品牌。不幸的是,在每一天,存在着非法商户通过一些对抗手段干扰商标识别来逃避侵权,这带来了很高的知识产权风险和财务损失。为了促进先进的多媒体人工智能技术的发展,以保护企业来之不易的创作和想法免受恶意使用和剽窃,因此提出了鲁棒性标识检测挑战赛

65 Dec 22, 2022
Official code for the publication "HyFactor: Hydrogen-count labelled graph-based defactorization Autoencoder".

HyFactor Graph-based architectures are becoming increasingly popular as a tool for structure generation. Here, we introduce a novel open-source archit

Laboratoire-de-Chemoinformatique 11 Oct 10, 2022
Code release for the ICML 2021 paper "PixelTransformer: Sample Conditioned Signal Generation".

PixelTransformer Code release for the ICML 2021 paper "PixelTransformer: Sample Conditioned Signal Generation". Project Page Installation Please insta

Shubham Tulsiani 24 Dec 17, 2022
a reimplementation of LiteFlowNet in PyTorch that matches the official Caffe version

pytorch-liteflownet This is a personal reimplementation of LiteFlowNet [1] using PyTorch. Should you be making use of this work, please cite the paper

Simon Niklaus 365 Dec 31, 2022