We present a framework for training multi-modal deep learning models on unlabelled video data by forcing the network to learn invariances to transformations applied to both the audio and video streams.

Related tags

Deep LearningGDT
Overview

Multi-Modal Self-Supervision using GDT and StiCa

This is an official pytorch implementation of papers: Multi-modal Self-Supervision from Generalized Data Transformations and Space-Time Crop & Attend: Improving Cross-modal Video Representation Learning. In this repository, we provide PyTorch code for pretraining and testing our proposed GDT and StiCa models.

If you find GDT and STiCA useful in your research, please use the following BibTeX entries for citation.

@misc{patrick2020multimodal,
      title={Multi-modal Self-Supervision from Generalized Data Transformations}, 
      author={Mandela Patrick and Yuki M. Asano and Polina Kuznetsova and Ruth Fong and João F. Henriques and Geoffrey Zweig and Andrea Vedaldi},
      year={2021},
      booktitle={International Conference on Computer Vision (ICCV)},
}

@misc{m2021spacetime,
    title={Space-Time Crop & Attend: Improving Cross-modal Video Representation Learning},
    author={Mandela Patrick and Yuki M. Asano and Bernie Huang and Ishan Misra and Florian Metze and Joao Henriques and Andrea Vedaldi},
    year={2021},
    booktitle={International Conference on Computer Vision (ICCV)},
}

Highlights

(1) GDT: Formulate and generalize most pretext tasks in a NCE objective.

Using this formulation, we test various pretext tasks previously unexplored and achieve SOTA downstream performance.

(2) STiCA: Importance of incorporating within-modal invariance in cross-modal learning

We show how to efficiently incorporate within-modal invariance learning using feature crops and achieve SOTA downstream performance.

Model Zoo

We provide GDT models pretrained on Kinetics-400 (K400), HowTo100M (HT100M), and Instagram-65M (IG65M) datasets, and StiCa models pretrained on Kinetics-400 (K400).

name dataset # of frames spatial crop HMDB51 Top1 UCF101 Top1 url
GDT K400 30 112 62.3 90.9 model
GDT HT100M 30 112 94.1 67.4 model
GDT IG65M 30 112 72.8 95.2 model
name dataset # of frames spatial crop HMDB51 Top1 UCF101 Top1 url
STiCA K400 60 112 67.0 93.1 Coming Soon

Installation

This repo was tested with Ubuntu 16.04.5 LTS, Python 3.7.5, PyTorch 1.3.1, Torchvision 0.4.1, and CUDA 10.0.

Step 1

  • Clone this repo to your local machine

Step 2

  • Install required packages using conda env create -f environment.yml

Step 3

  • Activate conda environment using conda activate GDT

Step 4

  • Install kornia library pip install kornia==0.1.4

Step 5

  • See below for how to pretrain GDT / StiCa or benchmark pretrained models

Data Preperation

For Kinetics-400/600, HMDB-51 and UCF-101 datasets:

  1. Ensure all datasets are in the format:
  2. $ROOT_DIR/$SPLIT/$CLASS/*
    

To prepare How-To-100M dataset, do the following:

  1. Download the word2vec matrix and dictionary, unzip the file, and place in datasets/data folder.
  2. wget https://www.rocq.inria.fr/cluster-willow/amiech/word2vec.zip
    unzip word2vec.zip
    mv word2vec.pth datasets/data/word2vec.pth 
    
  3. Download the csv files of captions.
  4. wget https://www.rocq.inria.fr/cluster-willow/amiech/howto100m/howto100m_captions.zip
    unzip howto100m_captions.zip
    
  5. Download the preprocessed HowTo100M videos (12TB in total) by filling this Google form: https://forms.gle/hztrfnFQUJWBtiki8.

Usage

GDT pretraining

To pretrain audio-visual GDT on K-400

Multi-node distributed training with SLURM cluster:

sbatch pretraining_scripts/pretrain_gdt_k400.sh ${HYPOTHESIS_DESC} ${HYPOTHESIS} 

Single-node distributed training:

python -m torch.distributed.launch --master_port=$RANDOM --nproc_per_node=2 --use_env main_gdt.py --batch_size $BS --lr $LR --hypothesis {1,2,3,4,5,6,7,8,9}

To pretrain video-text GDT on HT100M

Multi-node training with SLURM cluster:

sbatch pretraining_scripts/pretrain_gdt_ht100m.sh ${HYPOTHESIS_DESC} ${HYPOTHESIS} 

Single-node distributed training:

python -m torch.distributed.launch --master_port=$RANDOM --nproc_per_node=2 --use_env main_gdt.py --batch_size $BS --lr $LR --hypothesis {1,2,3,4,5,6,7,8,9} --dataset ht100m --decode_audio False --model vid_text_gdt --sample_rate 2

$HYPOTHESIS refers to the hypotheses explored in GDT. We experiment with the following:

1 - cross-modal baseline (cross_modal_baseline)
2 - variant to time reversal (v_reversal)
3 - invariant to time reversal (i_reversal)
4 - variant to time shift (v_shift)
5 - invariant to time shift (i_shift)
6 - variant to time reversal and variant to time shift (v_reversal_v_shift)
7 - invariant to time reversal, variant to time shift (i_reversal_v_shift)
8 - variant to time reversal, and invariant to time shift (v_reversal_i_shift)
9 - invariant to time reversal, invariant to time shift (i_reversal_i_shift)

Please modify the following in SLURM script:

  • SBATCH directives (e.g. partition, nodes, constraint,)
  • SAV_FOLDER
  • --root_dir (path of K-400 / HT100M train directory)

All experiments were run with 8 nodes (64 GPUs, volta32). Please scale batch-size and learning-rate appropriately.

STiCA pretraining

To pretrain audio-visual STiCA on K-400

Multi-node training with SLURM cluster:

sbatch scripts/pretrain_stica.sh $NUM_FRAMES $AUD_NUM_SEC $NUM_LARGE_CROPS $NUM_SMALL_CROPS $NUM_SMALL_TCROPS $NUM_LARGE_TCROPS $NUM_LAYER

Single-node distributed training:

python -m torch.distributed.launch --master_port=$RANDOM --nproc_per_node=2 --use_env main_stica.py --batch_size $BS --base_lr $LR

Hyper-parameters:

NUM_FRAMES - number of frames (e.g. 30)
AUD_NUM_SEC - number of seconds (30f: 1sec, 60f: 2s)
NUM_LARGE_CROPS - num of large feature spatial crops (e.g. 2)
NUM_SMALL_CROPS - num of small feature spatial crops (e.g. 4)
NUM_SMALL_TCROPS - num of large feature spatial crops (e.g. 1)
NUM_LARGE_TCROPS - num of small feature spatial crops (e.g. 2)
NUM_LAYER - num of transformer pooling layers (0 == GAP, >1 is num. of transformer layers)
e.g. sbatch scripts/pretrain_stica.sh 30 1 2 4 1 2 0

Please modify the following in SLURM script:

  • SBATCH directives (e.g. partition, nodes, constraint,)
  • SAV_FOLDER
  • --root_dir (path of K-400 / HT100M train directory)

All experiments were run with 8 nodes (64 GPUs, volta32). Please scale batch-size and learning-rate appropriately.

Benchmarking

To evaluate pretraining on video action recognition on UCF-101 and HMDB-51 datasets,

Locally:

python3 eval_video.py --dataset {ucf101, hmdb51} --fold {1,2,3} --weights-path {WEIGHTS_PATH} --model ${vid_text_gdt, stica, av_gdt}

On SLURM:

bash scripts/eval.sh ${WEIGHTS_PATH} ${OUTPUT_DIR} ${CKPT_NUM} ${CLIP_LEN} ${vid_text_gdt, stica, av_gdt} ${1, 2, 3}

Modify --root_dir, --ucf101-annotation-path, and --hmdb51-annotation-path in eval_video.py.

License

The majority of this work is licensed under CC-NC 4.0 International license.

Contributing

We actively welcome your pull requests. Please see CONTRIBUTING.md and CODE_OF_CONDUCT.md for more info.

Owner
Facebook Research
Facebook Research
Ros2-voiceroid2 - ROS2 wrapper package of VOICEROID2

ros2_voiceroid2 ROS2 wrapper package of VOICEROID2 Windows Only Installation Ins

Nkyoku 1 Jan 23, 2022
Social Network Ads Prediction

Social network advertising, also social media targeting, is a group of terms that are used to describe forms of online advertising that focus on social networking services.

Khazar 2 Jan 28, 2022
Analysis of Smiles through reservoir sampling & RDkit

Analysis of Smiles through reservoir sampling and machine learning (under development). This is a simple project that includes two Jupyter files for t

Aurimas A. Nausėdas 6 Aug 30, 2022
Python package for downloading ECMWF reanalysis data and converting it into a time series format.

ecmwf_models Readers and converters for data from the ECMWF reanalysis models. Written in Python. Works great in combination with pytesmo. Citation If

TU Wien - Department of Geodesy and Geoinformation 31 Dec 26, 2022
Magic tool for managing internet connection in local network by @zalexdev

Megacut ✂️ A new powerful Python3 tool for managing internet on a local network Installation git clone https://github.com/stryker-project/megacut cd m

Stryker 12 Dec 15, 2022
An extremely simple, intuitive, hardware-friendly, and well-performing network structure for LiDAR semantic segmentation on 2D range image. IROS21

FIDNet_SemanticKITTI Motivation Implementing complicated network modules with only one or two points improvement on hardware is tedious. So here we pr

YimingZhao 54 Dec 12, 2022
Creating predictive checklists from data using integer programming.

Learning Optimal Predictive Checklists A Python package to learn simple predictive checklists from data subject to customizable constraints. For more

Healthy ML 5 Apr 19, 2022
TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.

TensorFlowOnSpark TensorFlowOnSpark brings scalable deep learning to Apache Hadoop and Apache Spark clusters. By combining salient features from the T

Yahoo 3.8k Jan 04, 2023
Implementation of CVPR 2021 paper "Spatially-invariant Style-codes Controlled Makeup Transfer"

SCGAN Implementation of CVPR 2021 paper "Spatially-invariant Style-codes Controlled Makeup Transfer" Prepare The pre-trained model is avaiable at http

118 Dec 12, 2022
[NeurIPS 2021] "G-PATE: Scalable Differentially Private Data Generator via Private Aggregation of Teacher Discriminators"

G-PATE This is the official code base for our NeurIPS 2021 paper: "G-PATE: Scalable Differentially Private Data Generator via Private Aggregation of T

AI Secure 14 Oct 12, 2022
A python script to dump all the challenges locally of a CTFd-based Capture the Flag.

A python script to dump all the challenges locally of a CTFd-based Capture the Flag. Features Connects and logins to a remote CTFd instance. Dumps all

Podalirius 77 Dec 07, 2022
Music Classification: Beyond Supervised Learning, Towards Real-world Applications

Music Classification: Beyond Supervised Learning, Towards Real-world Applications

104 Dec 15, 2022
Generative Modelling of BRDF Textures from Flash Images [SIGGRAPH Asia, 2021]

Neural Material Official code repository for the paper: Generative Modelling of BRDF Textures from Flash Images [SIGGRAPH Asia, 2021] Henzler, Deschai

Philipp Henzler 80 Dec 20, 2022
PULSE: Self-Supervised Photo Upsampling via Latent Space Exploration of Generative Models

PULSE: Self-Supervised Photo Upsampling via Latent Space Exploration of Generative Models Code accompanying CVPR'20 paper of the same title. Paper lin

Alex Damian 7k Dec 30, 2022
UMEC: Unified Model and Embedding Compression for Efficient Recommendation Systems

[ICLR 2021] "UMEC: Unified Model and Embedding Compression for Efficient Recommendation Systems" by Jiayi Shen, Haotao Wang*, Shupeng Gui*, Jianchao Tan, Zhangyang Wang, and Ji Liu

VITA 39 Dec 03, 2022
A simple editor for captions in .SRT file extension

WaySRT A simple editor for captions in .SRT file extension The program doesn't use any external dependecies, just run: python way_srt.py {file_name.sr

Gustavo Lopes 3 Nov 16, 2022
YOLTv5 rapidly detects objects in arbitrarily large aerial or satellite images that far exceed the ~600×600 pixel size typically ingested by deep learning object detection frameworks

YOLTv5 rapidly detects objects in arbitrarily large aerial or satellite images that far exceed the ~600×600 pixel size typically ingested by deep learning object detection frameworks.

Adam Van Etten 145 Jan 01, 2023
An implementation of shampoo

shampoo.pytorch An implementation of shampoo, proposed in Shampoo : Preconditioned Stochastic Tensor Optimization by Vineet Gupta, Tomer Koren and Yor

Ryuichiro Hataya 69 Sep 10, 2022
Implementation of the paper Recurrent Glimpse-based Decoder for Detection with Transformer.

REGO-Deformable DETR By Zhe Chen, Jing Zhang, and Dacheng Tao. This repository is the implementation of the paper Recurrent Glimpse-based Decoder for

Zhe Chen 33 Nov 30, 2022
A deep learning library that makes face recognition efficient and effective

Distributed Arcface Training in Pytorch This is a deep learning library that makes face recognition efficient, and effective, which can train tens of

Sajjad Aemmi 10 Nov 23, 2021