[arXiv22] Disentangled Representation Learning for Text-Video Retrieval

Overview

Disentangled Representation Learning for Text-Video Retrieval

MSR-VTT DiDeMo

This is a PyTorch implementation of the paper Disentangled Representation Learning for Text-Video Retrieval:

@Article{DRLTVR2022,
  author  = {Qiang Wang and Yanhao Zhang and Yun Zheng and Pan Pan and Xian-Sheng Hua},
  journal = {arXiv:2203.07111},
  title   = {Disentangled Representation Learning for Text-Video Retrieval},
  year    = {2022},
}

Catalog

  • Setup
  • Fine-tuning code
  • Visualization demo

Setup

Setup code environment

git clone https://github.com/foolwood/DRL.git
cd DRL
conda create -n drl python=3.9
conda activate drl
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple/
pip install torch==1.8.1+cu102 torchvision==0.9.1+cu102 -f https://download.pytorch.org/whl/torch_stable.html

Download CLIP Model (as pretraining)

cd tvr/models
wget https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt
# wget https://openaipublic.azureedge.net/clip/models/5806e77cd80f8b59890b7e101eabd078d9fb84e6937f9e85e4ecb61988df416f/ViT-B-16.pt
# wget https://openaipublic.azureedge.net/clip/models/b8cca3fd41ae0c99ba7e8951adf17d267cdb84cd88be6f7c2e0eca1737a03836/ViT-L-14.pt

Download Datasets

cd data/MSR-VTT
wget https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zip ; unzip MSRVTT.zip
mv MSRVTT/videos/all ./videos ; mv MSRVTT/annotation/MSR_VTT.json ./anns/MSRVTT_data.json

Fine-tuning code

  • Train on MSR-VTT 1k.
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 \
main.py --do_train 1 --workers 8 --n_display 50 \
--epochs 5 --lr 1e-4 --coef_lr 1e-3 --batch_size 128 --batch_size_val 128 \
--anno_path data/MSR-VTT/anns --video_path data/MSR-VTT/videos --datatype msrvtt \
--max_words 32 --max_frames 12 --video_framerate 1 \
--base_encoder ViT-B/32 --agg_module seqTransf \
--interaction wti --wti_arch 2 --cdcr 3 --cdcr_alpha1 0.11 --cdcr_alpha2 0.0 --cdcr_lambda 0.001 \
--output_dir ckpts/ckpt_msrvtt_wti_cdcr

Reproduce the ablation experiments scripts

configs
feature gpus Text-Video Video-Text train time (h)
[email protected] [email protected] [email protected] MdR MnR [email protected] [email protected] [email protected] MdR MnR
CLIP4Clip ViT/B-32 4 42.8 72.1 81.4 2.0 16.3 44.1 70.5 80.5 2.0 11.8 10.5
zero-shot ViT/B-32 4 31.1 53.7 63.4 4.0 41.6 26.5 50.1 61.7 5.0 39.9 -
Interaction
DP+None ViT/B-32 4 42.9 70.6 81.4 2.0 15.4 43.0 71.1 81.1 2.0 11.8 2.5
DP+seqTransf ViT/B-32 4 42.8 71.1 81.1 2.0 15.6 44.1 70.9 80.9 2.0 11.7 2.6
XTI+None ViT/B-32 4 40.5 71.1 82.6 2.0 13.6 42.7 70.8 80.2 2.0 12.5 14.3
XTI+seqTransf ViT/B-32 4 42.4 71.3 80.9 2.0 15.2 40.1 69.2 79.6 2.0 15.8 16.8
TI+seqTransf ViT/B-32 4 44.8 73.0 82.2 2.0 13.4 42.6 72.7 82.8 2.0 9.1 2.6
WTI+seqTransf ViT/B-32 4 46.6 73.4 83.5 2.0 13.0 45.4 73.4 81.9 2.0 9.2 2.6
Channel DeCorrelation Regularization
DP+seqTransf+CDCR ViT/B-32 4 43.9 71.1 81.2 2.0 15.3 42.3 70.3 81.1 2.0 11.4 2.6
TI+seqTransf+CDCR ViT/B-32 4 45.8 73.0 81.9 2.0 12.8 43.3 71.8 82.7 2.0 8.9 2.6
WTI+seqTransf+CDCR ViT/B-32 4 47.6 73.4 83.3 2.0 12.8 45.1 72.9 83.5 2.0 9.2 2.6

Note: the performances are slight boosts due to new hyperparameters.

Visualization demo

Run our visualization demo using matplotlib (no GPU needed):

License

See LICENSE for details.

Acknowledgments

Our code is partly based on CLIP4Clip.

Owner
Qiang Wang
Computer Vision & Machine Learning
Qiang Wang
A simple python library for fast image generation of people who do not exist.

Random Face A simple python library for fast image generation of people who do not exist. For more details, please refer to the [paper](https://arxiv.

Sergei Belousov 170 Dec 15, 2022
TJU Deep Learning & Neural Network

Deep_Learning & Neural_Network_Lab 实验环境 Python 3.9 Anaconda3(官网下载或清华镜像都行) PyTorch 1.10.1(安装代码如下) conda install pytorch torchvision torchaudio cudatool

St3ve Lee 1 Jan 19, 2022
Unadversarial Examples: Designing Objects for Robust Vision

Unadversarial Examples: Designing Objects for Robust Vision This repository contains the code necessary to replicate the major results of our paper: U

Microsoft 93 Nov 28, 2022
TransMIL: Transformer based Correlated Multiple Instance Learning for Whole Slide Image Classification

TransMIL: Transformer based Correlated Multiple Instance Learning for Whole Slide Image Classification [NeurIPS 2021] Abstract Multiple instance learn

132 Dec 30, 2022
PyTorch implementation of "Contrast to Divide: self-supervised pre-training for learning with noisy labels"

Contrast to Divide: self-supervised pre-training for learning with noisy labels This is an official implementation of "Contrast to Divide: self-superv

55 Nov 23, 2022
This implementation contains the application of GPlearn's symbolic transformer on a commodity futures sector of the financial market.

GPlearn_finiance_stock_futures_extension This implementation contains the application of GPlearn's symbolic transformer on a commodity futures sector

Chengwei <a href=[email protected]"> 189 Dec 25, 2022
Transformer part of 12th place solution in Riiid! Answer Correctness Prediction

kaggle_riiid Transformer part of 12th place solution in Riiid! Answer Correctness Prediction. Please see here for more information. Execution You need

Sakami Kosuke 2 Apr 23, 2022
A lightweight library designed to accelerate the process of training PyTorch models by providing a minimal

A lightweight library designed to accelerate the process of training PyTorch models by providing a minimal, but extensible training loop which is flexible enough to handle the majority of use cases,

Chris Hughes 110 Dec 23, 2022
Image super-resolution through deep learning

srez Image super-resolution through deep learning. This project uses deep learning to upscale 16x16 images by a 4x factor. The resulting 64x64 images

David Garcia 5.3k Dec 28, 2022
Classic Papers for Beginners and Impact Scope for Authors.

There have been billions of academic papers around the world. However, maybe only 0.0...01% among them are valuable or are worth reading. Since our limited life has never been forever, TopPaper provi

Qiulin Zhang 228 Dec 18, 2022
The Wearables Development Toolkit - a development environment for activity recognition applications with sensor signals

Wearables Development Toolkit (WDK) The Wearables Development Toolkit (WDK) is a framework and set of tools to facilitate the iterative development of

Juan Haladjian 114 Nov 27, 2022
DeepLab is a state-of-art deep learning system for semantic image segmentation built on top of Caffe.

DeepLab Introduction DeepLab is a state-of-art deep learning system for semantic image segmentation built on top of Caffe. It combines densely-compute

Ali 234 Nov 14, 2022
A Web API for automatic background removal using Deep Learning. App is made using Flask and deployed on Heroku.

Automatic_Background_Remover A Web API for automatic background removal using Deep Learning. App is made using Flask and deployed on Heroku. 👉 https:

Gaurav 16 Oct 29, 2022
Automatically Build Multiple ML Models with a Single Line of Code. Created by Ram Seshadri. Collaborators Welcome. Permission Granted upon Request.

Auto-ViML Automatically Build Variant Interpretable ML models fast! Auto_ViML is pronounced "auto vimal" (autovimal logo created by Sanket Ghanmare) N

AutoViz and Auto_ViML 397 Dec 30, 2022
Large-scale Hyperspectral Image Clustering Using Contrastive Learning, CIKM 21 Workshop

Spectral-spatial contrastive clustering (SSCC) Yaoming Cai, Yan Liu, Zijia Zhang, Zhihua Cai, and Xiaobo Liu, Large-scale Hyperspectral Image Clusteri

Yaoming Cai 4 Nov 02, 2022
Software associated to AAAI paper "Planning with Biological Neurons and Synapses"

jBrain Software associated with the AAAI 2022 paper Francesco D'Amore, Daniel Mitropolsky, Pierluigi Crescenzi, Emanuele Natale, Christos H. Papadimit

Pierluigi Crescenzi 1 Apr 10, 2022
Recurrent Variational Autoencoder that generates sequential data implemented with pytorch

Pytorch Recurrent Variational Autoencoder Model: This is the implementation of Samuel Bowman's Generating Sentences from a Continuous Space with Kim's

Daniil Gavrilov 347 Nov 14, 2022
Official Pytorch implementation for video neural representation (NeRV)

NeRV: Neural Representations for Videos (NeurIPS 2021) Project Page | Paper | UVG Data Hao Chen, Bo He, Hanyu Wang, Yixuan Ren, Ser-Nam Lim, Abhinav S

hao 214 Dec 28, 2022
Free like Freedom

This is all very much a work in progress! More to come! ( We're working on it though! Stay tuned!) Installation Open an Anaconda Prompt (in Windows, o

2.3k Jan 04, 2023
Learning a mapping from images to psychological similarity spaces with neural networks.

LearningPsychologicalSpaces v0.1: v1.1: v1.2: v1.3: v1.4: v1.5: The code in this repository explores learning a mapping from images to psychological s

Lucas Bechberger 8 Dec 12, 2022