A Joint Video and Image Encoder for End-to-End Retrieval

Overview

Frozen️ in Time ❄️ ️️️️

A Joint Video and Image Encoder for End-to-End Retrieval

project page | arXiv | webvid-data alt text Repository containing the code, models, data for end-to-end retrieval. WebVid data can be found here


📝 Preparation

  1. Create conda env conda env create -f requirements/frozen.yml

  2. Create data / experiment folders mkdir data; mkdir exps, note this can just be a symlink to where you want to store big data.

🔧 Finetuning (benchmarks: MSR-VTT)

  1. wget https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zip -P data; unzip data/MSRVTT.zip -d data

  2. Change num_gpus in the config file accordingly.

  3. Train python train.py --config configs/msrvtt_4f_i21k.json

  4. Test python test.py --resume exps/models/{EXP_NAME}/{EXP_TIMESTAMP}/model_best.pth

For finetuning a pretrained model, set "load_checkpoint": "PATH_TO_MODEL" in the config file.

🏋 ️‍️ Pretraining

  1. Download WebVid-2M (see https://github.com/m-bain/webvid)

  2. Download CC-3M (see https://ai.google.com/research/ConceptualCaptions/download)

  3. Train. python train.py --config CONFIG_PATH. Here are the different options:

    a. Dataset combinations

     i. CC-3M + WebVid2M: configs/cc-webvid2m-pt-i2k.json
     ii. WebVid2M : configs/webvid2m-pt-i2k.json
    

    You can add in an arbitrary number of image/video datasets for pre-training by adding as many dataloaders to the config file dataloader list as your heart desires. Adding more datasets will likely to higher downstream performance.

    b. Number of frames

    For image datasets, this should always be set to video_params": {"num_frames": 1, ...}.

    For video datasets, set this to what you want. N.B. More frames requires = more gpu memory.

    If, like us, you are not a big company and have limited compute, then you will benefit by training via a curriculum on the number of frames. A lot of the knowledge can be learned in the 1-frame setting, as we show in the paper. You can then finetune with more frames. See curriculum learning section

    c. Finetuning

    Set "load_checkpoint": "FULL_MODEL_PATH" in the config file. You can now use different experiment params, such as num_frames, to do curriculum learning for example.

🗄 Pretrained Weights

📚 Curriculum Learning on #frames

Curriculum learning on the number of frames in pretraining achieves similar performance with significant reduction in compute (both memory and training time). This is because model has higher throughput for fewer frames, as well as allowing a bigger batch size for the same gpu memory.

Our best model was trained on 1-frame then finetuned on 4-frames on CC+WebVid2M.

Train on 1-frame until the training loss converges, then finetune on 4-frames with the same config, from the 1-frame checkpoint via setting load_checkpoint in config file. 4-frame finetuning needs much less iterations (~10% of 1-frame setting is sufficient) since most of the knowledge is learned in the 1-frame setting.

📈 Experiment Logging and Visualising

This repository uses a sacred backbone for logging and tracking experiments, with a neptune front end. It makes life a lot easier. If you want to activate this:

  1. Create a neptune.ai account.
  2. Create a project, copy in your credentials in train.py and remove the ValueError
  3. Set neptune: true in your config files.

🎓 Cite

If you use this code in your research, please cite:

@misc{bain2021frozen,
      title={Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval}, 
      author={Max Bain and Arsha Nagrani and Gül Varol and Andrew Zisserman},
      year={2021},
      eprint={2104.00650},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

🙏 Acknowledgements

This code is based off the pytorch-template https://github.com/victoresque/pytorch-template

As well as many good practices adopted from Samuel Albanie's https://github.com/albanie/collaborative-experts

Owner
PhD Student, VGG, Oxford
Deeply Supervised, Layer-wise Prediction-aware (DSLP) Transformer for Non-autoregressive Neural Machine Translation

Non-Autoregressive Translation with Layer-Wise Prediction and Deep Supervision Training Efficiency We show the training efficiency of our DSLP model b

Chenyang Huang 36 Oct 31, 2022
HugsVision is a easy to use huggingface wrapper for state-of-the-art computer vision

HugsVision is an open-source and easy to use all-in-one huggingface wrapper for computer vision. The goal is to create a fast, flexible and user-frien

Labrak Yanis 166 Nov 27, 2022
Easy-to-use library to boost AI inference leveraging state-of-the-art optimization techniques.

NEW RELEASE How Nebullvm Works • Tutorials • Benchmarks • Installation • Get Started • Optimization Examples Discord | Website | LinkedIn | Twitter Ne

Nebuly 1.7k Dec 31, 2022
High-resolution networks and Segmentation Transformer for Semantic Segmentation

High-resolution networks and Segmentation Transformer for Semantic Segmentation Branches This is the implementation for HRNet + OCR. The PyTroch 1.1 v

HRNet 2.8k Jan 07, 2023
TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis with Explicit Pitch and Duration Prediction.

TalkNet 2 [WIP] TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis with Explicit Pitch and Duration Predictio

Rishikesh (ऋषिकेश) 69 Dec 17, 2022
[Nature Machine Intelligence' 21] "Advancing COVID-19 Diagnosis with Privacy-Preserving Collaboration in Artificial Intelligence"

[UCADI] COVID-19 Diagnosis With Federated Learning Intro We developed a Federated Learning (FL) Framework for global researchers to collaboratively tr

HUST EIC AI-LAB 30 Dec 12, 2022
Network Enhancement implementation in pytorch

network_enahncement_pytorch Network Enhancement implementation in pytorch Research paper Network Enhancement: a general method to denoise weighted bio

Yen 1 Nov 12, 2021
GazeScroller - Using Facial Movements to perform Hands-free Gesture on the system

GazeScroller Using Facial Movements to perform Hands-free Gesture on the system

2 Jan 05, 2022
Object detection GUI based on PaddleDetection

PP-Tracking GUI界面测试版 本项目是基于飞桨开源的实时跟踪系统PP-Tracking开发的可视化界面 在PaddlePaddle中加入pyqt进行GUI页面研发,可使得整个训练过程可视化,并通过GUI界面进行调参,模型预测,视频输出等,通过多种类型的识别,简化整体预测流程。 GUI界面

杨毓栋 68 Jan 02, 2023
Real-time analysis of intracranial neurophysiology recordings.

py_neuromodulation Click this button to run the "Tutorial ML with py_neuro" notebooks: The py_neuromodulation toolbox allows for real time capable pro

Interventional Cognitive Neuromodulation - Neumann Lab Berlin 15 Nov 03, 2022
Learning the Beauty in Songs: Neural Singing Voice Beautifier; ACL 2022 (Main conference); Official code

Learning the Beauty in Songs: Neural Singing Voice Beautifier Jinglin Liu, Chengxi Li, Yi Ren, Zhiying Zhu, Zhou Zhao Zhejiang University ACL 2022 Mai

Jinglin Liu 257 Dec 30, 2022
A Rao-Blackwellized Particle Filter for 6D Object Pose Tracking

PoseRBPF: A Rao-Blackwellized Particle Filter for 6D Object Pose Tracking PoseRBPF Paper Self-supervision Paper Pose Estimation Video Robot Manipulati

NVIDIA Research Projects 107 Dec 25, 2022
SIMULEVAL A General Evaluation Toolkit for Simultaneous Translation

SimulEval SimulEval is a general evaluation framework for simultaneous translation on text and speech. Requirement python = 3.7.0 Installation git cl

Facebook Research 48 Dec 28, 2022
PyG (PyTorch Geometric) - A library built upon PyTorch to easily write and train Graph Neural Networks (GNNs)

PyG (PyTorch Geometric) is a library built upon PyTorch to easily write and train Graph Neural Networks (GNNs) for a wide range of applications related to structured data.

PyG 16.5k Jan 08, 2023
The 2nd Version Of Slothybot

SlothyBot Go to this website: "https://bitly.com/SlothyBot" The 2nd Version Of Slothybot. The Bot Has Many Features, Such As: Moderation Commands; Kic

Slothy 0 Jun 01, 2022
Efficient Two-Step Networks for Temporal Action Segmentation (Neurocomputing 2021)

Efficient Two-Step Networks for Temporal Action Segmentation This repository provides a PyTorch implementation of the paper Efficient Two-Step Network

8 Apr 16, 2022
code for paper"A High-precision Semantic Segmentation Method Combining Adversarial Learning and Attention Mechanism"

PyTorch implementation of UAGAN(U-net Attention Generative Adversarial Networks) This repository contains the source code for the paper "A High-precis

Tong 8 Apr 25, 2022
Data-Driven Operational Space Control for Adaptive and Robust Robot Manipulation

OSCAR Project Page | Paper This repository contains the codebase used in OSCAR: Data-Driven Operational Space Control for Adaptive and Robust Robot Ma

NVIDIA Research Projects 74 Dec 22, 2022
Automatic caption evaluation metric based on typicality analysis.

SeMantic and linguistic UndeRstanding Fusion (SMURF) Automatic caption evaluation metric described in the paper "SMURF: SeMantic and linguistic UndeRs

Joshua Feinglass 6 Jan 09, 2022
Starter Code for VALUE benchmark

StarterCode for VALUE Benchmark This is the starter code for VALUE Benchmark [website], [paper]. This repository currently supports all baseline model

VALUE Benchmark 73 Dec 09, 2022