This repo provides code for QB-Norm (Cross Modal Retrieval with Querybank Normalisation)

Last update: Dec 29, 2022

Related tags

Overview

This repo provides code for QB-Norm (Cross Modal Retrieval with Querybank Normalisation)

Usage example

python dynamic_inverted_softmax.py --sims_train_test_path msrvtt/tt-ce-train-captions-test-videos-seed0.pkl --sims_test_path msrvtt/tt-ce-test-captions-test-videos-seed0.pkl --test_query_masks_path msrvtt/tt-ce-test-query_masks.pkl

To test QB-Norm on your own data you need to:

Extract the similarity matrix between the caption from the training split and the videos from the testing split path/to/sims/train/test
Extract testing split similarity matrix (similarities between testing captions and testing video) path/to/sims/test
Run QB-Norm

python dynamic_inverted_softmax.py --sims_train_test_path path/to/sims/train/test --sims_test_path path/to/sims/test

Data

The similarity matrices for each method were extracted using the official repositories as follows: CE+, TT-CE+, CLIP2Video, CLIP4Clip (for CLIP4Clip we used the official repo to train from scratch new models since they do not provide pre-trained weights), CLIP, MMT, Audio-Retrieval.

You can download the extracted similarity matrices for training and testing here: MSRVTT, MSVD, DiDeMo, LSMDC.

Text-Video retrieval results

QB-Norm Results on MSRVTT Benchmark

Model	Split	Task	[email protected]	[email protected]	[email protected]	MdR	Geom
CE+	Full	t2v	_{^14.4_(0.1)}	_{^37.4_(0.1)}	_{^50.2_(0.1)}	_{^10.0_(0.0)}	_{^30.0_(0.1)}
CE+ (+QB-Norm)	Full	t2v	_{^16.4_(0.0)}	_{^40.3_(0.1)}	_{^52.9_(0.1)}	_{^9.0_(0.0)}	_{^32.7_(0.1)}
TT-CE+	Full	t2v	_{^14.9_(0.1)}	_{^38.3_(0.1)}	_{^51.5_(0.1)}	_{^10.0_(0.0)}	_{^30.9_(0.1)}
TT-CE+ (+QB-Norm)	Full	t2v	_{^17.3_(0.0)}	_{^42.1_(0.2)}	_{^54.9_(0.1)}	_{^8.0_(0.0)}	_{^34.2_(0.1)}

QB-Norm Results on MSVD Benchmark

Model	Split	Task	[email protected]	[email protected]	[email protected]	MdR	Geom
TT-CE+	Full	t2v	_{^25.4_(0.3)}	_{^56.9_(0.4)}	_{^71.3_(0.2)}	_{^4.0_(0.0)}	_{^46.9_(0.3)}
TT-CE+ (+QB-Norm)	Full	t2v	_{^26.6_(1.0)}	_{^58.6_(1.3)}	_{^71.8_(1.1)}	_{^4.0_(0.0)}	_{^48.2_(1.2)}
CLIP2Video	Full	t2v	_^47.0	_^76.8	_^85.9	_^2.0	_^67.7
CLIP2Video (+QB-Norm)	Full	t2v	_^48.0	_^77.9	_^86.2	_^2.0	_^68.5

QB-Norm Results on DiDeMo Benchmark

Model	Split	Task	[email protected]	[email protected]	[email protected]	MdR	Geom
TT-CE+	Full	t2v	_{^21.6_(0.7)}	_{^48.6_(0.4)}	_{^62.9_(0.6)}	_{^6.0_(0.0)}	_{^40.4_(0.4)}
TT-CE+ (+QB-Norm)	Full	t2v	_{^24.2_(0.7)}	_{^50.8_(0.7)}	_{^64.4_(0.1)}	_{^5.3_(0.5)}	_{^43.0_(0.2)}
CLIP4Clip	Full	t2v	_^43.0	_^70.5	_^80.0	_^2.0	_^62.4
CLIP4Clip (+QB-Norm)	Full	t2v	_^43.5	_^71.4	_^80.9	_^2.0	_^63.1

QB-Norm Results on LSMDC Benchmark

Model	Split	Task	[email protected]	[email protected]	[email protected]	MdR	Geom
TT-CE+	Full	t2v	_{^17.2_(0.4)}	_{^36.5_(0.6)}	_{^46.3_(0.3)}	_{^13.7_(0.5)}	_{^30.7_(0.3)}
TT-CE+ (+QB-Norm)	Full	t2v	_{^17.8_(0.4)}	_{^37.7_(0.5)}	_{^47.6_(0.6)}	_{^12.7_(0.5)}	_{^31.7_(0.3)}
CLIP4Clip	Full	t2v	_^21.3	_^40.0	_^49.5	_^11.0	_^34.8
CLIP4Clip (+QB-Norm)	Full	t2v	_^22.4	_^40.1	_^49.5	_^11.0	_^35.4

QB-Norm Results on VaTeX Benchmark

Model	Split	Task	[email protected]	[email protected]	[email protected]	MdR	Geom
TT-CE+	Full	t2v	_{^53.2_(0.2)}	_{^87.4_(0.1)}	_{^93.3_(0.0)}	_{^1.0_(0.0)}	_{^75.7_(0.1)}
TT-CE+ (+QB-Norm)	Full	t2v	_{^54.8_(0.1)}	_{^88.2_(0.1)}	_{^93.8_(0.1)}	_{^1.0_(0.0)}	_{^76.8_(0.0)}
CLIP2Video	Full	t2v	_^57.4	_^87.9	_^93.6	_^1.0	_^77.9
CLIP2Video (+QB-Norm)	Full	t2v	_^58.8	_^88.3	_^93.8	_^1.0	_^78.7

QB-Norm Results on QuerYD Benchmark

Model	Split	Task	[email protected]	[email protected]	[email protected]	MdR	Geom
CE+	Full	t2v	_{^13.2_(2.0)}	_{^37.1_(2.9)}	_{^50.5_(1.9)}	_{^10.3_(1.2)}	_{^29.1_(2.2)}
CE+ (+QB-Norm)	Full	t2v	_{^14.1_(1.8)}	_{^38.6_(1.3)}	_{^51.1_(1.6)}	_{^10.0_(0.8)}	_{^30.2_(1.7)}
TT-CE+	Full	t2v	_{^14.4_(0.5)}	_{^37.7_(1.7)}	_{^50.9_(1.6)}	_{^9.8_(1.0)}	_{^30.3_(0.9)}
TT-CE+ (+QB-Norm)	Full	t2v	_{^15.1_(1.6)}	_{^38.3_(2.4)}	_{^51.2_(2.8)}	_{^10.3_(1.7)}	_{^30.9_(2.3)}

Text-Image retrieval results

QB-Norm Results on MSCoCo Benchmark

Model	Split	Task	[email protected]	[email protected]	[email protected]	MdR	Geom
CLIP	5k	t2i	_^30.3	_^56.1	_^67.1	_^4.0	_^48.5
CLIP (+QB-Norm)	5k	t2i	_^34.8	_^59.9	_^70.4	_^3.0	_^52.8
MMT-Oscar	5k	t2i	_^52.2	_^80.2	_^88.0	_^1.0	_^71.7
MMT-Oscar (+QB-Norm)	5k	t2i	_^53.9	_^80.5	_^88.1	_^1.0	_^72.6

Text-Audio retrieval results

QB-Norm Results on AudioCaps Benchmark

Model	Split	Task	[email protected]	[email protected]	[email protected]	MdR	Geom
AR-CE	Full	t2a	_{^23.1_(0.6)}	_{^55.1_(0.7)}	_{^70.7_(0.6)}	_{^4.7_(0.5)}	_{^44.8_(0.7)}
AR-CE (+QB-Norm)	Full	t2a	_{^23.9_(0.2)}	_{^57.1_(0.3)}	_{^71.6_(0.4)}	_{^4.0_(0.0)}	_{^46.0_(0.3)}

References

If you find this code useful or use the extracted similarity matrices, please consider citing:

@misc{bogolin2021cross,
      title={Cross Modal Retrieval with Querybank Normalisation}, 
      author={Simion-Vlad Bogolin and Ioana Croitoru and Hailin Jin and Yang Liu and Samuel Albanie},
      year={2021},
      eprint={2112.12777},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

This repo provides code for QB-Norm (Cross Modal Retrieval with Querybank Normalisation)

Related tags

Overview

Data

Text-Video retrieval results

Text-Image retrieval results

Text-Audio retrieval results

References

Owner

Disease Informed Neural Networks (DINNs) — neural networks capable of learning how diseases spread, forecasting their progression, and finding their unique parameters (e.g. death rate).

In this project we use both Resnet and Self-attention layer for cat, dog and flower classification.

Fluency ENhanced Sentence-bert Evaluation (FENSE), metric for audio caption evaluation. And Benchmark dataset AudioCaps-Eval, Clotho-Eval.

[ACL 20] Probing Linguistic Features of Sentence-level Representations in Neural Relation Extraction

Human Pose Detection on EdgeTPU

Multi-Glimpse Network With Python

Deeplearning project at The Technological University of Denmark (DTU) about Neural ODEs for finding dynamics in ordinary differential equations and real world time series data

Recall Loss for Semantic Segmentation (This repo implements the paper: Recall Loss for Semantic Segmentation)

AITUS - An atomatic notr maker for CYTUS

Official Implementation of 'UPDeT: Universal Multi-agent Reinforcement Learning via Policy Decoupling with Transformers' ICLR 2021(spotlight)

A highly modular PyTorch framework with a focus on Neural Architecture Search (NAS).

Vision-and-Language Navigation in Continuous Environments using Habitat

Simple Text-Generator with OpenAI gpt-2 Pytorch Implementation

Simple image captioning model - CLIP prefix captioning.

3D mesh stylization driven by a text input in PyTorch

Deduplicating Training Data Makes Language Models Better

ContourletNet: A Generalized Rain Removal Architecture Using Multi-Direction Hierarchical Representation

Code for "Optimizing risk-based breast cancer screening policies with reinforcement learning"

RTS3D: Real-time Stereo 3D Detection from 4D Feature-Consistency Embedding Space for Autonomous Driving

Exploring Versatile Prior for Human Motion via Motion Frequency Guidance (3DV2021)