HCQ: Hybrid Contrastive Quantization for Efficient Cross-View Video Retrieval

Overview

HCQ: Hybrid Contrastive Quantization for Efficient Cross-View Video Retrieval

[toc]

1. Introduction

This repository provides the code for our paper at TheWebConf 2022:

Hybrid Contrastive Quantization for Efficient Cross-View Video Retrieval. Jinpeng Wang, Bin Chen, Dongliang Liao, Ziyun Zeng, Gongfu Li, Shu-Tao Xia, Jin Xu. [arXiv].

Our proposed Hybrid Contrastive Quantization (HCQ) is the first quantization learning method for cross-view (e.g., text-to-video) retrieval, which learns both coarse-grained and fine-grained quantizations with transformers. Experiments on MSRVTT, LSMDC and ActivityNet Captions datasets demonstrate that it can achieve competitive performance with state-of-the-art non-compressed retrieval methods while showing high efficiency in storage and computation.

In the following, we will guide you how to use this repository step by step. 🤗

2. Preparation

git clone https://github.com/gimpong/WWW22-HCQ.git

2.1 Requirements

  • python 3.7.4
  • gensim 4.1.2
  • h5py 3.6.0
  • numpy 1.17.3
  • pandas 1.2.3
  • pytorch-warmup 0.0.4
  • scikit-learn 0.23.0
  • scipy 1.6.1
  • tensorboardX 2.4.1
  • torch 1.6.0+cu101
  • transformers 3.1.0
cd WWW22-HCQ
# Install the requirements
pip install -r requirements.txt

We conduct each training experiment on a single NVIDIA® Tesla® V100 GPU (32 GB).

2.2 Download the features

Before running the code, we need to download the datasets and arrange them in the "data" directory properly. We use the video features provided by the authors of MMT. These features can be downloaded from this page by running the following commands:

# Create and move to WWW22-HCQ/data directory
cd data
# Download the video features
wget http://pascal.inrialpes.fr/data2/vgabeur/video-features/MSRVTT.tar.gz
wget http://pascal.inrialpes.fr/data2/vgabeur/video-features/activity-net.tar.gz
wget http://pascal.inrialpes.fr/data2/vgabeur/video-features/LSMDC.tar.gz
# Extract the video features
tar -xvf MSRVTT.tar.gz
tar -xvf activity-net.tar.gz
tar -xvf LSMDC.tar.gz

3. Training and Evaluation

3.1 Training from scratch

Let us take "training HCQ on MSRVTT dataset ('1k-A' split)" as an example:

# working directory: WWW22-HCQ/
python -m train --config configs/HCQ_MSRVTT_1kA.json

Expected results:

MSRVTT_jsfusion_test:
 t2v_metrics/R1/final_eval: 25.9
 t2v_metrics/R5/final_eval: 54.8
 t2v_metrics/R10/final_eval: 69.0
 t2v_metrics/R50/final_eval: 88.8
 t2v_metrics/MedR/final_eval: 5.0
 t2v_metrics/MeanR/final_eval: 28.062
 t2v_metrics/geometric_mean_R1-R5-R10/final_eval: 46.09386629981193
 v2t_metrics/R1/final_eval: 26.3
 v2t_metrics/R5/final_eval: 57.0
 v2t_metrics/R10/final_eval: 70.1
 v2t_metrics/R50/final_eval: 90.0
 v2t_metrics/MedR/final_eval: 4.0
 v2t_metrics/MeanR/final_eval: 25.1535
 v2t_metrics/geometric_mean_R1-R5-R10/final_eval: 47.18995255588879

After training, a folder with the same name as the configuration json file (e.g., "HCQ_MSRVTT_1kA") will be generated under WWW22-HCQ/exps/, which contains the model checkpoints, logs, tensorboard files, and so on.

For reproducing other experiments, please see the following tables. You can just replace the config json path with another in the training command.

3.1.1 Main results of HCQ (reported in Table 1-3 in our paper)

Model Dataset (+split) Config json Log Text-to-Video Retrieval Video-to-Text Retrieval
[email protected] [email protected] [email protected] [email protected] Median rank Mean rank Geometric mean of recall@{1,5,10} [email protected] [email protected] [email protected] [email protected] Median rank Mean rank Geometric mean of recall@{1,5,10}
HCQ MSRVTT (1k-A) HCQ_MSRVTT_1kA.json HCQ_MSRVTT_1kA.txt  25.90 54.80 69.00 88.80 5 28.06 46.09 26.30 57.00 70.10 90.00 4 25.15 47.19
MSRVTT (1k-B) HCQ_MSRVTT_1kB.json HCQ_MSRVTT_1kB.txt  22.50 51.50 65.90 86.10 5 33.65 42.43 23.70 52.20 66.90 88.10 5 29.30 43.58
MSRVTT (Full) HCQ_MSRVTT_full.json HCQ_MSRVTT_full.txt  15.15 38.53 51.00 81.34 10 46.22 30.99 18.26 44.88 59.06 87.16 7 30.96 36.45
LSMDC HCQ_LSMDC.json HCQ_LSMDC.txt  14.50 33.60 43.10 68.20 18.5 75.95 27.59 13.70 33.20 42.80 66.10 17 74.28 26.90
ActivityNet Captions HCQ_ActivityNet.json HCQ_ActivityNet.txt  22.19 53.69 70.12 91.21 5 30.71 43.72 23.00 54.85 70.14 91.38 5 29.08 44.56

3.1.2 Result of Hybrid Contrastive Transformer (HCT), Dual Transformer (DT) + DCMH, and DT + JPQ (reported in Table 4 in our paper)

Model Dataset (+split) Config json Log Text-to-Video Retrieval Video-to-Text Retrieval
[email protected] [email protected] [email protected] [email protected] Median rank Mean rank Geometric mean of recall@{1,5,10} [email protected] [email protected] [email protected] [email protected] Median rank Mean rank Geometric mean of recall@{1,5,10}
HCT MSRVTT (1k-A) HCT_MSRVTT_1kA.json HCT_MSRVTT_1kA.txt 27.80 58.00 70.00 89.50 4 26.79 48.33 27.30 57.80 72.10 90.60 4 24.38 48.46
MSRVTT (1k-B) HCT_MSRVTT_1kB.json HCT_MSRVTT_1kB.txt 25.70 53.70 67.30 88.30 5 31.09 45.29 24.70 55.50 68.70 88.80 4 25.54 45.50
MSRVTT (Full) HCT_MSRVTT_full.json HCT_MSRVTT_full.txt 16.76 41.87 55.79 82.44 8 44.33 33.95 21.64 50.57 63.88 87.66 5 29.56 41.19
LSMDC HCT_LSMDC.json HCT_LSMDC.txt 16.40 34.10 43.10 69.10 17 72.39 28.89 14.10 33.70 41.40 67.40 18 73.54 26.99
ActivityNet Captions HCT_ActivityNet.json HCT_ActivityNet.txt 23.12 54.95 71.14 92.64 5 24.82 44.88 22.94 55.81 70.84 92.29 4 25.35 44.93
DT+DCMH MSRVTT (1k-A) DCMH_MSRVTT_1kA.json DCMH_MSRVTT_1kA.txt 19.00 48.40 62.20 85.30 6 32.40 38.53 20.00 50.20 63.30 84.90 5.5 31.69 39.91
MSRVTT (1k-B) DCMH_MSRVTT_1kB.json DCMH_MSRVTT_1kB.txt 15.80 41.30 57.70 83.30 8 40.42 33.52 16.60 44.10 58.10 84.10 7 37.17 34.91
MSRVTT (Full) DCMH_MSRVTT_full.json DCMH_MSRVTT_full.txt 8.46 28.16 41.51 73.48 15.75 67.90 21.46 9.57 31.30 46.62 78.13 12 55.30 24.08
LSMDC DCMH_LSMDC.json DCMH_LSMDC.txt 10.00 25.80 36.00 66.30 22 75.84 21.02 9.60 25.80 36.40 65.40 22.75 78.37 20.81
ActivityNet Captions DCMH_ActivityNet.json DCMH_ActivityNet.txt 12.34 38.40 55.62 84.62 8.5 63.41 29.76 12.45 39.19 55.52 84.58 8.5 65.43 30.03
DT+JPQ MSRVTT (1k-A) JPQ_MSRVTT_1kA.json JPQ_MSRVTT_1kA.txt 18.90 46.80 60.80 87.90 6 29.12 37.75 18.20 47.40 63.20 87.80 6 26.63 37.92
MSRVTT (1k-B) JPQ_MSRVTT_1kB.json JPQ_MSRVTT_1kB.txt 14.90 42.50 57.70 86.90 7 33.05 33.18 15.30 43.50 59.10 88.30 7 27.79 34.01
MSRVTT (Full) JPQ_MSRVTT_full.json JPQ_MSRVTT_full.txt 9.30 30.00 43.44 77.49 14 50.00 22.97 11.44 36.29 51.30 82.84 10 37.00 27.72
LSMDC JPQ_LSMDC.json JPQ_LSMDC.txt 9.50 23.40 34.30 63.10 25 80.27 19.68 7.80 22.80 32.80 62.50 27 79.98 18.00
ActivityNet Captions JPQ_ActivityNet.json JPQ_ActivityNet.txt 17.10 46.43 62.38 90.05 6 28.09 36.73 17.67 46.88 62.94 90.14 6 28.21 37.36

3.1.3 Results of HCQ under different hyper-parameters (reported in Figure 6 in our paper)

Experimental subject Dataset (+split) Setting Config json Log Text-to-Video Retrieval Video-to-Text Retrieval
[email protected] [email protected] [email protected] [email protected] Median rank Mean rank Geometric mean of recall@{1,5,10} [email protected] [email protected] [email protected] [email protected] Median rank Mean rank Geometric mean of recall@{1,5,10}
L: the number of active cluster(s) in GhostVLAD MSRVTT (1k-A) 1 HCQ_MSRVTT_1kA_L1.json HCQ_MSRVTT_1kA_L1.txt 25.10 54.10 67.30 89.10 5 28.21 45.04 22.70 55.10 67.90 89.90 4 25.35 43.96
3 HCQ_MSRVTT_1kA_L3.json HCQ_MSRVTT_1kA_L3.txt 25.70 52.90 66.90 89.30 5 28.39 44.97 26.70 55.00 68.50 90.50 4 24.20 46.51
7 (default) HCQ_MSRVTT_1kA.json HCQ_MSRVTT_1kA.txt 25.90 54.80 69.00 88.80 5 28.06 46.09 26.30 57.00 70.10 90.00 4 25.15 47.19
15 HCQ_MSRVTT_1kA_L15.json HCQ_MSRVTT_1kA_L15.txt 24.20 54.40 68.10 88.70 5 27.15 44.76 23.60 55.00 69.40 90.60 4 22.79 44.83
31 HCQ_MSRVTT_1kA_L31.json HCQ_MSRVTT_1kA_L31.txt 26.20 54.50 67.90 88.00 5 27.57 45.94 25.00 55.60 69.10 90.00 4 24.38 45.80
MSRVTT (1k-B) 1 HCQ_MSRVTT_1kB_L1.json HCQ_MSRVTT_1kB_L1.txt 22.40 51.70 64.10 87.50 5 30.79 42.03 21.90 52.50 65.90 88.10 5 27.49 42.32
3 HCQ_MSRVTT_1kB_L3.json HCQ_MSRVTT_1kB_L3.txt 23.10 50.60 65.40 87.90 5 31.43 42.44 22.90 51.70 66.50 88.30 5 26.82 42.86
7 (default) HCQ_MSRVTT_1kB.json HCQ_MSRVTT_1kB.txt 22.50 51.50 65.90 86.10 5 33.65 42.43 23.70 52.20 66.90 88.10 5 29.30 43.58
15 HCQ_MSRVTT_1kB_L15.json HCQ_MSRVTT_1kB_L15.txt 22.20 51.50 64.30 87.20 5 30.98 41.89 22.00 52.40 65.50 87.90 5 26.35 42.27
31 HCQ_MSRVTT_1kB_L31.json HCQ_MSRVTT_1kB_L31.txt 23.30 50.40 64.30 86.80 5 34.97 42.27 22.70 53.50 65.20 88.10 5 29.55 42.94
MSRVTT (Full) 1 HCQ_MSRVTT_full_L1.json HCQ_MSRVTT_full_L1.txt 14.31 38.63 52.24 80.94 10 44.35 30.68 17.32 44.98 59.60 86.89 7 31.44 35.95
3 HCQ_MSRVTT_full_L3.json HCQ_MSRVTT_full_L3.txt 14.45 39.16 51.84 80.80 10 45.37 30.84 17.56 46.19 60.37 86.82 6 31.24 36.58
7 (default) HCQ_MSRVTT_full.json HCQ_MSRVTT_full.txt 15.15 38.53 51.00 81.34 10 46.22 30.99 18.26 44.88 59.06 87.16 7 30.96 36.45
15 HCQ_MSRVTT_full_L15.json HCQ_MSRVTT_full_L15.txt 14.01 37.53 51.47 81.74 10 41.04 30.02 16.19 44.08 59.80 86.99 7 29.87 34.94
31 HCQ_MSRVTT_full_L31.json HCQ_MSRVTT_full_L31.txt 14.48 38.56 52.64 81.61 9 43.41 30.86 18.09 45.99 59.67 87.22 7 30.54 36.75
LSMDC 1 HCQ_LSMDC_L1.json HCQ_LSMDC_L1.txt 14.40 31.50 42.50 68.50 17 73.09 26.81 13.00 30.60 40.50 68.10 19 71.16 25.26
3 HCQ_LSMDC_L3.json HCQ_LSMDC_L3.txt 14.00 33.80 44.10 68.30 17 73.91 27.53 12.90 32.80 42.80 68.50 17 71.74 26.26
7 (default) HCQ_LSMDC.json HCQ_LSMDC.txt 14.50 33.60 43.10 68.20 18.5 75.95 27.59 13.70 33.20 42.80 66.10 17 74.28 26.90
15 HCQ_LSMDC_L15.json HCQ_LSMDC_L15.txt 14.10 32.60 41.90 69.80 17 71.28 26.81 13.10 31.40 40.70 68.30 18 71.21 25.58
31 HCQ_LSMDC_L31.json HCQ_LSMDC_L31.txt 12.80 31.90 41.90 68.30 17 72.03 25.77 12.50 32.20 42.00 67.20 17 72.26 25.66
ActivityNet Captions 1 HCQ_ActivityNet_L1.json HCQ_ActivityNet_L1.txt 19.77 50.54 65.77 89.06 5 33.26 40.35 20.03 51.33 66.36 89.40 5 32.14 40.86
3 HCQ_ActivityNet_L3.json HCQ_ActivityNet_L3.txt 20.95 52.21 68.35 90.54 5 30.22 42.13 20.72 53.10 68.70 90.50 5 29.18 42.28
7 (default) HCQ_ActivityNet.json HCQ_ActivityNet.txt 22.19 53.69 70.12 91.21 5 30.71 43.72 23.00 54.85 70.14 91.38 5 29.08 44.56
15 HCQ_ActivityNet_L15.json HCQ_ActivityNet_L15.txt 21.33 52.15 68.07 90.16 5 30.00 42.31 22.07 52.92 68.31 90.46 5 29.26 43.05
31 HCQ_ActivityNet_L31.json HCQ_ActivityNet_L31.txt 20.56 52.45 69.07 89.91 5 31.39 42.07 21.66 52.96 68.60 90.81 5 29.67 42.85
M: the number of sub-codebooks in each quantization module MSRVTT (1k-A) 8 HCQ_MSRVTT_1kA_M8.json HCQ_MSRVTT_1kA_M8.txt 23.00 52.00 65.00 87.00 5 32.93 42.68 21.40 52.40 65.50 88.20 5 30.19 41.88
16 HCQ_MSRVTT_1kA_M16.json HCQ_MSRVTT_1kA_M16.txt 23.40 53.40 68.10 88.00 5 30.89 43.98 23.00 55.30 68.60 89.60 4 26.62 44.35
32 (default) HCQ_MSRVTT_1kA.json HCQ_MSRVTT_1kA.txt 25.90 54.80 69.00 88.80 5 28.06 46.09 26.30 57.00 70.10 90.00 4 25.15 47.19
64 HCQ_MSRVTT_1kA_M64.json HCQ_MSRVTT_1kA_M64.txt 27.20 56.80 69.10 89.30 4 26.93 47.44 26.10 58.10 71.40 90.70 4 23.82 47.66
MSRVTT (1k-B) 8 HCQ_MSRVTT_1kB_M8.json HCQ_MSRVTT_1kB_M8.txt 20.10 47.00 60.60 84.10 6.75 37.97 38.54 18.90 47.90 63.10 86.40 6 36.00 38.51
16 HCQ_MSRVTT_1kB_M16.json HCQ_MSRVTT_1kB_M16.txt 22.50 49.50 62.70 85.90 6 33.82 41.18 21.10 52.10 65.60 87.10 5 32.43 41.62
32 (default) HCQ_MSRVTT_1kB.json HCQ_MSRVTT_1kB.txt 22.50 51.50 65.90 86.10 5 33.65 42.43 23.70 52.20 66.90 88.10 5 29.30 43.58
64 HCQ_MSRVTT_1kB_M64.json HCQ_MSRVTT_1kB_M64.txt 24.50 51.60 66.20 87.70 5 31.31 43.74 23.60 54.30 67.40 88.80 4.75 27.56 44.20
MSRVTT (Full) 8 HCQ_MSRVTT_full_M8.json HCQ_MSRVTT_full_M8.txt 11.61 33.44 46.86 75.82 12 62.06 26.30 11.91 36.99 51.77 82.31 10 44.63 28.36
16 HCQ_MSRVTT_full_M16.json HCQ_MSRVTT_full_M16.txt 12.81 36.45 50.17 79.06 10 52.58 28.61 14.55 41.07 55.85 84.75 8 37.39 32.20
32 (default) HCQ_MSRVTT_full.json HCQ_MSRVTT_full.txt 15.15 38.53 51.00 81.34 10 46.22 30.99 18.26 44.88 59.06 87.16 7 30.96 36.45
64 HCQ_MSRVTT_full_M64.json HCQ_MSRVTT_full_M64.txt 16.02 40.97 54.25 83.01 8 40.48 32.90 19.16 48.26 62.94 88.70 6 26.65 38.76
LSMDC 8 HCQ_LSMDC_M8.json HCQ_LSMDC_M8.txt 12.60 29.00 38.60 64.30 22 84.53 24.16 10.40 29.20 39.10 64.20 21 78.32 22.81
16 HCQ_LSMDC_M16.json HCQ_LSMDC_M16.txt 13.20 31.10 39.40 66.50 19 79.15 25.29 12.70 31.60 39.90 65.30 21 77.42 25.21
32 (default) HCQ_LSMDC.json HCQ_LSMDC.txt 14.50 33.60 43.10 68.20 18.5 75.95 27.59 13.70 33.20 42.80 66.10 17 74.28 26.90
64 HCQ_LSMDC_M64.json HCQ_LSMDC_M64.txt 14.80 33.00 43.60 69.10 16 72.80 27.72 14.10 32.30 40.80 67.40 19 72.64 26.49
ActivityNet Captions 8 HCQ_ActivityNet_M8.json HCQ_ActivityNet_M8.txt 18.77 48.44 65.08 88.75 6 39.86 38.97 18.63 48.69 65.24 89.30 6 38.20 38.97
16 HCQ_ActivityNet_M16.json HCQ_ActivityNet_M16.txt 20.56 51.86 67.93 89.89 5 35.07 41.68 20.68 52.10 68.09 90.44 5 32.72 41.87
32 (default) HCQ_ActivityNet.json HCQ_ActivityNet.txt 22.19 53.69 70.12 91.21 5 30.71 43.72 23.00 54.85 70.14 91.38 5 29.08 44.56
64 HCQ_ActivityNet_M64.json HCQ_ActivityNet_M64.txt 22.96 54.59 70.80 91.80 5 26.29 44.60 23.61 55.28 70.80 92.03 4 25.74 45.21
Batch size MSRVTT (1k-A) 16 HCQ_MSRVTT_1kA_bs16.json HCQ_MSRVTT_1kA_bs16.txt 24.20 53.40 67.40 89.90 5 25.86 44.33 23.60 54.10 67.60 89.60 4 22.96 44.19
32 HCQ_MSRVTT_1kA_bs32.json HCQ_MSRVTT_1kA_bs32.txt 24.20 54.00 67.20 89.90 5 27.50 44.45 24.00 54.30 66.90 90.10 4 25.09 44.34
64 HCQ_MSRVTT_1kA_bs64.json HCQ_MSRVTT_1kA_bs64.txt 26.20 55.90 67.90 88.70 4 26.67 46.33 25.50 55.80 69.00 89.90 4 23.37 46.13
128 (default) HCQ_MSRVTT_1kA.json HCQ_MSRVTT_1kA.txt 25.90 54.80 69.00 88.80 5 28.06 46.09 26.30 57.00 70.10 90.00 4 25.15 47.19
256 HCQ_MSRVTT_1kA_bs256.json HCQ_MSRVTT_1kA_bs256.txt 25.50 55.30 67.50 89.20 4 26.80 45.66 26.00 55.80 68.70 90.50 4 23.47 46.36
MSRVTT (1k-B) 16 HCQ_MSRVTT_1kB_bs16.json HCQ_MSRVTT_1kB_bs16.txt 22.00 49.40 64.50 87.60 6 31.45 41.23 18.50 51.80 66.20 89.60 5 26.30 39.88
32 HCQ_MSRVTT_1kB_bs32.json HCQ_MSRVTT_1kB_bs32.txt 22.60 49.20 65.10 87.10 6 32.03 41.68 21.40 52.30 65.90 88.20 5 28.20 41.94
64 HCQ_MSRVTT_1kB_bs64.json HCQ_MSRVTT_1kB_bs64.txt 23.60 50.70 64.60 86.60 5 33.26 42.60 21.10 51.60 64.60 89.00 5 28.00 41.28
128 (default) HCQ_MSRVTT_1kB.json HCQ_MSRVTT_1kB.txt 22.50 51.50 65.90 86.10 5 33.65 42.43 23.70 52.20 66.90 88.10 5 29.30 43.58
256 HCQ_MSRVTT_1kB_bs256.json HCQ_MSRVTT_1kB_bs256.txt 22.50 50.20 63.80 87.00 5 30.96 41.61 21.30 52.40 65.90 88.30 5 27.50 41.90
MSRVTT (Full) 16 HCQ_MSRVTT_full_bs16.json HCQ_MSRVTT_full_bs16.txt 13.08 37.96 52.91 82.04 9 41.76 29.72 15.95 42.44 57.59 86.09 8 31.76 33.91
32 HCQ_MSRVTT_full_bs32.json HCQ_MSRVTT_full_bs32.txt 13.75 38.39 52.37 80.80 10 45.51 30.24 16.39 44.58 58.86 86.29 7 32.54 35.04
64 HCQ_MSRVTT_full_bs64.json HCQ_MSRVTT_full_bs64.txt 14.65 39.20 52.98 82.27 9 44.13 31.22 17.69 46.59 61.10 87.83 6 31.56 36.93
128 (default) HCQ_MSRVTT_full.json HCQ_MSRVTT_full.txt 15.15 38.53 51.00 81.34 10 46.22 30.99 18.26 44.88 59.06 87.16 7 30.96 36.45
256 HCQ_MSRVTT_full_bs256.json HCQ_MSRVTT_full_bs256.txt 14.21 39.06 52.47 82.81 9 40.74 30.77 16.92 46.15 59.70 87.63 7 28.24 35.99
LSMDC 16 HCQ_LSMDC_bs16.json HCQ_LSMDC_bs16.txt 12.30 29.70 39.40 65.30 21 82.64 24.32 10.70 28.30 38.90 65.60 23 80.80 22.75
32 HCQ_LSMDC_bs32.json HCQ_LSMDC_bs32.txt 12.30 30.00 38.70 66.30 20 79.95 24.26 12.10 28.70 39.10 63.50 23 80.79 23.86
64 HCQ_LSMDC_bs64.json HCQ_LSMDC_bs64.txt 13.40 31.90 41.00 66.20 17 75.98 25.98 13.40 31.50 40.00 66.20 20 73.14 25.65
128 (default) HCQ_LSMDC.json HCQ_LSMDC.txt 14.50 33.60 43.10 68.20 18.5 75.95 27.59 13.70 33.20 42.80 66.10 17 74.28 26.90
256 HCQ_LSMDC_bs256.json HCQ_LSMDC_bs256.txt 14.30 34.80 43.60 69.30 16 74.04 27.89 14.30 33.50 42.50 67.70 16 71.84 27.31
ActivityNet Captions 16 HCQ_ActivityNet_bs16.json HCQ_ActivityNet_bs16.txt 21.31 52.55 70.59 92.19 5 27.31 42.92 22.25 53.18 70.41 92.33 5 26.57 43.68
32 (default) HCQ_ActivityNet.json HCQ_ActivityNet.txt 22.19 53.69 70.12 91.21 5 30.71 43.72 23.00 54.85 70.14 91.38 5 29.08 44.56
64 HCQ_ActivityNet_bs64.json HCQ_ActivityNet_bs64.txt 20.62 51.60 66.91 88.94 5 33.61 41.45 20.58 51.64 67.76 89.40 5 31.52 41.61
128 HCQ_ActivityNet_bs128.json HCQ_ActivityNet_bs128.txt 19.36 48.61 64.86 88.41 6 35.38 39.37 19.22 49.68 66.04 89.12 6 33.15 39.80
τ: the temperature factor in contrastive learning loss (Eq.(13)) MSRVTT (1k-A) 0.03 HCQ_MSRVTT_1kA_t0.03.json HCQ_MSRVTT_1kA_t0.03.txt 24.90 56.50 68.80 88.80 4 26.95 45.91 25.10 53.90 69.10 89.70 4 24.91 45.39
0.05 HCQ_MSRVTT_1kA.json HCQ_MSRVTT_1kA.txt 25.90 54.80 69.00 88.80 5 28.06 46.09 26.30 57.00 70.10 90.00 4 25.15 47.19
0..07 HCQ_MSRVTT_1kA_t0.07.json HCQ_MSRVTT_1kA_t0.07.txt 25.40 52.80 67.50 88.60 5 30.40 44.90 25.90 57.00 68.00 90.00 4 27.78 46.48
0.1 HCQ_MSRVTT_1kA_t0.1.json HCQ_MSRVTT_1kA_t0.1.txt 23.90 52.10 66.20 87.10 5 32.74 43.52 22.50 54.00 67.10 87.70 5 31.09 43.36
0.12 HCQ_MSRVTT_1kA_t0.12.json HCQ_MSRVTT_1kA_t0.12.txt 22.60 49.60 65.00 87.90 6 34.53 41.77 21.20 50.80 65.10 87.30 5 33.46 41.23
0.15 HCQ_MSRVTT_1kA_t0.15.json HCQ_MSRVTT_1kA_t0.15.txt 18.20 44.50 60.20 86.80 7 36.74 36.53 16.50 46.80 61.40 85.80 6 35.20 36.19
MSRVTT (1k-B) 0.03 HCQ_MSRVTT_1kB_t0.03.json HCQ_MSRVTT_1kB_t0.03.txt 23.10 51.90 63.40 88.20 5 30.89 42.36 22.90 51.70 65.60 88.10 5 25.72 42.67
0.05 HCQ_MSRVTT_1kB.json HCQ_MSRVTT_1kB.txt 22.50 51.50 65.90 86.10 5 33.65 42.43 23.70 52.20 66.90 88.10 5 29.30 43.58
0..07 HCQ_MSRVTT_1kB_t0.07.json HCQ_MSRVTT_1kB_t0.07.txt 23.90 49.90 63.50 86.70 6 34.78 42.31 22.70 52.10 65.30 87.40 5 32.91 42.59
0.1 HCQ_MSRVTT_1kB_t0.1.json HCQ_MSRVTT_1kB_t0.1.txt 19.90 50.70 63.80 86.80 5 35.51 40.08 19.90 50.70 65.00 87.20 5 34.81 40.33
0.12 HCQ_MSRVTT_1kB_t0.12.json HCQ_MSRVTT_1kB_t0.12.txt 19.00 46.30 61.00 86.40 7 35.89 37.72 18.30 48.20 61.30 86.60 6 35.56 37.81
0.15 HCQ_MSRVTT_1kB_t0.15.json HCQ_MSRVTT_1kB_t0.15.txt 15.60 43.20 56.70 84.50 8 40.02 33.68 14.70 44.20 57.90 85.80 7 39.38 33.51
MSRVTT (Full) 0.03 HCQ_MSRVTT_full_t0.03.json HCQ_MSRVTT_full_t0.03.txt 14.11 38.29 50.77 80.00 10 45.90 30.16 16.32 45.45 59.80 86.86 7 31.64 35.40
0.05 HCQ_MSRVTT_full.json HCQ_MSRVTT_full.txt 15.15 38.53 51.00 81.34 10 46.22 30.99 18.26 44.88 59.06 87.16 7 30.96 36.45
0..07 HCQ_MSRVTT_full_t0.07.json HCQ_MSRVTT_full_t0.07.txt 14.15 37.89 51.17 81.30 10 46.22 30.16 16.72 43.18 58.09 85.95 8 33.70 34.75
0.1 HCQ_MSRVTT_full_t0.1.json HCQ_MSRVTT_full_t0.1.txt 13.58 36.56 49.06 80.43 11 49.80 28.99 14.35 39.13 53.65 84.15 9 39.70 31.11
0.12 HCQ_MSRVTT_full_t0.12.json HCQ_MSRVTT_full_t0.12.txt 12.31 34.25 49.13 79.50 11 50.45 27.46 12.24 35.65 50.64 82.98 10 44.35 28.06
0.15 HCQ_MSRVTT_full_t0.15.json HCQ_MSRVTT_full_t0.15.txt 10.10 30.64 43.88 76.79 14 55.40 23.86 9.16 29.90 45.69 79.00 13 53.01 23.22
LSMDC 0.03 HCQ_LSMDC_t0.03.json HCQ_LSMDC_t0.03.txt 14.90 32.00 42.50 66.20 18 76.14 27.26 12.90 31.80 40.80 66.80 20 72.31 25.58
0.05 HCQ_LSMDC.json HCQ_LSMDC.txt 14.50 33.60 43.10 68.20 18.5 75.95 27.59 13.70 33.20 42.80 66.10 17 74.28 26.90
0..07 HCQ_LSMDC_t0.07.json HCQ_LSMDC_t0.07.txt 12.80 32.30 43.40 67.70 17 75.92 26.18 12.80 32.70 42.90 67.30 17 76.30 26.19
0.1 HCQ_LSMDC_t0.1.json HCQ_LSMDC_t0.1.txt 12.50 30.10 40.80 66.90 18 81.02 24.85 11.80 29.00 40.30 64.20 19 82.29 23.98
0.12 HCQ_LSMDC_t0.12.json HCQ_LSMDC_t0.12.txt 12.00 28.10 38.80 66.40 20 81.93 23.56 11.90 27.60 39.60 64.80 20 84.15 23.52
0.15 HCQ_LSMDC_t0.15.json HCQ_LSMDC_t0.15.txt 10.70 26.10 36.00 64.90 23 82.81 21.58 9.10 24.00 35.10 62.80 25 88.27 19.72
ActivityNet Captions 0.03 HCQ_ActivityNet_t0.03.json HCQ_ActivityNet_t0.03.txt 22.15 52.78 68.58 91.38 5 26.42 43.12 21.74 52.47 68.70 91.38 5 26.65 42.79
0.05 HCQ_ActivityNet.json HCQ_ActivityNet.txt 21.96 53.30 68.99 90.89 5 29.67 43.23 21.94 52.94 69.21 90.69 5 29.12 43.16
0..07 HCQ_ActivityNet_t0.07.json HCQ_ActivityNet_t0.07.txt 22.19 53.69 70.12 91.21 5 30.71 43.72 23.00 54.85 70.14 91.38 5 29.08 44.56
0.1 HCQ_ActivityNet_t0.1.json HCQ_ActivityNet_t0.1.txt 22.11 52.08 68.23 91.34 5 28.34 42.83 21.72 53.33 69.60 91.60 5 27.19 43.20
0.12 HCQ_ActivityNet_t0.12.json HCQ_ActivityNet_t0.12.txt 19.20 50.52 67.99 91.95 5 30.12 40.40 20.09 51.66 68.23 91.89 5 29.16 41.37
0.15 HCQ_ActivityNet_t0.15.json HCQ_ActivityNet_t0.15.txt 17.00 47.14 65.49 91.42 6 31.43 37.44 18.59 48.81 65.30 91.84 6 32.65 38.99

3.1.4 Results of HCQ with different kinds of text encoders ("1k-A" split) (reported in Table 5 in our paper)

Model Text Encoder Config json Log Text-to-Video Retrieval Video-to-Text Retrieval
[email protected] [email protected] [email protected] [email protected] Median rank Mean rank Geometric mean of recall@{1,5,10} [email protected] [email protected] [email protected] [email protected] Median rank Mean rank Geometric mean of recall@{1,5,10}
HCQ bert-base (default) HCQ_MSRVTT_1kA.json HCQ_MSRVTT_1kA.txt 25.90 54.80 69.00 88.80 5 28.06 46.09 26.30 57.00 70.10 90.00 4 25.15 47.19
BERT-large HCQ_MSRVTT_1kA_bert-large.json HCQ_MSRVTT_1kA_bert-large.txt 27.40 57.70 70.70 89.60 4 27.09 48.17 26.20 59.00 71.80 89.50 4 25.47 48.06
DistilBERT-base HCQ_MSRVTT_1kA_distilbert-base.json HCQ_MSRVTT_1kA_distilbert-base.txt 25.40 54.20 67.30 89.80 4 27.00 45.25 26.30 56.40 69.00 90.10 4 24.22 46.78
RoBERTa-base HCQ_MSRVTT_1kA_roberta-base.json HCQ_MSRVTT_1kA_roberta-base.txt 25.50 54.70 67.80 89.20 5 27.04 45.56 24.50 55.00 69.00 90.20 4 23.80 45.30
RoBERTa-large HCQ_MSRVTT_1kA_roberta-large.json HCQ_MSRVTT_1kA_roberta-large.txt 28.00 55.40 68.50 88.10 4 30.67 47.36 27.00 59.00 68.40 88.50 4 27.41 47.76
XLNet-base HCQ_MSRVTT_1kA_xlnet-base.json HCQ_MSRVTT_1kA_xlnet-base.txt 25.80 56.20 68.70 87.50 5 28.35 46.36 24.60 55.50 69.00 88.40 4 25.59 45.50
XLNet-large HCQ_MSRVTT_1kA_xlnet-large.json HCQ_MSRVTT_1kA_xlnet-large.txt 25.00 53.00 66.60 88.20 5 27.59 44.52 25.30 54.50 68.00 89.10 4 23.69 45.43

If you are doing experiments on a platform with enough RAM and want to accelerate the training, you can load the whole dataset in RAM by the following modification:

# WWW22-HCQ/base/base_dataset.py:L170
               load_in_ram=True, # change from 'False' to 'True'

3.2 Evaluation from checkpoint

We can evaluate the model from the checkpoint without re-training. The evaluation command:

python -m train --config configs/HCQ_MSRVTT_1kA.json --only_eval --load_checkpoint HCQ_MSRVTT_1kA.pth

We provide the checkpoint of HCQ_MSRVTT_1kA.json as an example, you can download this file (~1.6G) from the Google Drive and put it in the working directory (WWW22-HCQ/).

3.3 Evaluation for post-compression methods

Take the evaluation on MSRVTT dataset ("1k-A" split) as an example. First, we need to train an HCT.

# working directory: WWW22-HCQ/
python -m train --config configs/HCT_MSRVTT_1kA.json

Then, run the get_embed.py and pass the path of the HCT checkpoint to the script:

python -m get_embed configs/HCT_MSRVTT_1kA.json --only_eval --load_checkpoint HCT_MSRVTT_1kA/trained_model.pth

After that, we will get the embedding file embeddings.h5 under WWW22-HCQ/exps/HCT_MSRVTT_1kA/. Run the compress_embed.py and get the results:

# compress embeddings with LSH
python -m compress_embed --path ./exps/HCT_MSRVTT_1kA/embeddings.h5 --type LSH
# compress embeddings with PQ
python -m compress_embed --path ./exps/HCT_MSRVTT_1kA/embeddings.h5 --type PQ
# compress embeddings with OPQ
python -m compress_embed --path ./exps/HCT_MSRVTT_1kA/embeddings.h5 --type OPQ

3. References

If you find this code useful or use the toolkit in your work, please consider citing:

@inproceedings{wang22hcq,
  author={Wang, Jinpeng and Chen, Bin and Liao, Dongliang and Zeng, Ziyun and Li, Gongfu and Shu-Tao, Xia and Xu, Jin},
  title={Hybrid Contrastive Quantization for Efficient Cross-View Video Retrieval},
  booktitle={Proceedings of the Web Conference 2022},
  doi={10.1145/3485447.3512022}
}

4. Acknowledgements

Our code is based on the implementation of nanopq, Multi-Modal Transformer, Collaborative Experts, Transformers and Mixture of Embedding Experts.

5. Contact

If you have any question, you can raise an issue or email Jinpeng Wang ([email protected]). We will reply you soon.

ShapeGlot: Learning Language for Shape Differentiation

ShapeGlot: Learning Language for Shape Differentiation Created by Panos Achlioptas, Judy Fan, Robert X.D. Hawkins, Noah D. Goodman, Leonidas J. Guibas

Panos 32 Dec 23, 2022
Co-GAIL: Learning Diverse Strategies for Human-Robot Collaboration

CoGAIL Table of Content Overview Installation Dataset Training Evaluation Trained Checkpoints Acknowledgement Citations License Overview This reposito

Jeremy Wang 29 Dec 24, 2022
Artificial Intelligence playing minesweeper 🤖

AI playing Minesweeper ✨ Minesweeper is a single-player puzzle video game. The objective of the game is to clear a rectangular board containing hidden

Vaibhaw 8 Oct 17, 2022
A GUI for Face Recognition, based upon Docker, Tkinter, GPU and a camera device.

Face Recognition GUI This repository is a GUI version of Face Recognition by Adam Geitgey, where e.g. Docker and Tkinter are utilized. All the materia

Kasper Henriksen 6 Dec 05, 2022
Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit

CNTK Chat Windows build status Linux build status The Microsoft Cognitive Toolkit (https://cntk.ai) is a unified deep learning toolkit that describes

Microsoft 17.3k Dec 29, 2022
System Combination for Grammatical Error Correction Based on Integer Programming

System Combination for Grammatical Error Correction Based on Integer Programming This repository contains the code and scripts that implement the syst

NUS NLP Group 0 Mar 29, 2022
Black-Box-Tuning - Black-Box Tuning for Language-Model-as-a-Service

Black-Box-Tuning Source code for paper "Black-Box Tuning for Language-Model-as-a

Tianxiang Sun 149 Jan 04, 2023
Deep generative modeling for time-stamped heterogeneous data, enabling high-fidelity models for a large variety of spatio-temporal domains.

Neural Spatio-Temporal Point Processes [arxiv] Ricky T. Q. Chen, Brandon Amos, Maximilian Nickel Abstract. We propose a new class of parameterizations

Facebook Research 75 Dec 19, 2022
Safe Bayesian Optimization

SafeOpt - Safe Bayesian Optimization This code implements an adapted version of the safe, Bayesian optimization algorithm, SafeOpt [1], [2]. It also p

Felix Berkenkamp 111 Dec 11, 2022
Official implementation of Densely connected normalizing flows

Densely connected normalizing flows This repository is the official implementation of NeurIPS 2021 paper Densely connected normalizing flows. Poster a

Matej Grcić 31 Dec 12, 2022
MLP-Like Vision Permutator for Visual Recognition (PyTorch)

Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition (arxiv) This is a Pytorch implementation of our paper. We present Vision

Qibin (Andrew) Hou 162 Nov 28, 2022
Code for the paper "Reinforced Active Learning for Image Segmentation"

Reinforced Active Learning for Image Segmentation (RALIS) Code for the paper Reinforced Active Learning for Image Segmentation Dependencies python 3.6

Arantxa Casanova 79 Dec 19, 2022
Code for ICDM2020 full paper: "Sub-graph Contrast for Scalable Self-Supervised Graph Representation Learning"

Subg-Con Sub-graph Contrast for Scalable Self-Supervised Graph Representation Learning (Jiao et al., ICDM 2020): https://arxiv.org/abs/2009.10273 Over

34 Jul 06, 2022
A novel benchmark dataset for Monocular Layout prediction

AutoLay AutoLay: Benchmarking Monocular Layout Estimation Kaustubh Mani, N. Sai Shankar, J. Krishna Murthy, and K. Madhava Krishna Abstract In this pa

Kaustubh Mani 39 Apr 26, 2022
[NeurIPS2021] Code Release of K-Net: Towards Unified Image Segmentation

K-Net: Towards Unified Image Segmentation Introduction This is an official release of the paper K-Net:Towards Unified Image Segmentation. K-Net will a

Wenwei Zhang 423 Jan 02, 2023
[EMNLP 2021] Distantly-Supervised Named Entity Recognition with Noise-Robust Learning and Language Model Augmented Self-Training

RoSTER The source code used for Distantly-Supervised Named Entity Recognition with Noise-Robust Learning and Language Model Augmented Self-Training, p

Yu Meng 60 Dec 30, 2022
[3DV 2021] A Dataset-Dispersion Perspective on Reconstruction Versus Recognition in Single-View 3D Reconstruction Networks

dispersion-score Official implementation of 3DV 2021 Paper A Dataset-dispersion Perspective on Reconstruction versus Recognition in Single-view 3D Rec

Yefan 7 May 28, 2022
PaddleBoBo是基于PaddlePaddle和PaddleSpeech、PaddleGAN等开发套件的虚拟主播快速生成项目

PaddleBoBo - 元宇宙时代,你也可以动手做一个虚拟主播。 PaddleBoBo是基于飞桨PaddlePaddle深度学习框架和PaddleSpeech、PaddleGAN等开发套件的虚拟主播快速生成项目。PaddleBoBo致力于简单高效、可复用性强,只需要一张带人像的图片和一段文字,就能

502 Jan 08, 2023
Model search is a framework that implements AutoML algorithms for model architecture search at scale

Model search (MS) is a framework that implements AutoML algorithms for model architecture search at scale. It aims to help researchers speed up their exploration process for finding the right model a

Google 3.2k Dec 31, 2022
Contains code for the paper "Vision Transformers are Robust Learners".

Vision Transformers are Robust Learners This repository contains the code for the paper Vision Transformers are Robust Learners by Sayak Paul* and Pin

Sayak Paul 103 Jan 05, 2023