HCQ: Hybrid Contrastive Quantization for Efficient Cross-View Video Retrieval
[toc]
1. Introduction
This repository provides the code for our paper at TheWebConf 2022:
Hybrid Contrastive Quantization for Efficient Cross-View Video Retrieval. Jinpeng Wang, Bin Chen, Dongliang Liao, Ziyun Zeng, Gongfu Li, Shu-Tao Xia, Jin Xu. [arXiv].
Our proposed Hybrid Contrastive Quantization (HCQ) is the first quantization learning method for cross-view (e.g., text-to-video) retrieval, which learns both coarse-grained and fine-grained quantizations with transformers. Experiments on MSRVTT, LSMDC and ActivityNet Captions datasets demonstrate that it can achieve competitive performance with state-of-the-art non-compressed retrieval methods while showing high efficiency in storage and computation.
In the following, we will guide you how to use this repository step by step.
2. Preparation
git clone https://github.com/gimpong/WWW22-HCQ.git
2.1 Requirements
- python 3.7.4
- gensim 4.1.2
- h5py 3.6.0
- numpy 1.17.3
- pandas 1.2.3
- pytorch-warmup 0.0.4
- scikit-learn 0.23.0
- scipy 1.6.1
- tensorboardX 2.4.1
- torch 1.6.0+cu101
- transformers 3.1.0
cd WWW22-HCQ
# Install the requirements
pip install -r requirements.txt
We conduct each training experiment on a single NVIDIA® Tesla® V100 GPU (32 GB).
2.2 Download the features
Before running the code, we need to download the datasets and arrange them in the "data" directory properly. We use the video features provided by the authors of MMT. These features can be downloaded from this page by running the following commands:
# Create and move to WWW22-HCQ/data directory
cd data
# Download the video features
wget http://pascal.inrialpes.fr/data2/vgabeur/video-features/MSRVTT.tar.gz
wget http://pascal.inrialpes.fr/data2/vgabeur/video-features/activity-net.tar.gz
wget http://pascal.inrialpes.fr/data2/vgabeur/video-features/LSMDC.tar.gz
# Extract the video features
tar -xvf MSRVTT.tar.gz
tar -xvf activity-net.tar.gz
tar -xvf LSMDC.tar.gz
3. Training and Evaluation
3.1 Training from scratch
Let us take "training HCQ on MSRVTT dataset ('1k-A' split)" as an example:
# working directory: WWW22-HCQ/
python -m train --config configs/HCQ_MSRVTT_1kA.json
Expected results:
MSRVTT_jsfusion_test:
t2v_metrics/R1/final_eval: 25.9
t2v_metrics/R5/final_eval: 54.8
t2v_metrics/R10/final_eval: 69.0
t2v_metrics/R50/final_eval: 88.8
t2v_metrics/MedR/final_eval: 5.0
t2v_metrics/MeanR/final_eval: 28.062
t2v_metrics/geometric_mean_R1-R5-R10/final_eval: 46.09386629981193
v2t_metrics/R1/final_eval: 26.3
v2t_metrics/R5/final_eval: 57.0
v2t_metrics/R10/final_eval: 70.1
v2t_metrics/R50/final_eval: 90.0
v2t_metrics/MedR/final_eval: 4.0
v2t_metrics/MeanR/final_eval: 25.1535
v2t_metrics/geometric_mean_R1-R5-R10/final_eval: 47.18995255588879
After training, a folder with the same name as the configuration json file (e.g., "HCQ_MSRVTT_1kA") will be generated under WWW22-HCQ/exps/
, which contains the model checkpoints, logs, tensorboard files, and so on.
For reproducing other experiments, please see the following tables. You can just replace the config json path with another in the training command.
3.1.1 Main results of HCQ (reported in Table 1-3 in our paper)
Model | Dataset (+split) | Config json | Log | Text-to-Video Retrieval | Video-to-Text Retrieval | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
[email protected] | [email protected] | [email protected] | [email protected] | Median rank | Mean rank | Geometric mean of recall@{1,5,10} | [email protected] | [email protected] | [email protected] | [email protected] | Median rank | Mean rank | Geometric mean of recall@{1,5,10} | ||||
HCQ | MSRVTT (1k-A) | HCQ_MSRVTT_1kA.json | HCQ_MSRVTT_1kA.txt | 25.90 | 54.80 | 69.00 | 88.80 | 5 | 28.06 | 46.09 | 26.30 | 57.00 | 70.10 | 90.00 | 4 | 25.15 | 47.19 |
MSRVTT (1k-B) | HCQ_MSRVTT_1kB.json | HCQ_MSRVTT_1kB.txt | 22.50 | 51.50 | 65.90 | 86.10 | 5 | 33.65 | 42.43 | 23.70 | 52.20 | 66.90 | 88.10 | 5 | 29.30 | 43.58 | |
MSRVTT (Full) | HCQ_MSRVTT_full.json | HCQ_MSRVTT_full.txt | 15.15 | 38.53 | 51.00 | 81.34 | 10 | 46.22 | 30.99 | 18.26 | 44.88 | 59.06 | 87.16 | 7 | 30.96 | 36.45 | |
LSMDC | HCQ_LSMDC.json | HCQ_LSMDC.txt | 14.50 | 33.60 | 43.10 | 68.20 | 18.5 | 75.95 | 27.59 | 13.70 | 33.20 | 42.80 | 66.10 | 17 | 74.28 | 26.90 | |
ActivityNet Captions | HCQ_ActivityNet.json | HCQ_ActivityNet.txt | 22.19 | 53.69 | 70.12 | 91.21 | 5 | 30.71 | 43.72 | 23.00 | 54.85 | 70.14 | 91.38 | 5 | 29.08 | 44.56 |
3.1.2 Result of Hybrid Contrastive Transformer (HCT), Dual Transformer (DT) + DCMH, and DT + JPQ (reported in Table 4 in our paper)
Model | Dataset (+split) | Config json | Log | Text-to-Video Retrieval | Video-to-Text Retrieval | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
[email protected] | [email protected] | [email protected] | [email protected] | Median rank | Mean rank | Geometric mean of recall@{1,5,10} | [email protected] | [email protected] | [email protected] | [email protected] | Median rank | Mean rank | Geometric mean of recall@{1,5,10} | ||||
HCT | MSRVTT (1k-A) | HCT_MSRVTT_1kA.json | HCT_MSRVTT_1kA.txt | 27.80 | 58.00 | 70.00 | 89.50 | 4 | 26.79 | 48.33 | 27.30 | 57.80 | 72.10 | 90.60 | 4 | 24.38 | 48.46 |
MSRVTT (1k-B) | HCT_MSRVTT_1kB.json | HCT_MSRVTT_1kB.txt | 25.70 | 53.70 | 67.30 | 88.30 | 5 | 31.09 | 45.29 | 24.70 | 55.50 | 68.70 | 88.80 | 4 | 25.54 | 45.50 | |
MSRVTT (Full) | HCT_MSRVTT_full.json | HCT_MSRVTT_full.txt | 16.76 | 41.87 | 55.79 | 82.44 | 8 | 44.33 | 33.95 | 21.64 | 50.57 | 63.88 | 87.66 | 5 | 29.56 | 41.19 | |
LSMDC | HCT_LSMDC.json | HCT_LSMDC.txt | 16.40 | 34.10 | 43.10 | 69.10 | 17 | 72.39 | 28.89 | 14.10 | 33.70 | 41.40 | 67.40 | 18 | 73.54 | 26.99 | |
ActivityNet Captions | HCT_ActivityNet.json | HCT_ActivityNet.txt | 23.12 | 54.95 | 71.14 | 92.64 | 5 | 24.82 | 44.88 | 22.94 | 55.81 | 70.84 | 92.29 | 4 | 25.35 | 44.93 | |
DT+DCMH | MSRVTT (1k-A) | DCMH_MSRVTT_1kA.json | DCMH_MSRVTT_1kA.txt | 19.00 | 48.40 | 62.20 | 85.30 | 6 | 32.40 | 38.53 | 20.00 | 50.20 | 63.30 | 84.90 | 5.5 | 31.69 | 39.91 |
MSRVTT (1k-B) | DCMH_MSRVTT_1kB.json | DCMH_MSRVTT_1kB.txt | 15.80 | 41.30 | 57.70 | 83.30 | 8 | 40.42 | 33.52 | 16.60 | 44.10 | 58.10 | 84.10 | 7 | 37.17 | 34.91 | |
MSRVTT (Full) | DCMH_MSRVTT_full.json | DCMH_MSRVTT_full.txt | 8.46 | 28.16 | 41.51 | 73.48 | 15.75 | 67.90 | 21.46 | 9.57 | 31.30 | 46.62 | 78.13 | 12 | 55.30 | 24.08 | |
LSMDC | DCMH_LSMDC.json | DCMH_LSMDC.txt | 10.00 | 25.80 | 36.00 | 66.30 | 22 | 75.84 | 21.02 | 9.60 | 25.80 | 36.40 | 65.40 | 22.75 | 78.37 | 20.81 | |
ActivityNet Captions | DCMH_ActivityNet.json | DCMH_ActivityNet.txt | 12.34 | 38.40 | 55.62 | 84.62 | 8.5 | 63.41 | 29.76 | 12.45 | 39.19 | 55.52 | 84.58 | 8.5 | 65.43 | 30.03 | |
DT+JPQ | MSRVTT (1k-A) | JPQ_MSRVTT_1kA.json | JPQ_MSRVTT_1kA.txt | 18.90 | 46.80 | 60.80 | 87.90 | 6 | 29.12 | 37.75 | 18.20 | 47.40 | 63.20 | 87.80 | 6 | 26.63 | 37.92 |
MSRVTT (1k-B) | JPQ_MSRVTT_1kB.json | JPQ_MSRVTT_1kB.txt | 14.90 | 42.50 | 57.70 | 86.90 | 7 | 33.05 | 33.18 | 15.30 | 43.50 | 59.10 | 88.30 | 7 | 27.79 | 34.01 | |
MSRVTT (Full) | JPQ_MSRVTT_full.json | JPQ_MSRVTT_full.txt | 9.30 | 30.00 | 43.44 | 77.49 | 14 | 50.00 | 22.97 | 11.44 | 36.29 | 51.30 | 82.84 | 10 | 37.00 | 27.72 | |
LSMDC | JPQ_LSMDC.json | JPQ_LSMDC.txt | 9.50 | 23.40 | 34.30 | 63.10 | 25 | 80.27 | 19.68 | 7.80 | 22.80 | 32.80 | 62.50 | 27 | 79.98 | 18.00 | |
ActivityNet Captions | JPQ_ActivityNet.json | JPQ_ActivityNet.txt | 17.10 | 46.43 | 62.38 | 90.05 | 6 | 28.09 | 36.73 | 17.67 | 46.88 | 62.94 | 90.14 | 6 | 28.21 | 37.36 |
3.1.3 Results of HCQ under different hyper-parameters (reported in Figure 6 in our paper)
Experimental subject | Dataset (+split) | Setting | Config json | Log | Text-to-Video Retrieval | Video-to-Text Retrieval | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
[email protected] | [email protected] | [email protected] | [email protected] | Median rank | Mean rank | Geometric mean of recall@{1,5,10} | [email protected] | [email protected] | [email protected] | [email protected] | Median rank | Mean rank | Geometric mean of recall@{1,5,10} | |||||
L: the number of active cluster(s) in GhostVLAD | MSRVTT (1k-A) | 1 | HCQ_MSRVTT_1kA_L1.json | HCQ_MSRVTT_1kA_L1.txt | 25.10 | 54.10 | 67.30 | 89.10 | 5 | 28.21 | 45.04 | 22.70 | 55.10 | 67.90 | 89.90 | 4 | 25.35 | 43.96 |
3 | HCQ_MSRVTT_1kA_L3.json | HCQ_MSRVTT_1kA_L3.txt | 25.70 | 52.90 | 66.90 | 89.30 | 5 | 28.39 | 44.97 | 26.70 | 55.00 | 68.50 | 90.50 | 4 | 24.20 | 46.51 | ||
7 (default) | HCQ_MSRVTT_1kA.json | HCQ_MSRVTT_1kA.txt | 25.90 | 54.80 | 69.00 | 88.80 | 5 | 28.06 | 46.09 | 26.30 | 57.00 | 70.10 | 90.00 | 4 | 25.15 | 47.19 | ||
15 | HCQ_MSRVTT_1kA_L15.json | HCQ_MSRVTT_1kA_L15.txt | 24.20 | 54.40 | 68.10 | 88.70 | 5 | 27.15 | 44.76 | 23.60 | 55.00 | 69.40 | 90.60 | 4 | 22.79 | 44.83 | ||
31 | HCQ_MSRVTT_1kA_L31.json | HCQ_MSRVTT_1kA_L31.txt | 26.20 | 54.50 | 67.90 | 88.00 | 5 | 27.57 | 45.94 | 25.00 | 55.60 | 69.10 | 90.00 | 4 | 24.38 | 45.80 | ||
MSRVTT (1k-B) | 1 | HCQ_MSRVTT_1kB_L1.json | HCQ_MSRVTT_1kB_L1.txt | 22.40 | 51.70 | 64.10 | 87.50 | 5 | 30.79 | 42.03 | 21.90 | 52.50 | 65.90 | 88.10 | 5 | 27.49 | 42.32 | |
3 | HCQ_MSRVTT_1kB_L3.json | HCQ_MSRVTT_1kB_L3.txt | 23.10 | 50.60 | 65.40 | 87.90 | 5 | 31.43 | 42.44 | 22.90 | 51.70 | 66.50 | 88.30 | 5 | 26.82 | 42.86 | ||
7 (default) | HCQ_MSRVTT_1kB.json | HCQ_MSRVTT_1kB.txt | 22.50 | 51.50 | 65.90 | 86.10 | 5 | 33.65 | 42.43 | 23.70 | 52.20 | 66.90 | 88.10 | 5 | 29.30 | 43.58 | ||
15 | HCQ_MSRVTT_1kB_L15.json | HCQ_MSRVTT_1kB_L15.txt | 22.20 | 51.50 | 64.30 | 87.20 | 5 | 30.98 | 41.89 | 22.00 | 52.40 | 65.50 | 87.90 | 5 | 26.35 | 42.27 | ||
31 | HCQ_MSRVTT_1kB_L31.json | HCQ_MSRVTT_1kB_L31.txt | 23.30 | 50.40 | 64.30 | 86.80 | 5 | 34.97 | 42.27 | 22.70 | 53.50 | 65.20 | 88.10 | 5 | 29.55 | 42.94 | ||
MSRVTT (Full) | 1 | HCQ_MSRVTT_full_L1.json | HCQ_MSRVTT_full_L1.txt | 14.31 | 38.63 | 52.24 | 80.94 | 10 | 44.35 | 30.68 | 17.32 | 44.98 | 59.60 | 86.89 | 7 | 31.44 | 35.95 | |
3 | HCQ_MSRVTT_full_L3.json | HCQ_MSRVTT_full_L3.txt | 14.45 | 39.16 | 51.84 | 80.80 | 10 | 45.37 | 30.84 | 17.56 | 46.19 | 60.37 | 86.82 | 6 | 31.24 | 36.58 | ||
7 (default) | HCQ_MSRVTT_full.json | HCQ_MSRVTT_full.txt | 15.15 | 38.53 | 51.00 | 81.34 | 10 | 46.22 | 30.99 | 18.26 | 44.88 | 59.06 | 87.16 | 7 | 30.96 | 36.45 | ||
15 | HCQ_MSRVTT_full_L15.json | HCQ_MSRVTT_full_L15.txt | 14.01 | 37.53 | 51.47 | 81.74 | 10 | 41.04 | 30.02 | 16.19 | 44.08 | 59.80 | 86.99 | 7 | 29.87 | 34.94 | ||
31 | HCQ_MSRVTT_full_L31.json | HCQ_MSRVTT_full_L31.txt | 14.48 | 38.56 | 52.64 | 81.61 | 9 | 43.41 | 30.86 | 18.09 | 45.99 | 59.67 | 87.22 | 7 | 30.54 | 36.75 | ||
LSMDC | 1 | HCQ_LSMDC_L1.json | HCQ_LSMDC_L1.txt | 14.40 | 31.50 | 42.50 | 68.50 | 17 | 73.09 | 26.81 | 13.00 | 30.60 | 40.50 | 68.10 | 19 | 71.16 | 25.26 | |
3 | HCQ_LSMDC_L3.json | HCQ_LSMDC_L3.txt | 14.00 | 33.80 | 44.10 | 68.30 | 17 | 73.91 | 27.53 | 12.90 | 32.80 | 42.80 | 68.50 | 17 | 71.74 | 26.26 | ||
7 (default) | HCQ_LSMDC.json | HCQ_LSMDC.txt | 14.50 | 33.60 | 43.10 | 68.20 | 18.5 | 75.95 | 27.59 | 13.70 | 33.20 | 42.80 | 66.10 | 17 | 74.28 | 26.90 | ||
15 | HCQ_LSMDC_L15.json | HCQ_LSMDC_L15.txt | 14.10 | 32.60 | 41.90 | 69.80 | 17 | 71.28 | 26.81 | 13.10 | 31.40 | 40.70 | 68.30 | 18 | 71.21 | 25.58 | ||
31 | HCQ_LSMDC_L31.json | HCQ_LSMDC_L31.txt | 12.80 | 31.90 | 41.90 | 68.30 | 17 | 72.03 | 25.77 | 12.50 | 32.20 | 42.00 | 67.20 | 17 | 72.26 | 25.66 | ||
ActivityNet Captions | 1 | HCQ_ActivityNet_L1.json | HCQ_ActivityNet_L1.txt | 19.77 | 50.54 | 65.77 | 89.06 | 5 | 33.26 | 40.35 | 20.03 | 51.33 | 66.36 | 89.40 | 5 | 32.14 | 40.86 | |
3 | HCQ_ActivityNet_L3.json | HCQ_ActivityNet_L3.txt | 20.95 | 52.21 | 68.35 | 90.54 | 5 | 30.22 | 42.13 | 20.72 | 53.10 | 68.70 | 90.50 | 5 | 29.18 | 42.28 | ||
7 (default) | HCQ_ActivityNet.json | HCQ_ActivityNet.txt | 22.19 | 53.69 | 70.12 | 91.21 | 5 | 30.71 | 43.72 | 23.00 | 54.85 | 70.14 | 91.38 | 5 | 29.08 | 44.56 | ||
15 | HCQ_ActivityNet_L15.json | HCQ_ActivityNet_L15.txt | 21.33 | 52.15 | 68.07 | 90.16 | 5 | 30.00 | 42.31 | 22.07 | 52.92 | 68.31 | 90.46 | 5 | 29.26 | 43.05 | ||
31 | HCQ_ActivityNet_L31.json | HCQ_ActivityNet_L31.txt | 20.56 | 52.45 | 69.07 | 89.91 | 5 | 31.39 | 42.07 | 21.66 | 52.96 | 68.60 | 90.81 | 5 | 29.67 | 42.85 | ||
M: the number of sub-codebooks in each quantization module | MSRVTT (1k-A) | 8 | HCQ_MSRVTT_1kA_M8.json | HCQ_MSRVTT_1kA_M8.txt | 23.00 | 52.00 | 65.00 | 87.00 | 5 | 32.93 | 42.68 | 21.40 | 52.40 | 65.50 | 88.20 | 5 | 30.19 | 41.88 |
16 | HCQ_MSRVTT_1kA_M16.json | HCQ_MSRVTT_1kA_M16.txt | 23.40 | 53.40 | 68.10 | 88.00 | 5 | 30.89 | 43.98 | 23.00 | 55.30 | 68.60 | 89.60 | 4 | 26.62 | 44.35 | ||
32 (default) | HCQ_MSRVTT_1kA.json | HCQ_MSRVTT_1kA.txt | 25.90 | 54.80 | 69.00 | 88.80 | 5 | 28.06 | 46.09 | 26.30 | 57.00 | 70.10 | 90.00 | 4 | 25.15 | 47.19 | ||
64 | HCQ_MSRVTT_1kA_M64.json | HCQ_MSRVTT_1kA_M64.txt | 27.20 | 56.80 | 69.10 | 89.30 | 4 | 26.93 | 47.44 | 26.10 | 58.10 | 71.40 | 90.70 | 4 | 23.82 | 47.66 | ||
MSRVTT (1k-B) | 8 | HCQ_MSRVTT_1kB_M8.json | HCQ_MSRVTT_1kB_M8.txt | 20.10 | 47.00 | 60.60 | 84.10 | 6.75 | 37.97 | 38.54 | 18.90 | 47.90 | 63.10 | 86.40 | 6 | 36.00 | 38.51 | |
16 | HCQ_MSRVTT_1kB_M16.json | HCQ_MSRVTT_1kB_M16.txt | 22.50 | 49.50 | 62.70 | 85.90 | 6 | 33.82 | 41.18 | 21.10 | 52.10 | 65.60 | 87.10 | 5 | 32.43 | 41.62 | ||
32 (default) | HCQ_MSRVTT_1kB.json | HCQ_MSRVTT_1kB.txt | 22.50 | 51.50 | 65.90 | 86.10 | 5 | 33.65 | 42.43 | 23.70 | 52.20 | 66.90 | 88.10 | 5 | 29.30 | 43.58 | ||
64 | HCQ_MSRVTT_1kB_M64.json | HCQ_MSRVTT_1kB_M64.txt | 24.50 | 51.60 | 66.20 | 87.70 | 5 | 31.31 | 43.74 | 23.60 | 54.30 | 67.40 | 88.80 | 4.75 | 27.56 | 44.20 | ||
MSRVTT (Full) | 8 | HCQ_MSRVTT_full_M8.json | HCQ_MSRVTT_full_M8.txt | 11.61 | 33.44 | 46.86 | 75.82 | 12 | 62.06 | 26.30 | 11.91 | 36.99 | 51.77 | 82.31 | 10 | 44.63 | 28.36 | |
16 | HCQ_MSRVTT_full_M16.json | HCQ_MSRVTT_full_M16.txt | 12.81 | 36.45 | 50.17 | 79.06 | 10 | 52.58 | 28.61 | 14.55 | 41.07 | 55.85 | 84.75 | 8 | 37.39 | 32.20 | ||
32 (default) | HCQ_MSRVTT_full.json | HCQ_MSRVTT_full.txt | 15.15 | 38.53 | 51.00 | 81.34 | 10 | 46.22 | 30.99 | 18.26 | 44.88 | 59.06 | 87.16 | 7 | 30.96 | 36.45 | ||
64 | HCQ_MSRVTT_full_M64.json | HCQ_MSRVTT_full_M64.txt | 16.02 | 40.97 | 54.25 | 83.01 | 8 | 40.48 | 32.90 | 19.16 | 48.26 | 62.94 | 88.70 | 6 | 26.65 | 38.76 | ||
LSMDC | 8 | HCQ_LSMDC_M8.json | HCQ_LSMDC_M8.txt | 12.60 | 29.00 | 38.60 | 64.30 | 22 | 84.53 | 24.16 | 10.40 | 29.20 | 39.10 | 64.20 | 21 | 78.32 | 22.81 | |
16 | HCQ_LSMDC_M16.json | HCQ_LSMDC_M16.txt | 13.20 | 31.10 | 39.40 | 66.50 | 19 | 79.15 | 25.29 | 12.70 | 31.60 | 39.90 | 65.30 | 21 | 77.42 | 25.21 | ||
32 (default) | HCQ_LSMDC.json | HCQ_LSMDC.txt | 14.50 | 33.60 | 43.10 | 68.20 | 18.5 | 75.95 | 27.59 | 13.70 | 33.20 | 42.80 | 66.10 | 17 | 74.28 | 26.90 | ||
64 | HCQ_LSMDC_M64.json | HCQ_LSMDC_M64.txt | 14.80 | 33.00 | 43.60 | 69.10 | 16 | 72.80 | 27.72 | 14.10 | 32.30 | 40.80 | 67.40 | 19 | 72.64 | 26.49 | ||
ActivityNet Captions | 8 | HCQ_ActivityNet_M8.json | HCQ_ActivityNet_M8.txt | 18.77 | 48.44 | 65.08 | 88.75 | 6 | 39.86 | 38.97 | 18.63 | 48.69 | 65.24 | 89.30 | 6 | 38.20 | 38.97 | |
16 | HCQ_ActivityNet_M16.json | HCQ_ActivityNet_M16.txt | 20.56 | 51.86 | 67.93 | 89.89 | 5 | 35.07 | 41.68 | 20.68 | 52.10 | 68.09 | 90.44 | 5 | 32.72 | 41.87 | ||
32 (default) | HCQ_ActivityNet.json | HCQ_ActivityNet.txt | 22.19 | 53.69 | 70.12 | 91.21 | 5 | 30.71 | 43.72 | 23.00 | 54.85 | 70.14 | 91.38 | 5 | 29.08 | 44.56 | ||
64 | HCQ_ActivityNet_M64.json | HCQ_ActivityNet_M64.txt | 22.96 | 54.59 | 70.80 | 91.80 | 5 | 26.29 | 44.60 | 23.61 | 55.28 | 70.80 | 92.03 | 4 | 25.74 | 45.21 | ||
Batch size | MSRVTT (1k-A) | 16 | HCQ_MSRVTT_1kA_bs16.json | HCQ_MSRVTT_1kA_bs16.txt | 24.20 | 53.40 | 67.40 | 89.90 | 5 | 25.86 | 44.33 | 23.60 | 54.10 | 67.60 | 89.60 | 4 | 22.96 | 44.19 |
32 | HCQ_MSRVTT_1kA_bs32.json | HCQ_MSRVTT_1kA_bs32.txt | 24.20 | 54.00 | 67.20 | 89.90 | 5 | 27.50 | 44.45 | 24.00 | 54.30 | 66.90 | 90.10 | 4 | 25.09 | 44.34 | ||
64 | HCQ_MSRVTT_1kA_bs64.json | HCQ_MSRVTT_1kA_bs64.txt | 26.20 | 55.90 | 67.90 | 88.70 | 4 | 26.67 | 46.33 | 25.50 | 55.80 | 69.00 | 89.90 | 4 | 23.37 | 46.13 | ||
128 (default) | HCQ_MSRVTT_1kA.json | HCQ_MSRVTT_1kA.txt | 25.90 | 54.80 | 69.00 | 88.80 | 5 | 28.06 | 46.09 | 26.30 | 57.00 | 70.10 | 90.00 | 4 | 25.15 | 47.19 | ||
256 | HCQ_MSRVTT_1kA_bs256.json | HCQ_MSRVTT_1kA_bs256.txt | 25.50 | 55.30 | 67.50 | 89.20 | 4 | 26.80 | 45.66 | 26.00 | 55.80 | 68.70 | 90.50 | 4 | 23.47 | 46.36 | ||
MSRVTT (1k-B) | 16 | HCQ_MSRVTT_1kB_bs16.json | HCQ_MSRVTT_1kB_bs16.txt | 22.00 | 49.40 | 64.50 | 87.60 | 6 | 31.45 | 41.23 | 18.50 | 51.80 | 66.20 | 89.60 | 5 | 26.30 | 39.88 | |
32 | HCQ_MSRVTT_1kB_bs32.json | HCQ_MSRVTT_1kB_bs32.txt | 22.60 | 49.20 | 65.10 | 87.10 | 6 | 32.03 | 41.68 | 21.40 | 52.30 | 65.90 | 88.20 | 5 | 28.20 | 41.94 | ||
64 | HCQ_MSRVTT_1kB_bs64.json | HCQ_MSRVTT_1kB_bs64.txt | 23.60 | 50.70 | 64.60 | 86.60 | 5 | 33.26 | 42.60 | 21.10 | 51.60 | 64.60 | 89.00 | 5 | 28.00 | 41.28 | ||
128 (default) | HCQ_MSRVTT_1kB.json | HCQ_MSRVTT_1kB.txt | 22.50 | 51.50 | 65.90 | 86.10 | 5 | 33.65 | 42.43 | 23.70 | 52.20 | 66.90 | 88.10 | 5 | 29.30 | 43.58 | ||
256 | HCQ_MSRVTT_1kB_bs256.json | HCQ_MSRVTT_1kB_bs256.txt | 22.50 | 50.20 | 63.80 | 87.00 | 5 | 30.96 | 41.61 | 21.30 | 52.40 | 65.90 | 88.30 | 5 | 27.50 | 41.90 | ||
MSRVTT (Full) | 16 | HCQ_MSRVTT_full_bs16.json | HCQ_MSRVTT_full_bs16.txt | 13.08 | 37.96 | 52.91 | 82.04 | 9 | 41.76 | 29.72 | 15.95 | 42.44 | 57.59 | 86.09 | 8 | 31.76 | 33.91 | |
32 | HCQ_MSRVTT_full_bs32.json | HCQ_MSRVTT_full_bs32.txt | 13.75 | 38.39 | 52.37 | 80.80 | 10 | 45.51 | 30.24 | 16.39 | 44.58 | 58.86 | 86.29 | 7 | 32.54 | 35.04 | ||
64 | HCQ_MSRVTT_full_bs64.json | HCQ_MSRVTT_full_bs64.txt | 14.65 | 39.20 | 52.98 | 82.27 | 9 | 44.13 | 31.22 | 17.69 | 46.59 | 61.10 | 87.83 | 6 | 31.56 | 36.93 | ||
128 (default) | HCQ_MSRVTT_full.json | HCQ_MSRVTT_full.txt | 15.15 | 38.53 | 51.00 | 81.34 | 10 | 46.22 | 30.99 | 18.26 | 44.88 | 59.06 | 87.16 | 7 | 30.96 | 36.45 | ||
256 | HCQ_MSRVTT_full_bs256.json | HCQ_MSRVTT_full_bs256.txt | 14.21 | 39.06 | 52.47 | 82.81 | 9 | 40.74 | 30.77 | 16.92 | 46.15 | 59.70 | 87.63 | 7 | 28.24 | 35.99 | ||
LSMDC | 16 | HCQ_LSMDC_bs16.json | HCQ_LSMDC_bs16.txt | 12.30 | 29.70 | 39.40 | 65.30 | 21 | 82.64 | 24.32 | 10.70 | 28.30 | 38.90 | 65.60 | 23 | 80.80 | 22.75 | |
32 | HCQ_LSMDC_bs32.json | HCQ_LSMDC_bs32.txt | 12.30 | 30.00 | 38.70 | 66.30 | 20 | 79.95 | 24.26 | 12.10 | 28.70 | 39.10 | 63.50 | 23 | 80.79 | 23.86 | ||
64 | HCQ_LSMDC_bs64.json | HCQ_LSMDC_bs64.txt | 13.40 | 31.90 | 41.00 | 66.20 | 17 | 75.98 | 25.98 | 13.40 | 31.50 | 40.00 | 66.20 | 20 | 73.14 | 25.65 | ||
128 (default) | HCQ_LSMDC.json | HCQ_LSMDC.txt | 14.50 | 33.60 | 43.10 | 68.20 | 18.5 | 75.95 | 27.59 | 13.70 | 33.20 | 42.80 | 66.10 | 17 | 74.28 | 26.90 | ||
256 | HCQ_LSMDC_bs256.json | HCQ_LSMDC_bs256.txt | 14.30 | 34.80 | 43.60 | 69.30 | 16 | 74.04 | 27.89 | 14.30 | 33.50 | 42.50 | 67.70 | 16 | 71.84 | 27.31 | ||
ActivityNet Captions | 16 | HCQ_ActivityNet_bs16.json | HCQ_ActivityNet_bs16.txt | 21.31 | 52.55 | 70.59 | 92.19 | 5 | 27.31 | 42.92 | 22.25 | 53.18 | 70.41 | 92.33 | 5 | 26.57 | 43.68 | |
32 (default) | HCQ_ActivityNet.json | HCQ_ActivityNet.txt | 22.19 | 53.69 | 70.12 | 91.21 | 5 | 30.71 | 43.72 | 23.00 | 54.85 | 70.14 | 91.38 | 5 | 29.08 | 44.56 | ||
64 | HCQ_ActivityNet_bs64.json | HCQ_ActivityNet_bs64.txt | 20.62 | 51.60 | 66.91 | 88.94 | 5 | 33.61 | 41.45 | 20.58 | 51.64 | 67.76 | 89.40 | 5 | 31.52 | 41.61 | ||
128 | HCQ_ActivityNet_bs128.json | HCQ_ActivityNet_bs128.txt | 19.36 | 48.61 | 64.86 | 88.41 | 6 | 35.38 | 39.37 | 19.22 | 49.68 | 66.04 | 89.12 | 6 | 33.15 | 39.80 | ||
τ: the temperature factor in contrastive learning loss (Eq.(13)) | MSRVTT (1k-A) | 0.03 | HCQ_MSRVTT_1kA_t0.03.json | HCQ_MSRVTT_1kA_t0.03.txt | 24.90 | 56.50 | 68.80 | 88.80 | 4 | 26.95 | 45.91 | 25.10 | 53.90 | 69.10 | 89.70 | 4 | 24.91 | 45.39 |
0.05 | HCQ_MSRVTT_1kA.json | HCQ_MSRVTT_1kA.txt | 25.90 | 54.80 | 69.00 | 88.80 | 5 | 28.06 | 46.09 | 26.30 | 57.00 | 70.10 | 90.00 | 4 | 25.15 | 47.19 | ||
0..07 | HCQ_MSRVTT_1kA_t0.07.json | HCQ_MSRVTT_1kA_t0.07.txt | 25.40 | 52.80 | 67.50 | 88.60 | 5 | 30.40 | 44.90 | 25.90 | 57.00 | 68.00 | 90.00 | 4 | 27.78 | 46.48 | ||
0.1 | HCQ_MSRVTT_1kA_t0.1.json | HCQ_MSRVTT_1kA_t0.1.txt | 23.90 | 52.10 | 66.20 | 87.10 | 5 | 32.74 | 43.52 | 22.50 | 54.00 | 67.10 | 87.70 | 5 | 31.09 | 43.36 | ||
0.12 | HCQ_MSRVTT_1kA_t0.12.json | HCQ_MSRVTT_1kA_t0.12.txt | 22.60 | 49.60 | 65.00 | 87.90 | 6 | 34.53 | 41.77 | 21.20 | 50.80 | 65.10 | 87.30 | 5 | 33.46 | 41.23 | ||
0.15 | HCQ_MSRVTT_1kA_t0.15.json | HCQ_MSRVTT_1kA_t0.15.txt | 18.20 | 44.50 | 60.20 | 86.80 | 7 | 36.74 | 36.53 | 16.50 | 46.80 | 61.40 | 85.80 | 6 | 35.20 | 36.19 | ||
MSRVTT (1k-B) | 0.03 | HCQ_MSRVTT_1kB_t0.03.json | HCQ_MSRVTT_1kB_t0.03.txt | 23.10 | 51.90 | 63.40 | 88.20 | 5 | 30.89 | 42.36 | 22.90 | 51.70 | 65.60 | 88.10 | 5 | 25.72 | 42.67 | |
0.05 | HCQ_MSRVTT_1kB.json | HCQ_MSRVTT_1kB.txt | 22.50 | 51.50 | 65.90 | 86.10 | 5 | 33.65 | 42.43 | 23.70 | 52.20 | 66.90 | 88.10 | 5 | 29.30 | 43.58 | ||
0..07 | HCQ_MSRVTT_1kB_t0.07.json | HCQ_MSRVTT_1kB_t0.07.txt | 23.90 | 49.90 | 63.50 | 86.70 | 6 | 34.78 | 42.31 | 22.70 | 52.10 | 65.30 | 87.40 | 5 | 32.91 | 42.59 | ||
0.1 | HCQ_MSRVTT_1kB_t0.1.json | HCQ_MSRVTT_1kB_t0.1.txt | 19.90 | 50.70 | 63.80 | 86.80 | 5 | 35.51 | 40.08 | 19.90 | 50.70 | 65.00 | 87.20 | 5 | 34.81 | 40.33 | ||
0.12 | HCQ_MSRVTT_1kB_t0.12.json | HCQ_MSRVTT_1kB_t0.12.txt | 19.00 | 46.30 | 61.00 | 86.40 | 7 | 35.89 | 37.72 | 18.30 | 48.20 | 61.30 | 86.60 | 6 | 35.56 | 37.81 | ||
0.15 | HCQ_MSRVTT_1kB_t0.15.json | HCQ_MSRVTT_1kB_t0.15.txt | 15.60 | 43.20 | 56.70 | 84.50 | 8 | 40.02 | 33.68 | 14.70 | 44.20 | 57.90 | 85.80 | 7 | 39.38 | 33.51 | ||
MSRVTT (Full) | 0.03 | HCQ_MSRVTT_full_t0.03.json | HCQ_MSRVTT_full_t0.03.txt | 14.11 | 38.29 | 50.77 | 80.00 | 10 | 45.90 | 30.16 | 16.32 | 45.45 | 59.80 | 86.86 | 7 | 31.64 | 35.40 | |
0.05 | HCQ_MSRVTT_full.json | HCQ_MSRVTT_full.txt | 15.15 | 38.53 | 51.00 | 81.34 | 10 | 46.22 | 30.99 | 18.26 | 44.88 | 59.06 | 87.16 | 7 | 30.96 | 36.45 | ||
0..07 | HCQ_MSRVTT_full_t0.07.json | HCQ_MSRVTT_full_t0.07.txt | 14.15 | 37.89 | 51.17 | 81.30 | 10 | 46.22 | 30.16 | 16.72 | 43.18 | 58.09 | 85.95 | 8 | 33.70 | 34.75 | ||
0.1 | HCQ_MSRVTT_full_t0.1.json | HCQ_MSRVTT_full_t0.1.txt | 13.58 | 36.56 | 49.06 | 80.43 | 11 | 49.80 | 28.99 | 14.35 | 39.13 | 53.65 | 84.15 | 9 | 39.70 | 31.11 | ||
0.12 | HCQ_MSRVTT_full_t0.12.json | HCQ_MSRVTT_full_t0.12.txt | 12.31 | 34.25 | 49.13 | 79.50 | 11 | 50.45 | 27.46 | 12.24 | 35.65 | 50.64 | 82.98 | 10 | 44.35 | 28.06 | ||
0.15 | HCQ_MSRVTT_full_t0.15.json | HCQ_MSRVTT_full_t0.15.txt | 10.10 | 30.64 | 43.88 | 76.79 | 14 | 55.40 | 23.86 | 9.16 | 29.90 | 45.69 | 79.00 | 13 | 53.01 | 23.22 | ||
LSMDC | 0.03 | HCQ_LSMDC_t0.03.json | HCQ_LSMDC_t0.03.txt | 14.90 | 32.00 | 42.50 | 66.20 | 18 | 76.14 | 27.26 | 12.90 | 31.80 | 40.80 | 66.80 | 20 | 72.31 | 25.58 | |
0.05 | HCQ_LSMDC.json | HCQ_LSMDC.txt | 14.50 | 33.60 | 43.10 | 68.20 | 18.5 | 75.95 | 27.59 | 13.70 | 33.20 | 42.80 | 66.10 | 17 | 74.28 | 26.90 | ||
0..07 | HCQ_LSMDC_t0.07.json | HCQ_LSMDC_t0.07.txt | 12.80 | 32.30 | 43.40 | 67.70 | 17 | 75.92 | 26.18 | 12.80 | 32.70 | 42.90 | 67.30 | 17 | 76.30 | 26.19 | ||
0.1 | HCQ_LSMDC_t0.1.json | HCQ_LSMDC_t0.1.txt | 12.50 | 30.10 | 40.80 | 66.90 | 18 | 81.02 | 24.85 | 11.80 | 29.00 | 40.30 | 64.20 | 19 | 82.29 | 23.98 | ||
0.12 | HCQ_LSMDC_t0.12.json | HCQ_LSMDC_t0.12.txt | 12.00 | 28.10 | 38.80 | 66.40 | 20 | 81.93 | 23.56 | 11.90 | 27.60 | 39.60 | 64.80 | 20 | 84.15 | 23.52 | ||
0.15 | HCQ_LSMDC_t0.15.json | HCQ_LSMDC_t0.15.txt | 10.70 | 26.10 | 36.00 | 64.90 | 23 | 82.81 | 21.58 | 9.10 | 24.00 | 35.10 | 62.80 | 25 | 88.27 | 19.72 | ||
ActivityNet Captions | 0.03 | HCQ_ActivityNet_t0.03.json | HCQ_ActivityNet_t0.03.txt | 22.15 | 52.78 | 68.58 | 91.38 | 5 | 26.42 | 43.12 | 21.74 | 52.47 | 68.70 | 91.38 | 5 | 26.65 | 42.79 | |
0.05 | HCQ_ActivityNet.json | HCQ_ActivityNet.txt | 21.96 | 53.30 | 68.99 | 90.89 | 5 | 29.67 | 43.23 | 21.94 | 52.94 | 69.21 | 90.69 | 5 | 29.12 | 43.16 | ||
0..07 | HCQ_ActivityNet_t0.07.json | HCQ_ActivityNet_t0.07.txt | 22.19 | 53.69 | 70.12 | 91.21 | 5 | 30.71 | 43.72 | 23.00 | 54.85 | 70.14 | 91.38 | 5 | 29.08 | 44.56 | ||
0.1 | HCQ_ActivityNet_t0.1.json | HCQ_ActivityNet_t0.1.txt | 22.11 | 52.08 | 68.23 | 91.34 | 5 | 28.34 | 42.83 | 21.72 | 53.33 | 69.60 | 91.60 | 5 | 27.19 | 43.20 | ||
0.12 | HCQ_ActivityNet_t0.12.json | HCQ_ActivityNet_t0.12.txt | 19.20 | 50.52 | 67.99 | 91.95 | 5 | 30.12 | 40.40 | 20.09 | 51.66 | 68.23 | 91.89 | 5 | 29.16 | 41.37 | ||
0.15 | HCQ_ActivityNet_t0.15.json | HCQ_ActivityNet_t0.15.txt | 17.00 | 47.14 | 65.49 | 91.42 | 6 | 31.43 | 37.44 | 18.59 | 48.81 | 65.30 | 91.84 | 6 | 32.65 | 38.99 |
3.1.4 Results of HCQ with different kinds of text encoders ("1k-A" split) (reported in Table 5 in our paper)
Model | Text Encoder | Config json | Log | Text-to-Video Retrieval | Video-to-Text Retrieval | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
[email protected] | [email protected] | [email protected] | [email protected] | Median rank | Mean rank | Geometric mean of recall@{1,5,10} | [email protected] | [email protected] | [email protected] | [email protected] | Median rank | Mean rank | Geometric mean of recall@{1,5,10} | ||||
HCQ | bert-base (default) | HCQ_MSRVTT_1kA.json | HCQ_MSRVTT_1kA.txt | 25.90 | 54.80 | 69.00 | 88.80 | 5 | 28.06 | 46.09 | 26.30 | 57.00 | 70.10 | 90.00 | 4 | 25.15 | 47.19 |
BERT-large | HCQ_MSRVTT_1kA_bert-large.json | HCQ_MSRVTT_1kA_bert-large.txt | 27.40 | 57.70 | 70.70 | 89.60 | 4 | 27.09 | 48.17 | 26.20 | 59.00 | 71.80 | 89.50 | 4 | 25.47 | 48.06 | |
DistilBERT-base | HCQ_MSRVTT_1kA_distilbert-base.json | HCQ_MSRVTT_1kA_distilbert-base.txt | 25.40 | 54.20 | 67.30 | 89.80 | 4 | 27.00 | 45.25 | 26.30 | 56.40 | 69.00 | 90.10 | 4 | 24.22 | 46.78 | |
RoBERTa-base | HCQ_MSRVTT_1kA_roberta-base.json | HCQ_MSRVTT_1kA_roberta-base.txt | 25.50 | 54.70 | 67.80 | 89.20 | 5 | 27.04 | 45.56 | 24.50 | 55.00 | 69.00 | 90.20 | 4 | 23.80 | 45.30 | |
RoBERTa-large | HCQ_MSRVTT_1kA_roberta-large.json | HCQ_MSRVTT_1kA_roberta-large.txt | 28.00 | 55.40 | 68.50 | 88.10 | 4 | 30.67 | 47.36 | 27.00 | 59.00 | 68.40 | 88.50 | 4 | 27.41 | 47.76 | |
XLNet-base | HCQ_MSRVTT_1kA_xlnet-base.json | HCQ_MSRVTT_1kA_xlnet-base.txt | 25.80 | 56.20 | 68.70 | 87.50 | 5 | 28.35 | 46.36 | 24.60 | 55.50 | 69.00 | 88.40 | 4 | 25.59 | 45.50 | |
XLNet-large | HCQ_MSRVTT_1kA_xlnet-large.json | HCQ_MSRVTT_1kA_xlnet-large.txt | 25.00 | 53.00 | 66.60 | 88.20 | 5 | 27.59 | 44.52 | 25.30 | 54.50 | 68.00 | 89.10 | 4 | 23.69 | 45.43 |
If you are doing experiments on a platform with enough RAM and want to accelerate the training, you can load the whole dataset in RAM by the following modification:
# WWW22-HCQ/base/base_dataset.py:L170
load_in_ram=True, # change from 'False' to 'True'
3.2 Evaluation from checkpoint
We can evaluate the model from the checkpoint without re-training. The evaluation command:
python -m train --config configs/HCQ_MSRVTT_1kA.json --only_eval --load_checkpoint HCQ_MSRVTT_1kA.pth
We provide the checkpoint of HCQ_MSRVTT_1kA.json as an example, you can download this file (~1.6G) from the Google Drive and put it in the working directory (WWW22-HCQ/
).
3.3 Evaluation for post-compression methods
Take the evaluation on MSRVTT dataset ("1k-A" split) as an example. First, we need to train an HCT.
# working directory: WWW22-HCQ/
python -m train --config configs/HCT_MSRVTT_1kA.json
Then, run the get_embed.py
and pass the path of the HCT checkpoint to the script:
python -m get_embed configs/HCT_MSRVTT_1kA.json --only_eval --load_checkpoint HCT_MSRVTT_1kA/trained_model.pth
After that, we will get the embedding file embeddings.h5
under WWW22-HCQ/exps/HCT_MSRVTT_1kA/
. Run the compress_embed.py
and get the results:
# compress embeddings with LSH
python -m compress_embed --path ./exps/HCT_MSRVTT_1kA/embeddings.h5 --type LSH
# compress embeddings with PQ
python -m compress_embed --path ./exps/HCT_MSRVTT_1kA/embeddings.h5 --type PQ
# compress embeddings with OPQ
python -m compress_embed --path ./exps/HCT_MSRVTT_1kA/embeddings.h5 --type OPQ
3. References
If you find this code useful or use the toolkit in your work, please consider citing:
@inproceedings{wang22hcq,
author={Wang, Jinpeng and Chen, Bin and Liao, Dongliang and Zeng, Ziyun and Li, Gongfu and Shu-Tao, Xia and Xu, Jin},
title={Hybrid Contrastive Quantization for Efficient Cross-View Video Retrieval},
booktitle={Proceedings of the Web Conference 2022},
doi={10.1145/3485447.3512022}
}
4. Acknowledgements
Our code is based on the implementation of nanopq, Multi-Modal Transformer, Collaborative Experts, Transformers and Mixture of Embedding Experts.
5. Contact
If you have any question, you can raise an issue or email Jinpeng Wang ([email protected]). We will reply you soon.