Optimizing DR with hard negatives and achieving SOTA first-stage retrieval performance on TREC DL Track (SIGIR 2021 Full Paper).

Overview

Optimizing Dense Retrieval Model Training with Hard Negatives

Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Jiafeng Guo, Min Zhang, Shaoping Ma

This repo provides code, retrieval results, and trained models for our SIGIR Full paper Optimizing Dense Retrieval Model Training with Hard Negatives. The previous version is Learning To Retrieve: How to Train a Dense Retrieval Model Effectively and Efficiently.

We achieve very impressive retrieval results on both passage and document retrieval bechmarks. The proposed two algorithms (STAR and ADORE) are very efficient. IMHO, they are well worth trying and most likely improve your retriever's performance by a large margin.

The following figure shows the pros and cons of different training methods. You can train an effective Dense Retrieval model in three steps. Firstly, warmup your model using random negatives or BM25 top negatives. Secondly, use our proposed STAR to train the query encoder and document encoder. Thirdly, use our proposed ADORE to train the query encoder. image

Retrieval Results and Trained Models

Passage Retrieval Dev [email protected] Dev [email protected] Test [email protected] Files
Inbatch-Neg 0.264 0.837 0.583 Model
Rand-Neg 0.301 0.853 0.612 Model
STAR 0.340 0.867 0.642 Model Train Dev TRECTest
ADORE (Inbatch-Neg) 0.316 0.860 0.658 Model
ADORE (Rand-Neg) 0.326 0.865 0.661 Model
ADORE (STAR) 0.347 0.876 0.683 Model Train Dev TRECTest Leaderboard
Doc Retrieval Dev [email protected] Dev [email protected] Test [email protected] Files
Inbatch-Neg 0.320 0.864 0.544 Model
Rand-Neg 0.330 0.859 0.572 Model
STAR 0.390 0.867 0.605 Model Train Dev TRECTest
ADORE (Inbatch-Neg) 0.362 0.884 0.580 Model
ADORE (Rand-Neg) 0.361 0.885 0.585 Model
ADORE (STAR) 0.405 0.919 0.628 Model Train Dev TRECTest Leaderboard

If you want to use our first-stage leaderboard runs, contact me and I will send you the file.

If any links fail or the files go wrong, please contact me or open a issue.

Requirements

To install requirements, run the following commands:

git clone [email protected]:jingtaozhan/DRhard.git
cd DRhard
python setup.py install

However, you need to set up a new python enverionment for data preprocessing (see below).

Data Download

To download all the needed data, run:

bash download_data.sh

Data Preprocess

You need to set up a new environment with transformers==2.8.0 to tokenize the text. This is because we find the tokenizer behaves differently among versions 2, 3 and 4. To replicate the results in our paper with our provided trained models, it is necessary to use version 2.8.0 for preprocessing. Otherwise, you may need to re-train the DR models.

Run the following codes.

python preprocess.py --data_type 0; python preprocess.py --data_type 1

Inference

With our provided trained models, you can easily replicate our reported experimental results. Note that minor variance may be observed due to environmental difference.

STAR

The following codes use the provided STAR model to compute query/passage embeddings and perform similarity search on the dev set. (You can use --faiss_gpus option to use gpus for much faster similarity search.)

python ./star/inference.py --data_type passage --max_doc_length 256 --mode dev   
python ./star/inference.py --data_type doc --max_doc_length 512 --mode dev   

Run the following code to evaluate on MSMARCO Passage dataset.

python ./msmarco_eval.py ./data/passage/preprocess/dev-qrel.tsv ./data/passage/evaluate/star/dev.rank.tsv
Eval Started
#####################
MRR @10: 0.3404237731386721
QueriesRanked: 6980
#####################

Run the following code to evaluate on MSMARCO Document dataset.

python ./msmarco_eval.py ./data/doc/preprocess/dev-qrel.tsv ./data/doc/evaluate/star/dev.rank.tsv 100
Eval Started
#####################
MRR @100: 0.3903422772218344
QueriesRanked: 5193
#####################

ADORE

ADORE computes the query embeddings. The document embeddings are pre-computed by other DR models, like STAR. The following codes use the provided ADORE(STAR) model to compute query embeddings and perform similarity search on the dev set. (You can use --faiss_gpus option to use gpus for much faster similarity search.)

python ./adore/inference.py --model_dir ./data/passage/trained_models/adore-star --output_dir ./data/passage/evaluate/adore-star --preprocess_dir ./data/passage/preprocess --mode dev --dmemmap_path ./data/passage/evaluate/star/passages.memmap
python ./adore/inference.py --model_dir ./data/doc/trained_models/adore-star --output_dir ./data/doc/evaluate/adore-star --preprocess_dir ./data/doc/preprocess --mode dev --dmemmap_path ./data/doc/evaluate/star/passages.memmap

Evaluate ADORE(STAR) model on dev passage dataset:

python ./msmarco_eval.py ./data/passage/preprocess/dev-qrel.tsv ./data/passage/evaluate/adore-star/dev.rank.tsv

You will get

Eval Started
#####################
MRR @10: 0.34660697230181425
QueriesRanked: 6980
#####################

Evaluate ADORE(STAR) model on dev document dataset:

python ./msmarco_eval.py ./data/doc/preprocess/dev-qrel.tsv ./data/doc/evaluate/adore-star/dev.rank.tsv 100

You will get

Eval Started
#####################
MRR @100: 0.4049777020859768
QueriesRanked: 5193
#####################

Convert QID/PID Back

Our data preprocessing reassigns new ids for each query and document. Therefore, you may want to convert the ids back. We provide a script for this.

The following code shows an example to convert ADORE-STAR's ranking results on the dev passage dataset.

python ./cvt_back.py --input_dir ./data/passage/evaluate/adore-star/ --preprocess_dir ./data/passage/preprocess --output_dir ./data/passage/official_runs/adore-star --mode dev --dataset passage
python ./msmarco_eval.py ./data/passage/dataset/qrels.dev.small.tsv ./data/passage/official_runs/adore-star/dev.rank.tsv

You will get

Eval Started
#####################
MRR @10: 0.34660697230181425
QueriesRanked: 6980
#####################

Train

In the following instructions, we show how to replicate our experimental results on MSMARCO Passage Retrieval task.

STAR

We use the same warmup model as ANCE, the most competitive baseline, to enable a fair comparison. Please download it and extract it at ./data/passage/warmup

Next, we use this warmup model to extract static hard negatives, which will be utilized by STAR.

python ./star/prepare_hardneg.py \
--data_type passage \
--max_query_length 32 \
--max_doc_length 256 \
--mode dev \
--topk 200

It will automatically use all available gpus to retrieve documents. If all available cuda memory is less than 26GB (the index size), you can add --not_faiss_cuda to use CPU for retrieval.

Run the following command to train the DR model with STAR. In our experiments, we only use one GPU to train.

python ./star/train.py --do_train \
    --max_query_length 24 \
    --max_doc_length 120 \
    --preprocess_dir ./data/passage/preprocess \
    --hardneg_path ./data/passage/warmup_retrieve/hard.json \
    --init_path ./data/passage/warmup \
    --output_dir ./data/passage/star_train/models \
    --logging_dir ./data/passage/star_train/log \
    --optimizer_str lamb \
    --learning_rate 1e-4 \
    --gradient_checkpointing --fp16

Although we set number of training epcohs a very large value in the script, it is likely to converge within 50k steps (1.5 days) and you can manually kill the process. Using multiple gpus should speed up a lot, which requires some changes in the codes.

ADORE

Now we show how to use ADORE to finetune the query encoder. Here we use our provided STAR checkpoint as the fixed document encoder. You can also use another document encoder.

The passage embeddings by STAR should be located at ./data/passage/evaluate/star/passages.memmap. If not, follow the STAR inference procedure as shown above.

python ./adore/train.py \
--metric_cut 200 \
--init_path ./data/passage/trained_models/star \
--pembed_path ./data/passage/evaluate/star/passages.memmap \
--model_save_dir ./data/passage/adore_train/models \
--log_dir ./data/passage/adore_train/log \
--preprocess_dir ./data/passage/preprocess \
--model_gpu_index 0 \
--faiss_gpu_index 1 2 3

The above command uses the first gpu for encoding, and the 2nd~4th gpu for dense retrieval. You can change the faiss_gpu_index values based on your available cuda memory. For example, if you have a 32GB gpu, you can set model_gpu_index and faiss_gpu_index both to 0 because the CUDA memory is large enough. But if you only have 11GB gpus, three gpus are required for faiss.

Empirically, ADORE significantly improves retrieval performance after training for only one epoch, which only costs 1 hour if using GPUs to retrieve dynamic hard negatives.

Owner
Jingtao Zhan
IR Researcher, Ph.D student at Tsinghua University.
Jingtao Zhan
I created My own Virtual Artificial Intelligence named genesis, He can assist with my Tasks and also perform some analysis,,

Virtual-Artificial-Intelligence-genesis- I created My own Virtual Artificial Intelligence named genesis, He can assist with my Tasks and also perform

AKASH M 1 Nov 05, 2021
PHOTONAI is a high level python API for designing and optimizing machine learning pipelines.

PHOTONAI is a high level python API for designing and optimizing machine learning pipelines. We've created a system in which you can easily select and

Medical Machine Learning Lab - University of Münster 57 Nov 12, 2022
A bare-bones TensorFlow framework for Bayesian deep learning and Gaussian process approximation

Aboleth A bare-bones TensorFlow framework for Bayesian deep learning and Gaussian process approximation [1] with stochastic gradient variational Bayes

Gradient Institute 127 Dec 12, 2022
LWCC: A LightWeight Crowd Counting library for Python that includes several pretrained state-of-the-art models.

LWCC: A LightWeight Crowd Counting library for Python LWCC is a lightweight crowd counting framework for Python. It wraps four state-of-the-art models

Matija Teršek 39 Dec 28, 2022
Deep Multimodal Neural Architecture Search

MMNas: Deep Multimodal Neural Architecture Search This repository corresponds to the PyTorch implementation of the MMnas for visual question answering

Vision and Language Group@ MIL 23 Dec 21, 2022
Gradient representations in ReLU networks as similarity functions

Gradient representations in ReLU networks as similarity functions by Dániel Rácz and Bálint Daróczy. This repo contains the python code related to our

1 Oct 08, 2021
Implementing yolov4 target detection and tracking based on nao robot

Implementing yolov4 target detection and tracking based on nao robot

6 Apr 19, 2022
Group R-CNN for Point-based Weakly Semi-supervised Object Detection (CVPR2022)

Group R-CNN for Point-based Weakly Semi-supervised Object Detection (CVPR2022) By Shilong Zhang*, Zhuoran Yu*, Liyang Liu*, Xinjiang Wang, Aojun Zhou,

Shilong Zhang 129 Dec 24, 2022
CSD: Consistency-based Semi-supervised learning for object Detection

CSD: Consistency-based Semi-supervised learning for object Detection (NeurIPS 2019) By Jisoo Jeong, Seungeui Lee, Jee-soo Kim, Nojun Kwak Installation

80 Dec 15, 2022
Scalable Graph Neural Networks for Heterogeneous Graphs

Neighbor Averaging over Relation Subgraphs (NARS) NARS is an algorithm for node classification on heterogeneous graphs, based on scalable neighbor ave

Facebook Research 67 Dec 03, 2022
Official tensorflow implementation for CVPR2020 paper “Learning to Cartoonize Using White-box Cartoon Representations”

Tensorflow implementation for CVPR2020 paper “Learning to Cartoonize Using White-box Cartoon Representations”.

3.7k Dec 31, 2022
Predictive Maintenance LSTM

Predictive-Maintenance-LSTM - Predictive maintenance study for Complex case study, we've obtained failure causes by operational error and more deeply by design mistakes.

Amir M. Sadafi 1 Dec 31, 2021
Tutel MoE: An Optimized Mixture-of-Experts Implementation

Project Tutel Tutel MoE: An Optimized Mixture-of-Experts Implementation. Supported Framework: Pytorch Supported GPUs: CUDA(fp32 + fp16), ROCm(fp32) Ho

Microsoft 344 Dec 29, 2022
Gapmm2: gapped alignment using minimap2 (align transcripts to genome)

gapmm2: gapped alignment using minimap2 This tool is a wrapper for minimap2 to r

Jon Palmer 2 Jan 27, 2022
Official Implementation for Fast Training of Neural Lumigraph Representations using Meta Learning.

Fast Training of Neural Lumigraph Representations using Meta Learning Project Page | Paper | Data Alexander W. Bergman, Petr Kellnhofer, Gordon Wetzst

Alex 39 Oct 08, 2022
Official PyTorch implementation for "Low Precision Decentralized Distributed Training with Heterogenous Data"

Low Precision Decentralized Training with Heterogenous Data Official PyTorch implementation for "Low Precision Decentralized Distributed Training with

Aparna Aketi 0 Nov 23, 2021
BERT model training impelmentation using 1024 A100 GPUs for MLPerf Training v1.1

Pre-trained checkpoint and bert config json file Location of checkpoint and bert config json file This MLCommons members Google Drive location contain

SAIT (Samsung Advanced Institute of Technology) 12 Apr 27, 2022
Implementation of "RaScaNet: Learning Tiny Models by Raster-Scanning Image" from CVPR 2021.

RaScaNet: Learning Tiny Models by Raster-Scanning Images Deploying deep convolutional neural networks on ultra-low power systems is challenging, becau

SAIT (Samsung Advanced Institute of Technology) 5 Dec 26, 2022
Parasite: a tool allowing you to compress and decompress files, to reduce their size

🦠 Parasite 🦠 Parasite is a tool written in Python3 allowing you to "compress" any file, reducing its size. ⭐ Features ⭐ + Fast + Good optimization,

Billy 30 Nov 25, 2022
On the Complementarity between Pre-Training and Back-Translation for Neural Machine Translation (Findings of EMNLP 2021))

PTvsBT On the Complementarity between Pre-Training and Back-Translation for Neural Machine Translation (Findings of EMNLP 2021) Citation Please cite a

Sunbow Liu 10 Nov 25, 2022