Optimizing DR with hard negatives and achieving SOTA first-stage retrieval performance on TREC DL Track (SIGIR 2021 Full Paper).

Overview

Optimizing Dense Retrieval Model Training with Hard Negatives

Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Jiafeng Guo, Min Zhang, Shaoping Ma

This repo provides code, retrieval results, and trained models for our SIGIR Full paper Optimizing Dense Retrieval Model Training with Hard Negatives. The previous version is Learning To Retrieve: How to Train a Dense Retrieval Model Effectively and Efficiently.

We achieve very impressive retrieval results on both passage and document retrieval bechmarks. The proposed two algorithms (STAR and ADORE) are very efficient. IMHO, they are well worth trying and most likely improve your retriever's performance by a large margin.

The following figure shows the pros and cons of different training methods. You can train an effective Dense Retrieval model in three steps. Firstly, warmup your model using random negatives or BM25 top negatives. Secondly, use our proposed STAR to train the query encoder and document encoder. Thirdly, use our proposed ADORE to train the query encoder. image

Retrieval Results and Trained Models

Passage Retrieval Dev [email protected] Dev [email protected] Test [email protected] Files
Inbatch-Neg 0.264 0.837 0.583 Model
Rand-Neg 0.301 0.853 0.612 Model
STAR 0.340 0.867 0.642 Model Train Dev TRECTest
ADORE (Inbatch-Neg) 0.316 0.860 0.658 Model
ADORE (Rand-Neg) 0.326 0.865 0.661 Model
ADORE (STAR) 0.347 0.876 0.683 Model Train Dev TRECTest Leaderboard
Doc Retrieval Dev [email protected] Dev [email protected] Test [email protected] Files
Inbatch-Neg 0.320 0.864 0.544 Model
Rand-Neg 0.330 0.859 0.572 Model
STAR 0.390 0.867 0.605 Model Train Dev TRECTest
ADORE (Inbatch-Neg) 0.362 0.884 0.580 Model
ADORE (Rand-Neg) 0.361 0.885 0.585 Model
ADORE (STAR) 0.405 0.919 0.628 Model Train Dev TRECTest Leaderboard

If you want to use our first-stage leaderboard runs, contact me and I will send you the file.

If any links fail or the files go wrong, please contact me or open a issue.

Requirements

To install requirements, run the following commands:

git clone [email protected]:jingtaozhan/DRhard.git
cd DRhard
python setup.py install

However, you need to set up a new python enverionment for data preprocessing (see below).

Data Download

To download all the needed data, run:

bash download_data.sh

Data Preprocess

You need to set up a new environment with transformers==2.8.0 to tokenize the text. This is because we find the tokenizer behaves differently among versions 2, 3 and 4. To replicate the results in our paper with our provided trained models, it is necessary to use version 2.8.0 for preprocessing. Otherwise, you may need to re-train the DR models.

Run the following codes.

python preprocess.py --data_type 0; python preprocess.py --data_type 1

Inference

With our provided trained models, you can easily replicate our reported experimental results. Note that minor variance may be observed due to environmental difference.

STAR

The following codes use the provided STAR model to compute query/passage embeddings and perform similarity search on the dev set. (You can use --faiss_gpus option to use gpus for much faster similarity search.)

python ./star/inference.py --data_type passage --max_doc_length 256 --mode dev   
python ./star/inference.py --data_type doc --max_doc_length 512 --mode dev   

Run the following code to evaluate on MSMARCO Passage dataset.

python ./msmarco_eval.py ./data/passage/preprocess/dev-qrel.tsv ./data/passage/evaluate/star/dev.rank.tsv
Eval Started
#####################
MRR @10: 0.3404237731386721
QueriesRanked: 6980
#####################

Run the following code to evaluate on MSMARCO Document dataset.

python ./msmarco_eval.py ./data/doc/preprocess/dev-qrel.tsv ./data/doc/evaluate/star/dev.rank.tsv 100
Eval Started
#####################
MRR @100: 0.3903422772218344
QueriesRanked: 5193
#####################

ADORE

ADORE computes the query embeddings. The document embeddings are pre-computed by other DR models, like STAR. The following codes use the provided ADORE(STAR) model to compute query embeddings and perform similarity search on the dev set. (You can use --faiss_gpus option to use gpus for much faster similarity search.)

python ./adore/inference.py --model_dir ./data/passage/trained_models/adore-star --output_dir ./data/passage/evaluate/adore-star --preprocess_dir ./data/passage/preprocess --mode dev --dmemmap_path ./data/passage/evaluate/star/passages.memmap
python ./adore/inference.py --model_dir ./data/doc/trained_models/adore-star --output_dir ./data/doc/evaluate/adore-star --preprocess_dir ./data/doc/preprocess --mode dev --dmemmap_path ./data/doc/evaluate/star/passages.memmap

Evaluate ADORE(STAR) model on dev passage dataset:

python ./msmarco_eval.py ./data/passage/preprocess/dev-qrel.tsv ./data/passage/evaluate/adore-star/dev.rank.tsv

You will get

Eval Started
#####################
MRR @10: 0.34660697230181425
QueriesRanked: 6980
#####################

Evaluate ADORE(STAR) model on dev document dataset:

python ./msmarco_eval.py ./data/doc/preprocess/dev-qrel.tsv ./data/doc/evaluate/adore-star/dev.rank.tsv 100

You will get

Eval Started
#####################
MRR @100: 0.4049777020859768
QueriesRanked: 5193
#####################

Convert QID/PID Back

Our data preprocessing reassigns new ids for each query and document. Therefore, you may want to convert the ids back. We provide a script for this.

The following code shows an example to convert ADORE-STAR's ranking results on the dev passage dataset.

python ./cvt_back.py --input_dir ./data/passage/evaluate/adore-star/ --preprocess_dir ./data/passage/preprocess --output_dir ./data/passage/official_runs/adore-star --mode dev --dataset passage
python ./msmarco_eval.py ./data/passage/dataset/qrels.dev.small.tsv ./data/passage/official_runs/adore-star/dev.rank.tsv

You will get

Eval Started
#####################
MRR @10: 0.34660697230181425
QueriesRanked: 6980
#####################

Train

Instructions will be ready this weekend (7.18).

Owner
Jingtao Zhan
Ph.D at THUIR.
Jingtao Zhan
A flexible and extensible framework for gait recognition.

A flexible and extensible framework for gait recognition. You can focus on designing your own models and comparing with state-of-the-arts easily with the help of OpenGait.

Shiqi Yu 335 Dec 22, 2022
Two types of Recommender System : Content-based Recommender System and Colaborating filtering based recommender system

Recommender-Systems Two types of Recommender System : Content-based Recommender System and Colaborating filtering based recommender system So the data

Yash Kumar 0 Jan 20, 2022
Official code repository for Continual Learning In Environments With Polynomial Mixing Times

Official code for Continual Learning In Environments With Polynomial Mixing Times Continual Learning in Environments with Polynomial Mixing Times This

Sharath Raparthy 1 Dec 19, 2021
This repo is to be freely used by ML devs to check the GAN performances without coding from scratch.

GANs for Fun Created because I can! GOAL The goal of this repo is to be freely used by ML devs to check the GAN performances without coding from scrat

Sagnik Roy 13 Jan 26, 2022
A simple code to perform canny edge contrast detection on images.

CECED-Canny-Edge-Contrast-Enhanced-Detection A simple code to perform canny edge contrast detection on images. A simple code to process images using c

Happy N. Monday 3 Feb 15, 2022
This repository contains the code for "SBEVNet: End-to-End Deep Stereo Layout Estimation" paper by Divam Gupta, Wei Pu, Trenton Tabor, Jeff Schneider

SBEVNet: End-to-End Deep Stereo Layout Estimation This repository contains the code for "SBEVNet: End-to-End Deep Stereo Layout Estimation" paper by D

Divam Gupta 19 Dec 17, 2022
Sample Prior Guided Robust Model Learning to Suppress Noisy Labels

PGDF This repo is the official implementation of our paper "Sample Prior Guided Robust Model Learning to Suppress Noisy Labels ". Citation If you use

CVSM Group - email: <a href=[email protected]"> 22 Dec 23, 2022
Library for machine learning stacking generalization.

stacked_generalization Implemented machine learning *stacking technic[1]* as handy library in Python. Feature weighted linear stacking is also availab

114 Jul 19, 2022
Multi-View Consistent Generative Adversarial Networks for 3D-aware Image Synthesis (CVPR2022)

Multi-View Consistent Generative Adversarial Networks for 3D-aware Image Synthesis Multi-View Consistent Generative Adversarial Networks for 3D-aware

Xuanmeng Zhang 78 Dec 10, 2022
SAT: 2D Semantics Assisted Training for 3D Visual Grounding, ICCV 2021 (Oral)

SAT: 2D Semantics Assisted Training for 3D Visual Grounding SAT: 2D Semantics Assisted Training for 3D Visual Grounding by Zhengyuan Yang, Songyang Zh

Zhengyuan Yang 22 Nov 30, 2022
Free course that takes you from zero to Reinforcement Learning PRO πŸ¦ΈπŸ»β€πŸ¦ΈπŸ½

The Hands-on Reinforcement Learning course πŸš€ From zero to HERO πŸ¦ΈπŸ»β€πŸ¦ΈπŸ½ Out of intense complexities, intense simplicities emerge. -- Winston Churchi

Pau Labarta Bajo 260 Dec 28, 2022
Python package for downloading ECMWF reanalysis data and converting it into a time series format.

ecmwf_models Readers and converters for data from the ECMWF reanalysis models. Written in Python. Works great in combination with pytesmo. Citation If

TU Wien - Department of Geodesy and Geoinformation 31 Dec 26, 2022
Next-gen Rowhammer fuzzer that uses non-uniform, frequency-based patterns.

Blacksmith Rowhammer Fuzzer This repository provides the code accompanying the paper Blacksmith: Scalable Rowhammering in the Frequency Domain that is

Computer Security Group @ ETH Zurich 173 Nov 16, 2022
You Only Look Once for Panopitic Driving Perception

You Only πŸ‘€ Once for Panoptic πŸš— Perception You Only Look at Once for Panoptic driving Perception by Dong Wu, Manwen Liao, Weitian Zhang, Xinggang Wan

Hust Visual Learning Team 1.4k Jan 04, 2023
Knowledge Distillation Toolbox for Semantic Segmentation

SegDistill: Toolbox for Knowledge Distillation on Semantic Segmentation Networks This repo contains the supported code and configuration files for Seg

9 Dec 12, 2022
FG-transformer-TTS Fine-grained style control in transformer-based text-to-speech synthesis

LST-TTS Official implementation for the paper Fine-grained style control in transformer-based text-to-speech synthesis. Submitted to ICASSP 2022. Audi

Li-Wei Chen 64 Dec 30, 2022
PyTorch source code for Distilling Knowledge by Mimicking Features

LSHFM.detection This is the PyTorch source code for Distilling Knowledge by Mimicking Features. And this project contains code for object detection wi

Guo-Hua Wang 4 Dec 17, 2022
Cascading Feature Extraction for Fast Point Cloud Registration (BMVC 2021)

Cascading Feature Extraction for Fast Point Cloud Registration This repository contains the source code for the paper [Arxive link comming soon]. Meth

7 May 26, 2022
Look Closer: Bridging Egocentric and Third-Person Views with Transformers for Robotic Manipulation

Look Closer: Bridging Egocentric and Third-Person Views with Transformers for Robotic Manipulation Official PyTorch implementation for the paper Look

Rishabh Jangir 20 Nov 24, 2022
Code for MarioNette: Self-Supervised Sprite Learning, in NeurIPS 2021

MarioNette | Webpage | Paper | Video MarioNette: Self-Supervised Sprite Learning Dmitriy Smirnov, MichaΓ«l Gharbi, Matthew Fisher, Vitor Guizilini, Ale

Dima Smirnov 28 Nov 18, 2022