Improving Query Representations for DenseRetrieval with Pseudo Relevance Feedback:A Reproducibility Study.

Related tags

Deep LearningAPR
Overview

APR

The repo for the paper Improving Query Representations for DenseRetrieval with Pseudo Relevance Feedback:A Reproducibility Study.

Environment setup

To reproduce the results in the paper, we rely on two open-source IR toolkits: Pyserini and tevatron.

We cloned, merged, and modified the two toolkits in this repo and will use them to train and inference the PRF models. We refer to the original github repos to setup the environment:

Install Pyserini: https://github.com/castorini/pyserini/blob/master/docs/installation.md.

Install tevatron: https://github.com/texttron/tevatron#installation.

You also need MS MARCO passage ranking dataset, including the collection and queries. We refer to the official github repo for downloading the data.

To reproduce ANCE-PRF inference results with the original model checkpoint

The code, dataset, and model for reproducing the ANCE-PRF results presented in the original paper:

HongChien Yu, Chenyan Xiong, Jamie Callan. Improving Query Representations for Dense Retrieval with Pseudo Relevance Feedback

have been merged into Pyserini source. Simply just need to follow this instruction, which includes the instructions of downloading the dataset, model checkpoint (provided by the original authors), dense index, and PRF inference.

To train dense retriever PRF models

We use tevatron to train the dense retriever PRF query encodes that we investigated in the paper.

First, you need to have train queries run files to build hard negative training set for each DR.

You can use Pyserini to generate run files for ANCE, TCT-ColBERTv2 and DistilBERT KD TASB by changing the query set flag --topics to queries.train.tsv.

Once you have the run file, cd to /tevatron and run:

python make_train_from_ranking.py \
	--ranking_file /path/to/train/run \
	--model_type (ANCE or TCT or DistilBERT) \
	--output /path/to/save/hard/negative

Apart from the hard negative training set, you also need the original DR query encoder model checkpoints to initial the model weights. You can download them from Huggingface modelhub: ance, tct_colbert-v2-hnp-msmarco, distilbert-dot-tas_b-b256-msmarco. Please use the same name as the link in Huggingface modelhub for each of the folders that contain the model.

After you generated the hard negative training set and downloaded all the models, you can kick off the training for DR-PRF query encoders by:

python -m torch.distributed.launch \
    --nproc_per_node=2 \
    -m tevatron.driver.train \
    --output_dir /path/to/save/mdoel/checkpoints \
    --model_name_or_path /path/to/model/folder \
    --do_train \
    --save_steps 5000 \
    --train_dir /path/to/hard/negative \
    --fp16 \
    --per_device_train_batch_size 32 \
    --learning_rate 1e-6 \
    --num_train_epochs 10 \
    --train_n_passages 21 \
    --q_max_len 512 \
    --dataloader_num_workers 10 \
    --warmup_steps 5000 \
    --add_pooler

To inference dense retriever PRF models

Install Pyserini by following the instructions within pyserini/README.md

Then run:

python -m pyserini.dsearch --topics /path/to/query/tsv/file \
    --index /path/to/index \
    --encoder /path/to/encoder \ # This encoder is for first round retrieval
    --batch-size 64 \
    --output /path/to/output/run/file \
    --prf-method tctv2-prf \
    --threads 12 \
    --sparse-index msmarco-passage \
    --prf-encoder /path/to/encoder \ # This encoder is for PRF query generation
    --prf-depth 3

An example would be:

python -m pyserini.dsearch --topics ./data/msmarco-test2020-queries.tsv \
    --index ./dindex-msmarco-passage-tct_colbert-v2-hnp-bf \
    --encoder ./tct_colbert_v2_hnp \
    --batch-size 64 \
    --output ./runs/tctv2-prf3.res \
    --prf-method tctv2-prf \
    --threads 12 \
    --sparse-index msmarco-passage \
    --prf-encoder ./tct-colbert-v2-prf3/checkpoint-10000 \
    --prf-depth 3

Or one can use pre-built index and models available in Pyserini:

python -m pyserini.dsearch --topics dl19-passage \
    --index msmarco-passage-tct_colbert-v2-hnp-bf \
    --encoder castorini/tct_colbert-v2-hnp-msmarco \
    --batch-size 64 \
    --output ./runs/tctv2-prf3.res \
    --prf-method tctv2-prf \
    --threads 12 \
    --sparse-index msmarco-passage \
    --prf-encoder ./tct-colbert-v2-prf3/checkpoint-10000 \
    --prf-depth 3

The PRF depth --prf-depth 3 depends on the PRF encoder trained, if trained with PRF 3, here only can use PRF 3.

Where --topics can be: TREC DL 2019 Passage: dl19-passage TREC DL 2020 Passage: dl20 MS MARCO Passage V1: msmarco-passage-dev-subset

--encoder can be: ANCE: castorini/ance-msmarco-passage TCT-ColBERT V2 HN+: castorini/tct_colbert-v2-hnp-msmarco DistilBERT Balanced: sebastian-hofstaetter/distilbert-dot-tas_b-b256-msmarco

--index can be: ANCE index with MS MARCO V1 passage collection: msmarco-passage-ance-bf TCT-ColBERT V2 HN+ index with MS MARCO V1 passage collection: msmarco-passage-tct_colbert-v2-hnp-bf DistillBERT Balanced index with MS MARCO V1 passage collection: msmarco-passage-distilbert-dot-tas_b-b256-bf

To evaluate the run:

TREC DL 2019

python -m pyserini.eval.trec_eval -c -m ndcg_cut.10 -m recall.1000 -l 2 dl19-passage ./runs/tctv2-prf3.res

TREC DL 2020

python -m pyserini.eval.trec_eval -c -m ndcg_cut.10 -m recall.1000 -l 2 dl20-passage ./runs/tctv2-prf3.res

MS MARCO Passage Ranking V1

python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset ./runs/tctv2-prf3.res
Owner
ielab
The Information Engineering Lab
ielab
Real-CUGAN - Real Cascade U-Nets for Anime Image Super Resolution

Real Cascade U-Nets for Anime Image Super Resolution 中文 | English 🔥 Real-CUGAN

tarsin 111 Dec 28, 2022
Supporting code for the Neograd algorithm

Neograd This repo supports the paper Neograd: Gradient Descent with a Near-Ideal Learning Rate, which introduces the algorithm "Neograd". The paper an

Michael Zimmer 12 May 01, 2022
RLBot Python bindings for the Rust crate rl_ball_sym

RLBot Python bindings for rl_ball_sym 0.6 Prerequisites: Rust & Cargo Build Tools for Visual Studio RLBot - Verify that the file %localappdata%\RLBotG

Eric Veilleux 2 Nov 25, 2022
Temporal Dynamic Convolutional Neural Network for Text-Independent Speaker Verification and Phonemetic Analysis

TDY-CNN for Text-Independent Speaker Verification Official implementation of Temporal Dynamic Convolutional Neural Network for Text-Independent Speake

Seong-Hu Kim 16 Oct 17, 2022
Official PyTorch Implementation of Convolutional Hough Matching Networks, CVPR 2021 (oral)

Convolutional Hough Matching Networks This is the implementation of the paper "Convolutional Hough Matching Network" by J. Min and M. Cho. Implemented

Juhong Min 70 Nov 22, 2022
A multi-mode modulator for multi-domain few-shot classification (ICCV)

A multi-mode modulator for multi-domain few-shot classification (ICCV)

Yanbin Liu 8 Apr 28, 2022
Official PyTorch implementation of Learning Intra-Batch Connections for Deep Metric Learning (ICML 2021) published at International Conference on Machine Learning

About This repository the official PyTorch implementation of Learning Intra-Batch Connections for Deep Metric Learning. The config files contain the s

Dynamic Vision and Learning Group 41 Dec 10, 2022
Sleep staging from ECG, assisted with EEG

Sleep_Staging_Knowledge Distillation This codebase implements knowledge distillation approach for ECG based sleep staging assisted by EEG based sleep

2 Dec 12, 2022
HAR-stacked-residual-bidir-LSTMs - Deep stacked residual bidirectional LSTMs for HAR

HAR-stacked-residual-bidir-LSTM The project is based on this repository which is presented as a tutorial. It consists of Human Activity Recognition (H

Guillaume Chevalier 287 Dec 27, 2022
An implementation for `Text2Event: Controllable Sequence-to-Structure Generation for End-to-end Event Extraction`

Text2Event An implementation for Text2Event: Controllable Sequence-to-Structure Generation for End-to-end Event Extraction Please contact Yaojie Lu (@

Roger 153 Jan 07, 2023
PyTorch code for our paper "Gated Multiple Feedback Network for Image Super-Resolution" (BMVC2019)

Gated Multiple Feedback Network for Image Super-Resolution This repository contains the PyTorch implementation for the proposed GMFN [arXiv]. The fram

Qilei Li 66 Nov 03, 2022
Label Mask for Multi-label Classification

LM-MLC 一种基于完型填空的多标签分类算法 1 前言 本文主要介绍本人在全球人工智能技术创新大赛【赛道一】设计的一种基于完型填空(模板)的多标签分类算法:LM-MLC,该算法拟合能力很强能感知标签关联性,在多个数据集上测试表明该算法与主流算法无显著性差异,在该比赛数据集上的dev效果很好,但是由

52 Nov 20, 2022
Code for MarioNette: Self-Supervised Sprite Learning, in NeurIPS 2021

MarioNette | Webpage | Paper | Video MarioNette: Self-Supervised Sprite Learning Dmitriy Smirnov, Michaël Gharbi, Matthew Fisher, Vitor Guizilini, Ale

Dima Smirnov 28 Nov 18, 2022
Implementation of Convolutional LSTM in PyTorch.

ConvLSTM_pytorch This file contains the implementation of Convolutional LSTM in PyTorch made by me and DavideA. We started from this implementation an

Andrea Palazzi 1.3k Dec 29, 2022
La source de mon module 'pyfade' disponible sur Pypi.

Version: 1.2 Introduction Pyfade est un module permettant de créer des dégradés colorés. Il vous permettra de changer chaque ligne de votre texte par

Billy 20 Sep 12, 2021
Instance-wise Occlusion and Depth Orders in Natural Scenes (CVPR 2022)

Instance-wise Occlusion and Depth Orders in Natural Scenes Official source code. Appears at CVPR 2022 This repository provides a new dataset, named In

27 Dec 27, 2022
Dcf-game-infrastructure-public - Contains all the components necessary to run a DC finals (attack-defense CTF) game from OOO

dcf-game-infrastructure All the components necessary to run a game of the OOO DC

Order of the Overflow 46 Sep 13, 2022
A toolkit for developing and comparing reinforcement learning algorithms.

Status: Maintenance (expect bug fixes and minor updates) OpenAI Gym OpenAI Gym is a toolkit for developing and comparing reinforcement learning algori

OpenAI 29.6k Jan 08, 2023
Official implementation for: Blended Diffusion for Text-driven Editing of Natural Images.

Blended Diffusion for Text-driven Editing of Natural Images Blended Diffusion for Text-driven Editing of Natural Images Omri Avrahami, Dani Lischinski

328 Dec 30, 2022