Improving Query Representations for DenseRetrieval with Pseudo Relevance Feedback:A Reproducibility Study.

Related tags

Deep LearningAPR
Overview

APR

The repo for the paper Improving Query Representations for DenseRetrieval with Pseudo Relevance Feedback:A Reproducibility Study.

Environment setup

To reproduce the results in the paper, we rely on two open-source IR toolkits: Pyserini and tevatron.

We cloned, merged, and modified the two toolkits in this repo and will use them to train and inference the PRF models. We refer to the original github repos to setup the environment:

Install Pyserini: https://github.com/castorini/pyserini/blob/master/docs/installation.md.

Install tevatron: https://github.com/texttron/tevatron#installation.

You also need MS MARCO passage ranking dataset, including the collection and queries. We refer to the official github repo for downloading the data.

To reproduce ANCE-PRF inference results with the original model checkpoint

The code, dataset, and model for reproducing the ANCE-PRF results presented in the original paper:

HongChien Yu, Chenyan Xiong, Jamie Callan. Improving Query Representations for Dense Retrieval with Pseudo Relevance Feedback

have been merged into Pyserini source. Simply just need to follow this instruction, which includes the instructions of downloading the dataset, model checkpoint (provided by the original authors), dense index, and PRF inference.

To train dense retriever PRF models

We use tevatron to train the dense retriever PRF query encodes that we investigated in the paper.

First, you need to have train queries run files to build hard negative training set for each DR.

You can use Pyserini to generate run files for ANCE, TCT-ColBERTv2 and DistilBERT KD TASB by changing the query set flag --topics to queries.train.tsv.

Once you have the run file, cd to /tevatron and run:

python make_train_from_ranking.py \
	--ranking_file /path/to/train/run \
	--model_type (ANCE or TCT or DistilBERT) \
	--output /path/to/save/hard/negative

Apart from the hard negative training set, you also need the original DR query encoder model checkpoints to initial the model weights. You can download them from Huggingface modelhub: ance, tct_colbert-v2-hnp-msmarco, distilbert-dot-tas_b-b256-msmarco. Please use the same name as the link in Huggingface modelhub for each of the folders that contain the model.

After you generated the hard negative training set and downloaded all the models, you can kick off the training for DR-PRF query encoders by:

python -m torch.distributed.launch \
    --nproc_per_node=2 \
    -m tevatron.driver.train \
    --output_dir /path/to/save/mdoel/checkpoints \
    --model_name_or_path /path/to/model/folder \
    --do_train \
    --save_steps 5000 \
    --train_dir /path/to/hard/negative \
    --fp16 \
    --per_device_train_batch_size 32 \
    --learning_rate 1e-6 \
    --num_train_epochs 10 \
    --train_n_passages 21 \
    --q_max_len 512 \
    --dataloader_num_workers 10 \
    --warmup_steps 5000 \
    --add_pooler

To inference dense retriever PRF models

Install Pyserini by following the instructions within pyserini/README.md

Then run:

python -m pyserini.dsearch --topics /path/to/query/tsv/file \
    --index /path/to/index \
    --encoder /path/to/encoder \ # This encoder is for first round retrieval
    --batch-size 64 \
    --output /path/to/output/run/file \
    --prf-method tctv2-prf \
    --threads 12 \
    --sparse-index msmarco-passage \
    --prf-encoder /path/to/encoder \ # This encoder is for PRF query generation
    --prf-depth 3

An example would be:

python -m pyserini.dsearch --topics ./data/msmarco-test2020-queries.tsv \
    --index ./dindex-msmarco-passage-tct_colbert-v2-hnp-bf \
    --encoder ./tct_colbert_v2_hnp \
    --batch-size 64 \
    --output ./runs/tctv2-prf3.res \
    --prf-method tctv2-prf \
    --threads 12 \
    --sparse-index msmarco-passage \
    --prf-encoder ./tct-colbert-v2-prf3/checkpoint-10000 \
    --prf-depth 3

Or one can use pre-built index and models available in Pyserini:

python -m pyserini.dsearch --topics dl19-passage \
    --index msmarco-passage-tct_colbert-v2-hnp-bf \
    --encoder castorini/tct_colbert-v2-hnp-msmarco \
    --batch-size 64 \
    --output ./runs/tctv2-prf3.res \
    --prf-method tctv2-prf \
    --threads 12 \
    --sparse-index msmarco-passage \
    --prf-encoder ./tct-colbert-v2-prf3/checkpoint-10000 \
    --prf-depth 3

The PRF depth --prf-depth 3 depends on the PRF encoder trained, if trained with PRF 3, here only can use PRF 3.

Where --topics can be: TREC DL 2019 Passage: dl19-passage TREC DL 2020 Passage: dl20 MS MARCO Passage V1: msmarco-passage-dev-subset

--encoder can be: ANCE: castorini/ance-msmarco-passage TCT-ColBERT V2 HN+: castorini/tct_colbert-v2-hnp-msmarco DistilBERT Balanced: sebastian-hofstaetter/distilbert-dot-tas_b-b256-msmarco

--index can be: ANCE index with MS MARCO V1 passage collection: msmarco-passage-ance-bf TCT-ColBERT V2 HN+ index with MS MARCO V1 passage collection: msmarco-passage-tct_colbert-v2-hnp-bf DistillBERT Balanced index with MS MARCO V1 passage collection: msmarco-passage-distilbert-dot-tas_b-b256-bf

To evaluate the run:

TREC DL 2019

python -m pyserini.eval.trec_eval -c -m ndcg_cut.10 -m recall.1000 -l 2 dl19-passage ./runs/tctv2-prf3.res

TREC DL 2020

python -m pyserini.eval.trec_eval -c -m ndcg_cut.10 -m recall.1000 -l 2 dl20-passage ./runs/tctv2-prf3.res

MS MARCO Passage Ranking V1

python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset ./runs/tctv2-prf3.res
Owner
ielab
The Information Engineering Lab
ielab
Tutel MoE: An Optimized Mixture-of-Experts Implementation

Project Tutel Tutel MoE: An Optimized Mixture-of-Experts Implementation. Supported Framework: Pytorch Supported GPUs: CUDA(fp32 + fp16), ROCm(fp32) Ho

Microsoft 344 Dec 29, 2022
Understanding Convolutional Neural Networks from Theoretical Perspective via Volterra Convolution

nnvolterra Run Code Compile first: make compile Run all codes: make all Test xconv: make npxconv_test MNIST dataset needs to be downloaded, converted

1 May 24, 2022
Hierarchical Time Series Forecasting with a familiar API

scikit-hts Hierarchical Time Series with a familiar API. This is the result from not having found any good implementations of HTS on-line, and my work

Carlo Mazzaferro 204 Dec 17, 2022
Probabilistic Tensor Decomposition of Neural Population Spiking Activity

Probabilistic Tensor Decomposition of Neural Population Spiking Activity Matlab (recommended) and Python (in developement) implementations of Soulat e

Hugo Soulat 6 Nov 30, 2022
Multilingual Image Captioning

Multilingual Image Captioning Authors: Bhavitvya Malik, Gunjan Chhablani Demo Link: https://huggingface.co/spaces/flax-community/multilingual-image-ca

Gunjan Chhablani 32 Nov 25, 2022
Official code for "Maximum Likelihood Training of Score-Based Diffusion Models", NeurIPS 2021 (spotlight)

Maximum Likelihood Training of Score-Based Diffusion Models This repo contains the official implementation for the paper Maximum Likelihood Training o

Yang Song 84 Dec 12, 2022
PyTorch implementation of the WarpedGANSpace: Finding non-linear RBF paths in GAN latent space (ICCV 2021)

Authors official PyTorch implementation of the "WarpedGANSpace: Finding non-linear RBF paths in GAN latent space" [ICCV 2021].

Christos Tzelepis 100 Dec 06, 2022
PyTorch implementation for ACL 2021 paper "Maria: A Visual Experience Powered Conversational Agent".

Maria: A Visual Experience Powered Conversational Agent This repository is the Pytorch implementation of our paper "Maria: A Visual Experience Powered

Jokie 22 Dec 12, 2022
Pytorch implementation of the paper "Optimization as a Model for Few-Shot Learning"

Optimization as a Model for Few-Shot Learning This repo provides a Pytorch implementation for the Optimization as a Model for Few-Shot Learning paper.

Albert Berenguel Centeno 238 Jan 04, 2023
deep learning model with only python and numpy with test accuracy 99 % on mnist dataset and different optimization choices

deep_nn_model_with_only_python_100%_test_accuracy deep learning model with only python and numpy with test accuracy 99 % on mnist dataset and differen

0 Aug 28, 2022
Text2Art is an AI art generator powered with VQGAN + CLIP and CLIPDrawer models

Text2Art is an AI art generator powered with VQGAN + CLIP and CLIPDrawer models. You can easily generate all kind of art from drawing, painting, sketch, or even a specific artist style just using a t

Muhammad Fathy Rashad 643 Dec 30, 2022
(CVPR 2021) PAConv: Position Adaptive Convolution with Dynamic Kernel Assembling on Point Clouds

PAConv: Position Adaptive Convolution with Dynamic Kernel Assembling on Point Clouds by Mutian Xu*, Runyu Ding*, Hengshuang Zhao, and Xiaojuan Qi. Int

CVMI Lab 228 Dec 25, 2022
Cours d'Algorithmique Appliquée avec Python pour BTS SIO SISR

Course: Introduction to Applied Algorithms with Python (in French) This is the source code of the website for the Applied Algorithms with Python cours

Loic Yvonnet 0 Jan 27, 2022
Cleaned up code for DSTC 10: SIMMC 2.0 track: subtask 2: multimodal coreference resolution

UNITER-Based Situated Coreference Resolution with Rich Multimodal Input: arXiv MMCoref_cleaned Code for the MMCoref task of the SIMMC 2.0 dataset. Pre

Yichen (William) Huang 2 Dec 05, 2022
Implementation of Hourglass Transformer, in Pytorch, from Google and OpenAI

Hourglass Transformer - Pytorch (wip) Implementation of Hourglass Transformer, in Pytorch. It will also contain some of my own ideas about how to make

Phil Wang 61 Dec 25, 2022
Cross-modal Deep Face Normals with Deactivable Skip Connections

Cross-modal Deep Face Normals with Deactivable Skip Connections Victoria Fernández Abrevaya*, Adnane Boukhayma*, Philip H. S. Torr, Edmond Boyer (*Equ

72 Nov 27, 2022
Nvdiffrast - Modular Primitives for High-Performance Differentiable Rendering

Nvdiffrast – Modular Primitives for High-Performance Differentiable Rendering Modular Primitives for High-Performance Differentiable Rendering Samuli

NVIDIA Research Projects 675 Jan 06, 2023
Code repository of the paper Neural circuit policies enabling auditable autonomy published in Nature Machine Intelligence

Neural Circuit Policies Enabling Auditable Autonomy Online access via SharedIt Neural Circuit Policies (NCPs) are designed sparse recurrent neural net

8 Jan 07, 2023
CLDF dataset derived from Robbeets et al.'s "Triangulation Supports Agricultural Spread" from 2021

CLDF dataset derived from Robbeets et al.'s "Triangulation Supports Agricultural Spread" from 2021 How to cite If you use these data please cite the o

Digital Linguistics 2 Dec 20, 2021
Code for "3D Human Pose and Shape Regression with Pyramidal Mesh Alignment Feedback Loop"

PyMAF This repository contains the code for the following paper: 3D Human Pose and Shape Regression with Pyramidal Mesh Alignment Feedback Loop Hongwe

Hongwen Zhang 450 Dec 28, 2022