Personal implementation of paper "Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval"

Overview

Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval

This repo provides personal implementation of paper Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval in a simplified way. The code is refered to official version of ANCE.

Environment

'transformers==2.3.0' 
'pytrec-eval'
'faiss-cpu'
'wget'
'python==3.6.*'

Data Download & Preprocessing

To download all the needed data, run:

bash commands/data_download.sh 

Data Preprocessing

The command to preprocess passage and document data is listed below:

python data/msmarco_data.py 
--data_dir $raw_data_dir \
--out_data_dir $preprocessed_data_dir \ 
--model_type {use rdot_nll for ANCE FirstP, rdot_nll_multi_chunk for ANCE MaxP} \ 
--model_name_or_path roberta-base \ 
--max_seq_length {use 512 for ANCE FirstP, 2048 for ANCE MaxP} \ 
--data_type {use 1 for passage, 0 for document}

The data preprocessing command is included as the first step in the training command file commands/run_train.sh

Warmup for Training

ANCE training starts from a pretrained BM25 warmup checkpoint. The command with our used parameters to train this warmup checkpoint is in commands/run_train_warmup.py and is shown below:

    python3 -m torch.distributed.launch --nproc_per_node=1 ../drivers/run_warmup.py \
    --train_model_type rdot_nll \
    --model_name_or_path roberta-base \
    --task_name MSMarco \
    --do_train \
    --evaluate_during_training \
    --data_dir ${location of your raw data}  
    --max_seq_length 128 
    --per_gpu_eval_batch_size=256 \
    --per_gpu_train_batch_size=32 \
    --learning_rate 2e-4  \
    --logging_steps 100   \
    --num_train_epochs 2.0  \
    --output_dir ${location for checkpoint saving} \
    --warmup_steps 1000  \
    --overwrite_output_dir \
    --save_steps 30000 \
    --gradient_accumulation_steps 1 \
    --expected_train_size 35000000 \
    --logging_steps_per_eval 1 \
    --fp16 \
    --optimizer lamb \
    --log_dir ~/tensorboard/${DLWS_JOB_ID}/logs/OSpass

Training

To train the model(s) in the paper, you need to start two commands in the following order:

  1. run commands/run_train.sh which does three things in a sequence:

    a. Data preprocessing: this is explained in the previous data preprocessing section. This step will check if the preprocess data folder exists, and will be skipped if the checking is positive.

    b. Initial ANN data generation: this step will use the pretrained BM25 warmup checkpoint to generate the initial training data. The command is as follow:

     python -m torch.distributed.launch --nproc_per_node=$gpu_no ../drivers/run_ann_data_gen.py 
     --training_dir {# checkpoint location, not used for initial data generation} \ 
     --init_model_dir {pretrained BM25 warmup checkpoint location} \ 
     --model_type rdot_nll \
     --output_dir $model_ann_data_dir \
     --cache_dir $model_ann_data_dir_cache \
     --data_dir $preprocessed_data_dir \
     --max_seq_length 512 \
     --per_gpu_eval_batch_size 16 \
     --topk_training {top k candidates for ANN search(ie:200)} \ 
     --negative_sample {negative samples per query(20)} \ 
     --end_output_num 0 # only set as 0 for initial data generation, do not set this otherwise
    

    c. Training: ANCE training with the most recently generated ANN data, the command is as follow:

     python -m torch.distributed.launch --nproc_per_node=$gpu_no ../drivers/run_ann.py 
     --model_type rdot_nll \
     --model_name_or_path $pretrained_checkpoint_dir \
     --task_name MSMarco \
     --triplet {# default = False, action="store_true", help="Whether to run training}\ 
     --data_dir $preprocessed_data_dir \
     --ann_dir {location of the ANN generated training data} \ 
     --max_seq_length 512 \
     --per_gpu_train_batch_size=8 \
     --gradient_accumulation_steps 2 \
     --learning_rate 1e-6 \
     --output_dir $model_dir \
     --warmup_steps 5000 \
     --logging_steps 100 \
     --save_steps 10000 \
     --optimizer lamb 
    
  2. Once training starts, start another job in parallel to fetch the latest checkpoint from the ongoing training and update the training data. To do that, run

     bash commands/run_ann_data_gen.sh
    

    The command is similar to the initial ANN data generation command explained previously

Inference

The command for inferencing query and passage/doc embeddings is the same as that for Initial ANN data generation described above as the first step in ANN data generation is inference. However you need to add --inference to the command to have the program to stop after the initial inference step. commands/run_inference.sh provides a sample command.

Evaluation

The evaluation is done through "Calculate Metrics.ipynb". This notebook calculates full ranking and reranking metrics used in the paper including NDCG, MRR, hole rate, recall for passage/document, dev/eval set specified by user. In order to run it, you need to define the following parameters at the beginning of the Jupyter notebook.

    checkpoint_path = {location for dumpped query and passage/document embeddings which is output_dir from run_ann_data_gen.py}
    checkpoint =  {embedding from which checkpoint(ie: 200000)}
    data_type =  {0 for document, 1 for passage}
    test_set =  {0 for MSMARCO dev_set, 1 for TREC eval_set}
    raw_data_dir = 
    processed_data_dir = 

ANCE VS DPR on OpenQA Benchmarks

We also evaluate ANCE on the OpenQA benchmark used in a parallel work (DPR). At the time of our experiment, only the pre-processed NQ and TriviaQA data are released. Our experiments use the two released tasks and inherit DPR retriever evaluation. The evaluation uses the [email protected]/100 which is whether the Top-20/100 retrieved passages include the answer. We explain the steps to reproduce our results on OpenQA Benchmarks in this section.

Download data

commands/data_download.sh takes care of this step.

ANN data generation & ANCE training

Following the same training philosophy discussed before, the ann data generation and ANCE training for OpenQA require two parallel jobs.

  1. We need to preprocess data and generate an initial training set for ANCE to start training. The command for that is provided in:
commands/run_ann_data_gen_dpr.sh

We keep this data generation job running after it creates an initial training set as it will later keep generating training data with newest checkpoints from the training process.

  1. After an initial training set is generated, we start an ANCE training job with commands provided in:
commands/run_train_dpr.sh

During training, the evaluation metrics will be printed to tensorboards each time it receives new training data. Alternatively, you could check the metrics in the dumped file "ann_ndcg_#" in the directory specified by "model_ann_data_dir" in commands/run_ann_data_gen_dpr.sh each time new training data is generated.

Results

The run_train.sh and run_ann_data_gen.sh files contain the command with the parameters we used for passage ANCE(FirstP), document ANCE(FirstP) and document ANCE(MaxP) Our model achieves the following performance on MSMARCO dev set and TREC eval set :

MSMARCO Dev Passage Retrieval [email protected] [email protected] Steps
ANCE(FirstP) 0.330 0.959 600K
ANCE(MaxP) - - -
TREC DL Passage [email protected] Rerank Retrieval Steps
ANCE(FirstP) 0.677 0.648 600K
ANCE(MaxP) - - -
TREC DL Document [email protected] Rerank Retrieval Steps
ANCE(FirstP) 0.641 0.615 210K
ANCE(MaxP) 0.671 0.628 139K
MSMARCO Dev Passage Retrieval [email protected] Steps
pretrained BM25 warmup checkpoint 0.311 60K
ANCE Single-task Training Top-20 Top-100 Steps
NQ 81.9 87.5 136K
TriviaQA 80.3 85.3 100K
ANCE Multi-task Training Top-20 Top-100 Steps
NQ 82.1 87.9 300K
TriviaQA 80.3 85.2 300K

Click the steps in the table to download the corresponding checkpoints.

Our result for document ANCE(FirstP) TREC eval set top 100 retrieved document per query could be downloaded here. Our result for document ANCE(MaxP) TREC eval set top 100 retrieved document per query could be downloaded here.

The TREC eval set query embedding and their ids for our passage ANCE(FirstP) experiment could be downloaded here. The TREC eval set query embedding and their ids for our document ANCE(FirstP) experiment could be downloaded here. The TREC eval set query embedding and their ids for our document 2048 ANCE(MaxP) experiment could be downloaded here.

The t-SNE plots for all the queries in the TREC document eval set for ANCE(FirstP) could be viewed here.

run_train.sh and run_ann_data_gen.sh files contain the commands with the parameters we used for passage ANCE(FirstP), document ANCE(FirstP) and document 2048 ANCE(MaxP) to reproduce the results in this section. run_train_warmup.sh contains the commands to reproduce the results for the pretrained BM25 warmup checkpoint in this section

Note the steps to reproduce similar results as shown in the table might be a little different due to different synchronizing between training and ann data generation processes and other possible environment differences of the user experiments.

Owner
John
My research interests are machine learning and recommender systems.
John
A Pytorch implementation of CVPR 2021 paper "RSG: A Simple but Effective Module for Learning Imbalanced Datasets"

RSG: A Simple but Effective Module for Learning Imbalanced Datasets (CVPR 2021) A Pytorch implementation of our CVPR 2021 paper "RSG: A Simple but Eff

120 Dec 12, 2022
PyTorch code for Vision Transformers training with the Self-Supervised learning method DINO

Self-Supervised Vision Transformers with DINO PyTorch implementation and pretrained models for DINO. For details, see Emerging Properties in Self-Supe

Facebook Research 4.2k Jan 03, 2023
Model of an AI powered sign language interpreter.

TEXT AND SPEECH TO SIGN LANGUAGE. A web application which takes in text or live audio speech recording as input, converts and displays the relevant Si

Mark Gatere 4 Mar 30, 2022
Neural network chess engine trained on Gary Kasparov's games.

Neural Chess It's not the best chess engine, but it is a chess engine. Proof of concept neural network chess engine (feed-forward multi-layer perceptr

3 Jun 22, 2022
Source code for TACL paper "KEPLER: A Unified Model for Knowledge Embedding and Pre-trained Language Representation".

KEPLER: A Unified Model for Knowledge Embedding and Pre-trained Language Representation Source code for TACL 2021 paper KEPLER: A Unified Model for Kn

THU-KEG 138 Dec 22, 2022
HiddenMarkovModel implements hidden Markov models with Gaussian mixtures as distributions on top of TensorFlow

Class HiddenMarkovModel HiddenMarkovModel implements hidden Markov models with Gaussian mixtures as distributions on top of TensorFlow 2.0 Installatio

Susara Thenuwara 2 Nov 03, 2021
Inference pipeline for our participation in the FeTA challenge 2021.

feta-inference Inference pipeline for our participation in the FeTA challenge 2021. Team name: TRABIT Installation Download the two folders in https:/

Lucas Fidon 2 Apr 13, 2022
Repository for the paper titled: "When is BERT Multilingual? Isolating Crucial Ingredients for Cross-lingual Transfer"

When is BERT Multilingual? Isolating Crucial Ingredients for Cross-lingual Transfer This repository contains code for our paper titled "When is BERT M

Princeton Natural Language Processing 9 Dec 23, 2022
Facial expression detector

A tensorflow convolutional neural network model to detect facial expressions.

Carlos Tardón Rubio 5 Apr 20, 2022
So-ViT: Mind Visual Tokens for Vision Transformer

So-ViT: Mind Visual Tokens for Vision Transformer        Introduction This repository contains the source code under PyTorch framework and models trai

Jiangtao Xie 44 Nov 24, 2022
An evaluation toolkit for voice conversion models.

Voice-conversion-evaluation An evaluation toolkit for voice conversion models. Sample test pair Generate the metadata for evaluating models. The direc

30 Aug 29, 2022
Points2Surf: Learning Implicit Surfaces from Point Clouds (ECCV 2020 Spotlight)

Points2Surf: Learning Implicit Surfaces from Point Clouds (ECCV 2020 Spotlight)

Philipp Erler 329 Jan 06, 2023
Simulation-based inference for the Galactic Center Excess

Simulation-based inference for the Galactic Center Excess Siddharth Mishra-Sharma and Kyle Cranmer Abstract The nature of the Fermi gamma-ray Galactic

Siddharth Mishra-Sharma 3 Jan 21, 2022
Unofficial PyTorch implementation of Neural Additive Models (NAM) by Agarwal, et al.

nam-pytorch Unofficial PyTorch implementation of Neural Additive Models (NAM) by Agarwal, et al. [abs, pdf] Installation You can access nam-pytorch vi

Rishabh Anand 11 Mar 14, 2022
Used to record WKU's utility bills on a regular basis.

WKU水电费小助手 一个用于定期记录WKU水电费的脚本 Looking for English Readme? 背景 由于WKU校园内的水电账单系统时常存在扣费延迟的现象,而补扣的费用缺乏令人信服的证明。不少学生为费用摸不着头脑,但也没有申诉的依据。为了更好地掌握水电费使用情况,留下一手证据,我开源

2 Jul 21, 2022
Free course that takes you from zero to Reinforcement Learning PRO 🦸🏻‍🦸🏽

The Hands-on Reinforcement Learning course 🚀 From zero to HERO 🦸🏻‍🦸🏽 Out of intense complexities, intense simplicities emerge. -- Winston Churchi

Pau Labarta Bajo 260 Dec 28, 2022
This is the official implementation of "One Question Answering Model for Many Languages with Cross-lingual Dense Passage Retrieval".

CORA This is the official implementation of the following paper: Akari Asai, Xinyan Yu, Jungo Kasai and Hannaneh Hajishirzi. One Question Answering Mo

Akari Asai 59 Dec 28, 2022
Using CNN to mimic the driver based on training data from Torcs

Behavioural-Cloning-in-autonomous-driving Using CNN to mimic the driver based on training data from Torcs. Approach First, the data was collected from

Sudharshan 2 Jan 05, 2022
HAT: Hierarchical Aggregation Transformers for Person Re-identification

HAT: Hierarchical Aggregation Transformers for Person Re-identification

11 Sep 05, 2022
Self-Supervised Learning

Self-Supervised Learning Features self_supervised offers features like modular framework support for multi-gpu training using PyTorch Lightning easy t

Robin 1 Dec 14, 2021