Scalable training for dense retrieval models.

Overview

Scalable implementation of dense retrieval.

Training on cluster

By default it trains locally:

PYTHONPATH=.:$PYTHONPATH python dpr_scale/main.py trainer.gpus=1

SLURM Training

To train the model on SLURM, run:

PYTHONPATH=.:$PYTHONPATH python dpr_scale/main.py -m trainer=slurm trainer.num_nodes=2 trainer.gpus=2

Reproduce DPR on 8 gpus

PYTHONPATH=.:$PYTHONPATH python dpr_scale/main.py -m --config-name nq.yaml  +hydra.launcher.name=dpr_stl_nq_reproduce

Generate embeddings on Wikipedia

PYTHONPATH=.:$PYTHONPATH python dpr_scale/generate_embeddings.py -m --config-name nq.yaml datamodule=generate datamodule.test_path=psgs_w100.tsv +task.ctx_embeddings_dir=<CTX_EMBEDDINGS_DIR> +task.checkpoint_path=<CHECKPOINT_PATH>

Get retrieval results

Currently this runs on 1 GPU. Use CTX_EMBEDDINGS_DIR from above.

PYTHONPATH=.:$PYTHONPATH python dpr_scale/run_retrieval.py --config-name nq.yaml trainer=gpu_1_host trainer.gpus=1 +task.output_path=<PATH_TO_OUTPUT_JSON> +task.ctx_embeddings_dir=<CTX_EMBEDDINGS_DIR> +task.checkpoint_path=<CHECKPOINT_PATH> +task.passages=psgs_w100.tsv datamodule.test_path=<PATH_TO_QUERIES_JSONL>

Generate query embeddings

Alternatively, query embedding generation and retrieval can be separated. After query embeddings are generated using the following command, the run_retrieval_fb.py or run_retrieval_multiset.py script can be used to perform retrieval.

PYTHONPATH=.:$PYTHONPATH python dpr_scale/generate_query_embeddings.py -m --config-name nq.yaml trainer.gpus=1 datamodule.test_path=<PATH_TO_QUERIES_JSONL> +task.ctx_embeddings_dir=<CTX_EMBEDDINGS_DIR> +task.checkpoint_path=<CHECKPOINT_PATH> +task.query_emb_output_path=<OUTPUT_TO_QUERY_EMB>

Get evaluation metrics for a given JSON output file

python dpr_scale/eval_dpr.py --retrieval <PATH_TO_OUTPUT_JSON> --topk 1 5 10 20 50 100 

Get evaluation metrics for MSMARCO

python dpr_scale/msmarco_eval.py ~data/msmarco/qrels.dev.small.tsv PATH_TO_OUTPUT_JSON

Domain-matched Pre-training Tasks for Dense Retrieval

Paper: https://arxiv.org/abs/2107.13602

The sections below provide links to datasets and pretrained models, as well as, instructions to prepare datasets, pretrain and fine-tune them.

Q&A Datasets

PAQ

Download the dataset from here

Conversational Datasets

You can download the dataset from the respective tables.

Reddit

File Download Link
train download
dev download

ConvAI2

File Download Link
train download
dev download

DSTC7

File Download Link
train download
dev download
test download

Prepare by downloading the tar ball linked here, and using the command below.

DSTC7_DATA_ROOT=<path_of_dir_where_the_data_is_extracted>
python dpr_scale/data_prep/prep_conv_datasets.py \
    --dataset dstc7 \
    --in_file_path $DSTC7_DATA_ROOT/ubuntu_train_subtask_1_augmented.json \
    --out_file_path $DSTC7_DATA_ROOT/ubuntu_train.jsonl

Ubuntu V2

File Download Link
train download
dev download
test download

Prepare by downloading the tar ball linked here, and using the command below.

UBUNTUV2_DATA_ROOT=<path_of_dir_where_the_data_is_extracted>
python dpr_scale/data_prep/prep_conv_datasets.py \
    --dataset ubuntu2 \
    --in_file_path $UBUNTUV2_DATA_ROOT/train.csv \
    --out_file_path $UBUNTUV2_DATA_ROOT/train.jsonl

Pretraining DPR

Pretrained Checkpoints

Pretrained Model Dataset Download Link
BERT-base PAQ download
BERT-large PAQ download
BERT-base Reddit download
BERT-large Reddit download
RoBERTa-base Reddit download
RoBERTa-large Reddit download

Pretraining on PAQ dataset

DPR_ROOT=<path_of_your_repo's_root>
MODEL="bert-large-uncased"
NODES=8
BSZ=16
MAX_EPOCHS=20
LR=1e-5
TIMOUT_MINS=4320
EXP_DIR=<path_of_the_experiment_dir>
TRAIN_PATH=<path_of_the_training_data_file>
mkdir -p ${EXP_DIR}/logs
PYTHONPATH=$DPR_ROOT python ${DPR_ROOT}/dpr_scale/main.py -m \
    --config-dir ${DPR_ROOT}/dpr_scale/conf \
    --config-name nq.yaml \
    hydra.launcher.timeout_min=$TIMOUT_MINS \
    hydra.sweep.dir=${EXP_DIR} \
    trainer.num_nodes=${NODES} \
    task.optim.lr=${LR} \
    task.model.model_path=${MODEL} \
    trainer.max_epochs=${MAX_EPOCHS} \
    datamodule.train_path=$TRAIN_PATH \
    datamodule.batch_size=${BSZ} \
    datamodule.num_negative=1 \
    datamodule.num_val_negative=10 \
    datamodule.num_test_negative=50 > ${EXP_DIR}/logs/log.out 2> ${EXP_DIR}/logs/log.err &

Pretraining on Reddit dataset

# Use a batch size of 16 for BERT and RoBERTa base models.
BSZ=4
NODES=8
MAX_EPOCHS=5
WARMUP_STEPS=10000
LR=1e-5
MODEL="roberta-large"
EXP_DIR=<path_of_the_experiment_dir>
PYTHONPATH=. python dpr_scale/main.py -m \
    --config-dir ${DPR_ROOT}/dpr_scale/conf \
    --config-name reddit.yaml \
    hydra.launcher.nodes=${NODES} \
    hydra.sweep.dir=${EXP_DIR} \
    trainer.num_nodes=${NODES} \
    task.optim.lr=${LR} \
    task.model.model_path=${MODEL} \
    trainer.max_epochs=${MAX_EPOCHS} \
    task.warmup_steps=${WARMUP_STEPS} \
    datamodule.batch_size=${BSZ} > ${EXP_DIR}/logs/log.out 2> ${EXP_DIR}/logs/log.err &

Fine-tuning DPR on downstream tasks/datasets

Fine-tune the pretrained PAQ checkpoint

# You can also try 2e-5 or 5e-5. Usually these 3 learning rates work best.
LR=1e-5
# Use a batch size of 32 for BERT and RoBERTa base models.
BSZ=12
MODEL="bert-large-uncased"
MAX_EPOCHS=40
WARMUP_STEPS=1000
NODES=1
PRETRAINED_CKPT_PATH=<path_of_checkpoint_pretrained_on_reddit>
EXP_DIR=<path_of_the_experiment_dir>
PYTHONPATH=. python dpr_scale/main.py -m \
    --config-dir ${DPR_ROOT}/dpr_scale/conf \
    --config-name nq.yaml \
    hydra.launcher.name=${NAME} \
    hydra.sweep.dir=${EXP_DIR} \
    trainer.num_nodes=${NODES} \
    trainer.max_epochs=${MAX_EPOCHS} \
    datamodule.num_negative=1 \
    datamodule.num_val_negative=25 \
    datamodule.num_test_negative=50 \
    +trainer.val_check_interval=150 \
    task.warmup_steps=${WARMUP_STEPS} \
    task.optim.lr=${LR} \
    task.pretrained_checkpoint_path=$PRETRAINED_CKPT_PATH \
    task.model.model_path=${MODEL} \
    datamodule.batch_size=${BSZ} > ${EXP_DIR}/logs/log.out 2> ${EXP_DIR}/logs/log.err &

Fine-tune the pretrained Reddit checkpoint

Batch sizes that worked on Volta 32GB GPUs for respective model and datasets.

Model Dataset Batch Size
BERT/RoBERTa base ConvAI2 64
RBERT/RoBERTa base ConvAI2 16
BERT/RoBERTa base DSTC7 24
BERT/RoBERTa base DSTC7 8
BERT/RoBERTa base Ubuntu V2 64
BERT/RoBERTa large Ubuntu V2 16
# Change the config file name to convai2.yaml or dstc7.yaml for the respective datasets.
CONFIG_FILE_NAME=ubuntuv2.yaml
# You can also try 2e-5 or 5e-5. Usually these 3 learning rates work best.
LR=1e-5
BSZ=16
NODES=1
MAX_EPOCHS=5
WARMUP_STEPS=10000
MODEL="roberta-large"
PRETRAINED_CKPT_PATH=<path_of_checkpoint_pretrained_on_reddit>
EXP_DIR=<path_of_the_experiment_dir>
PYTHONPATH=${DPR_ROOT} python ${DPR_ROOT}/dpr_scale/main.py -m \
    --config-dir=${DPR_ROOT}/dpr_scale/conf \
    --config-name=$CONFIG_FILE_NAME \
    hydra.launcher.nodes=${NODES} \
    hydra.sweep.dir=${EXP_DIR} \
    trainer.num_nodes=${NODES} \
    trainer.max_epochs=${MAX_EPOCHS} \
    +trainer.val_check_interval=150 \
    task.pretrained_checkpoint_path=$PRETRAINED_CKPT_PATH \
    task.warmup_steps=${WARMUP_STEPS} \
    task.optim.lr=${LR} \
    task.model.model_path=$MODEL \
    datamodule.batch_size=${BSZ} > ${EXP_DIR}/logs/log.out 2> ${EXP_DIR}/logs/log.err &

License

dpr-scale is CC-BY-NC 4.0 licensed as of now.

Owner
Facebook Research
Facebook Research
Image super-resolution (SR) is a fast-moving field with novel architectures attracting the spotlight

Revisiting RCAN: Improved Training for Image Super-Resolution Introduction Image super-resolution (SR) is a fast-moving field with novel architectures

Zudi Lin 76 Dec 01, 2022
SAFL: A Self-Attention Scene Text Recognizer with Focal Loss

SAFL: A Self-Attention Scene Text Recognizer with Focal Loss This repository implements the SAFL in pytorch. Installation conda env create -f environm

6 Aug 24, 2022
Codes and Data Processing Files for our paper.

Code Scripts and Processing Files for EEG Sleep Staging Paper 1. Folder Tree ./src_preprocess (data preprocessing files for SHHS and Sleep EDF) sleepE

Chaoqi Yang 18 Dec 12, 2022
Open source simulator for autonomous vehicles built on Unreal Engine / Unity, from Microsoft AI & Research

Welcome to AirSim AirSim is a simulator for drones, cars and more, built on Unreal Engine (we now also have an experimental Unity release). It is open

Microsoft 13.8k Jan 05, 2023
A computational optimization project towards the goal of gerrymandering the results of a hypothetical election in the UK.

A computational optimization project towards the goal of gerrymandering the results of a hypothetical election in the UK.

Emma 1 Jan 18, 2022
Pytorch implementation of CVPR2021 paper "MUST-GAN: Multi-level Statistics Transfer for Self-driven Person Image Generation"

MUST-GAN Code | paper The Pytorch implementation of our CVPR2021 paper "MUST-GAN: Multi-level Statistics Transfer for Self-driven Person Image Generat

TianxiangMa 46 Dec 26, 2022
Attention-guided gan for synthesizing IR images

SI-AGAN Attention-guided gan for synthesizing IR images This repository contains the Tensorflow code for "Pedestrian Gender Recognition by Style Trans

1 Oct 25, 2021
The official PyTorch implementation of paper BBN: Bilateral-Branch Network with Cumulative Learning for Long-Tailed Visual Recognition

BBN: Bilateral-Branch Network with Cumulative Learning for Long-Tailed Visual Recognition Boyan Zhou, Quan Cui, Xiu-Shen Wei*, Zhao-Min Chen This repo

Megvii-Nanjing 616 Dec 21, 2022
This PyTorch package implements MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation (NAACL 2022).

MoEBERT This PyTorch package implements MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation (NAACL 2022). Installation Create an

Simiao Zuo 34 Dec 24, 2022
Alphabetical Letter Recognition

DecisionTrees-Image-Classification Alphabetical Letter Recognition In these demo we are using "Decision Trees" Our database is composed by Learning Im

Mohammed Firass 4 Nov 30, 2021
SmallInitEmb - LayerNorm(SmallInit(Embedding)) in a Transformer to improve convergence

SmallInitEmb LayerNorm(SmallInit(Embedding)) in a Transformer I find that when t

PENG Bo 11 Dec 25, 2022
TransGAN: Two Transformers Can Make One Strong GAN

[Preprint] "TransGAN: Two Transformers Can Make One Strong GAN", Yifan Jiang, Shiyu Chang, Zhangyang Wang

VITA 1.5k Jan 07, 2023
Generating Fractals on Starknet with Cairo

StarknetFractals Generating the mandelbrot set on Starknet Current Implementation generates 1 pixel of the fractal per call(). It takes a few minutes

Orland0x 10 Jul 16, 2022
[CVPR 2021] Official PyTorch Implementation for "Iterative Filter Adaptive Network for Single Image Defocus Deblurring"

IFAN: Iterative Filter Adaptive Network for Single Image Defocus Deblurring Checkout for the demo (GUI/Google Colab)! The GUI version might occasional

Junyong Lee 173 Dec 30, 2022
Multi-layer convolutional LSTM with Pytorch

Convolution_LSTM_pytorch Thanks for your attention. I haven't got time to maintain this repo for a long time. I recommend this repo which provides an

Zijie Zhuang 734 Jan 03, 2023
Code for the paper: Sketch Your Own GAN

Sketch Your Own GAN Project | Paper | Youtube | Slides Our method takes in one or a few hand-drawn sketches and customizes an off-the-shelf GAN to mat

677 Dec 28, 2022
A Simulated Optimal Intrusion Response Game

Optimal Intrusion Response An OpenAI Gym interface to a MDP/Markov Game model for optimal intrusion response of a realistic infrastructure simulated u

Kim Hammar 10 Dec 09, 2022
This is the repository for the paper "Have I done enough planning or should I plan more?"

Metacognitive Learning Tool box https://re.is.mpg.de What Is This? This repository contains two modules used to analyse metacognitive learning in huma

0 Dec 01, 2021
This is the codebase for the ICLR 2021 paper Trajectory Prediction using Equivariant Continuous Convolution

Trajectory Prediction using Equivariant Continuous Convolution (ECCO) This is the codebase for the ICLR 2021 paper Trajectory Prediction using Equivar

Spatiotemporal Machine Learning 45 Jul 22, 2022
YOLO5Face: Why Reinventing a Face Detector (https://arxiv.org/abs/2105.12931)

Introduction Yolov5-face is a real-time,high accuracy face detection. Performance Single Scale Inference on VGA resolution(max side is equal to 640 an

DeepCam Shenzhen 1.4k Jan 07, 2023