EMNLP'2021: Simple Entity-centric Questions Challenge Dense Retrievers

Overview

EntityQuestions

This repository contains the EntityQuestions dataset as well as code to evaluate retrieval results from the the paper Simple Entity-centric Questions Challenge Dense Retrievers by Chris Sciavolino*, Zexuan Zhong*, Jinhyuk Lee, and Danqi Chen (* equal contribution).

[9/16/21] This repo is not yet set in stone, we're still putting finishing touches on the tooling and documentation :) Thanks for your patience!

Quick Links

Installation

You can download a .zip file of the dataset here, or using wget with the command:

$ wget https://nlp.cs.princeton.edu/projects/entity-questions/dataset.zip

We include the dependencies needed to run the code in this repository. We recommend having a separate miniconda environment for running DPR code. You can create the environment using the following commands:

$ conda create -n EntityQ python=3.6
$ conda activate EntityQ
$ pip install -r requirements.txt

Dataset Overview

The unzipped dataset directory should have the following structure:

dataset/
    | train/
        | P*.train.json     // all randomly sampled training files 
    | dev/
        | P*.dev.json       // all randomly sampled development files
    | test/
        | P*.test.json      // all randomly sampled testing files
    | one-off/
        | common-random-buckets/
            | P*/
                | bucket*.test.json
        | no-overlap/
            | P*/
                | P*_no_overlap.{train,dev,test}.json
        | nq-seen-buckets/
            | P*/
                bucket*.test.json
        | similar/
            | P*
                | P*_similar.{train,dev,test}.json

The main dataset is included in dataset/ under train/, dev/, and test/, each containing the randomly sampled training, development, and testing subsets, respectively. For example, the evaluation set for place-of-birth (P19) can be found in the dataset/test/P19.test.json file.

We also include all of the one-off datasets we used to generate the tables/figures presented in the paper under dataset/one-off/, explained below:

  • one-off/common-random-buckets/ contains buckets of 1,000 randomly sampled examples, used to produce Fig. 1 of the paper (specifically for rand-ent).
  • one-off/no-overlap/ contains the training/development splits for our analyses in Section 4.1 of the paper (we do not use the testing split in our analysis). These training/development sets have subject entities with no token overlap with subject entities of the randomly sampled test set (specifically for all fine-tuning in Table 2).
  • one-off/nq-seen-buckets/ contains buckets of questions with subject entities that overlap with subject entities seen in the NQ training set, used to produce Fig. 1 of the paper (specifically for train-ent).
  • one-off/similar contains the training/development splits for the syntactically different but symantically equal question sets, used for our analyses in Section 4.1 (specifically the similar rows). Again, we do not use the testing split in our analysis. These questions are identical to one-off/no-overlap/ but use a different question template.

Retrieving DPR Results

Our analysis is based on a previous version of the DPR repository (specifically the Oct. 5 version w. hash 27a8436b070861e2fff481e37244009b48c29c09), so our commands may not be up-to-date with the March 2021 release. That said, most of the commands should be clearly transferable.

First, we recommend following the setup guide from the official DPR repository. Once set up, you can download the relevant pre-trained models/indices using their download_data.py script. For our analysis, we used the DPR-NQ model and the DPR-Multi model. To run retrieval using a pre-trained model, you'll minimally need to download:

  1. The pre-trained model
  2. The Wikipedia passage splits
  3. The encoded Wikipedia passage FAISS index
  4. A question/answer dataset

With this, you can use the following python command:

python dense_retriever.py \
    --batch_size 512 \
    --model_file "path/to/pretrained/model/file.cp" \
    --qa_file "path/to/qa/dataset/to/evaluate.json" \
    --ctx_file "path/to/wikipedia/passage/splits.tsv" \
    --encoded_ctx_file "path/to/encoded/wikipedia/passage/index/" \
    --save_or_load_index \
    --n-docs 100 \
    --validation_workers 1 \
    --out_file "path/to/desired/output/location.json"

We had access to a single 11Gb Nvidia RTX 2080Ti GPU w. 128G of RAM when running DPR retrieval.

Retrieving BM25 Results

We use the Pyserini implementation of BM25 for our analysis. We use the default settings and index on the same passage splits downloaded from the DPR repository. We include steps to re-create our BM25 results below.

First, we need to pre-process the DPR passage splits into the proper format for BM25 indexing. We include this file in bm25/build_bm25_ctx_passages.py. Rather than writing all passages into a single file, you can optionally shard the passages into multiple files (specified by the n_shards argument). It also creates a mapping from the passage ID to the title of the article the passage is from. You can use this file as follows:

python bm25/build_bm25_ctx_passages.py \
    --wiki_passages_file "path/to/wikipedia/passage/splits.tsv" \
    --outdir "path/to/desired/output/directory/" \
    --title_index_path "path/to/desired/output/directory/.json" \
    --n_shards number_of_shards_of_passages_to_write

Now that you have all the passages in files, you can build the BM25 index using the following command:

python -m pyserini.index -collection JsonCollection \
    -generator DefaultLuceneDocumentGenerator \
    -threads 4 \
    -input "path/to/generated/passages/folder/" \
    -index "path/to/desired/index/folder/" \
    -storePositions -storeDocvectors -storeRaw

Once the index is built, you can use it in the bm25/bm25_retriever.py script to get retrieval results for an input file:

python bm25/bm25_retriever.py \
    --index_path "path/to/built/bm25/index/directory/" \
    --passage_id_to_title_path "path/to/title_index_path/from_step_1.json" \
    --input "path/to/input/qa/file.json" \
    --output_dir "path/to/output/directory/"

By default, the script will retrieve 100 passages (--n_docs), use string matching to determine answer presence (--answer_type), and take in .json files (--input_file_type). You can optionally provide a glob using the --glob flag. The script writes the results to the file with the same name as the input file, but in the output directory.

Evaluating Retriever Results

We provide an evaluation script in utils/accuracy.py. The expected format is equivalent to DPR's output format. It either accepts a single file to evaluate, or a glob of multiple files if the --glob option is set. To evaluate a single file, you can use the following command:

python utils/accuracy.py \
    --results "path/to/retrieval/results.json" \
    --k_values 1,5,20,100

or with a glob with:

python utils/accuracy.py \
    --results="path/to/glob*.test.json" \
    --glob \
    --k_values 1,5,20,100

Bugs or Questions?

Feel free to open an issue on this GitHub repository and we'd be happy to answer your questions as best we can!

Citation

If you use our dataset or code in your research, please cite our work:

@inproceedings{sciavolino2021simple,
   title={Simple Entity-centric Questions Challenge Dense Retrievers},
   author={Sciavolino, Christopher and Zhong, Zexuan and Lee, Jinhyuk and Chen, Danqi},
   booktitle={Empirical Methods in Natural Language Processing (EMNLP)},
   year={2021}
}
Owner
Princeton Natural Language Processing
Princeton Natural Language Processing
Recursive Bayesian Networks

Recursive Bayesian Networks This repository contains the code to reproduce the results from the NeurIPS 2021 paper Lieck R, Rohrmeier M (2021) Recursi

Robert Lieck 11 Oct 18, 2022
A lightweight tool to get an AI Infrastructure Stack up in minutes not days.

K3ai will take care of setup K8s for You, deploy the AI tool of your choice and even run your code on it.

k3ai 105 Dec 04, 2022
U-Net Brain Tumor Segmentation

U-Net Brain Tumor Segmentation 🚀 :Feb 2019 the data processing implementation in this repo is not the fastest way (code need update, contribution is

Hao 448 Jan 02, 2023
[ICCV21] Self-Calibrating Neural Radiance Fields

Self-Calibrating Neural Radiance Fields, ICCV, 2021 Project Page | Paper | Video Author Information Yoonwoo Jeong [Google Scholar] Seokjun Ahn [Google

381 Dec 30, 2022
The repository offers the official implementation of our paper in PyTorch.

Cloth Interactive Transformer (CIT) Cloth Interactive Transformer for Virtual Try-On Bin Ren1, Hao Tang1, Fanyang Meng2, Runwei Ding3, Ling Shao4, Phi

Bingoren 49 Dec 01, 2022
DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (SVS & TTS); AAAI 2022; Official code

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism This repository is the official PyTorch implementation of our AAAI-2022 paper, in

Jinglin Liu 803 Dec 28, 2022
Official implementation of the PICASO: Permutation-Invariant Cascaded Attentional Set Operator

PICASO Official PyTorch implemetation for the paper PICASO:Permutation-Invariant Cascaded Attentive Set Operator. Requirements Python 3 torch = 1.0 n

Samira Zare 0 Dec 23, 2021
Official pytorch implementation of the AAAI 2021 paper Semantic Grouping Network for Video Captioning

Semantic Grouping Network for Video Captioning Hobin Ryu, Sunghun Kang, Haeyong Kang, and Chang D. Yoo. AAAI 2021. [arxiv] Environment Ubuntu 16.04 CU

Hobin Ryu 43 Nov 25, 2022
CLOOB training (JAX) and inference (JAX and PyTorch)

cloob-training Pretrained models There are two pretrained CLOOB models in this repo at the moment, a 16 epoch and a 32 epoch ViT-B/16 checkpoint train

Katherine Crowson 64 Nov 27, 2022
An educational tool to introduce AI planning concepts using mobile manipulator robots.

JEDAI Explains Decision-Making AI Virtual Machine Image The recommended way of using JEDAI is to use pre-configured Virtual Machine image that is avai

Autonomous Agents and Intelligent Robots 13 Nov 15, 2022
Pure python implementations of popular ML algorithms.

Minimal ML algorithms This repo includes minimal implementations of popular ML algorithms using pure python and numpy. The purpose of these notebooks

Alexis Gidiotis 3 Jan 10, 2022
This is the official repository for our paper: ''Pruning Self-attentions into Convolutional Layers in Single Path''.

Pruning Self-attentions into Convolutional Layers in Single Path This is the official repository for our paper: Pruning Self-attentions into Convoluti

Zhuang AI Group 77 Dec 26, 2022
These are the materials for the paper "Few-Shot Out-of-Domain Transfer Learning of Natural Language Explanations"

Few-shot-NLEs These are the materials for the paper "Few-Shot Out-of-Domain Transfer Learning of Natural Language Explanations". You can find the smal

Yordan Yordanov 0 Oct 21, 2022
Process text, including tokenizing and representing sentences as vectors and Applying some concepts like RNN, LSTM and GRU to create a classifier can detect the language in which a sentence is written from among 17 languages.

Language Identifier What is this ? The goal of this project is to create a model that is able to predict a given sentence language through text proces

Hossam Asaad 9 Dec 15, 2022
An atmospheric growth and evolution model based on the EVo degassing model and FastChem 2.0

EVolve Linking planetary mantles to atmospheric chemistry through volcanism using EVo and FastChem. Overview EVolve is a linked mantle degassing and a

Pip Liggins 2 Jan 17, 2022
Based on Yolo's low-power, ultra-lightweight universal target detection algorithm, the parameter is only 250k, and the speed of the smart phone mobile terminal can reach ~300fps+

Based on Yolo's low-power, ultra-lightweight universal target detection algorithm, the parameter is only 250k, and the speed of the smart phone mobile terminal can reach ~300fps+

567 Dec 26, 2022
FEMDA: Robust classification with Flexible Discriminant Analysis in heterogeneous data

FEMDA: Robust classification with Flexible Discriminant Analysis in heterogeneous data. Flexible EM-Inspired Discriminant Analysis is a robust supervised classification algorithm that performs well i

0 Sep 06, 2022
In this project, we develop a face recognize platform based on MTCNN object-detection netcwork and FaceNet self-supervised network.

模式识别大作业——人脸检测与识别平台 本项目是一个简易的人脸检测识别平台,提供了人脸信息录入和人脸识别的功能。前端采用 html+css+js,后端采用 pytorch,

Xuhua Huang 5 Aug 02, 2022
Towards Long-Form Video Understanding

Towards Long-Form Video Understanding Chao-Yuan Wu, Philipp Krähenbühl, CVPR 2021 [Paper] [Project Page] [Dataset] Citation @inproceedings{lvu2021,

Chao-Yuan Wu 69 Dec 26, 2022
This package proposes simplified exporting pytorch models to ONNX and TensorRT, and also gives some base interface for model inference.

PyTorch Infer Utils This package proposes simplified exporting pytorch models to ONNX and TensorRT, and also gives some base interface for model infer

Alex Gorodnitskiy 11 Mar 20, 2022