Official repository with code and data accompanying the NAACL 2021 paper "Hurdles to Progress in Long-form Question Answering" (https://arxiv.org/abs/2103.06332).

Overview

Hurdles to Progress in Long-form Question Answering

This repository contains the official scripts and datasets accompanying our NAACL 2021 paper, "Hurdles to Progress in Long-form Question Answering". This repository supports inference from the pretrained retriever / generator & includes evaluation scripts.

Specifically, this codebase contains the model checkpoints, inference scripts for the retriever / generator model, generated outputs from model using c-REALM retrievals and random retrievals, scripts to compute ROUGE-L / R-Prec scores using the generations, scripts for question paraphrase classification, scripts for computing ROUGE-L bounds. You can also find the original Routing Transformer model's codebase here.

Setup

pip install transformers
pip install tensor2tensor

For GPU support, you might need to change the version of your TensorFlow depending on the CUDA / CuDNN installation (details). GPU support is strongly recommended for faster inference.

Model Checkpoints & Generations

Routing Transformer finetuned on ELI5: link
c-REALM TF Hub model + encoded retrieval corpora: link
c-REALM tokenized KILT Wikipedia data: link
c-REALM tokenized ELI5 training data: link
Pre-computed generations & QQP classifier: link

The original Routing Transformer model (pretrained on PG-19) and a local attention version of it can be found in the main repository (link).

Generating from the Routing Transformer

(We have provided the pre-computed retrievals from c-REALM on ELI5, so no need to run the c-REALM retriever)

  1. Download the "Routing Transformer finetuned on ELI5" model listed above and place it inside models.
wget https://storage.googleapis.com/rt-checkpoint/eli5_checkpoint.zip
unzip eli5_checkpoint.zip -d models
rm eli5_checkpoint.zip
  1. Download the generations folder from the Google Drive link listed as "Pre-computed generations & QQP classifier" above.

  2. Run eval_generate_eli5.py to generate from the model. We have provided c-REALM retrieval outputs in the script for the ELI5 validation / test split. For custom inputs, you will need to load the retriever and wikipedia corpus (see next section). Generation is on the slower side (~4 minutes per ELI5 QA pair on a 1080ti GPU), we hope to switch to the faster decoding mode in the Routing Transformer model in the near future.

Retrievals from c-REALM

(This script only tests the retriever, it doesn't depend on the Routing Transformer generator model)

  1. Download the "c-REALM TF Hub model + encoded retrieval corpora" model listed above. Place it inside the models folder.
wget https://storage.googleapis.com/rt-checkpoint/retriever.zip
unzip retriever.zip -d models
rm retriever.zip
  1. Download "c-REALM tokenized KILT Wikipedia data" if you are interested in retrieving from the KILT Wikipedia corpus and/or "c-REALM tokenized ELI5 training data" if you are interested in retrieving question paraphrases from the ELI5 training set. Place them inside the models folder.
wget https://storage.googleapis.com/rt-checkpoint/eli5_retrieval_train.zip
unzip eli5_retrieval_train.zip -d models
rm eli5_retrieval_train.zip
  1. Run eval_retriever_eli5.py to retrieve using c-REALM. Modify the --retrieval_corpus flag to choose the retrieval corpus.

Evaluation of Outputs

Setup

  1. Download the generations folder from the Google Drive link into this root folder.

  2. Clone the KILT repository in this folder and run the installation in a virtual environment.

git clone https://github.com/facebookresearch/KILT
cd KILT
virtualenv -p python3.7 kilt-venv
pip install -r requirements.txt
pip install --editable .
  1. If you are interested in using the Quora Question Paraphrase classifier (used in Section 3.2 of the paper), download the roberta-large-finetuned-qqp folder from "Pre-computed generations & QQP classifier" listed above. This model was built by Tu Vu.

  2. Download the ELI5 train, validation and test splits.

cd KILT
wget http://dl.fbaipublicfiles.com/KILT/eli5-train-kilt.jsonl -O train.jsonl
wget http://dl.fbaipublicfiles.com/KILT/eli5-dev-kilt.jsonl -O valid.jsonl
wget http://dl.fbaipublicfiles.com/KILT/eli5-test_without_answers-kilt.jsonl -O test.jsonl

Running evaluations

Enter the KILT folder and run the following command for evaluating p=0.6 with c-REALM retrievals on the validation set:

python kilt/eval_downstream.py ../generations/final_guess_eli5_0.6_predicted_retrieval.jsonl ../generations/final_gold_eli5_0.6_predicted_retrieval.jsonl

which should give you the output (partly reported in Table 6 of the paper),

{   'downstream': {   'accuracy': 0.0,
                      'em': 0.0,
                      'f1': 0.25566078582652935,
                      'rougel': 0.24417152125142375},
    'kilt': {   'KILT-accuracy': 0.0,
                'KILT-em': 0.0,
                'KILT-f1': 0.03414819887348917,
                'KILT-rougel': 0.03205580975169385},
    'retrieval': {'Rprec': 0.13258897418004187, '[email protected]': 0.2122586648057688}}

To evaluate other configurations, modify the paths in the command above. You can replace 0.6 with 0.9 for higher entropy generations, and replace predicted with random for randomly selected retrieval paragraphs (Hurdle #1 or Section 3.1 in the paper). Note that you should make this change for both the guess and gold files, to ensure correct alignment. We have only provided generations for the validation set since the test set answers / retrievals for ELI5 are hidden behind the KILT leaderboard.

Question paraphrase classification using QQP Classifier

In Section 3.2 of our paper, we used a Quora Question Paraphrase classifier to find question paraphrases amoung similar questions retrieved by c-REALM. To run this, make sure you have downloaded the QQP checkpoint (step 3 in Setup) and run,

python run_qqp.py --input_file generations/final_guess_eli5_0.6_similar_questions.jsonl

You should get a score of 43.6%. Note that this is a lower-bound --- qualitatively we found this classifier missed several paraphrase pairs with low lexical overlap, or cases where the retrieved training set question will have a super-set of the information needed to answer the validation set question.

Lower and Upper Bounds on ROUGE-L

Run the following to evaluate bounds on ROUGE-L. Make sure you have completed steps 1, 4 in the setup above. Scripts to evaluate other bounds involving training set retrieval coming soon!

cp generate_final_guess_bounds.py KILT/
cd KILT

# Copy input lowerbound, should get 20.0 ROUGE-L
python generate_final_guess_bounds.py --bound_type copy_input

# Random training set answer, should get 15.8-16.2 ROUGE-L depending on randomness
python generate_final_guess_bounds.py --bound_type random_train_ans

# "Performance" can be further boosted by randomly selecting from only longest answers
# for each training set question, up to ~16.7 ROUGE-L. This result is not reported in
# paper, but can be run using:
python generate_final_guess_bounds.py --bound_type random_train_ans_longest

# Longest gold answer upperbound, should get 21.2 ROUGE-L
python generate_final_guess_bounds.py --bound_type longest_gold

# Best gold answer upperbound, should get 26.2 ROUGE-L (takes a while to run, 45 min on single core)
python generate_final_guess_bounds.py --bound_type best_gold

Citation

If you found our paper or this repository useful, please cite:

@inproceedings{lfqa21,
author={Kalpesh Krishna and Aurko Roy and Mohit Iyyer},
Booktitle = {North American Association for Computational Linguistics},
Year = "2021",
Title={Hurdles to Progress in Long-form Question Answering},
}
Owner
Kalpesh Krishna
PhD student in Computer Science at UMass Amherst. Formerly IIT Bombay, @google-research, @mozilla, TTIC, @wncc.
Kalpesh Krishna
Signals-backend - A suite of card games written in Python

Card game A suite of card games written in the Python language. Features coming

1 Feb 15, 2022
AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty

AugMix Introduction We propose AugMix, a data processing technique that mixes augmented images and enforces consistent embeddings of the augmented ima

Google Research 876 Dec 17, 2022
Repository for the semantic WMI loss

Installation: pip install -e . Installing DL2: First clone DL2 in a separate directory and install it using the following commands: git clone https:/

Nick Hoernle 4 Sep 15, 2022
A PyTorch re-implementation of Neural Radiance Fields

nerf-pytorch A PyTorch re-implementation Project | Video | Paper NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis Ben Mildenhall

Krishna Murthy 709 Jan 09, 2023
Adversarial Learning for Modeling Human Motion

Adversarial Learning for Modeling Human Motion This repository contains the open source code which reproduces the results for the paper: Adversarial l

wangqi 6 Jun 15, 2021
Benchmark tools for Compressive LiDAR-to-map registration

Benchmark tools for Compressive LiDAR-to-map registration This repo contains the released version of code and datasets used for our IROS 2021 paper: "

Allie 9 Nov 24, 2022
Exemplo de implementação do padrão circuit breaker em python

fast-circuit-breaker Circuit breakers existem para permitir que uma parte do seu sistema falhe sem destruir todo seu ecossistema de serviços. Michael

James G Silva 17 Nov 10, 2022
Code for "Intra-hour Photovoltaic Generation Forecasting based on Multi-source Data and Deep Learning Methods."

pv_predict_unet-lstm Code for "Intra-hour Photovoltaic Generation Forecasting based on Multi-source Data and Deep Learning Methods." IEEE Transactions

FolkScientistInDL 8 Oct 08, 2022
Simultaneous Detection and Segmentation

Simultaneous Detection and Segmentation This is code for the ECCV Paper: Simultaneous Detection and Segmentation Bharath Hariharan, Pablo Arbelaez,

Bharath Hariharan 96 Jul 20, 2022
Differentiable Optimizers with Perturbations in Pytorch

Differentiable Optimizers with Perturbations in PyTorch This contains a PyTorch implementation of Differentiable Optimizers with Perturbations in Tens

Jake Tuero 54 Jun 22, 2022
Source codes for the paper "Local Additivity Based Data Augmentation for Semi-supervised NER"

LADA This repo contains codes for the following paper: Jiaao Chen*, Zhenghui Wang*, Ran Tian, Zichao Yang, Diyi Yang: Local Additivity Based Data Augm

GT-SALT 36 Dec 02, 2022
Fbone (Flask bone) is a Flask (Python microframework) starter/template/bootstrap/boilerplate application.

Fbone (Flask bone) is a Flask (Python microframework) starter/template/bootstrap/boilerplate application.

Wilson 1.7k Dec 30, 2022
Implementation for paper "STAR: A Structure-aware Lightweight Transformer for Real-time Image Enhancement" (ICCV 2021).

STAR-pytorch Implementation for paper "STAR: A Structure-aware Lightweight Transformer for Real-time Image Enhancement" (ICCV 2021). CVF (pdf) STAR-DC

43 Dec 21, 2022
Backdoor Attack through Frequency Domain

Backdoor Attack through Frequency Domain DEPENDENCIES python==3.8.3 numpy==1.19.4 tensorflow==2.4.0 opencv==4.5.1 idx2numpy==1.2.3 pytorch==1.7.0 Data

5 Jun 18, 2022
Callable PyTrees and filtered JIT/grad transformations => neural networks in JAX.

Equinox Callable PyTrees and filtered JIT/grad transformations = neural networks in JAX Equinox brings more power to your model building in JAX. Repr

Patrick Kidger 909 Dec 30, 2022
Here is the implementation of our paper S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations.

S2VC Here is the implementation of our paper S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations. In thi

81 Dec 15, 2022
Official implementation of the network presented in the paper "M4Depth: A motion-based approach for monocular depth estimation on video sequences"

M4Depth This is the reference TensorFlow implementation for training and testing depth estimation models using the method described in M4Depth: A moti

Michaël Fonder 76 Jan 03, 2023
[CVPR 2022 Oral] Balanced MSE for Imbalanced Visual Regression https://arxiv.org/abs/2203.16427

Balanced MSE Code for the paper: Balanced MSE for Imbalanced Visual Regression Jiawei Ren, Mingyuan Zhang, Cunjun Yu, Ziwei Liu CVPR 2022 (Oral) News

Jiawei Ren 267 Jan 01, 2023
[CoRL 21'] TANDEM: Tracking and Dense Mapping in Real-time using Deep Multi-view Stereo

TANDEM: Tracking and Dense Mapping in Real-time using Deep Multi-view Stereo Lukas Koestler1*    Nan Yang1,2*,†    Niclas Zeller2,3    Daniel Cremers1

TUM Computer Vision Group 744 Jan 04, 2023
Ejemplo Algoritmo Viterbi - Example of a Viterbi algorithm applied to a hidden Markov model on DNA sequence

Ejemplo Algoritmo Viterbi Ejemplo de un algoritmo Viterbi aplicado a modelo ocul

Mateo Velásquez Molina 1 Jan 10, 2022