Fusion-in-Decoder Distilling Knowledge from Reader to Retriever for Question Answering

Related tags

Deep LearningFiD
Overview

This repository contains code for:

  • Fusion-in-Decoder models
  • Distilling Knowledge from Reader to Retriever

Dependencies

  • Python 3
  • PyTorch (currently tested on version 1.6.0)
  • Transformers (version 3.0.2, unlikely to work with a different version)
  • NumPy

Data

Download data

NaturalQuestions and TriviaQA data can be downloaded using get-data.sh. Both datasets are obtained from the original source and the wikipedia dump is downloaded from the DPR repository. In addition to the question and answers, this script retrieves the Wikipedia passages used to trained the released pretrained models.

Data format

The expected data format is a list of entry examples, where each entry example is a dictionary containing

  • id: example id, optional
  • question: question text
  • target: answer used for model training, if not given, the target is randomly sampled from the 'answers' list
  • answers: list of answer text for evaluation, also used for training if target is not given
  • ctxs: a list of passages where each item is a dictionary containing - title: article title - text: passage text

Entry example:

{
  'id': '0',
  'question': 'What element did Marie Curie name after her native land?',
  'target': 'Polonium',
  'answers': ['Polonium', 'Po (chemical element)', 'Po'],
  'ctxs': [
            {
                "title": "Marie Curie",
                "text": "them on visits to Poland. She named the first chemical element that she discovered in 1898 \"polonium\", after her native country. Marie Curie died in 1934, aged 66, at a sanatorium in Sancellemoz (Haute-Savoie), France, of aplastic anemia from exposure to radiation in the course of her scientific research and in the course of her radiological work at field hospitals during World War I. Maria Sk\u0142odowska was born in Warsaw, in Congress Poland in the Russian Empire, on 7 November 1867, the fifth and youngest child of well-known teachers Bronis\u0142awa, \"n\u00e9e\" Boguska, and W\u0142adys\u0142aw Sk\u0142odowski. The elder siblings of Maria"
            },
            {
                "title": "Marie Curie",
                "text": "was present in such minute quantities that they would eventually have to process tons of the ore. In July 1898, Curie and her husband published a joint paper announcing the existence of an element which they named \"polonium\", in honour of her native Poland, which would for another twenty years remain partitioned among three empires (Russian, Austrian, and Prussian). On 26 December 1898, the Curies announced the existence of a second element, which they named \"radium\", from the Latin word for \"ray\". In the course of their research, they also coined the word \"radioactivity\". To prove their discoveries beyond any"
            }
          ]
}

Pretrained models.

Pretrained models can be downloaded using get-model.sh. Currently availble models are [nq_reader_base, nq_reader_large, nq_retriever, tqa_reader_base, tqa_reader_large, tqa_retriever].

bash get-model.sh -m model_name

Performance of the pretrained models:

Mode size NaturalQuestions TriviaQA
dev test dev test
base 49.2 50.1 68.7 69.3
large 52.7 54.4 72.5 72.5

I. Fusion-in-Decoder

Fusion-in-Decoder models can be trained using train_reader.py and evaluated with test_reader.py.

Train

train_reader.py provides the code to train a model. An example usage of the script is given below:

python train_reader.py \
        --train_data train_data.json \
        --eval_data eval_data.json \
        --model_size base \
        --per_gpu_batch_size 1 \
        --n_context 100 \
        --name my_experiment \
        --checkpoint_dir checkpoint \

Training these models with 100 passages is memory intensive. To alleviate this issue we use checkpointing with the --use_checkpoint option. Tensors of variable sizes lead to memory overhead. Encoder input tensors have a fixed size by default, but not the decoder input tensors. The tensor size on the decoder side can be fixed using --answer_maxlength. The large readers have been trained on 64 GPUs with the following hyperparameters:

python train_reader.py \
        --use_checkpoint \
        --lr 0.00005 \
        --optim adamw \
        --scheduler linear \
        --weight_decay 0.01 \
        --text_maxlength 250 \
        --per_gpu_batch_size 1 \
        --n_context 100 \
        --total_step 15000 \
        --warmup_step 1000 \

Test

You can evaluate your model or a pretrained model with test_reader.py. An example usage of the script is provided below.

python test_reader.py \
        --model_path checkpoint_dir/my_experiment/my_model_dir/checkpoint/best_dev \
        --eval_data eval_data.json \
        --per_gpu_batch_size 1 \
        --n_context 100 \
        --name my_test \
        --checkpoint_dir checkpoint \

II. Distilling knowledge from reader to retriever for question answering

This repository also contains code to train a retriever model following the method proposed in our paper: Distilling knowledge from reader to retriever for question answering. This code is heavily inspired by the DPR codebase and reuses parts of it. The proposed method consists in several steps:

1. Obtain reader cross-attention scores

Assuming that we have already retrieved relevant passages for each question, the first step consists in generating cross-attention scores. This can be done using the option --write_crossattention_scores in test.py. It saves the dataset with cross-attention scores in checkpoint_dir/name/dataset_wscores.json. To retrieve the initial set of passages for each question, different options can be considered, such as DPR or BM25.

python test.py \
        --model_path my_model_path \
        --eval_data data.json \
        --per_gpu_batch_size 4 \
        --n_context 100 \
        --name my_test \
        --checkpoint_dir checkpoint \
        --write_crossattention_scores \

2. Retriever training

train_retriever.py provides the code to train a retriever using the scores previously generated.

python train_retriever.py \
        --lr 1e-4 \
        --optim adamw \
        --scheduler linear \
        --train_data train_data.json \
        --eval_data eval_data.json \
        --n_context 100 \
        --total_steps 20000 \
        --scheduler_steps 30000 \

3. Knowldege source indexing

Then the trained retriever is used to index a knowldege source, Wikipedia in our case.

python3 generate_retriever_embedding.py \
        --model_path <model_dir> \ #directory
        --passages passages.tsv \ #.tsv file
        --output_path wikipedia_embeddings \
        --shard_id 0 \
        --num_shards 1 \
        --per_gpu_batch_size 500 \

4. Passage retrieval

After indexing, given an input query, passages can be efficiently retrieved:

python passage_retrieval.py \
    --model_path <model_dir> \
    --passages psgs_w100.tsv \
    --data_path data.json \
    --passages_embeddings "wikipedia_embeddings/wiki_*" \
    --output_path retrieved_data.json \
    --n-docs 100 \

We found that iterating the four steps here can improve performances, depending on the initial set of documents.

References

[1] G. Izacard, E. Grave Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering

@misc{izacard2020leveraging,
      title={Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering},
      author={Gautier Izacard and Edouard Grave},
      year={2020},
      eprint={2007.01282},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

[2] G. Izacard, E. Grave Distilling Knowledge from Reader to Retriever for Question Answering

@misc{izacard2020distilling,
      title={Distilling Knowledge from Reader to Retriever for Question Answering},
      author={Gautier Izacard and Edouard Grave},
      year={2020},
      eprint={2012.04584},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

License

See the LICENSE file for more details.

Owner
Meta Research
Meta Research
Buffon’s needle: one of the oldest problems in geometric probability

Buffon-s-Needle Buffon’s needle is one of the oldest problems in geometric proba

3 Feb 18, 2022
这是一个yolox-pytorch的源码,可以用于训练自己的模型。

YOLOX:You Only Look Once目标检测模型在Pytorch当中的实现 目录 性能情况 Performance 实现的内容 Achievement 所需环境 Environment 小技巧的设置 TricksSet 文件下载 Download 训练步骤 How2train 预测步骤

Bubbliiiing 613 Jan 05, 2023
Implementation of a protein autoregressive language model, but with autoregressive infilling objective (editing subsequences capability)

Protein GLM (wip) Implementation of a protein autoregressive language model, but with autoregressive infilling objective (editing subsequences capabil

Phil Wang 17 May 06, 2022
Mind the Trade-off: Debiasing NLU Models without Degrading the In-distribution Performance

Models for natural language understanding (NLU) tasks often rely on the idiosyncratic biases of the dataset, which make them brittle against test cases outside the training distribution.

Ubiquitous Knowledge Processing Lab 22 Jan 02, 2023
Neighborhood Reconstructing Autoencoders

Neighborhood Reconstructing Autoencoders The official repository for Neighborhood Reconstructing Autoencoders (Lee, Kwon, and Park, NeurIPS 2021). T

Yonghyeon Lee 24 Dec 14, 2022
Implementation for Curriculum DeepSDF

Curriculum-DeepSDF This repository is an implementation for Curriculum DeepSDF. Full paper is available here. Preparation Please follow original setti

Haidong Zhu 69 Dec 29, 2022
Image to Image translation, image generataton, few shot learning

Semi-supervised Learning for Few-shot Image-to-Image Translation [paper] Abstract: In the last few years, unpaired image-to-image translation has witn

yaxingwang 49 Nov 18, 2022
A light and fast one class detection framework for edge devices. We provide face detector, head detector, pedestrian detector, vehicle detector......

A Light and Fast Face Detector for Edge Devices Big News: LFD, which is a big update of LFFD, now is released (2021.03.09). It is strongly recommended

YonghaoHe 1.3k Dec 25, 2022
CRLT: A Unified Contrastive Learning Toolkit for Unsupervised Text Representation Learning

CRLT: A Unified Contrastive Learning Toolkit for Unsupervised Text Representation Learning This repository contains the code and relevant instructions

XiaoMing 5 Aug 19, 2022
Dataset for the Research2Clinics @ NeurIPS 2021 Paper: What Do You See in this Patient? Behavioral Testing of Clinical NLP Models

Behavioral Testing of Clinical NLP Models This repository contains code for testing the behavior of clinical prediction models based on patient letter

Betty van Aken 2 Sep 20, 2022
Robust, modular and efficient implementation of advanced Hamiltonian Monte Carlo algorithms

AdvancedHMC.jl AdvancedHMC.jl provides a robust, modular and efficient implementation of advanced HMC algorithms. An illustrative example for Advanced

The Turing Language 167 Jan 01, 2023
Kaggle | 9th place (part of) solution for the Bristol-Myers Squibb – Molecular Translation challenge

Part of the 9th place solution for the Bristol-Myers Squibb – Molecular Translation challenge translating images containing chemical structures into I

Erdene-Ochir Tuguldur 22 Nov 30, 2022
Activity tragle - Google is tracking everything, we just look at it

activity_tragle Google is tracking everything, we just look at it here. You need

BERNARD Guillaume 1 Feb 15, 2022
The code for Bi-Mix: Bidirectional Mixing for Domain Adaptive Nighttime Semantic Segmentation

BiMix The code for Bi-Mix: Bidirectional Mixing for Domain Adaptive Nighttime Semantic Segmentation arxiv Framework: visualization results: Requiremen

stanley 18 Sep 18, 2022
EDPN: Enhanced Deep Pyramid Network for Blurry Image Restoration

EDPN: Enhanced Deep Pyramid Network for Blurry Image Restoration Ruikang Xu, Zeyu Xiao, Jie Huang, Yueyi Zhang, Zhiwei Xiong. EDPN: Enhanced Deep Pyra

69 Dec 15, 2022
PyG (PyTorch Geometric) - A library built upon PyTorch to easily write and train Graph Neural Networks (GNNs)

PyG (PyTorch Geometric) is a library built upon PyTorch to easily write and train Graph Neural Networks (GNNs) for a wide range of applications related to structured data.

PyG 16.5k Jan 08, 2023
MultiTaskLearning - Multi Task Learning for 3D segmentation

Multi Task Learning for 3D segmentation Perception stack of an Autonomous Drivin

2 Sep 22, 2022
CLEAR algorithm for multi-view data association

CLEAR: Consistent Lifting, Embedding, and Alignment Rectification Algorithm The Matlab, Python, and C++ implementation of the CLEAR algorithm, as desc

MIT Aerospace Controls Laboratory 30 Jan 02, 2023
Deep Learning Theory

Deep Learning Theory 整理了一些深度学习的理论相关内容,持续更新。 Overview Recent advances in deep learning theory 总结了目前深度学习理论研究的六个方向的一些结果,概述型,没做深入探讨(2021)。 1.1 complexity

fq 103 Jan 04, 2023
Motion Reconstruction Code and Data for Skills from Videos (SFV)

Motion Reconstruction Code and Data for Skills from Videos (SFV) This repo contains the data and the code for motion reconstruction component of the S

268 Dec 01, 2022