Entity-Based Knowledge Conflicts in Question Answering.

Overview

Entity-Based Knowledge Conflicts in Question Answering

Run Instructions | Paper | Citation | License

This repository provides the Substitution Framework described in Section 2 of our paper Entity-Based Knowledge Conflicts in Question Answering. Given a quesion answering dataset, we derive a new dataset where the context passages have been modified to have new answers to their question. By training on the original examples and evaluating on the derived examples, we simulate a parametric-contextual knowledge conflict --- useful for understanding how model's employ sources of knowledge to arrive at a decision.

Our dataset derivation follows two steps: (1) identifying named entity answers, and (2) replacing all occurrences of the answer in the context with a substituted entity, effectively changing the answer. The answer substitutions depend on the chosen substitution policy.

Run Instructions

1. Setup

Setup requirements and download SpaCy and WikiData dependencies.

bash setup.sh

2. (Optional) Download and Process Wikidata

This optional stage reproduces wikidata/entity_info.json.gz, downloaded during Setup.

Download the Wikidata dump from October 2020 here and the Wikipedia pageviews from June 2, 2020 here.

NOTE: We don't use the newest Wikidata dump because Wikidata doesn't keep old dumps so reproducibility is an issue. If you'd like to use the newest dump, it is available here. Wikipedia pageviews, on the other hand, are kept around and can be found here. Be sure to download the *-user.bz2 file and not the *-automatic.bz2 or the *-spider.bz2 files.

To extract out Wikidata information, run the following (takes ~8 hours)

python extract_wikidata_info.py --wikidata_dump wikidata-20201026-all.json.bz2 --popularity_dump pageviews-20210602-user.bz2 --output_file entity_info.json.gz

The output file of this step is available here.

3. Load and Preprocess Dataset

PYTHONPATH=. python src/load_dataset.py -d MRQANaturalQuestionsTrain -w wikidata/entity_info.json.gz
PYTHONPATH=. python src/load_dataset.py -d MRQANaturalQuestionsDev -w wikidata/entity_info.json.gz

4. Generate Substitutions

PYTHONPATH=. python src/generate_substitutions.py --inpath datasets/normalized/MRQANaturalQuestionsTrain.jsonl --outpath datasets/substitution-sets/MRQANaturalQuestionsTrain
   
    .jsonl 
    
      -n 1 ...
PYTHONPATH=. python src/generate_substitutions.py --inpath datasets/normalized/MRQANaturalQuestionsDev.jsonl --outpath datasets/substitution-sets/MRQANaturalQuestionsDev
     
      .jsonl 
      
        -n 1 ...

      
     
    
   

See descriptions of the substitution policies (substitution-commands) we provide here. Inspect the argparse and substitution-specific subparsers in generate_substitutions.py to see additional arguments.

Our Substitution Functions

Here we define the the substitution functions we provide. These functions ingests a QADataset, and modifies the context passage, according to defined rules, such that there is now a new answer to the question, according to the context. Greater detail is provided in our paper.

  • Alias Substitution (sub-command: alias-substitution) --- Here we replace an answer with one of it's wikidata aliases. Since the substituted answer is always semantically equivalent, answer type preservation is naturally maintained.
  • Popularity Substitution (sub-command: popularity-substitution) --- Here we replace answers with a WikiData answer of the same type, with a specified popularity bracket (according to monthly page views).
  • Corpus Substitution (sub-command: corpus-substitution) --- Here we replace answers with other answers of the same type, sampled from the same corpus.
  • Type Swap Substitution (sub-command: type-swap-substitution) --- Here we replace answers with other answers of different type, sampled from the same corpus.

How to Add Your own Dataset / Substitution Fn / NER Models

Use your own Dataset

To add your own dataset, create your own subclass of QADataset (in src/classes/qadataset.py).

  1. Overwrite the read_original_dataset function, to read your dataset, creating a List of QAExample objects.
  2. Add your class and the url/filepath to the DATASETS variable in src/load_dataset.py.

See MRQANaturalQuetsionsDataset in src/classes/qadataset.py as an example.

Use your own Substitution Function

We define 5 different substitution functions in src/generate_substitutions.py. These are described here. Inspect their docstrings and feel free to add your own, leveraging any of the wikidata, derived answer type, or other info we populate for examples and answers. Here are the steps to create your own:

  1. Add a subparser in src/generate_substitutions.py for your new function, with any relevant parameters. See alias_sub_parser as an example.
  2. Add your own substitution function to src/substitution_fns.py, ensuring the signature arguments match those specified in the subparser. See alias_substitution_fn as an example.
  3. Add a reference to your new function to SUBSTITUTION_FNS in src/generate_substitutions.py. Ensure the dictionary key matches the subparser name.

Use your own Named Entity Recognition and/or Entity Linking Model

Our SpaCy NER model is trained and used mainly to categorize answer text into answer types. Only substitutions that preserve answer type are likely to be coherent.

The functions which need to be changed are:

  1. run_ner_linking in utils.py, which loads the NER model and populates info for each answer (see function docstring).
  2. Answer._select_answer_type() in src/classes/answer.py, which uses the NER answer type label and wikidata type labels to cateogrize the answer into a type category.

Citation

Please cite the following if you found this resource or our paper useful.

@misc{longpre2021entitybased,
      title={Entity-Based Knowledge Conflicts in Question Answering}, 
      author={Shayne Longpre and Kartik Perisetla and Anthony Chen and Nikhil Ramesh and Chris DuBois and Sameer Singh},
      year={2021},
      eprint={2109.05052},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

License

The Knowledge Conflicts repository, and entity-based substitution framework are licensed according to the LICENSE file.

Contact Us

To contact us feel free to email the authors in the paper or create an issue in this repository.

Owner
Apple
Apple
Ego4d dataset repository. Download the dataset, visualize, extract features & example usage of the dataset

Ego4D EGO4D is the world's largest egocentric (first person) video ML dataset and benchmark suite, with 3,600 hrs (and counting) of densely narrated v

Meta Research 118 Jan 07, 2023
Forecasting with Gradient Boosted Time Series Decomposition

ThymeBoost ThymeBoost combines time series decomposition with gradient boosting to provide a flexible mix-and-match time series framework for spicy fo

131 Jan 08, 2023
An Official Repo of CVPR '20 "MSeg: A Composite Dataset for Multi-Domain Segmentation"

This is the code for the paper: MSeg: A Composite Dataset for Multi-domain Semantic Segmentation (CVPR 2020, Official Repo) [CVPR PDF] [Journal PDF] J

226 Nov 05, 2022
Nested Graph Neural Network (NGNN) is a general framework to improve a base GNN's expressive power and performance

Nested Graph Neural Networks About Nested Graph Neural Network (NGNN) is a general framework to improve a base GNN's expressive power and performance.

Muhan Zhang 38 Jan 05, 2023
PyTorch Lightning + Hydra. A feature-rich template for rapid, scalable and reproducible ML experimentation with best practices. ⚡🔥⚡

Lightning-Hydra-Template A clean and scalable template to kickstart your deep learning project 🚀 ⚡ 🔥 Click on Use this template to initialize new re

Łukasz Zalewski 2.1k Jan 09, 2023
Everything's Talkin': Pareidolia Face Reenactment (CVPR2021)

Everything's Talkin': Pareidolia Face Reenactment (CVPR2021) Linsen Song, Wayne Wu, Chaoyou Fu, Chen Qian, Chen Change Loy, and Ran He [Paper], [Video

71 Dec 21, 2022
This project aims to explore the deployment of Swin-Transformer based on TensorRT, including the test results of FP16 and INT8.

Swin Transformer This project aims to explore the deployment of SwinTransformer based on TensorRT, including the test results of FP16 and INT8. Introd

maggiez 87 Dec 21, 2022
Pmapper is a super-resolution and deconvolution toolkit for python 3.6+

pmapper pmapper is a super-resolution and deconvolution toolkit for python 3.6+. PMAP stands for Poisson Maximum A-Posteriori, a highly flexible and a

NASA Jet Propulsion Laboratory 8 Nov 06, 2022
(CVPR 2021) Lifting 2D StyleGAN for 3D-Aware Face Generation

Lifting 2D StyleGAN for 3D-Aware Face Generation Official implementation of paper "Lifting 2D StyleGAN for 3D-Aware Face Generation". Requirements You

Yichun Shi 66 Nov 29, 2022
Software that can generate photos from paintings, turn horses into zebras, perform style transfer, and more.

CycleGAN PyTorch | project page | paper Torch implementation for learning an image-to-image translation (i.e. pix2pix) without input-output pairs, for

Jun-Yan Zhu 11.5k Dec 30, 2022
Supporting code for the Neograd algorithm

Neograd This repo supports the paper Neograd: Gradient Descent with a Near-Ideal Learning Rate, which introduces the algorithm "Neograd". The paper an

Michael Zimmer 12 May 01, 2022
Escaping the Gradient Vanishing: Periodic Alternatives of Softmax in Attention Mechanism

Period-alternatives-of-Softmax Experimental Demo for our paper 'Escaping the Gradient Vanishing: Periodic Alternatives of Softmax in Attention Mechani

slwang9353 0 Sep 06, 2021
Weighted QMIX: Expanding Monotonic Value Function Factorisation

This repo contains the cleaned-up code that was used in "Weighted QMIX: Expanding Monotonic Value Function Factorisation"

whirl 82 Dec 29, 2022
Bib-parser - Convenient script to parse .bib files with the ACM Digital Library like metadata

Bib Parser Convenient script to parse .bib files with the ACM Digital Library li

Mehtab Iqbal (Shahan) 1 Jan 26, 2022
Segmentation models with pretrained backbones. Keras and TensorFlow Keras.

Python library with Neural Networks for Image Segmentation based on Keras and TensorFlow. The main features of this library are: High level API (just

Pavel Yakubovskiy 4.2k Jan 09, 2023
The official repo for CVPR2021——ViPNAS: Efficient Video Pose Estimation via Neural Architecture Search.

ViPNAS: Efficient Video Pose Estimation via Neural Architecture Search [paper] Introduction This is the official implementation of ViPNAS: Efficient V

Lumin 42 Sep 26, 2022
UFPR-ADMR-v2 Dataset

UFPR-ADMR-v2 Dataset The UFPR-ADMRv2 dataset contains 5,000 dial meter images obtained on-site by employees of the Energy Company of Paraná (Copel), w

Gabriel Salomon 8 Sep 29, 2022
Distilling Motion Planner Augmented Policies into Visual Control Policies for Robot Manipulation (CoRL 2021)

Distilling Motion Planner Augmented Policies into Visual Control Policies for Robot Manipulation [Project website] [Paper] This project is a PyTorch i

Cognitive Learning for Vision and Robotics (CLVR) lab @ USC 6 Feb 28, 2022
Simple Python application to transform Serial data into OSC messages

SerialToOSC-Bridge Simple Python application to transform Serial data into OSC messages. The current purpose is to be a compatibility layer between ha

Division of Applied Acoustics at Chalmers University of Technology 3 Jun 03, 2021
Shape Matching of Real 3D Object Data to Synthetic 3D CADs (3DV project @ ETHZ)

Real2CAD-3DV Shape Matching of Real 3D Object Data to Synthetic 3D CADs (3DV project @ ETHZ) Group Member: Yue Pan, Yuanwen Yue, Bingxin Ke, Yujie He

24 Jun 22, 2022