Improving Factual Consistency of Abstractive Text Summarization

Overview

Improving Factual Consistency of Abstractive Text Summarization

We provide the code for the papers:

  1. "Entity-level Factual Consistency of Abstractive Text Summarization", EACL 2021.
    • We provide a set of new metrics to quantify the entity-level factual consistency of generated summaries. We also provide code for the two methods in our paper:
      • JAENS: joint entity and summary generation, and
      • Summary-worthy entity classification with summarization (multi-task learning)
  2. "Improving Factual Consistency of Abstractive Summarization via Question Answering", ACL-IJCNLP 2021
    • QUALS, a new automatic metric for factual consistency.
    • CONSEQ, a new contrastive learning algorithm for Seq2seq models to optimize sequence level objectives such as QUALS.

Our code is based on the fairseq library and we added support for model training on Sagemaker.

Requirements and setup

  • python==3.6: conda create -n entity_fact python=3.6
  • pytorch==1.4.0: pip install torch==1.4.0 torchvision==0.5.0
  • run pip install --editable ./
  • install file2rouge following instructions here
  • download en_core_web_lg: python -m spacy download en_core_web_lg

Entity-level Factual Consistency of Abstractive Text Summarization

Data preprocessing:

We provide three options to preprocess summarization data through the filter_level option.

  • filter_level=0: no special processing
  • filter_level=1: remove corruption text in source articles and summaries. (Undesirable texts included as a result of imperfect data collection. e.g. "Share this with Email, Facebook, Messenger". Undesirable summaries such as "Collection of all USATODAY.com coverage of People, including articles, videos, photos, and quotes.")
  • filter_level=2: entity hallucination filtering in addition to corruption text removal. A summary sentence is removed if it contains a named entity not in the source document.

XSUM:

  1. Follow the instructions here to download and extract text from HTML files and establish the xsum-extracts-from-downloads directory.
  2. Let be the directory that contains the xsum-extracts-from-downloads directory and XSum-TRAINING-DEV-TEST-SPLIT-90-5-5.json.
  3. Run python preprocess/data_prepro_clean.py --mode preprocess_xsum --input_dir --output_dir /processed-data --filter_level 0 with filter_level set to 0, 1, or 2.

CNNDM:

  1. Download and unzip the stories directories from here for both CNN and Daily Mail. Put all .story files in a directory /raw_stories .
  2. Download the url files mapping_train.txt, mapping_test.txt and mapping_valid.txt from here to .
  3. Run python preprocess/data_prepro_clean.py --mode preprocess_cnndm --input_dir --output_dir /processed-data --filter_level 0 with filter_level set to 0, 1, or 2.

NEWSROOM:

  1. Download the datasets following instructions from here.
  2. Run python preprocess/data_prepro_clean.py --mode preprocess_newsroom --input_dir --output_dir /processed-data --filter_level 0 with filter_level set to 0, 1, or 2.

Tokenize and binarize the data:

Download bpe encoder.json, vocabulary and fairseq dictionary to a directory, say ; then tokenize and binarize the data.

wget -O <bpe-dir>/encoder.json 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json'
wget -O <bpe-dir>/vocab.bpe 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe'
wget -N <bpe-dir>/dict.txt' https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/dict.txt'

cd preprocess
python data_prepro_clean.py --mode bpe_binarize --input_dir <processed-data-dir> --tokenizer_dir <bpe-dir>

This generates the binary input files as well as the dictionaries under /data_bin for fairseq.

JAENS: Joint Entity and Summary Generation

The idea is to train the seq2seq model to generate .

1. prepare the train/val data by generating entity augmented targets:

python preprocess/create_entity_classification_labels.py --base_dir <processed-data-dir> --type entity_augment --tokenizer_dir <bpe-dir>

Binarize the augmented targets: cd preprocess

python data_prepro_clean.py --mode binarize --input_dir 
   
    /entity_augment --tokenizer_dir 
    

    
   

Since we already binarized the source documents, we just need to create symbolic links to put all binary input files together for fairseq training:

ln -s <processed-data-dir>/data_bin/train.source-target.source.idx <processed-data-dir>/entity_augment/data_bin/train.source-target.source.idx
ln -s <processed-data-dir>/data_bin/train.source-target.source.bin <processed-data-dir>/entity_augment/data_bin/train.source-target.source.bin
ln -s <processed-data-dir>/data_bin/valid.source-target.source.bin <processed-data-dir>/entity_augment/data_bin/valid.source-target.source.bin
ln -s <processed-data-dir>/data_bin/valid.source-target.source.idx <processed-data-dir>/entity_augment/data_bin/valid.source-target.source.idx

2. Fine-tune the BART-large model on the generated data:

Run the launch scripts scripts/launch_xsum.py, scripts/launch_cnndm.py or scripts/launch_newsroom.py to fine-tune the BART-large model. Note you need to modify the following in the scripts:

  • hyperparameters.
  • train_path: location of the binary input files. e.g. /entity_augment/data_bin .
  • init_path: location of the pre-trained BART-large model checkpoint. Please rename the checkpoint to pretrained_model.pt
  • output_path: location for the model outputs.

If training locally, you need to specify ngpus - the number of GPUS in the local machine. Example command:

python scripts/launch_xsum.py --datatype ner_filtered --epoch 8 --exp_type local

If training on Sagemaker, you need to specify the docker image name (image_name) as well as execution role (role). To create Sagemaker docker container and push to ECR:

./build_and_push.sh 
   

   

To launch training job:

python scripts/launch_xsum.py --datatype ner_filtered --epoch 8 --exp_type sagemaker

3. Generate the summaries from the fine-tuned models:

preprocess/multi_gpu_generate.py is used to generate summaries.

Since the JAENS models generates the named entities before the summaries, we need to remove the named entities before evaluating the summaries. Example command:

python evaluate_hypo.py --mode remove_ent_from_hypo --base_dir 
   
     --sub_dir 
    
      --split val --pattern .*.hypo

    
   

4. To evaluate the generated summaries for ROUGE as well as entity level factual scores:

We use the tokenizer from Stanford CoreNLP package. Example command:

export CLASSPATH=path/to/stanford-corenlp-full-2018-10-05/stanford-corenlp-3.9.2.jar
python evaluate_hypo.py --mode evaluate_summary --base_dir <output-dir> --sub_dir <output-sub-dir> --split val --pattern .*.hypo

See preprocess/run_*_eval.sh for examples.

Summary-worthy entity classification with summarization (multi-task learning)

We perform summary-worthy entity classification at a classification head on the encoder while keeping the seq2seq objective at the decoder. For training, we need to preprocess the input document by create B-I-O labels to identify summary-worthy entities:

python create_entity_classification_labels.py --base_dir 
   
     --type cls_labels --tokenizer_dir 
    
     
python data_prepro_clean.py --mode binarize_cls_labels --input_dir 
     
       --output_dir 
      
       /data_bin --tokenizer_dir 
        
       
      
     
    
   

Launch training jobs using scripts scripts/launch_multitask_*.py.

Improving Factual Consistency of Abstractive Summarization via Question Answering

QUALS (QUestion Answering with Language model score for Summarization)

To evaluate the QUALS of summaries (e.g. test.target) given original input (e.g. test.source), we execute the following steps in the preprocess sub-directory.

0. Prepare summaries into jsonl format

python evaluate_hypo.py --mode convert_hypo_to_json --base_dir 
   
     --sub_dir 
    
      --split test --pattern .target

    
   

1. Generating question and answer pairs from summaries

python sm_inference_asum.py --task gen_qa --base_dir 
   
     --source_dir 
    
      --output_dir 
     
       --num_workers 
      
        --bsz 5 --beam 60 --max_len 60 --min_len 8 --checkpoint_dir 
       
         --ckp_file checkpoint2.pt --bin_dir 
        
         /data_bin --diverse_beam_groups 60 --diverse_beam_strength 0.5 --batch_lines True --input_file test.target.hypo --return_token_scores True 
        
       
      
     
    
   

Here, we use diverse beam search to generate 60 question-answer pairs for each summary. The batch_lines option is set to True to batch bsz input summaries together for efficient generation. The QAGen model is trained by fine-tuning BART on the SQuAD and NewsQA datasets by concatenating the question-answer pairs using a separator.

To train the QAGen model, place the dev-v1.1.json and train-v1.1.json of SQuAD and the combined-newsqa-data-v1.json of the NewsQA under . The following command generates the binarized input for fine-tuning BART using Fairseq.

python data_prepro_clean.py --mode newsqa_squad_prepro --input_dir 
   
     --output_dir 
    

    
   

You can also download our trained QAGen model from s3 by running:

aws s3 cp s3://fact-check-summarization/newsqa-squad-qagen-checkpoint/checkpoint2.pt 
   
    /

   

Alternatively, you can download here if you don't have awscli.

2. Filter the generated question and answer for high quality pairs

python evaluate_hypo.py --mode filter_qas_dataset_lm_score --base_dir 
   
     --sub_dir 
    
      --pattern test.target.hypo.beam60.qas

    
   

3. Evaluate the generated question and answer pairs using the source document as input

python sm_inference_asum.py --task qa_eval --base_dir 
   
     --output_dir 
    
      --num_workers 
     
       --bsz 30 --checkpoint_dir 
      
        --ckp_file checkpoint2.pt --bin_dir 
       
        /data_bin --qas_dir 
        
          --source_file test.source --target_file test.target --input_file test.target.qas_filtered --prepend_target False 
        
       
      
     
    
   

4. Compute QUALS scores for each summary

python evaluate_hypo.py --mode compute_hypos_lm_score --base_dir 
   
     --sub_dir 
    
      --pattern test.*.source_eval_noprepend

    
   

CONSEQ (CONtrastive SEQ2seq learning)

To use QUALS to improve factual consistency of the summarization model using the CONSEQ algorithm, we follow the steps:

  1. Obtain the MLE summarization baseline by fine-tuning the BART model. Note that in the ACL paper, we used the corruption-filtered CNNDM and XSUM datasets (filter_level=1).
  2. Use the MLE summarization model to sample summaries on the training data.
  3. Evaluate the QUALS for the generated summaries as well as the ground truth summaries of the training data.
  4. Form the positive and negative sets for contrastive learning.
  5. Fine-tune the MLE summarization model using the positive and negative examples. Example scripts for launching training jobs locally or on Sagemaker are preprocess/run_generate_unlikelihood_train_cnndm.sh and preprocess/run_generate_unlikelihood_train_xsum.sh.

We provide an example script preprocess/run_generate_unlikelihood_train_xsum.sh to illustrate steps 2-4. Note that

  • To avoid running for a long time and encountering OOM errors and then restarting the whole process, we split the input files into smaller ones. We do this by splitting the source file by line (e.g. each sub-file has 10000 lines):
split -l 10000 train.source train.source.split   
  • The script has to make repeated calls of python sm_inference_asum.py --task gen_qa to generate question-ansewr pairs, as many times as there are sub-files as a result of line splits. The python function automatically checks which sub-files have been processed (based on output files) so it always processes the next available sub-file. If all sub-files have been processed, it will simply do nothing so it's safe if it's called more times than there are available sub-files.
  • Similarly, sm_inference_asum.py --task qa_eval needs to be repeated called to cover all sub-files.
  • The speed of question-answer pairs generation depends on the batch size setting. Depending on the summary file and the batch_lines setting, batching is handled differently. If the summary file contains only a single summary per input document, batch_lines should be set to True and bsz number of input lines are batched together as input to the QAGen model. If the summary file contains multiple summaries per input document, batch_lines should be set to False and the batching is done using bsz within each input example (line). For example, if there are 6 summaries per line in the summary file, we should set batch_lines to False; setting bsz to 7 will batch all the 7 summaries in a line together, which gives the best speed. (setting it higher won't improve speed since we do not do batching over different lines of input as batch_lines is False). On CNN-DM, bsz of 7 would sometimes result in OOM errors with 16G GPU memory so I use 3 or 4; on XSUM, it is safe to use 7.
  • num_workers should be the number of GPUs available on the machine. The lines in each input files will be distributed per GPU.
  • Finally, run the following to concatenate the QUALS scores from the sub-files: cat *.quals > train.source.source_eval_noprepend.quals

Citations

@inproceedings{nan-etal-2021-entity,
    title = "Entity-level Factual Consistency of Abstractive Text Summarization",
    author = "Nan, Feng  and
      Nallapati, Ramesh  and
      Wang, Zhiguo  and
      Nogueira dos Santos, Cicero  and
      Zhu, Henghui  and
      Zhang, Dejiao  and
      McKeown, Kathleen  and
      Xiang, Bing",
    booktitle = "Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume",
    month = apr,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2021.eacl-main.235",
    pages = "2727--2733",
}

@inproceedings{nan-etal-2021-improving,
    title = {Improving Factual Consistency of Abstractive Summarization via Question Answering},
    author = "Nan, Feng  and
      Nogueira dos Santos, Cicero  and
      Zhu, Henghui  and
      Ng, Patrick  and
      McKeown, Kathleen  and
      Nallapati, Ramesh  and
      Zhang, Dejiao  and
      Wang, Zhiguo  and
      Arnold, Andrew  and
      Xiang, Bing",
    booktitle = {Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP)},
    address = {Virtual},
    month = {August},
    url = {https://arxiv.org/abs/2105.04623},
    year = {2021}
}
Official implementation of the NRNS paper: No RL, No Simulation: Learning to Navigate without Navigating

No RL No Simulation (NRNS) Official implementation of the NRNS paper: No RL, No Simulation: Learning to Navigate without Navigating NRNS is a heriarch

Meera Hahn 20 Nov 29, 2022
We present a framework for training multi-modal deep learning models on unlabelled video data by forcing the network to learn invariances to transformations applied to both the audio and video streams.

Multi-Modal Self-Supervision using GDT and StiCa This is an official pytorch implementation of papers: Multi-modal Self-Supervision from Generalized D

Facebook Research 42 Dec 09, 2022
Lightweight tool to perform MITM attack on local network

ARPSpy - A lightweight tool to perform MITM attack Using many library to perform ARP Spoof and auto-sniffing HTTP packet containing credential. (Never

MinhItachi 8 Aug 28, 2022
MQBench Quantization Aware Training with PyTorch

MQBench Quantization Aware Training with PyTorch I am using MQBench(Model Quantization Benchmark)(http://mqbench.tech/) to quantize the model for depl

Ling Zhang 29 Nov 18, 2022
My solutions for Stanford University course CS224W: Machine Learning with Graphs Fall 2021 colabs (GNN, GAT, GraphSAGE, GCN)

machine-learning-with-graphs My solutions for Stanford University course CS224W: Machine Learning with Graphs Fall 2021 colabs Course materials can be

Marko Njegomir 7 Dec 14, 2022
Code for the paper "Reinforced Active Learning for Image Segmentation"

Reinforced Active Learning for Image Segmentation (RALIS) Code for the paper Reinforced Active Learning for Image Segmentation Dependencies python 3.6

Arantxa Casanova 79 Dec 19, 2022
StyleGAN - Official TensorFlow Implementation

StyleGAN — Official TensorFlow Implementation Picture: These people are not real – they were produced by our generator that allows control over differ

NVIDIA Research Projects 13.1k Jan 09, 2023
PyTorch Implementation of Small Lesion Segmentation in Brain MRIs with Subpixel Embedding (ORAL, MICCAIW 2021)

Small Lesion Segmentation in Brain MRIs with Subpixel Embedding PyTorch implementation of Small Lesion Segmentation in Brain MRIs with Subpixel Embedd

22 Oct 21, 2022
The official PyTorch implementation of recent paper - SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training

This repository is the official PyTorch implementation of SAINT. Find the paper on arxiv SAINT: Improved Neural Networks for Tabular Data via Row Atte

Gowthami Somepalli 284 Dec 21, 2022
MLJetReconstruction - using machine learning to reconstruct jets for CMS

MLJetReconstruction - using machine learning to reconstruct jets for CMS The C++ data extraction code used here was based heavily on that foundv here.

ALPhA Davidson 0 Nov 17, 2021
Background-Click Supervision for Temporal Action Localization

Background-Click Supervision for Temporal Action Localization This repository is the official implementation of BackTAL. In this work, we study the te

LeYang 221 Oct 09, 2022
LIVECell - A large-scale dataset for label-free live cell segmentation

LIVECell dataset This document contains instructions of how to access the data associated with the submitted manuscript "LIVECell - A large-scale data

Sartorius Corporate Research 112 Jan 07, 2023
Set of methods to ensemble boxes from different object detection models, including implementation of "Weighted boxes fusion (WBF)" method.

Set of methods to ensemble boxes from different object detection models, including implementation of "Weighted boxes fusion (WBF)" method.

1.4k Jan 05, 2023
🕹️ Official Implementation of Conditional Motion In-betweening (CMIB) 🏃

Conditional Motion In-Betweening (CMIB) Official implementation of paper: Conditional Motion In-betweeening. Paper(arXiv) | Project Page | YouTube in-

Jihoon Kim 81 Dec 22, 2022
Lip Reading - Cross Audio-Visual Recognition using 3D Convolutional Neural Networks

Lip Reading - Cross Audio-Visual Recognition using 3D Convolutional Neural Networks - Official Project Page This repository contains the code develope

Amirsina Torfi 1.7k Dec 18, 2022
Learning an Adaptive Meta Model-Generator for Incrementally Updating Recommender Systems

Learning an Adaptive Meta Model-Generator for Incrementally Updating Recommender Systems This is our experimental code for RecSys 2021 paper "Learning

11 Jul 28, 2022
Everything you need to know about NumPy( Creating Arrays, Indexing, Math,Statistics,Reshaping).

Everything you need to know about NumPy( Creating Arrays, Indexing, Math,Statistics,Reshaping).

1 Feb 14, 2022
Defending graph neural networks against adversarial attacks (NeurIPS 2020)

GNNGuard: Defending Graph Neural Networks against Adversarial Attacks Authors: Xiang Zhang ( Zitnik Lab @ Harvard 44 Dec 07, 2022

RoFormer_pytorch

PyTorch RoFormer 原版Tensorflow权重(https://github.com/ZhuiyiTechnology/roformer) chinese_roformer_L-12_H-768_A-12.zip (提取码:xy9x) 已经转化为PyTorch权重 chinese_r

yujun 283 Dec 12, 2022
Pervasive Attention: 2D Convolutional Networks for Sequence-to-Sequence Prediction

This is a fork of Fairseq(-py) with implementations of the following models: Pervasive Attention - 2D Convolutional Neural Networks for Sequence-to-Se

Maha 490 Dec 15, 2022