Pre-training with Extracted Gap-sentences for Abstractive SUmmarization Sequence-to-sequence models

Overview

PEGASUS library

Pre-training with Extracted Gap-sentences for Abstractive SUmmarization Sequence-to-sequence models, or PEGASUS, uses self-supervised objective Gap Sentences Generation (GSG) to train a transformer encoder-decoder model. The paper can be found on arXiv. ICML 2020 accepted.

If you use this code or these models, please cite the following paper:

@misc{zhang2019pegasus,
    title={PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization},
    author={Jingqing Zhang and Yao Zhao and Mohammad Saleh and Peter J. Liu},
    year={2019},
    eprint={1912.08777},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Results update

We train a pegasus model with sampled gap sentence ratios on both C4 and HugeNews, and stochastically sample important sentences. The updated the results are reported in this table.

dataset C4 HugeNews Mixed & Stochastic
xsum 45.20/22.06/36.99 47.21/24.56/39.25 47.60/24.83/39.64
cnn_dailymail 43.90/21.20/40.76 44.17/21.47/41.11 44.16/21.56/41.30
newsroom 45.07/33.39/41.28 45.15/33.51/41.33 45.98/34.20/42.18
multi_news 46.74/17.95/24.26 47.52/18.72/24.91 47.65/18.75/24.95
gigaword 38.75/19.96/36.14 39.12/19.86/36.24 39.65/20.47/36.76
wikihow 43.07/19.70/34.79 41.35/18.51/33.42 46.39/22.12/38.41 *
reddit_tifu 26.54/8.94/21.64 26.63/9.01/21.60 27.99/9.81/22.94
big_patent 53.63/33.16/42.25 53.41/32.89/42.07 52.29/33.08/41.66 *
arxiv 44.70/17.27/25.80 44.67/17.18/25.73 44.21/16.95/25.67
pubmed 45.49/19.90/27.69 45.09/19.56/27.42 45.97/20.15/28.25
aeslc 37.69/21.85/36.84 37.40/21.22/36.45 37.68/21.25/36.51
billsum 57.20/39.56/45.80 57.31/40.19/45.82 59.67/41.58/47.59

The "Mixed & Stochastic" model has the following changes:

  • trained on both C4 and HugeNews (dataset mixture is weighted by their number of examples).
  • trained for 1.5M instead of 500k (we observe slower convergence on pretraining perplexity).
  • the model uniformly sample a gap sentence ratio between 15% and 45%.
  • importance sentences are sampled using a 20% uniform noise to importance scores.
  • the sentencepiece tokenizer is updated to be able to encode newline character.

(*) the numbers of wikihow and big_patent datasets are not comparable because of change in tokenization and data:

  • wikihow dataset contains newline characters which is useful for paragraph segmentation, the C4 and HugeNews model's sentencepiece tokenizer doesn't encode newline and loose this information.
  • we update the BigPatent dataset to preserve casing, some format cleanings are also changed, please refer to change in TFDS.

Setup

create an instance on google cloud with GPU (optional)

Please create a project first and create an instance

gcloud compute instances create \
  ${VM_NAME} \
  --zone=${ZONE} \
  --machine-type=n1-highmem-8 \
  --accelerator type=nvidia-tesla-v100,count=1 \
  --boot-disk-size=500GB \
  --image-project=ml-images \
  --image-family=tf-1-15 \
  --maintenance-policy TERMINATE --restart-on-failure

install library and dependencies

Clone library on github and install requirements.

git clone https://github.com/google-research/pegasus
cd pegasus
export PYTHONPATH=.
pip3 install -r requirements.txt

Download vocab, pretrained and fine-tuned checkpoints of all experiments from Google Cloud.

Alternatively in terminal, follow the instruction and install gsutil. Then

mkdir ckpt
gsutil cp -r gs://pegasus_ckpt/ ckpt/

Finetuning on downstream datasets

on existing dataset

Finetune on an existing dataset aeslc.

python3 pegasus/bin/train.py --params=aeslc_transformer \
--param_overrides=vocab_filename=ckpt/pegasus_ckpt/c4.unigram.newline.10pct.96000.model \
--train_init_checkpoint=ckpt/pegasus_ckpt/model.ckpt-1500000 \
--model_dir=ckpt/pegasus_ckpt/aeslc

If you would like to finetune on a subset of dataset, please refer to the example of input pattern.

Evaluate on the finetuned dataset.

python3 pegasus/bin/evaluate.py --params=aeslc_transformer \
--param_overrides=vocab_filename=ckpt/pegasus_ckpt/c4.unigram.newline.10pct.96000.model,batch_size=1,beam_size=5,beam_alpha=0.6 \
--model_dir=ckpt/pegasus_ckpt/aeslc

Note that the above example is using a single GPU so the batch_size is much smaller than the results reported in the paper.

add new finetuning dataset

Two types of dataset format are supported: TensorFlow Datasets (TFDS) or TFRecords.

This tutorial shows how to add a new dataset in TFDS. (The fine-tuning dataset is expected to be supervised, please provide supervised_keys in dataset info).

Tfrecords format requires each record to be a tf example of {"inputs":tf.string, "targets":tf.string}.

For example, if you registered a TFDS dataset called new_tfds_dataset for training and evaluation, and have some files in tfrecord format called new_dataset_files.tfrecord* for test, they can be registered in /pegasus/params/public_params.py.

@registry.register("new_params")
def my_param(param_overrides):
  return public_params.transformer_params(
      {
          "train_pattern": "tfds:new_tfds_dataset,train",
          "dev_pattern": "tfds:new_tfds_dataset,validation",
          "test_pattern": "tfrecord:new_dataset_files.tfrecord*",
          "max_input_len": 512,
          "max_output_len": 128,
          "train_steps": 10000,
          "learning_rate": 0.0001,
          "batch_size": 8,
      }, param_overrides)

Evaluation metrics.

Evaluation results can be found in mode_dir. Summarization metrics are automatically calculated for each evaluation point.

  • ROUGE is the main metric for summarization quality.

  • BLEU is an alternative quality metric for language generation.

  • Extractive Fragments Coverage & Density are metrics that measures the abstractiveness of the summary.

  • Repetition Rates measures generation repetition failure modes.

  • Length statistics measures the length distribution of decodes comparing to gold summary.

Several types of output files can be found in model_dir

  • text_metrics-*.txt: above metrics in text format. Each row contains metric name, 95% lower bound value, mean value, 95% upper bound value.
  • inputs-.txt, targets-.txt, predictions-*.txt: raw text files of model inputs/outputs.

Pre-training

Pretraining (on C4 or any other corpus) requires a customly built tensorflow that includes ops for on-the-fly parsing that processes raw text document into model inputs and targets ids. Please refer to pegasus/ops/pretrain_parsing_ops.cc and pegasus/data/parsers.py for details.

Acknowledgements

Contains parts of code and design for training and evaluation of summarization models originally by Ben Goodrich [email protected].

Owner
Google Research
Google Research
Composed Image Retrieval using Pretrained LANguage Transformers (CIRPLANT)

CIRPLANT This repository contains the code and pre-trained models for Composed Image Retrieval using Pretrained LANguage Transformers (CIRPLANT) For d

Zheyuan (David) Liu 29 Nov 17, 2022
BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

OpenBMB 377 Jan 02, 2023
Official implementation of MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis

MLP Singer Official implementation of MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis. Audio samples are available on our demo page.

Neosapience 103 Dec 23, 2022
Text-Based zombie apocalyptic decision-making game in Python

Inspiration We shared university first year game coursework.[to gauge previous experience and start brainstorming] Adapted a particular nuclear fallou

Amin Sabbagh 2 Feb 17, 2022
This is a MD5 password/passphrase brute force tool

CROWES-PASS-CRACK-TOOl This is a MD5 password/passphrase brute force tool How to install: Do 'git clone https://github.com/CROW31/CROWES-PASS-CRACK-TO

9 Mar 02, 2022
Implementation of the Hybrid Perception Block and Dual-Pruned Self-Attention block from the ITTR paper for Image to Image Translation using Transformers

ITTR - Pytorch Implementation of the Hybrid Perception Block (HPB) and Dual-Pruned Self-Attention (DPSA) block from the ITTR paper for Image to Image

Phil Wang 17 Dec 23, 2022
An open-source NLP research library, built on PyTorch.

An Apache 2.0 NLP research library, built on PyTorch, for developing state-of-the-art deep learning models on a wide variety of linguistic tasks. Quic

AI2 11.4k Jan 01, 2023
Various Algorithms for Short Text Mining

Short Text Mining in Python Introduction This package shorttext is a Python package that facilitates supervised and unsupervised learning for short te

Kwan-Yuet 466 Dec 06, 2022
SpikeX - SpaCy Pipes for Knowledge Extraction

SpikeX is a collection of pipes ready to be plugged in a spaCy pipeline. It aims to help in building knowledge extraction tools with almost-zero effort.

Erre Quadro Srl 384 Dec 12, 2022
Code and datasets for our paper "PTR: Prompt Tuning with Rules for Text Classification"

PTR Code and datasets for our paper "PTR: Prompt Tuning with Rules for Text Classification" If you use the code, please cite the following paper: @art

THUNLP 118 Dec 30, 2022
Mastering Transformers, published by Packt

Mastering Transformers This is the code repository for Mastering Transformers, published by Packt. Build state-of-the-art models from scratch with adv

Packt 195 Jan 01, 2023
Code to reproduce the results of the paper 'Towards Realistic Few-Shot Relation Extraction' (EMNLP 2021)

Realistic Few-Shot Relation Extraction This repository contains code to reproduce the results in the paper "Towards Realistic Few-Shot Relation Extrac

Bloomberg 8 Nov 09, 2022
GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training Code and model from our AAAI 2021 paper

Amazon Web Services - Labs 83 Jan 09, 2023
Extracting Summary Knowledge Graphs from Long Documents

GraphSum This repo contains the data and code for the G2G model in the paper: Extracting Summary Knowledge Graphs from Long Documents. The other basel

Zeqiu (Ellen) Wu 10 Oct 21, 2022
Toward Model Interpretability in Medical NLP

Toward Model Interpretability in Medical NLP LING380: Topics in Computational Linguistics Final Project James Cross ( 1 Mar 04, 2022

Implementation of Token Shift GPT - An autoregressive model that solely relies on shifting the sequence space for mixing

Token Shift GPT Implementation of Token Shift GPT - An autoregressive model that relies solely on shifting along the sequence dimension and feedforwar

Phil Wang 32 Oct 14, 2022
The model is designed to train a single and large neural network in order to predict correct translation by reading the given sentence.

Neural Machine Translation communication system The model is basically direct to convert one source language to another targeted language using encode

Nishant Banjade 7 Sep 22, 2022
Code-autocomplete, a code completion plugin for Python

Code AutoComplete code-autocomplete, a code completion plugin for Python.

xuming 13 Jan 07, 2023
This repository contains the code for "Generating Datasets with Pretrained Language Models".

Datasets from Instructions (DINO 🦕 ) This repository contains the code for Generating Datasets with Pretrained Language Models. The paper introduces

Timo Schick 154 Jan 01, 2023