BOOKSUM: A Collection of Datasets for Long-form Narrative Summarization

Related tags

Deep Learningbooksum
Overview

BOOKSUM: A Collection of Datasets for Long-form Narrative Summarization

Authors: Wojciech Kryściński, Nazneen Rajani, Divyansh Agarwal, Caiming Xiong, Dragomir Radev

Introduction

The majority of available text summarization datasets include short-form source documents that lack long-range causal and temporal dependencies, and often contain strong layout and stylistic biases. While relevant, such datasets will offer limited challenges for future generations of text summarization systems. We address these issues by introducing BookSum, a collection of datasets for long-form narrative summarization. Our dataset covers source documents from the literature domain, such as novels, plays and stories, and includes highly abstractive, human written summaries on three levels of granularity of increasing difficulty: paragraph-, chapter-, and book-level. The domain and structure of our dataset poses a unique set of challenges for summarization systems, which include: processing very long documents, non-trivial causal and temporal dependencies, and rich discourse structures. To facilitate future work, we trained and evaluated multiple extractive and abstractive summarization models as baselines for our dataset.

Paper link: https://arxiv.org/abs/2105.08209

Table of Contents

  1. Updates
  2. Citation
  3. Legal Note
  4. License
  5. Usage
  6. Get Involved

Updates

4/15/2021

Initial commit

Citation

@article{kryscinski2021booksum,
      title={BookSum: A Collection of Datasets for Long-form Narrative Summarization}, 
      author={Wojciech Kry{\'s}ci{\'n}ski and Nazneen Rajani and Divyansh Agarwal and Caiming Xiong and Dragomir Radev},
      year={2021},
      eprint={2105.08209},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Legal Note

By downloading or using the resources, including any code or scripts, shared in this code repository, you hereby agree to the following terms, and your use of the resources is conditioned on and subject to these terms.

  1. You may only use the scripts shared in this code repository for research purposes. You may not use or allow others to use the scripts for any other purposes and other uses are expressly prohibited.
  2. You will comply with all terms and conditions, and are responsible for obtaining all rights, related to the services you access and the data you collect.
  3. We do not make any representations or warranties whatsoever regarding the sources from which data is collected. Furthermore, we are not liable for any damage, loss or expense of any kind arising from or relating to your use of the resources shared in this code repository or the data collected, regardless of whether such liability is based in tort, contract or otherwise.

License

The code is released under the BSD-3 License (see LICENSE.txt for details).

Usage

1. Chapterized Project Guteberg Data

The chapterized book text from Gutenberg, for the books we use in our work, has been made available through a public GCP bucket. It can be fetched using:

gsutil cp gs://sfr-books-dataset-chapters-research/all_chapterized_books.zip .

or downloaded directly here.

2. Data Collection

Data collection scripts for the summary text are organized by the different sources that we use summaries from. Note: At the time of collecting the data, all links in literature_links.tsv were working for the respective sources.

For each data source, run get_works.py to first fetch the links for each book, and then run get_summaries.py to get the summaries from the collected links.

python scripts/data_collection/cliffnotes/get_works.py
python scripts/data_collection/cliffnotes/get_summaries.py

3. Data Cleaning

Data Cleaning is performed through the following steps:

First script for some basic cleaning operations, like removing parentheses, links etc from the summary text

python scripts/data_cleaning_scripts/basic_clean.py

We use intermediate alignments in summary_chapter_matched_all_sources.jsonl to identify which summaries are separable, and separates them, creating new summaries (eg. Chapters 1-3 summary separated into 3 different files - Chapter 1 summary, Chapter 2 summary, Chapter 3 summary)

python scripts/data_cleaning_scripts/split_aggregate_chaps_all_sources.py

Lastly, our final cleaning script using various regexes to separate out analysis/commentary text, removes prefixes, suffixes etc.

python scripts/data_cleaning_scripts/clean_summaries.py

Data Alignments

Generating paragraph alignments from the chapter-level-summary-alignments, is performed individually for the train/test/val splits:

Gather the data from the summaries and book chapters into a single jsonl

python paragraph-level-summary-alignments/gather_data.py

Generate alignments of the paragraphs with sentences from the summary using the bi-encoder paraphrase-distilroberta-base-v1

python paragraph-level-summary-alignments/align_data_bi_encoder_paraphrase.py

Aggregate the generated alignments for cases where multiple sentences from chapter-summaries are matched to the same paragraph from the book

python paragraph-level-summary-alignments/aggregate_paragraph_alignments_bi_encoder_paraphrase.py

Troubleshooting

  1. The web archive links we collect the summaries from can often be unreliable, taking a long time to load. One way to fix this is to use higher sleep timeouts when one of the links throws an exception, which has been implemented in some of the scripts.
  2. Some links that constantly throw errors are aggregated in a file called - 'section_errors.txt'. This is useful to inspect which links are actually unavailable and re-running the data collection scripts for those specific links.

Get Involved

Please create a GitHub issue if you have any questions, suggestions, requests or bug-reports. We welcome PRs!

Owner
Salesforce
A variety of vendor agnostic projects which power Salesforce
Salesforce
Code for "Typilus: Neural Type Hints" PLDI 2020

Typilus A deep learning algorithm for predicting types in Python. Please find a preprint here. This repository contains its implementation (src/) and

47 Nov 08, 2022
Contrastive Language-Image Pretraining

CLIP [Blog] [Paper] [Model Card] [Colab] CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pair

OpenAI 11.5k Jan 08, 2023
Multi-Task Learning as a Bargaining Game

Nash-MTL Official implementation of "Multi-Task Learning as a Bargaining Game". Setup environment conda create -n nashmtl python=3.9.7 conda activate

Aviv Navon 87 Dec 26, 2022
Steerable discovery of neural audio effects

Steerable discovery of neural audio effects Christian J. Steinmetz and Joshua D. Reiss Abstract Applications of deep learning for audio effects often

Christian J. Steinmetz 182 Dec 29, 2022
xitorch: differentiable scientific computing library

xitorch is a PyTorch-based library of differentiable functions and functionals that can be widely used in scientific computing applications as well as deep learning.

24 Apr 15, 2021
Deeply Supervised, Layer-wise Prediction-aware (DSLP) Transformer for Non-autoregressive Neural Machine Translation

Non-Autoregressive Translation with Layer-Wise Prediction and Deep Supervision Training Efficiency We show the training efficiency of our DSLP model b

Chenyang Huang 36 Oct 31, 2022
This is a Pytorch implementation of paper: DropEdge: Towards Deep Graph Convolutional Networks on Node Classification

DropEdge: Towards Deep Graph Convolutional Networks on Node Classification This is a Pytorch implementation of paper: DropEdge: Towards Deep Graph Con

401 Dec 16, 2022
PixelPyramids: Exact Inference Models from Lossless Image Pyramids (ICCV 2021)

PixelPyramids: Exact Inference Models from Lossless Image Pyramids This repository contains the PyTorch implementation of the paper PixelPyramids: Exa

Visual Inference Lab @TU Darmstadt 8 Dec 11, 2022
ACL'2021: LM-BFF: Better Few-shot Fine-tuning of Language Models

LM-BFF (Better Few-shot Fine-tuning of Language Models) This is the implementation of the paper Making Pre-trained Language Models Better Few-shot Lea

Princeton Natural Language Processing 607 Jan 07, 2023
Pure python implementations of popular ML algorithms.

Minimal ML algorithms This repo includes minimal implementations of popular ML algorithms using pure python and numpy. The purpose of these notebooks

Alexis Gidiotis 3 Jan 10, 2022
Styled Handwritten Text Generation with Transformers (ICCV 21)

⚡ Handwriting Transformers [PDF] Ankan Kumar Bhunia, Salman Khan, Hisham Cholakkal, Rao Muhammad Anwer, Fahad Shahbaz Khan & Mubarak Shah Abstract: We

Ankan Kumar Bhunia 85 Dec 22, 2022
Optical machine for senses sensing using speckle and deep learning

# Senses-speckle [Remote Photonic Detection of Human Senses Using Secondary Speckle Patterns](https://doi.org/10.21203/rs.3.rs-724587/v1) paper Python

Zeev Kalyuzhner 0 Sep 26, 2021
Open source Python implementation of the HDR+ photography pipeline

hdrplus-python Open source Python implementation of the HDR+ photography pipeline, originally developped by Google and presented in a 2016 article. Th

77 Jan 05, 2023
The PyTorch re-implement of a 3D CNN Tracker to extract coronary artery centerlines with state-of-the-art (SOTA) performance. (paper: 'Coronary artery centerline extraction in cardiac CT angiography using a CNN-based orientation classifier')

The PyTorch re-implement of a 3D CNN Tracker to extract coronary artery centerlines with state-of-the-art (SOTA) performance. (paper: 'Coronary artery centerline extraction in cardiac CT angiography

James 135 Dec 23, 2022
One implementation of the paper "DMRST: A Joint Framework for Document-Level Multilingual RST Discourse Segmentation and Parsing".

Introduction One implementation of the paper "DMRST: A Joint Framework for Document-Level Multilingual RST Discourse Segmentation and Parsing". Users

seq-to-mind 18 Dec 11, 2022
Ensembling Off-the-shelf Models for GAN Training

Data-Efficient GANs with DiffAugment project | paper | datasets | video | slides Generated using only 100 images of Obama, grumpy cats, pandas, the Br

MIT HAN Lab 1.2k Dec 26, 2022
AISTATS 2019: Confidence-based Graph Convolutional Networks for Semi-Supervised Learning

Confidence-based Graph Convolutional Networks for Semi-Supervised Learning Source code for AISTATS 2019 paper: Confidence-based Graph Convolutional Ne

MALL Lab (IISc) 56 Dec 03, 2022
LieTransformer: Equivariant Self-Attention for Lie Groups

LieTransformer This repository contains the implementation of the LieTransformer used for experiments in the paper LieTransformer: Equivariant Self-At

OxCSML (Oxford Computational Statistics and Machine Learning) 50 Dec 28, 2022
🚗 INGI Dakar 2K21 - Be the first one on the finish line ! 🚗

🚗 INGI Dakar 2K21 - Be the first one on the finish line ! 🚗 This year's first semester Club Info challenge will put you at the head of a car racing

ClubINFO INGI (UCLouvain) 6 Dec 10, 2021