Two-stage text summarization with BERT and BART

Overview

Two-Stage Text Summarization

Description

We experiment with a 2-stage summarization model on CNN/DailyMail dataset that combines the ability to filter informative sentences (like extractive summarization) and the ability to paraphrase (like abstractive summarization). Our best model achieves a ROUGE-L F1 score of 39.82, which outperforms the strong Lead-3 baseline and BERTSumEXT. Qualitative analysis indicates better readability and factual accuracy. Further, fine-tuning both stages on our oracle as the gold references shows the potential to outperform BART.

Results

Environment

conda create -n text-sum python=3.8
conda activate text-sum
pip install -r src/requirements.txt

Extraction stage

See here

Abstraction stage

See here

Owner
Yukai Yang (Alexis)
Passionate about scalable systems for video/data analytics. Software engineer, open source lover
Yukai Yang (Alexis)
๐Ÿค— The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

๐Ÿค— The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

Hugging Face 15k Jan 02, 2023
Enterprise Scale NLP with Hugging Face & SageMaker Workshop series

Workshop: Enterprise-Scale NLP with Hugging Face & Amazon SageMaker Earlier this year we announced a strategic collaboration with Amazon to make it ea

Philipp Schmid 161 Dec 16, 2022
Py65 65816 - Add support for the 65C816 to py65

Add support for the 65C816 to py65 Py65 (https://github.com/mnaberez/py65) is a

4 Jan 04, 2023
SGMC: Spectral Graph Matrix Completion

SGMC: Spectral Graph Matrix Completion Code for AAAI21 paper "Scalable and Explainable 1-Bit Matrix Completion via Graph Signal Learning". Data Format

Chao Chen 8 Dec 12, 2022
PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer

Cross-Covariance Image Transformer (XCiT) PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer L

Facebook Research 605 Jan 02, 2023
Creating a Feed of MISP Events from ThreatFox (by abuse.ch)

ThreatFox2Misp Creating a Feed of MISP Events from ThreatFox (by abuse.ch) What will it do? This will fetch IOCs from ThreatFox by Abuse.ch, convert t

17 Nov 22, 2022
Shared code for training sentence embeddings with Flax / JAX

flax-sentence-embeddings This repository will be used to share code for the Flax / JAX community event to train sentence embeddings on 1B+ training pa

Nils Reimers 23 Dec 30, 2022
Japanese Long-Unit-Word Tokenizer with RemBertTokenizerFast of Transformers

Japanese-LUW-Tokenizer Japanese Long-Unit-Word (ๅ›ฝ่ชž็ ”้•ทๅ˜ไฝ) Tokenizer for Transformers based on ้’็ฉบๆ–‡ๅบซ Basic Usage from transformers import RemBertToken

Koichi Yasuoka 3 Dec 22, 2021
Semi-automated vocabulary generation from semantic vector models

vec2word Semi-automated vocabulary generation from semantic vector models This script generates a list of potential conlang word forms along with asso

9 Nov 25, 2022
This repository has a implementations of data augmentation for NLP for Japanese.

daaja This repository has a implementations of data augmentation for NLP for Japanese: EDA: Easy Data Augmentation Techniques for Boosting Performance

Koga Kobayashi 60 Nov 11, 2022
An attempt to map the areas with active conflict in Ukraine using open source twitter data.

Live Action Map (LAM) An attempt to use open source data on Twitter to map areas with active conflict. Right now it is used for the Ukraine-Russia con

Kinshuk Dua 171 Nov 21, 2022
2021 2ํ•™๊ธฐ ๋ฐ์ดํ„ฐํฌ๋กค๋ง ๊ธฐ๋งํ”„๋กœ์ ํŠธ

๊ณต์ง€ ์ฃผ์ œ ์›น ํฌ๋กค๋ง์„ ์ด์šฉํ•œ ์ทจ์—… ๊ณต๊ณ  ์Šค์ผ€์ค„๋Ÿฌ ์Šค์ผ€์ค„ ์ฃผ์ œ ์ •ํ•˜๊ธฐ ์ฝ”๋”ฉํ•˜๊ธฐ ํ•ต์‹ฌ ์ฝ”๋“œ ์„ค๋ช… + ํ”ผํ”ผํ‹ฐ ๊ตฌ์กฐ ๊ตฌ์ƒ // 12/4 ํ†  ํ”ผํ”ผํ‹ฐ + ์Šคํฌ๋ฆฝํŠธ(๋Œ€๋ณธ) ์ œ์ž‘ + ๋…นํ™” // ~ 12/10 ~ 12/11 ๊ธˆ~ํ†  ์˜์ƒ ํŽธ์ง‘ // ~12/11 ํ†  ์›นํฌ๋กค๋Ÿฌ ์‚ฌ๋žŒ์ธ_ํ‰๊ท 

Choi Eun Jeong 2 Aug 16, 2022
Code for the paper "Flexible Generation of Natural Language Deductions"

Code for the paper "Flexible Generation of Natural Language Deductions"

Kaj Bostrom 12 Nov 11, 2022
Beautiful visualizations of how language differs among document types.

Scattertext 0.1.0.0 A tool for finding distinguishing terms in corpora and displaying them in an interactive HTML scatter plot. Points corresponding t

Jason S. Kessler 2k Dec 27, 2022
Comprehensive-E2E-TTS - PyTorch Implementation

A Non-Autoregressive End-to-End Text-to-Speech (text-to-wav), supporting a family of SOTA unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultima

Keon Lee 114 Nov 13, 2022
Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.

textgenrnn Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code, or quickly tr

Max Woolf 4.8k Dec 30, 2022
An open source framework for seq2seq models in PyTorch.

pytorch-seq2seq Documentation This is a framework for sequence-to-sequence (seq2seq) models implemented in PyTorch. The framework has modularized and

International Business Machines 1.4k Jan 02, 2023
GrammarTagger โ€” A Neural Multilingual Grammar Profiler for Language Learning

GrammarTagger โ€” A Neural Multilingual Grammar Profiler for Language Learning GrammarTagger is an open-source toolkit for grammatical profiling for lan

Octanove Labs 27 Jan 05, 2023
Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition

SEW (Squeezed and Efficient Wav2vec) The repo contains the code of the paper "Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speec

ASAPP Research 67 Dec 01, 2022
Code for the paper in Findings of EMNLP 2021: "EfficientBERT: Progressively Searching Multilayer Perceptron via Warm-up Knowledge Distillation".

This repository contains the code for the paper in Findings of EMNLP 2021: "EfficientBERT: Progressively Searching Multilayer Perceptron via Warm-up Knowledge Distillation".

Chenhe Dong 28 Nov 10, 2022