The guide to tackle with the Text Summarization

Overview

awesome-text-summarization

The guide to tackle with the Text Summarization.

Motivation

To take the appropriate action, we need latest information.
But on the contrary, the amount of the information is more and more growing. There are many categories of information (economy, sports, health, technology...) and also there are many sources (news site, blog, SNS...).

growth_of_data

from THE HISTORICAL GROWTH OF DATA: WHY WE NEED A FASTER TRANSFER SOLUTION FOR LARGE DATA SETS

So to make an automatically & accurate summaries feature will helps us to understand the topics and shorten the time to do it.

Task Definition

Basically, we can regard the "summarization" as the "function" its input is document and output is summary. And its input & output type helps us to categorize the multiple summarization tasks.

  • Single document summarization
    • summary = summarize(document)
  • Multi-document summarization
    • summary = summarize(document_1, document_2, ...)

We can take the query to add the viewpoint of summarization.

  • Query focused summarization
    • summary = summarize(document, query)

This type of summarization is called "Query focused summarization" on the contrary to the "Generic summarization". Especially, a type that set the viewpoint to the "difference" (update) is called "Update summarization".

  • Update summarization
    • summary = summarize(document, previous_document_or_summary)

And the "summary" itself has some variety.

  • Indicative summary
    • It looks like a summary of the book. This summary describes what kinds of the story, but not tell all of the stories especially its ends (so indicative summary has only partial information).
  • Informative summary
    • In contrast to the indicative summary, the informative summary includes full information of the document.
  • Keyword summary
    • Not the text, but the words or phrases from the input document.
  • Headline summary
    • Only one line summary.

Discussion

Generic summarization is really useful? Sparck Jones argued that summarization should not be done in a vacuum, but rather done according to the purpose of summarization (2). She argued that generic summarization is not necessary and in fact, wrong-headed. On the other hand, the headlines and 3-line summaries in the newspaper helps us.

Basic Approach

There are mainly two ways to make the summary. Extractive and Abstractive.

Extractive

  • Select relevant phrases of the input document and concatenate them to form a summary (like "copy-and-paste").
    • Pros: They are quite robust since they use existing natural-language phrases that are taken straight from the input.
    • Cons: But they lack in flexibility since they cannot use novel words or connectors. They also cannot paraphrase like people sometimes do.

Now I show the some categories of extractive summarization.

Graph Base

The graph base model makes the graph from the document, then summarize it by considering the relation between the nodes (text-unit). TextRank is the typical graph based method.

TextRank

TextRank is based on PageRank algorithm that is used on Google Search Engine. Its base concept is "The linked page is good, much more if it from many linked page". The links between the pages are expressed by matrix (like Round-robin table). We can convert this matrix to transition probability matrix by dividing the sum of links in each page. And the page surfer moves the page according to this matrix.

page_rank.png Page Rank Algorithm

TextRank regards words or sentences as pages on the PageRank. So when you use the TextRank, following points are important.

  • Define the "text units" and add them as the nodes in the graph.
  • Define the "relation" between the text units and add them as the edges in the graph.
    • You can set the weight of the edge also.

Then, solve the graph by PageRank algorithm. LexRank uses the sentence as node and the similarity as relation/weight (similarity is calculated by IDF-modified Cosine similarity).

If you want to use TextRank, following tools support TextRank.

Feature Base

The feature base model extracts the features of sentence, then evaluate its importance. Here is the representative research.

Sentence Extraction Based Single Document Summarization

In this paper, following features are used.

  • Position of the sentence in input document
  • Presence of the verb in the sentence
  • Length of the sentence
  • Term frequency
  • Named entity tag NE
  • Font style

...etc. All the features are accumulated as the score.

feature_base_score.png

The No.of coreferences are the number of pronouns to previous sentence. It is simply calculated by counting the pronouns occurred in the first half of the sentence. So the Score represents the reference to the previous sentence.

Now we can evaluate each sentences. Next is selecting the sentence to avoid the duplicate of the information. In this paper, the same word between the new and selected sentence is considered. And the refinement to connect the selected sentences are executed.

Luhn’s Algorithm is also feature base. It evaluates the "significance" of the word that is calculated from the frequency.

You can try feature base text summarization by TextTeaser (PyTeaser is available for Python user).

Of course, you can use Deep learning model to extract sentence feature. SummaRuNNer is a representative model for the extractive summarization by DNN.

summa_runner.png

  1. Make sentence feature (vector) by Bi-directional LSTM from word vectors (word2vec).
  2. Make document feature by Bi-directional LSTM from sentence vectors (1).
  3. Calculate selection probability from 1 & 2.

Topic Base

The topic base model calculates the topic of the document and evaluate each sentences by what kinds of topics are included (the "main" topic is highly evaluated when scoring the sentence).

Latent Semantic Analysis (LSA) is usually used to detect the topic. It's based on SVD (Singular Value Decomposition).
The following paper is good starting point to overview the LSA(Topic) base summarization.

Text summarization using Latent Semantic Analysis

topic_base.png
The simple LSA base sentence selection

There are many variations the way to calculate & select the sentence according to the SVD value. To select the sentence by the topic(=V, eigenvectors/principal axes) and its score is most simple method.

If you want to use LSA, gensim supports it.

Grammer Base

The grammer base model parses the text and constructs a grammatical structure, then select/reorder substructures.

Title Generation with Quasi-Synchronous Grammar

grammer_base.png

This model can produce meaningful "paraphrase" based on the grammatical structure. For example, above image shows the phrase "in the disputed territory of East Timor" is converted to "in East Timor". To analyze grammatical structure is useful to reconstruct the phrase with keeping its meaning.

Neural Network Base

The main theme of extractive summarization by Neural Network is following two point.

  1. How to get good sentence representation.
  2. How to predict the selection of sentence.

1 is encoding problem, 2 is objective function problem. These are summarized at A Survey on Neural Network-Based Summarization Methods.

extractive_by_nn.png

Abstractive

  • Generate a summary that keeps original intent. It's just like humans do.
    • Pros: They can use words that were not in the original input. It enables to make more fluent and natural summaries.
    • Cons: But it is also a much harder problem as you now require the model to generate coherent phrases and connectors.

Extractive & Abstractive is not conflicting ways. You can use both to generate the summary. And there are a way collaborate with human.

  • Aided Summarization
    • Combines automatic methods with human input.
    • Computer suggests important information from the document, and the human decide to use it or not. It uses information retrieval, and text mining way.

The beginning of the abstractive summarization, Banko et al. (2000) suggest to use machine translatation model to abstractive summarization model. As like the machine translation model converts a source language text to a target one, the summarization system converts a source document to a target summary.

Nowadays, encoder-decoder model that is one of the neural network models is mainly used in machine translation. So this model is also widely used in abstractive summarization model. The summarization model that used encoder-decoder model first achieved state-of-the-art on the two sentence-level summarization dataset, DUC-2004 and Gigaword.

If you want to try the encoder-decoder summarization model, tensorflow offers basic model.

Encoder-Decoder Model

The encoder-decoder model is composed of encoder and decoder like its name. The encoder converts an input document to a latent representation (vector), and the decoder generates a summary by using it.

encoder_decoder.png

But the encoder-decoder model is not the silver bullet. There are many remaining issues are there.

  • How to set the focus on the important sentence, keyword.
  • How to handle the novel/rare (but important) word in source document.
  • How to handle the long document.
  • Want to make more human-readable summary.
  • Want to use large vocabulary.
Researches

A Neural Attention Model for Sentence Summarization

  • How to set the focus on the important sentence, keyword.?
    • use Attention (sec 3.2)
  • How to handle the novel/rare (but important) word in source document.
    • add n-gram match term to the loss function (sec 5)
  • Other features
    • use 1D convolution to capture the local context
    • use beam-search to generate summary
  • Implementation

Abstractive Text Summarization Using Sequence-to-Sequence RNNs and Beyond

  • How to set the focus on the important sentence, keyword.
    • use enhanced feature such as POS, Named Entity tag, TF, IDF (sec 2.2)
  • How to handle the novel/rare (but important) word in source document.
    • switch the decoder(generate word) and pointer(copy from original text). (sec 2.3)
  • How to handle the long document.
    • use sentence level attention (sec 2.4)
  • Want to use large vocabulary.

Combination Approach

Not only one side of extractive or abstractive, combine them to generate summaries.

Pointer-Generator Network

Combine the extractive and abstractive model by switching probability.

Researches

Get To The Point: Summarization with Pointer-Generator Networks

  • How to set the focus on the important sentence, keyword.
    • use Attention (sec 2.1)
  • How to handle the novel/rare (but important) word in source document.
    • switch the decoder(generator) and pointer network (by p_gen probability).
    • combine the distribution of vocabulary and attention with p_gen and (1 - p_gen) weight (please refer the following picture).
  • Implementation

get_to_the_point

A Deep Reinforced Model for Abstractive Summarization

  • How to set the focus on the important sentence, keyword.
    • use intra-temporal attention (attention over specific parts of the encoded input sequence) (sec 2.1)
  • How to handle the novel/rare (but important) word in source document.
    • use pointer network to copy input token instead of generating it. (sec 2.3)
  • Want to make more human-readable summary.
    • use reinforcement learning (ROUGE-optimized RL) with supervised learning. (sec 3.2)
  • Other features
    • use intra-decoder attention (attention to decoded context) to supress the repeat of the same phrases. (sec 2.2)
    • constrain duplicate trigram to avoid repetition. (sec 2.5)

rl_model_for_as

from Your tldr by an ai: a deep reinforced model for abstractive summarization

The pointer-generator network is theoretically beautiful, but you have to pay attention to its behavior.
Weber et al. (2018) report that a pointer-generator model heavily depends on a "copy" (pointer) at test time.
Weber et al. (2018) use a penalty term for pointer/generator mixture rate to overcome this phenomenon and control the abstractive.

Extract then Abstract model

Use extractive model to select the sentence from documents, then adopt the abstractive model to selected sentences.

Researches

Generating Wikipedia by Summarizing Long Sequences

  • How to set the focus on the important sentence, keyword.
    • use the encoder-less self-attention network (sec 4.2.3~). it concatenates input & output, and predict the next token from the previous sequence.
  • How to handle the long document.
    • use the extractive model to extract tokens from long document first, then execute the abstractive model.

generating wikipedia summarization

Query Focused Abstractive Summarization: Incorporating Query Relevance, Multi-Document Coverage, and Summary Length Constraints into seq2seq Models

This model combines query focused extractive model and abstractive model. Extract sentence and calculate the relevance score of each word according to the query, then input it to pre-trained abstractive model.

  • How to set the focus on the important sentence, keyword.
    • use attention and query relevance score
  • How to handle the long document.
    • use query to extract document/sentences from multiple documents.

query focused model and abstractive model

Fast Abstractive Summarization with Reinforce-Selected Sentence Rewriting

The extractor is reinforcement-learning agent, and the abstractor rewrites selected sentence.

query focused model and abstractive model

  • How to set the focus on the important sentence, keyword.
    • implements the Extractor by CNN-then-RNN model (Figure 1), and train it by ROUGE score.
  • How to handle the novel/rare (but important) word in source document.
    • use copy mechanism

Transfer Learning

It'll be useful to reuse the learned model to make a new summarization model. That's called "Transfer Learning". It enables making the model by few data and short time. This feature leads to domain-specific summarization by a few data/short time.

Researches

BERT is the representative model that enables getting good representation of sentences. Several methods are proposed to create summaries by BERT.

Fine-tune BERT for Extractive Summarization

  • How to use the pre-trained model?
    • Getting good sentence representation.
    • BERT generates token based feature. Therefore there is a need to convert token based to sentence based representation. They use "Segmentation Embedding" to recognize sentence boundary and use the first token of segmentation as sentence embedding.

Pretraining-Based Natural Language Generation for Text Summarization

  • How to use the pre-trained model?
    • Getting good sentence representation and refine a generated sentence.
    • BERT is trained to predict the masked token, so can't generate a sequence. In this research, the first summarization is generated by an ordinary transformer model, and then drop some tokens to filling it by BERT. Final summarization is created by input BERT representation and complemented (refined) sentence representation by BERT.

Evaluation

ROUGE-N

Rouge-N is a word N-gram count that matche between the model and the gold summary. It is similart to the "recall" because it evaluates the covering rate of gold summary, and not consider the not included n-gram in it.

ROUGE-1 and ROUGE-2 is usually used. The ROUGE-1 means word base, so its order is not regarded. So "apple pen" and "pen apple" is same ROUGE-1 score. But if ROUGE-2, "apple pen" becomes single entity so "apple pen" and "pen apple" does not match. If you increase the ROUGE-"N" count, finally evaluates completely match or not.

BLEU

BLEU is a modified form of "precision", that used in machine translation evaluation usually. BLEU is basically calculated on the n-gram co-occerance between the generated summary and the gold (You don't need to specify the "n" unlike ROUGE).

Resources

Datasets

Libraries

Articles

Papers

Overview

  1. A. Nenkova, and K. McKeown, "Automatic summarization,". Foundations and Trends in Information Retrieval, 5(2-3):103–233, 2011.
  2. K. Sparck Jones, “Automatic summarizing: factors and directions,”. Advances in Automatic Text Summarization, pp. 1–12, MIT Press, 1998.
  3. Y. Dong, "A Survey on Neural Network-Based Summarization Methods,". arXiv preprint arXiv:1804.04589, 2018.

Extractive Summarization

  1. R. Mihalcea, and P. Tarau, "Textrank: Bringing order into texts,". In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, 2004.
  2. G. Erkan, and D. R. Radev, "LexRank: graph-based lexical centrality as salience in text summarization,". Journal of Artificial Intelligence Research, v.22 n.1, p.457-479, July 2004.
  3. J. Jagadeesh, P. Pingali, and V. Varma, "Sentence Extraction Based Single Document Summarization", Workshop on Document Summarization, 19th and 20th March, 2005.
  4. P.H. Luhn, "Automatic creation of literature abstracts,". IBM Journal, pages 159-165, 1958.
  5. M. G. Ozsoy, F. N. Alpaslan, and I. Cicekli, "Text summarization using Latent Semantic Analysis,". Proceedings of the 23rd International Conference on Computational Linguistics, vol. 37, pp. 405-417, aug 2011.
  6. K. Woodsend, Y. Feng, and M. Lapata, "Title generation with quasi-synchronous grammar,". Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, p.513-523, October 09-11, 2010.
  7. R. Nallapati, F. Zhai and B. Zhou, "SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive Summarization of Documents,". In AAAI, 2017.

Abstractive Summarization

  1. M. Banko, V. O. Mittal, and M. J. Witbrock, "Headline Generation Based on Statistical Translation,". In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, pages 318–325. Association for Computational Linguistics, 2000.
  2. A. M. Rush, S. Chopra, and J. Weston, "A Neural Attention Model for Abstractive Sentence Summarization,". In EMNLP, 2015.
  3. S. Chopra, M. Auli, and A. M. Rush, "Abstractive sentence summarization with attentive recurrent neural networks,". In North American Chapter of the Association for Computational Linguistics, 2016.
  4. R. Nallapati, B. Zhou, C. dos Santos, C. Gulcehre, and B. Xiang, "Abstractive text summarization using sequence-to-sequence RNNs and beyond,". In Computational Natural Language Learning, 2016.
  5. S. Jean, K. Cho, R. Memisevic, and Yoshua Bengio. "On using very large target vocabulary for neural machine translation,". CoRR, abs/1412.2007. 2014.

Combination

  1. A. See, P. J. Liu, and C. D. Manning, "Get to the point: Summarization with pointergenerator networks,". In ACL, 2017.
  2. N. Weber, L. Shekhar, N. Balasubramanian, and K. Cho, "Controlling Decoding for More Abstractive Summaries with Copy-Based Networks,". arXiv preprint arXiv:1803.07038, 2018.
  3. R. Paulus, C. Xiong, and R. Socher, "A deep reinforced model for abstractive summarization,". arXiv preprint arXiv:1705.04304, 2017.
  4. P. J. Liu, M. Saleh, E. Pot, B. Goodrich, R. Sepassi, L. Kaiser, and N. Shazeer, "Generating Wikipedia by Summarizing Long Sequences,". arXiv preprint arXiv:1801.10198, 2018.
  5. T. Baumel, M. Eyal, and M. Elhadad, "Query Focused Abstractive Summarization: Incorporating Query Relevance, Multi-Document Coverage, and Summary Length Constraints into seq2seq Models,". arXiv preprint arXiv:1801.07704, 2018.
  6. Y. Chen, M. Bansal, "Fast Abstractive Summarization with Reinforce-Selected Sentence Rewriting,". In ACL, 2018.
  7. Isabel Cachola, Kyle Lo, Arman Cohan, Daniel S. Weld, "TLDR: Extreme Summarization of Scientific Documents"

Transfer Learning

  1. Y. Liu. "Fine-tune BERT for Extractive Summarization,". arXiv preprint arXiv:1903.10318, 2019.
  2. H. Zhang, J. Xu and J. Wang. "Pretraining-Based Natural Language Generation for Text Summarization,". arXiv preprint arXiv:1902.09243, 2019.
Owner
Takahiro Kubo
Takahiro Kubo
One Stop Anomaly Shop: Anomaly detection using two-phase approach: (a) pre-labeling using statistics, Natural Language Processing and static rules; (b) anomaly scoring using supervised and unsupervised machine learning.

One Stop Anomaly Shop (OSAS) Quick start guide Step 1: Get/build the docker image Option 1: Use precompiled image (might not reflect latest changes):

Adobe, Inc. 148 Dec 26, 2022
NLP: SLU tagging

NLP: SLU tagging

北海若 3 Jan 14, 2022
VoiceFixer VoiceFixer is a framework for general speech restoration.

VoiceFixer VoiceFixer is a framework for general speech restoration. We aim at the restoration of severly degraded speech and historical speech. Paper

Leo 174 Jan 06, 2023
Paddle2.x version AI-Writer

Paddle2.x 版本AI-Writer 用魔改 GPT 生成网文。Tuned GPT for novel generation.

yujun 74 Jan 04, 2023
Under the hood working of transformers, fine-tuning GPT-3 models, DeBERTa, vision models, and the start of Metaverse, using a variety of NLP platforms: Hugging Face, OpenAI API, Trax, and AllenNLP

Transformers-for-NLP-2nd-Edition @copyright 2022, Packt Publishing, Denis Rothman Contact me for any question you have on LinkedIn Get the book on Ama

Denis Rothman 150 Dec 23, 2022
A PyTorch-based model pruning toolkit for pre-trained language models

English | 中文说明 TextPruner是一个为预训练语言模型设计的模型裁剪工具包,通过轻量、快速的裁剪方法对模型进行结构化剪枝,从而实现压缩模型体积、提升模型速度。 其他相关资源: 知识蒸馏工具TextBrewer:https://github.com/airaria/TextBrewe

Ziqing Yang 231 Jan 08, 2023
🤗🖼️ HuggingPics: Fine-tune Vision Transformers for anything using images found on the web.

🤗 🖼️ HuggingPics Fine-tune Vision Transformers for anything using images found on the web. Check out the video below for a walkthrough of this proje

Nathan Raw 185 Dec 21, 2022
Simple tool/toolkit for evaluating NLG (Natural Language Generation) offering various automated metrics.

Simple tool/toolkit for evaluating NLG (Natural Language Generation) offering various automated metrics. Jury offers a smooth and easy-to-use interface. It uses datasets for underlying metric computa

Open Business Software Solutions 129 Jan 06, 2023
Repository for Graph2Pix: A Graph-Based Image to Image Translation Framework

Graph2Pix: A Graph-Based Image to Image Translation Framework Installation Install the dependencies in env.yml $ conda env create -f env.yml $ conda a

18 Nov 17, 2022
DomainWordsDict, Chinese words dict that contains more than 68 domains, which can be used as text classification、knowledge enhance task

DomainWordsDict, Chinese words dict that contains more than 68 domains, which can be used as text classification、knowledge enhance task。涵盖68个领域、共计916万词的专业词典知识库,可用于文本分类、知识增强、领域词汇库扩充等自然语言处理应用。

liuhuanyong 357 Dec 24, 2022
A retro text-to-speech bot for Discord

hawking A retro text-to-speech bot for Discord, designed to work with all of the stuff you might've seen in Moonbase Alpha, using the existing command

Nick Schorr 23 Dec 25, 2022
Python library for interactive topic model visualization. Port of the R LDAvis package.

pyLDAvis Python library for interactive topic model visualization. This is a port of the fabulous R package by Carson Sievert and Kenny Shirley. pyLDA

Ben Mabey 1.7k Dec 20, 2022
⚡ boost inference speed of T5 models by 5x & reduce the model size by 3x using fastT5.

Reduce T5 model size by 3X and increase the inference speed up to 5X. Install Usage Details Functionalities Benchmarks Onnx model Quantized onnx model

Kiran R 399 Jan 05, 2023
novel deep learning research works with PaddlePaddle

Research 发布基于飞桨的前沿研究工作,包括CV、NLP、KG、STDM等领域的顶会论文和比赛冠军模型。 目录 计算机视觉(Computer Vision) 自然语言处理(Natrual Language Processing) 知识图谱(Knowledge Graph) 时空数据挖掘(Spa

1.5k Jan 03, 2023
NLP Overview

NLP-Overview Introduction The field of NPL encompasses a variety of topics which involve the computational processing and understanding of human langu

PeterPham 1 Jan 13, 2022
NeMo: a toolkit for conversational AI

NVIDIA NeMo Introduction NeMo is a toolkit for creating Conversational AI applications. NeMo product page. Introductory video. The toolkit comes with

NVIDIA Corporation 5.3k Jan 04, 2023
A sample project that exists for PyPUG's "Tutorial on Packaging and Distributing Projects"

A sample Python project A sample project that exists as an aid to the Python Packaging User Guide's Tutorial on Packaging and Distributing Projects. T

Python Packaging Authority 4.5k Dec 30, 2022
🚀 RocketQA, dense retrieval for information retrieval and question answering, including both Chinese and English state-of-the-art models.

In recent years, the dense retrievers based on pre-trained language models have achieved remarkable progress. To facilitate more developers using cutt

475 Jan 04, 2023
Help you discover excellent English projects and get rid of disturbing by other spoken language

GitHub English Top Charts 「Help you discover excellent English projects and get

GrowingGit 544 Jan 09, 2023
Mirco Ravanelli 2.3k Dec 27, 2022