Skip to content

Aditya-shahh/Text_Summarization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

Text Summarization

WCN — Weighted Contextual N-gram method for evaluation of Text Summarization

In this project, I fine tune T5 model on Extreme Summarization (XSum) Dataset achieving a rouge2 f score of 9.5% on test data. Further I discuss the drawbacks of ngram based metrics as well as contextual word metrics.

Finally, I propose use of Weighted Contextual N-gram (WCN) method – an alternative metric which can be more effective for evaluation of text generation tasks.

The complete documentation of the project can be found here

Dataset

I use the Extreme Summarization (XSum) Dataset. The dataset can be downloaded from here

The dataset consists of BBC articles and accompanying single sentence summaries. Specifically, each article is prefaced with an introductory sentence (aka summary) which is professionally written, typically by the author of the article.

There are two features in this dataset:
(1) document: Input news article.
(2) summary: Onesentence summary of the article.

The idea is to generate a short, one-sentence news summary answering the question ”What is the article about?”. There are in total 226k samples: 204,045 samples for training data, 11,332 samples for validation data and 11,334 samples for test data. The average number of words in a document is 431.07 (19.77 sentences) and the average number of words in a summary is 23.26.

Code

The source code for this project can be found at text_summarization.ipynb.

About

WCN — Weighted Contextual N-gram method for evaluation of Text Summarization

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published