Repo for Enhanced Seq2Seq Autoencoder via Contrastive Learning for Abstractive Text Summarization

Last update: Nov 01, 2022

Related tags

Overview

ESACL: Enhanced Seq2Seq Autoencoder via Contrastive Learning for AbstractiveText Summarization

This repo is for our paper "Enhanced Seq2Seq Autoencoder via Contrastive Learning for AbstractiveText Summarization". Our program is building on top of the Huggingface transformers framework. You can refer to their repo at: https://github.com/huggingface/transformers/tree/master/examples/seq2seq.

Local Setup

Tested with Python 3.7 via virtual environment. Clone the repo, go to the repo folder, setup the virtual environment, and install the required packages:

$ python3.7 -m venv venv
$ source venv/bin/activate
$ pip install -r requirements.txt

Install `apex`

Based on the recommendation from HuggingFace, both finetuning and eval are 30% faster with --fp16. For that you need to install apex.

$ git clone https://github.com/NVIDIA/apex
$ cd apex
$ pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Data

Create a directory for data used in this work named data:

$ mkdir data

CNN/DM

$ wget https://cdn-datasets.huggingface.co/summarization/cnn_dm_v2.tgz
$ tar -xzvf cnn_dm_v2.tgz
$ mv cnn_cln data/cnndm

XSUM

$ wget https://cdn-datasets.huggingface.co/summarization/xsum.tar.gz
$ tar -xzvf xsum.tar.gz
$ mv xsum data/xsum

Generate Augmented Dataset

$ python generate_augmentation.py \
    --dataset xsum \
    --n 5 \
    --augmentation1 randomdelete \
    --augmentation2 randomswap

Training

CNN/DM

Our model is warmed up using sshleifer/distilbart-cnn-12-6:

$ DATA_DIR=./data/cnndm-augmented/RandominsertionRandominsertion-NumSent-3
$ OUTPUT_DIR=./log/cnndm

$ python -m torch.distributed.launch --nproc_per_node=3  cl_finetune_trainer.py \
  --data_dir $DATA_DIR \
  --output_dir $OUTPUT_DIR \
  --learning_rate=5e-7 \
  --per_device_train_batch_size 16 \
  --per_device_eval_batch_size 16 \
  --do_train --do_eval \
  --evaluation_strategy steps \
  --freeze_embeds \
  --save_total_limit 10 \
  --save_steps 1000 \
  --logging_steps 1000 \
  --num_train_epochs 5 \
  --model_name_or_path sshleifer/distilbart-cnn-12-6 \
  --alpha 0.2 \
  --temperature 0.5 \
  --freeze_encoder_layer 6 \
  --prediction_loss_only \
  --fp16

XSUM

$ DATA_DIR=./data/xsum-augmented/RandomdeleteRandomswap-NumSent-3
$ OUTPUT_DIR=./log/xsum

$ python -m torch.distributed.launch --nproc_per_node=3  cl_finetune_trainer.py \
  --data_dir $DATA_DIR \
  --output_dir $OUTPUT_DIR \
  --learning_rate=5e-7 \
  --per_device_train_batch_size 16 \
  --per_device_eval_batch_size 16 \
  --do_train --do_eval \
  --evaluation_strategy steps \
  --freeze_embeds \
  --save_total_limit 10 \
  --save_steps 1000 \
  --logging_steps 1000 \
  --num_train_epochs 5 \
  --model_name_or_path sshleifer/distilbart-xsum-12-6 \
  --alpha 0.2 \
  --temperature 0.5 \
  --freeze_encoder \
  --prediction_loss_only \
  --fp16

Evaluation

We have released the following checkpoints for pre-trained models as described in the paper:

CNN/DM:
XSUM:

CNN/DM

CNN/DM requires an extra postprocessing step.

$ export DATA=cnndm
$ export DATA_DIR=data/$DATA
$ export CHECKPOINT_DIR=./log/$DATA
$ export OUTPUT_DIR=output/$DATA

$ python -m torch.distributed.launch --nproc_per_node=2  run_distributed_eval.py \
    --model_name sshleifer/distilbart-cnn-12-6  \
    --save_dir $OUTPUT_DIR \
    --data_dir $DATA_DIR \
    --bs 16 \
    --fp16 \
    --use_checkpoint \
    --checkpoint_path $CHECKPOINT_DIR
    
$ python postprocess_cnndm.py \
    --src_file $OUTPUT_DIR/test_generations.txt \
    --tgt_file $DATA_DIR/test.target

XSUM

$ export DATA=xsum
$ export DATA_DIR=data/$DATA
$ export CHECKPOINT_DIR=./log/$DATA
$ export OUTPUT_DIR=output/$DATA

$ python -m torch.distributed.launch --nproc_per_node=3  run_distributed_eval.py \
    --model_name sshleifer/distilbart-xsum-12-6  \
    --save_dir $OUTPUT_DIR \
    --data_dir $DATA_DIR \
    --bs 16 \
    --fp16 \
    --use_checkpoint \
    --checkpoint_path $CHECKPOINT_DIR

Repo for Enhanced Seq2Seq Autoencoder via Contrastive Learning for Abstractive Text Summarization

Related tags

Overview

ESACL: Enhanced Seq2Seq Autoencoder via Contrastive Learning for AbstractiveText Summarization

Local Setup

Install `apex`

Data

CNN/DM

XSUM

Generate Augmented Dataset

Training

CNN/DM

XSUM

Evaluation

CNN/DM

XSUM

Owner

Rachel Zheng

MILES is a multilingual text simplifier inspired by LSBert - A BERT-based lexical simplification approach proposed in 2018. Unlike LSBert, MILES uses the bert-base-multilingual-uncased model, as well as simple language-agnostic approaches to complex word identification (CWI) and candidate ranking.

CVSS: A Massively Multilingual Speech-to-Speech Translation Corpus

BERT-based Financial Question Answering System

:P Some basic stuff I'm gonna use for my upcoming Agile Software Development and Devops

Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Expressions.

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset.

Host your own GPT-3 Discord bot

DensePhrases provides answers to your natural language questions from the entire Wikipedia in real-time

Few-shot Natural Language Generation for Task-Oriented Dialog

BiQE: Code and dataset for the BiQE paper

用Resnet101+GPT搭建一个玩王者荣耀的AI

Develop open-source Python Arabic NLP libraries that the Arab world will easily use in all Natural Language Processing applications

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Repository of the Code to Chatbots, developed in Python

Auto-researching tool generating word documents.

Minimal GUI for accessing the Watson Text to Speech service.

Text classification on IMDB dataset using Keras and Bi-LSTM network

GooAQ 🥑 : Google Answers to Google Questions!

An easy-to-use framework for BERT models, with trainers, various NLP tasks and detailed annonations

Machine Psychology: Python Generated Art

Repo for Enhanced Seq2Seq Autoencoder via Contrastive Learning for Abstractive Text Summarization

Related tags

Overview

ESACL: Enhanced Seq2Seq Autoencoder via Contrastive Learning for AbstractiveText Summarization

Local Setup

Install apex

Data

CNN/DM

XSUM

Generate Augmented Dataset

Training

CNN/DM

XSUM

Evaluation

CNN/DM

XSUM

Owner

Rachel Zheng

MILES is a multilingual text simplifier inspired by LSBert - A BERT-based lexical simplification approach proposed in 2018. Unlike LSBert, MILES uses the bert-base-multilingual-uncased model, as well as simple language-agnostic approaches to complex word identification (CWI) and candidate ranking.

CVSS: A Massively Multilingual Speech-to-Speech Translation Corpus

BERT-based Financial Question Answering System

:P Some basic stuff I'm gonna use for my upcoming Agile Software Development and Devops

Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Expressions.

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset.

Host your own GPT-3 Discord bot

DensePhrases provides answers to your natural language questions from the entire Wikipedia in real-time

Few-shot Natural Language Generation for Task-Oriented Dialog

BiQE: Code and dataset for the BiQE paper

用Resnet101+GPT搭建一个玩王者荣耀的AI

Develop open-source Python Arabic NLP libraries that the Arab world will easily use in all Natural Language Processing applications

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Repository of the Code to Chatbots, developed in Python

Auto-researching tool generating word documents.

Minimal GUI for accessing the Watson Text to Speech service.

Text classification on IMDB dataset using Keras and Bi-LSTM network

GooAQ 🥑 : Google Answers to Google Questions!

An easy-to-use framework for BERT models, with trainers, various NLP tasks and detailed annonations

Machine Psychology: Python Generated Art

Install `apex`