Code and data for ACL2021 paper Cross-Lingual Abstractive Summarization with Limited Parallel Resources.

Related tags

Deep LearningMCLAS
Overview

Multi-Task Framework for Cross-Lingual Abstractive Summarization (MCLAS)

The code for ACL2021 paper Cross-Lingual Abstractive Summarization with Limited Parallel Resources (Paper).

Some codes are borrowed from PreSumm (https://github.com/nlpyang/PreSumm).

Environments

Python version: This code is in Python3.7

Package Requirements: torch==1.1.0, transformers, tensorboardX, multiprocess, pyrouge

Needs few changes to be compatible with torch 1.4.0~1.8.0, mainly tensor type (bool) bugs.

Data Preparation

To improve training efficiency, we preprocessed concatenated dataset (with target "monolingual summary + [LSEP] + cross-lingual summary") and normal dataset (with target "cross-lingual summary") in advance.

You can build your own dataset or download our preprocessed dataset.

Download Preprocessed dataset.

  1. En2De dataset: Google Drive Link.
  2. En2EnDe (concatenated) dataset: Google Drive Link.
  3. En2Zh dataset: Google Drive Link.
  4. En2EnZh (concatenated) dataset: Google Drive Link.

PS: Our implementation filter some invalid samples (if the target of a sample is too short). Hence the number of the training samples may be smaller than what is reported in the paper.

Build Your Own Dataset.

Remain to be origanized. Some of the code needs to be debug, plz use it carefully.

Build tokenized files.

Plz refer to function tokenize_xgiga() or tokenize_new() in ./src/data_builder.py to write your code to preprocess your own training, validation, and test dataset. And then run the following commands:

python preprocess.py -mode tokenize_xgiga -raw_path PATH_TO_YOUR_RAW_DATA -save_path PATH_TO_YOUR_SAVE_PATH
  • Stanford CoreNLP needs to be installed.

Plz substitute tokenize_xgiga to your own process function.

In our case, we made the raw data directory as follows:

.
└── raw_directory
    ├── train
    |   ├── 1.story
    |   ├── 2.story
    |   ├── 3.story
    |   └── ...
    ├── test
    |   ├── 1.story
    |   ├── 2.story
    |   ├── 3.story
    |   └── ...
    └─ dev
        ├── 1.story
        ├── 2.story
        ├── 3.story
        └── ...

Correspondingly, the tokenized data directory is as follows

.
└── raw_directory
    ├── train
    |   ├── 1.story.json
    |   ├── 2.story.json
    |   ├── 3.story.json
    |   └── ...
    ├── test
    |   ├── 1.story.json
    |   ├── 2.story.json
    |   ├── 3.story.json
    |   └── ...
    └─ dev
        ├── 1.story.json
        ├── 2.story.json
        ├── 3.story.json
        └── ...

Build tokenized files to json files.

python preprocess.py -mode format_to_lines_new -raw_path RAW_PATH -save_path JSON_PATH -n_cpus 1 -use_bert_basic_tokenizer false -map_path MAP_PATH -shard_size 3000

Shard size is pretty important and needs to be selected carefully. This implementation use a shard as a base data unit for low-resource training. In our setting, the shard size of En2Zh, Zh2En, and En2De is 1.5k, 5k, and 3k, respectively.

Build json files to pytorch(pt) files.

python preprocess.py -mode format_to_bert_new -raw_path JSON_PATH -save_path BERT_DATA_PATH  -lower -n_cpus 1 -log_file ../logs/preprocess.log

Model Training

Full dataset scenario training

To train our model in full dataset scenario, plz use following command. Change the data path to switch the trained model between NCLS and MCLAS.

When using NCLS type datasets, arguement --multi_task enables training with NCLS+MS model.

 python train.py  \
 -task abs -mode train \
 -temp_dir ../tmp \
 -bert_data_path PATH_TO_DATA/ncls \  
 -dec_dropout 0.2  \
 -model_path ../model_abs_en2zh_noseg \
 -sep_optim true \
 -lr_bert 0.005 -lr_dec 0.2 \
 -save_checkpoint_steps 5000 \
 -batch_size 1300 \
 -train_steps 400000 \
 -report_every 50 -accum_count 5 \
 -use_bert_emb true -use_interval true \
 -warmup_steps_bert 20000 -warmup_steps_dec 10000 \
 -max_pos 512 -visible_gpus 0  -max_length 1000 -max_tgt_len 1000 \
 -log_file ../logs/abs_bert_en2zh  
 # --multi_task

Low-resource scenario training

Monolingual summarization pretraining

First we should train a monolingual summarization model using following commands:

You can change the trained model type using the same methods mentioned above (change dataset or --multi_task)

python train.py  \
-task abs -mode train \
-dec_dropout 0.2  \
-model_path ../model_abs_en2en_de/ \
-bert_data_path PATH_TO_DATA/xgiga.en \
-temp_dir ../tmp \
-sep_optim true \
-lr_bert 0.002 -lr_dec 0.2 \
-save_checkpoint_steps 2000 \
-batch_size 210 \
-train_steps 200000 \
-report_every 50 -accum_count 5 \
-use_bert_emb true -use_interval true \
-warmup_steps_bert 25000 -warmup_steps_dec 15000 \
-max_pos 512 -visible_gpus 0,1,2 -max_length 1000 -max_tgt_len 1000 \
-log_file ../logs/abs_bert_mono_enen_de \
--train_first  

# -train_from is used as continue training from certain training checkpoints.
# example:
# -train_from ../model_abs_en2en_de/model_step_70000.pt \

Low-resource scenario fine-tuning

After obtaining the monolingual model, we use it to initialize the low-resource models and continue training process.

Note:

-train_from should be omitted if you want to train a model without monolingual initialization.

--new_optim is necessary since we need to restart warm-up and learning rate decay during this process.

--few_shot controls whether to use limited resource to train the model. Meanwhile, '-few_shot_rate' controls the number of samples that you want to use. More specifically, the number of dataset's chunks.

For each scenario in our paper (using our preprocessed dataset), the few_shot_rate is set as 1, 5, and 10.

python train.py  \
-task abs -mode train \
-dec_dropout 0.2  \
-model_path ../model_abs_enende_fewshot1/ \
-train_from ../model_abs_en2en_de/model_step_50000.pt \
-bert_data_path PATH_TO_YOUR_DATA/xgiga.en \
-temp_dir ../tmp \
-sep_optim true \
-lr_bert 0.002 -lr_dec 0.2 \
-save_checkpoint_steps 1000 \
-batch_size 270 \
-train_steps 10000 \
-report_every 50 -accum_count 5 \
-use_bert_emb true -use_interval true \
-warmup_steps_bert 25000 -warmup_steps_dec 15000 \
-max_pos 512 -visible_gpus 0,2,3 -max_length 1000 -max_tgt_len 1000 \
-log_file ../logs/abs_bert_enende_fewshot1 \
--few_shot -few_shot_rate 1 --new_optim

Model Evaluation

To evaluate a model, use a command as follows:

python train.py -task abs \
-mode validate \
-batch_size 5 \
-test_batch_size 5 \
-temp_dir ../tmp \
-bert_data_path PATH_TO_YOUR_DATA/xgiga.en \
-log_file ../results/val_abs_bert_enende_fewshot1_noinit \
-model_path ../model_abs_enende_fewshot1_noinit -sep_optim true \
-use_interval true -visible_gpus 1 \
-max_pos 512 -max_length 150 \
-alpha 0.95 -min_length 20 \
-max_tgt_len 1000 \
-result_path ../logs/abs_bert_enende_fewshot1 -test_all \
--predict_2language

If you are not evaluating a MCLAS model, plz remove --predict_2language.

If you are predicting Chinese summaries, plz add --predict_chinese to the command.

If you are evaluating a NCLS+MS model, plz add --multi_task to the command.

Using following two commands will slightly improve all models' performance.

--language_limit means that the predictor will only predict words appearing in summaries of training data.

--tgt_mask is a list, recording all the words appearing in summaries of the training set. We provided chiniese and english dict in ./src directory .

Other Notable Commands

Plz ignore these arguments, these command were added and abandoned when trying new ideas¸ I will delete these related code in the future.

  • --sep_decoder
  • --few_sep_decoder
  • --tgt_seg
  • --few_sep_decoder
  • -bart

Besides, --batch_verification is used to debug, printing all the attributes in a training batch.

Owner
Yu Bai
https://ybai-nlp.github.io/
Yu Bai
A collection of differentiable SVD methods and also the official implementation of the ICCV21 paper "Why Approximate Matrix Square Root Outperforms Accurate SVD in Global Covariance Pooling?"

Differentiable SVD Introduction This repository contains: The official Pytorch implementation of ICCV21 paper Why Approximate Matrix Square Root Outpe

YueSong 32 Dec 25, 2022
Self-Regulated Learning for Egocentric Video Activity Anticipation

Self-Regulated Learning for Egocentric Video Activity Anticipation Introduction This is a Pytorch implementation of the model described in our paper:

qzhb 13 Sep 23, 2022
StyleGAN2 Webtoon / Anime Style Toonify

StyleGAN2 Webtoon / Anime Style Toonify Korea Webtoon or Japanese Anime Character Stylegan2 base high Quality 1024x1024 / 512x512 Generate and Transfe

121 Dec 21, 2022
Fast, flexible and fun neural networks.

Brainstorm Discontinuation Notice Brainstorm is no longer being maintained, so we recommend using one of the many other,available frameworks, such as

IDSIA 1.3k Nov 21, 2022
Keepsake is a Python library that uploads files and metadata (like hyperparameters) to Amazon S3 or Google Cloud Storage

Keepsake Version control for machine learning. Keepsake is a Python library that uploads files and metadata (like hyperparameters) to Amazon S3 or Goo

Replicate 1.6k Dec 29, 2022
Focal Loss for Dense Rotation Object Detection

Convert ResNets weights from GluonCV to Tensorflow Abstract GluonCV released some new resnet pre-training weights and designed some new resnets (such

17 Nov 24, 2021
Model of an AI powered sign language interpreter.

TEXT AND SPEECH TO SIGN LANGUAGE. A web application which takes in text or live audio speech recording as input, converts and displays the relevant Si

Mark Gatere 4 Mar 30, 2022
PyTorch implementation of Trust Region Policy Optimization

PyTorch implementation of TRPO Try my implementation of PPO (aka newer better variant of TRPO), unless you need to you TRPO for some specific reasons.

Ilya Kostrikov 366 Nov 15, 2022
🍀 Pytorch implementation of various Attention Mechanisms, MLP, Re-parameter, Convolution, which is helpful to further understand papers.⭐⭐⭐

🍀 Pytorch implementation of various Attention Mechanisms, MLP, Re-parameter, Convolution, which is helpful to further understand papers.⭐⭐⭐

xmu-xiaoma66 7.7k Jan 05, 2023
Spherical CNNs

Spherical CNNs Equivariant CNNs for the sphere and SO(3) implemented in PyTorch Overview This library contains a PyTorch implementation of the rotatio

Jonas Köhler 893 Dec 28, 2022
Official repository of Semantic Image Matting

Semantic Image Matting This is the official repository of Semantic Image Matting (CVPR2021). Overview Natural image matting separates the foreground f

192 Dec 29, 2022
A flexible ML framework built to simplify medical image reconstruction and analysis experimentation.

meddlr Getting Started Meddlr is a config-driven ML framework built to simplify medical image reconstruction and analysis problems. Installation To av

Arjun Desai 36 Dec 16, 2022
Torch-based tool for quantizing high-dimensional vectors using additive codebooks

Trainable multi-codebook quantization This repository implements a utility for use with PyTorch, and ideally GPUs, for training an efficient quantizer

Daniel Povey 41 Jan 07, 2023
Implementation of the paper NAST: Non-Autoregressive Spatial-Temporal Transformer for Time Series Forecasting.

Non-AR Spatial-Temporal Transformer Introduction Implementation of the paper NAST: Non-Autoregressive Spatial-Temporal Transformer for Time Series For

Chen Kai 66 Nov 28, 2022
Official Pytorch Implementation of Unsupervised Image Denoising with Frequency Domain Knowledge

Unsupervised Image Denoising with Frequency Domain Knowledge (BMVC 2021 Oral) : Official Project Page This repository provides the official PyTorch im

Donggon Jang 12 Sep 26, 2022
Home repository for the Regularized Greedy Forest (RGF) library. It includes original implementation from the paper and multithreaded one written in C++, along with various language-specific wrappers.

Regularized Greedy Forest Regularized Greedy Forest (RGF) is a tree ensemble machine learning method described in this paper. RGF can deliver better r

RGF-team 364 Dec 28, 2022
Using python and scikit-learn to make stock predictions

MachineLearningStocks in python: a starter project and guide EDIT as of Feb 2021: MachineLearningStocks is no longer actively maintained MachineLearni

Robert Martin 1.3k Dec 29, 2022
A Genetic Programming platform for Python with TensorFlow for wicked-fast CPU and GPU support.

Karoo GP Karoo GP is an evolutionary algorithm, a genetic programming application suite written in Python which supports both symbolic regression and

Kai Staats 149 Jan 09, 2023
Graph WaveNet apdapted for brain connectivity analysis.

Graph WaveNet for brain network analysis This is the implementation of the Graph WaveNet model used in our manuscript: S. Wein , A. Schüller, A. M. To

4 Dec 17, 2022
202 Jan 06, 2023