PRIMER
The official code for PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization.
PRIMER is a pre-trained model for multi-document representation with focus on summarization that reduces the need for dataset-specific architectures and large amounts of fine-tuning labeled data. With extensive experiments on 6 multi-document summarization datasets from 3 different domains on the zero-shot, few-shot and full-supervised settings, PRIMER outperforms current state-of-the-art models on most of these settings with large margins.
Set up
- Create new virtual environment by
conda create --name primer python=3.7
conda activate primer
conda install cudatoolkit=10.0
- Install Longformer by
pip install git+https://github.com/allenai/longformer.git
- Install requirements to run the summarization scripts and data generation scripts by
pip install -r requirements.txt
Usage of PRIMER
- Download the pre-trained PRIMER model here to
./PRIMER_model
- Load the tokenizer and model by
from transformers import AutoTokenizer
from longformer import LongformerEncoderDecoderForConditionalGeneration
from longformer import LongformerEncoderDecoderConfig
tokenizer = AutoTokenizer.from_pretrained('./PRIMER_model/')
config = LongformerEncoderDecoderConfig.from_pretrained('./PRIMER_model/')
model = LongformerEncoderDecoderForConditionalGeneration.from_pretrained(
'./PRIMER_model/', config=config)
Make sure the documents separated with <doc-sep>
in the input.
Summarization Scripts
You can use script/primer_main.py
for pre-train/train/test PRIMER, and script/compared_model_main.py
for train/test BART/PEGASUS/LED.
Pre-training Data Generation
Newshead: we crawled the newshead dataset using the original code, and cleaned up the crawled data, the final newshead dataset can be found here.
You can use utils/pretrain_preprocess.py
to generate pre-training data.
- Generate data with scores and entities with
--mode compute_all_scores
- Generate pre-training data with
--mode pretraining_data_with_score
:- Pegasus:
--strategy greedy --metric pegasus_score
- Entity_Pyramid:
--strategy greedy_entity_pyramid --metric pyramid_rouge
- Pegasus:
Datasets
- For Multi-News and Multi-XScience, it will automatically download from Huggingface.
- WCEP-10: the preprocessed version can be found here
- Wikisum: we only use a small subset for few-shot training(10/100) and testing(3200). The subset we used can be found here. Note we have significantly more examples than we used in
train.pt
andvalid.pt
, as we sample 10/100 examples multiple times in the few-shot setting, and we need to make sure it has a large pool to sample from. - DUC2003/2004: You need to apply for access based on the instruction
- arXiv: you can find the data we used in this repo