The official code for PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization

Related tags

Deep LearningPRIMER
Overview

PRIMER

The official code for PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization.

PRIMER is a pre-trained model for multi-document representation with focus on summarization that reduces the need for dataset-specific architectures and large amounts of fine-tuning labeled data. With extensive experiments on 6 multi-document summarization datasets from 3 different domains on the zero-shot, few-shot and full-supervised settings, PRIMER outperforms current state-of-the-art models on most of these settings with large margins.

Set up

  1. Create new virtual environment by
conda create --name primer python=3.7
conda activate primer
conda install cudatoolkit=10.0
  1. Install Longformer by
pip install git+https://github.com/allenai/longformer.git
  1. Install requirements to run the summarization scripts and data generation scripts by
pip install -r requirements.txt

Usage of PRIMER

  1. Download the pre-trained PRIMER model here to ./PRIMER_model
  2. Load the tokenizer and model by
from transformers import AutoTokenizer
from longformer import LongformerEncoderDecoderForConditionalGeneration
from longformer import LongformerEncoderDecoderConfig

tokenizer = AutoTokenizer.from_pretrained('./PRIMER_model/')
config = LongformerEncoderDecoderConfig.from_pretrained('./PRIMER_model/')
model = LongformerEncoderDecoderForConditionalGeneration.from_pretrained(
            './PRIMER_model/', config=config)

Make sure the documents separated with <doc-sep> in the input.

Summarization Scripts

You can use script/primer_main.py for pre-train/train/test PRIMER, and script/compared_model_main.py for train/test BART/PEGASUS/LED.

Pre-training Data Generation

Newshead: we crawled the newshead dataset using the original code, and cleaned up the crawled data, the final newshead dataset can be found here.

You can use utils/pretrain_preprocess.py to generate pre-training data.

  1. Generate data with scores and entities with --mode compute_all_scores
  2. Generate pre-training data with --mode pretraining_data_with_score:
    • Pegasus: --strategy greedy --metric pegasus_score
    • Entity_Pyramid: --strategy greedy_entity_pyramid --metric pyramid_rouge

Datasets

  • For Multi-News and Multi-XScience, it will automatically download from Huggingface.
  • WCEP-10: the preprocessed version can be found here
  • Wikisum: we only use a small subset for few-shot training(10/100) and testing(3200). The subset we used can be found here. Note we have significantly more examples than we used in train.pt and valid.pt, as we sample 10/100 examples multiple times in the few-shot setting, and we need to make sure it has a large pool to sample from.
  • DUC2003/2004: You need to apply for access based on the instruction
  • arXiv: you can find the data we used in this repo
Code for You Only Cut Once: Boosting Data Augmentation with a Single Cut

You Only Cut Once (YOCO) YOCO is a simple method/strategy of performing augmenta

88 Dec 28, 2022
Official code of Team Yao at Multi-Modal-Fact-Verification-2022

Official code of Team Yao at Multi-Modal-Fact-Verification-2022 A Multi-Modal Fact Verification dataset released as part of the De-Factify workshop in

Wei-Yao Wang 11 Nov 15, 2022
Code to produce syntactic representations that can be used to study syntax processing in the human brain

Can fMRI reveal the representation of syntactic structure in the brain? The code base for our paper on understanding syntactic representations in the

Aniketh Janardhan Reddy 4 Dec 18, 2022
PyTorch experiments with the Zalando fashion-mnist dataset

zalando-pytorch PyTorch experiments with the Zalando fashion-mnist dataset Project Organization ├── LICENSE ├── Makefile - Makefile with co

Federico Baldassarre 31 Sep 25, 2021
CONetV2: Efficient Auto-Channel Size Optimization for CNNs

CONetV2: Efficient Auto-Channel Size Optimization for CNNs Exciting News! CONetV2: Efficient Auto-Channel Size Optimization for CNNs has been accepted

Mahdi S. Hosseini 3 Dec 13, 2021
competitions-v2

Codabench (formerly Codalab Competitions v2) Installation $ cp .env_sample .env $ docker-compose up -d $ docker-compose exec django ./manage.py migrat

CodaLab 21 Dec 02, 2022
Video-face-extractor - Video face extractor with Python

Python face extractor Setup Create the srcvideos and faces directories Put your

2 Feb 03, 2022
Neural-Pull: Learning Signed Distance Functions from Point Clouds by Learning to Pull Space onto Surfaces(ICML 2021)

Neural-Pull: Learning Signed Distance Functions from Point Clouds by Learning to Pull Space onto Surfaces(ICML 2021) This repository contains the code

149 Dec 15, 2022
Detectorch - detectron for PyTorch

Detectorch - detectron for PyTorch (Disclaimer: this is work in progress and does not feature all the functionalities of detectron. Currently only inf

Ignacio Rocco 558 Dec 23, 2022
Official Implementation of PCT

Official Implementation of PCT Prerequisites python == 3.8.5 Please make sure you have the following libraries installed: numpy torch=1.4.0 torchvisi

32 Nov 21, 2022
Analysis code and Latex source of the manuscript describing the conditional permutation test of confounding bias in predictive modelling.

Git repositoty of the manuscript entitled Statistical quantification of confounding bias in predictive modelling by Tamas Spisak The manuscript descri

PNI - Predictive Neuroimaging Lab, University Hospital Essen, Germany 0 Nov 22, 2021
The implementation of CVPR2021 paper Temporal Query Networks for Fine-grained Video Understanding, by Chuhan Zhang, Ankush Gupta and Andrew Zisserman.

Temporal Query Networks for Fine-grained Video Understanding 📋 This repository contains the implementation of CVPR2021 paper Temporal_Query_Networks

55 Dec 21, 2022
Concept drift monitoring for HA model servers.

{Fast, Correct, Simple} - pick three Easily compare training and production ML data & model distributions Goals Boxkite is an instrumentation library

98 Dec 15, 2022
Official Matlab Implementation for "Tiny Obstacle Discovery by Occlusion-aware Multilayer Regression", TIP 2020

Tiny Obstacle Discovery by Occlusion-aware Multilayer Regression Official Matlab Implementation for "Tiny Obstacle Discovery by Occlusion-aware Multil

Xuefeng 5 Jan 15, 2022
PyTorch trainer and model for Sequence Classification

PyTorch-trainer-and-model-for-Sequence-Classification After cloning the repository, modify your training data so that the training data is a .csv file

NhanTieu 2 Dec 09, 2022
Codes for CyGen, the novel generative modeling framework proposed in "On the Generative Utility of Cyclic Conditionals" (NeurIPS-21)

On the Generative Utility of Cyclic Conditionals This repository is the official implementation of "On the Generative Utility of Cyclic Conditionals"

Chang Liu 44 Nov 16, 2022
Collection of TensorFlow2 implementations of Generative Adversarial Network varieties presented in research papers.

TensorFlow2-GAN Collection of tf2.0 implementations of Generative Adversarial Network varieties presented in research papers. Model architectures will

41 Apr 28, 2022
E2e music remastering system - End-to-end Music Remastering System Using Self-supervised and Adversarial Training

End-to-end Music Remastering System This repository includes source code and pre

Junghyun (Tony) Koo 37 Dec 15, 2022
PyTorch implementation for MINE: Continuous-Depth MPI with Neural Radiance Fields

MINE: Continuous-Depth MPI with Neural Radiance Fields Project Page | Video PyTorch implementation for our ICCV 2021 paper. MINE: Towards Continuous D

Zijian Feng 325 Dec 29, 2022
Code for classifying international patents based on the text of their titles/abstracts

Patent Classification Goal: To train a machine learning classifier that can automatically classify international patents downloaded from the WIPO webs

Prashanth Rao 1 Nov 08, 2022