Code for our EMNLP 2021 paper “Heterogeneous Graph Neural Networks for Keyphrase Generation”

Last update: Nov 24, 2022

Overview

GATER

This repository contains the code for our EMNLP 2021 paper “Heterogeneous Graph Neural Networks for Keyphrase Generation”.

Our implementation is built on the source code from keyphrase-generation-rl and fastNLP. Thanks for their work.

If you use this code, please cite our paper:

@inproceedings{ye2021heterogeneous,
  title={Heterogeneous Graph Neural Networks for Keyphrase Generation},
  author={Ye, Jiacheng and Cai, Ruijian and Gui, Tao and Zhang, Qi},
  booktitle={Proceedings of EMNLP},
  year={2021}
}

Dependency

python 3.5+
pytorch 1.0+
dgl 0.4.3
sentence_transformers 1.1.0
faiss 1.6.3

Dataset

The datasets can be downloaded from here, which are the tokenized version of the datasets provided by Ken Chen:

The testsets directory contains the five datasets for testing (i.e., inspec, krapivin, nus, and semeval and kp20k), where each of the datasets contains test_src.txt and test_trg.txt.
The kp20k_sorted directory contains the training and validation files (i.e., train_src.txt, train_trg.txt, valid_src.txt and valid_trg.txt).
Each line of the *_src.txt file is the source document, which contains the tokenized words of title <eos> abstract .
Each line of the *_trg.txt file contains the target keyphrases separated by an ; character. For example, each line can be like present keyphrase one;present keyphrase two;absent keyprhase one;absent keyphrase two.

Quick Start

The whole process includes the following steps:

Build tfidf: The retrievers/build_tfidf.py script is used to build the index for document retrieval.
Preprocessing: The preprocess.py script numericalizes the train_src.txt, train_trg.txt,valid_src.txt and valid_trg.txt files, and produces train.one2many.pt, valid.one2many.pt and vocab.pt.
Training: The train.py script loads the train.one2many.pt, valid.one2many.pt and vocab.pt file and performs training. We evaluate the model every 8000 batches on the valid set, and the model will be saved if the valid loss is lower than the previous one.
Decoding: The predict.py script loads the trained model and performs decoding on the five test datasets. The prediction file will be saved, which is like predicted keyphrase one;predicted keyphrase two;….
Evaluation: The evaluate_prediction.py script loads the ground-truth and predicted keyphrases, and calculates the [email protected]$ and [email protected]$ metrics.

For the sake of simplicity, we provide an one-click script in the script directory. You can run the following command to run the whole process with Gater model:

# under `One2One` paradigm
bash scripts/run_gater_one2one.sh

# under `One2Seq` paradigm
bash scripts/run_gater_one2seq.sh

You can also run the baseline model with the following command:

# under `One2One` paradigm
bash scripts/run_one2one.sh

# under `One2Seq` paradigm
bash scripts/run_one2seq.sh

Note:

Please download and unzip the datasets in the ./data directory first.
The Preprocessing procedure takes time because we have to pre-retrieve similiar references for each samples, and we also store them for the preparation of the training stage.
To run all the bash files smoothly, you may need to specify the correct home_dir (i.e., the absolute path to kg_gater dictionary) and the gpu id for CUDA_VISIBLE_DEVICES. We provide a small amount of data to quickly test whether your running environment is correct. You can test by running the following command:

bash scripts/run_small_gater_one2seq.sh

Code for our EMNLP 2021 paper “Heterogeneous Graph Neural Networks for Keyphrase Generation”

Related tags

Overview

GATER

Dependency

Dataset

Quick Start

Owner

Jiacheng Ye

AgML is a comprehensive library for agricultural machine learning

A Keras implementation of CapsNet in the paper: Sara Sabour, Nicholas Frosst, Geoffrey E Hinton. Dynamic Routing Between Capsules

Set of methods to ensemble boxes from different object detection models, including implementation of "Weighted boxes fusion (WBF)" method.

This repository introduces a short project about Transfer Learning for Classification of MRI Images.

Code of the paper "Shaping Visual Representations with Attributes for Few-Shot Learning (ASL)".

EPSANet：An Efficient Pyramid Split Attention Block on Convolutional Neural Network

Assessing the Influence of Models on the Performance of Reinforcement Learning Algorithms applied on Continuous Control Tasks

Implementation of OpenAI paper with Simple Noise Scale on Fastai V2

A set of tests for evaluating large-scale algorithms for Wasserstein-2 transport maps computation.

Google Brain - Ventilator Pressure Prediction

Coursera - Quiz & Assignment of Coursera

A Fast and Accurate One-Stage Approach to Visual Grounding, ICCV 2019 (Oral)

Joint Gaussian Graphical Model Estimation: A Survey

Experiments with Fourier layers on simulation data.

RM Operation can equivalently convert ResNet to VGG, which is better for pruning; and can help RepVGG perform better when the depth is large.

Filtering variational quantum algorithms for combinatorial optimization

Angora is a mutation-based fuzzer. The main goal of Angora is to increase branch coverage by solving path constraints without symbolic execution.

An official reimplementation of the method described in the INTERSPEECH 2021 paper - Speech Resynthesis from Discrete Disentangled Self-Supervised Representations.

kullanışlı ve işinizi kolaylaştıracak bir araç

Dirty Pixels: Towards End-to-End Image Processing and Perception