Dataset and Source code of paper 'Enhancing Keyphrase Extraction from Academic Articles with their Reference Information'.

Overview

Enhancing Keyphrase Extraction from Academic Articles with their Reference Information

Overview

Dataset and code for paper "Enhancing Keyphrase Extraction from Academic Articles with their Reference Information".

The research content of this project is to analyze the impact of the introduction of reference title in scientific literature on the effect of keyword extraction. This project uses three datasets: SemEval-2010, PubMed and LIS-2000, which are located in the dataset folder. At the same time, we use two unsupervised methods: TF-IDF and TextRank, and three supervised learning methods: NaiveBayes, CRF and BiLSTM-CRF. The first four are traditional keywords extraction methods, located in the folder ML, and the last one is deep learning method, located in the folder DL.

Directory structure

Keyphrase_Extraction:                 Root directory
│  dl.bat:                            Batch commands to run deep learning model
│  ml.bat:                            Batch commands to run traditional models
│ 
├─Dataset:                            Store experimental datasets
│      SemEval-2010:                  Contains 244 scientific papers 
│      PubMed:                        Contains 1316 scientific papers
│      LIS-2000:                      Contains 2000 scientific papers
│ 
├─DL:                                 Store the source code of the deep learning model
│  │  build_path.py:                  Create file paths for saving preprocessed data
│  │  crf.py:                         Source code of CRF algorithm implementation(Use pytorch framework)
│  │  main.py:                        The main function of running the program
│  │  model.py:                       Source code of BiLSTM-CRF model
│  │  preprocess.py:                  Source code of preprocessing function
│  │  textrank.py:                    Source code of TextRank algorithm implementation.
│  │  tf_idf.py:                      Source code of TF-IDF algorithm implementation.
│  │  utils.py:                       Some auxiliary functions
│  ├─models:                          Parameter configuration of deep learning models
│  └─datas
│        tags:                        Label settings for sequence labeling
│ 
└─ML:                                 Store the source code of the traditional models
    │  build_path.py:                 Create file paths for saving preprocessed data
    │  configs.py:                    Path configuration file
    │  crf.py:                        Source code of CRF algorithm implementation(Use CRF++ Toolkit)
    │  evaluate.py:                   Source code for result evaluation
    │  naivebayes.py:                 Source code of naivebayes algorithm implementation(Use KEA-3.0 Toolkit)
    │  preprocessing.py:              Source code of preprocessing function
    │  textrank.py:                   Source code of TextRank algorithm implementation
    │  tf_idf.py:                     Source code of TF-IDF algorithm implementation
    │  utils.py:                      Some auxiliary functions
    ├─CRF++:                          CRF++ Toolkit
    └─KEA-3.0:                        KEA-3.0 Toolkit

Dataset Description

The dataset includes the following three json files:

  • SemEval-2010: SemEval-2010 Task 5 dataset, it contains 244 scientific papers and can be visited at: https://semeval2.fbk.eu/semeval2.php?location=data.
  • PubMed: Contains 1316 scientific papers from PubMed (https://github.com/boudinfl/ake-datasets/tree/master/datasets/PubMed).
  • LIS-2000: Contains 2000 scientific papers from journals in Library and Information Science (LIS).

    Each line of the json file includes:

  • title (T): The title of the paper.
  • abstract (A): The abstract of the paper.
  • introduction (I): The introduction of the paper.
  • conclusion (C): The conclusion of the paper.
  • body1 (Fp): The first sentence of each paragraph.
  • body2 (Lp): The last sentence of each paragraph.
  • full_text (F): The full text of the paper.
  • references (R): references list and only the title of each reference is provided.
  • keywords (K): the keywords of the paper and these keywords were annotated manually.

    Quick Start

    In order to facilitate the reproduction of the experimental results, the project uses bat batch command to run the program uniformly (only in Windows Environment). The dl.bat file is the batch command to run the deep learning model, and the ml.bat file is the batch command to run the traditional algorithm.

    How does it work?

    In the Windows environment, use the key combination Win + R and enter cmd to open the DOS command box, and switch to the project's root directory (Keyphrase_Extraction). Then input dl.bat, that is, run deep learning model to get the result of keyword extraction; Enter ml.bat to run traditional algorithm to get keywords Extract the results.

    Experimental results

    The following figures show that the influence of reference information on keyphrase extraction results of TF*IDF, TextRank, NB, CRF and BiLSTM-CRF.

    Table 1: Keyphrase extraction performance of multiple corpora constructed using different logical structure texts on the dataset of SemEval-2010 Table1

    Table 2: Keyphrase extraction performance of multiple corpora constructed using different logical structure texts on the dataset of PubMed Table2

    Table 3: Keyphrase extraction performance of multiple corpora constructed using different logical structure texts on the dataset of LIS-2000 Table3

    Note: The yellow, green and blue bold fonts in the table represent the largest of the P, R and F1 value obtained from different corpora using the same model, respectively.

    Dependency packages

    Before running this project, check that the following Python packages are included in your runtime environment.

  • pytorch 1.7.1
  • nltk 3.5
  • numpy 1.19.2
  • pandas 1.1.3
  • tqdm 4.50.2

    Citation

    Please cite the following paper if you use this codes and dataset in your work.

    Chengzhi Zhang, Lei Zhao, Mengyuan Zhao, Yingyi Zhang. Enhancing Keyphrase Extraction from Academic Articles with their Reference Information. Scientometrics, 2021. (in press) [arXiv]

  • Owner
    Professor at iSchool of Nanjing University of Science and Technology
    A simple image/video to Desmos graph converter run locally

    Desmos Bezier Renderer A simple image/video to Desmos graph converter run locally Sample Result Setup Install dependencies apt update apt install git

    Kevin JY Cui 339 Dec 23, 2022
    FCOS: Fully Convolutional One-Stage Object Detection (ICCV'19)

    FCOS: Fully Convolutional One-Stage Object Detection This project hosts the code for implementing the FCOS algorithm for object detection, as presente

    Tian Zhi 3.1k Jan 05, 2023
    OstrichRL: A Musculoskeletal Ostrich Simulation to Study Bio-mechanical Locomotion.

    OstrichRL This is the repository accompanying the paper OstrichRL: A Musculoskeletal Ostrich Simulation to Study Bio-mechanical Locomotion. It contain

    Vittorio La Barbera 51 Nov 17, 2022
    PyTorch implementation of TSception V2 using DEAP dataset

    TSception This is the PyTorch implementation of TSception V2 using DEAP dataset in our paper: Yi Ding, Neethu Robinson, Su Zhang, Qiuhao Zeng, Cuntai

    Yi Ding 27 Dec 15, 2022
    Implementation of Kalman Filter in Python

    Kalman Filter in Python This is a basic example of how Kalman filter works in Python. I do plan on refactoring and expanding this repo in the future.

    Enoch Kan 35 Sep 11, 2022
    Ray tracing of a Schwarzschild black hole written entirely in TensorFlow.

    TensorGeodesic Ray tracing of a Schwarzschild black hole written entirely in TensorFlow. Dependencies: Python 3 TensorFlow 2.x numpy matplotlib About

    5 Jan 15, 2022
    Code and experiments for "Deep Neural Networks for Rank Consistent Ordinal Regression based on Conditional Probabilities"

    corn-ordinal-neuralnet This repository contains the orginal model code and experiment logs for the paper "Deep Neural Networks for Rank Consistent Ord

    Raschka Research Group 14 Dec 27, 2022
    Pytorch implementation of "Training a 85.4% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet"

    Token Labeling: Training an 85.4% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet (arxiv) This is a Pytorch implementation of our te

    蒋子航 383 Dec 27, 2022
    A PyTorch implementation of the Transformer model in "Attention is All You Need".

    Attention is all you need: A Pytorch Implementation This is a PyTorch implementation of the Transformer model in "Attention is All You Need" (Ashish V

    Yu-Hsiang Huang 7.1k Jan 04, 2023
    Elevation Mapping on GPU.

    Elevation Mapping cupy Overview This is a ros package of elevation mapping on GPU. Code are written in python and uses cupy for GPU calculation. * pla

    Robotic Systems Lab - Legged Robotics at ETH Zürich 183 Dec 19, 2022
    A minimal implementation of face-detection models using flask, gunicorn, nginx, docker, and docker-compose

    Face-Detection-flask-gunicorn-nginx-docker This is a simple implementation of dockerized face-detection restful-API implemented with flask, Nginx, and

    Pooya-Mohammadi 30 Dec 17, 2022
    FAVD: Featherweight Assisted Vulnerability Discovery

    FAVD: Featherweight Assisted Vulnerability Discovery This repository contains the replication package for the paper "Featherweight Assisted Vulnerabil

    secureIT 4 Sep 16, 2022
    Python Jupyter kernel using Poetry for reproducible notebooks

    Poetry Kernel Use per-directory Poetry environments to run Jupyter kernels. No need to install a Jupyter kernel per Python virtual environment! The id

    Pathbird 204 Jan 04, 2023
    Context Axial Reverse Attention Network for Small Medical Objects Segmentation

    CaraNet: Context Axial Reverse Attention Network for Small Medical Objects Segmentation This repository contains the implementation of a novel attenti

    401 Dec 23, 2022
    Locally Enhanced Self-Attention: Rethinking Self-Attention as Local and Context Terms

    LESA Introduction This repository contains the official implementation of Locally Enhanced Self-Attention: Rethinking Self-Attention as Local and Cont

    Chenglin Yang 20 Dec 31, 2021
    Official code for "Mean Shift for Self-Supervised Learning"

    MSF Official code for "Mean Shift for Self-Supervised Learning" Requirements Python = 3.7.6 PyTorch = 1.4 torchvision = 0.5.0 faiss-gpu = 1.6.1 In

    UMBC Vision 44 Nov 21, 2022
    A Factor Model for Persistence in Investment Manager Performance

    Factor-Model-Manager-Performance A Factor Model for Persistence in Investment Manager Performance I apply methods and processes similar to those used

    Omid Arhami 1 Dec 01, 2021
    Designing a Minimal Retrieve-and-Read System for Open-Domain Question Answering (NAACL 2021)

    Designing a Minimal Retrieve-and-Read System for Open-Domain Question Answering Abstract In open-domain question answering (QA), retrieve-and-read mec

    Clova AI Research 34 Apr 13, 2022
    Unofficial implement with paper SpeakerGAN: Speaker identification with conditional generative adversarial network

    Introduction This repository is about paper SpeakerGAN , and is unofficially implemented by Mingming Huang ( 7 Jan 03, 2023

    QuanTaichi evaluation suite

    QuanTaichi: A Compiler for Quantized Simulations (SIGGRAPH 2021) Yuanming Hu, Jiafeng Liu, Xuanda Yang, Mingkuan Xu, Ye Kuang, Weiwei Xu, Qiang Dai, W

    Taichi Developers 120 Jan 04, 2023