Entity Disambiguation as text extraction (ACL 2022)

Overview

ExtEnD: Extractive Entity Disambiguation

Python Python PyTorch plugin: spacy Code style: black

This repository contains the code of ExtEnD: Extractive Entity Disambiguation, a novel approach to Entity Disambiguation (i.e. the task of linking a mention in context with its most suitable entity in a reference knowledge base) where we reformulate this task as a text extraction problem. This work was accepted at ACL 2022.

If you find our paper, code or framework useful, please reference this work in your paper:

@inproceedings{barba-etal-2021-extend,
    title = "{E}xt{E}n{D}: Extractive Entity Disambiguation",
    author = "Barba, Edoardo  and
      Procopio, Luigi  and
      Navigli, Roberto",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics",
    month = may,
    year = "2022",
    address = "Online and Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
}

ExtEnD Image

ExtEnD is built on top of the classy library. If you are interested in using this project, we recommend checking first its introduction, although it is not strictly required to train and use the models.

Finally, we also developed a few additional tools that make it simple to use and test ExtEnD models:

Setup the environment

Requirements:

  • Debian-based (e.g. Debian, Ubuntu, ...) system
  • conda installed

To quickly setup the environment to use ExtEnd/replicate our experiments, you can use the bash script setup.sh. The only requirements needed here is to have a Debian-based system (Debian, Ubuntu, ...) and conda installed.

bash setup.sh

Checkpoints

We release the following checkpoints:

Model Training Dataset Avg Score
Longformer Large AIDA 85.8

Once you have downloaded the files, untar them inside the experiments/ folder.

# move file to experiments folder
mv ~/Downloads/extend-longformer-large.tar.gz experiments/
# untar
tar -xf experiments/extend-longformer-large.tar.gz -C experiments/
rm experiments/extend-longformer-large.tar.gz

Data

All the datasets used to train and evaluate ExtEnD can be downloaded using the following script from the facebook GENRE repository.

We strongly recommend you organize them in the following structure under the data folder as it is used by several scripts in the project.

data
├── aida
│   ├── test.aida
│   ├── train.aida
│   └── validation.aida
└── out_of_domain
    ├── ace2004-test-kilt.ed
    ├── aquaint-test-kilt.ed
    ├── clueweb-test-kilt.ed
    ├── msnbc-test-kilt.ed
    └── wiki-test-kilt.ed

Training

To train a model from scratch, you just have to use the following command:

classy train qa <folder> -n my-model-name --profile aida-longformer-large-gam -pd extend

can be any folder containing exactly 3 files:

  • train.aida
  • validation.aida
  • test.aida

This is required to let classy automatically discover the dataset splits. For instance, to re-train our AIDA-only model:

classy train data/aida -n my-model-name --profile aida-longformer-large-gam -pd extend

Note that can be any folder, as long as:

  • it contains these 3 files
  • they are in the same format as the files in data/aida

So if you want to train on these different datasets, just create the corresponding directory and you are ready to go!

In case you want to modify some training hyperparameter, you just have to edit the aida-longformer-large-gam profile in the configurations/ folder. You can take a look to the modifiable parameters by adding the parameter --print to the training command. You can find more on this in classy official documentation.

Predict

You can use classy syntax to perform file prediction:

classy predict -pd extend file \
    experiments/extend-longformer-large \
    data/aida/test.aida \
    -o data/aida_test_predictions.aida

Evaluation

To evaluate a checkpoint, you can run the bash script scripts/full_evaluation.sh, passing its path as an input argument. This will evaluate the model provided against both AIDA and OOD resources.

# syntax: bash scripts/full_evaluation.sh <ckpt-path>
bash scripts/full_evaluation.sh experiments/extend-longformer-large/2021-10-22/09-11-39/checkpoints/best.ckpt

If you are interested in AIDA-only evaluation, you can use scripts/aida_evaluation.sh instead (same syntax).

Furthermore, you can evaluate the model on any dataset that respects the same format of the original ones with the following command:

classy evaluate \
    experiments/extend-longformer-large/2021-10-22/09-11-39/checkpoints/best.ckpt \
    data/aida/test.aida \
    -o data/aida_test_evaluation.txt \
    -pd extend

spaCy

You can also use ExtEnD with spaCy, allowing you to use our system with a seamless interface that tackles full end-to-end entity linking. To do so, you just need to have cloned the repo and run setup.sh to configure the environment. Then, you will be able to add extend as a custom component in the following way:

import spacy
from extend import spacy_component

nlp = spacy.load("en_core_web_sm")

extend_config = dict(
    checkpoint_path="<ckpt-path>",
    mentions_inventory_path="<inventory-path>",
    device=0,
    tokens_per_batch=4000,
)

nlp.add_pipe("extend", after="ner", config=extend_config)

input_sentence = "Japan began the defence of their title " \
                 "with a lucky 2-1 win against Syria " \
                 "in a championship match on Friday."

doc = nlp(input_sentence)

# [(Japan, Japan National Footbal Team), (Syria, Syria National Footbal Team)]
disambiguated_entities = [(ent.text, ent._.disambiguated_entity) for ent in doc.ents]

Where:

  • <ckpt-path> is the path to a pretrained checkpoint of extend that you can find in the Checkpoints section, and
  • <inventory-path> is the path to a file containing the mapping from mentions to the corresponding candidates.

We support two formats for <inventory-path>:

  • tsv:
    $ head -1 <inventory-path>
    Rome \[TAB\] Rome City \[TAB\] Rome Football Team \[TAB\] Roman Empire \[TAB\] ...
    That is, <inventory-path> is a tab-separated file where, for each row, we have the mention (Rome) followed by its possible entities.
  • sqlite: a sqlite3 database with a candidate table with two columns:
    • mention (text PRIMARY KEY)
    • entities (text). This must be a tab-separated list of the corresponding entities.

We release 6 possible pre-computed <inventory-path> that you could use (we recommend creating a folder data/inventories/ and placing the files downloaded there inside, e.g., = data/inventories/le-and-titov-2018-inventory.min-count-2.sqlite3):

Inventory Number of Mentions Source
le-and-titov-2018-inventory.min-count-2.tsv 12090972 Cleaned version of the candidate set released by Le and Titov (2018). We discard mentions whose count is less than 2.
[Recommended] le-and-titov-2018-inventory.min-count-2.sqlite3 12090972 Cleaned version of the candidate set released by Le and Titov (2018). We discard mentions whose count is less than 2.
le-and-titov-2018-inventory.tsv 21571265 The candidate set released by Le and Titov (2018)
le-and-titov-2018-inventory.sqlite3 21571265 The candidate set released by Le and Titov (2018)

Note that, as far as you respect either of these two formats, you can also create and use your own inventory!

Docker container

Finally, we also release a docker image running two services, a streamlit demo and a REST service:

$ docker run -p 22001:22001 -p 22002:22002 --rm -itd poccio/extend:1.0.1
<container id>

Now you can:

  • checkout the streamlit demo at http://127.0.0.1:22001/
  • invoke the REST service running at http://127.0.0.1:22002/ (http://127.0.0.1:22002/docs you can find the OpenAPI documentation):
    $ curl -X POST http://127.0.0.1:22002/ -H 'Content-Type: application/json' -d '[{"text": "Rome is in Italy"}]'
    [{"text":"Rome is in Italy","disambiguated_entities":[{"char_start":0,"char_end":4,"mention":"Rome","entity":"Rome"},{"char_start":11,"char_end":16,"mention":"Italy","entity":"Italy"}]}]

Acknowledgments

The authors gratefully acknowledge the support of the ERC Consolidator Grant MOUSSE No. 726487 under the European Union’s Horizon 2020 research and innovation programme.

This work was supported in part by the MIUR under grant “Dipartimenti di eccellenza 2018-2022” of the Department of Computer Science of the Sapienza University of Rome.

License

This work is under the Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.

Owner
Sapienza NLP group
The NLP group at the Sapienza University of Rome
Sapienza NLP group
CCF BDCI 2020 房产行业聊天问答匹配赛道 A榜47/2985

CCF BDCI 2020 房产行业聊天问答匹配 A榜47/2985 赛题描述详见:https://www.datafountain.cn/competitions/474 文件说明 data: 存放训练数据和测试数据以及预处理代码 model_bert.py: 网络模型结构定义 adv_train

shuo 40 Sep 28, 2022
KLUE-baseline contains the baseline code for the Korean Language Understanding Evaluation (KLUE) benchmark.

KLUE Baseline Korean(한국어) KLUE-baseline contains the baseline code for the Korean Language Understanding Evaluation (KLUE) benchmark. See our paper fo

74 Dec 13, 2022
BERN2: an advanced neural biomedical namedentity recognition and normalization tool

BERN2 We present BERN2 (Advanced Biomedical Entity Recognition and Normalization), a tool that improves the previous neural network-based NER tool by

DMIS Laboratory - Korea University 99 Jan 06, 2023
Toy example of an applied ML pipeline for me to experiment with MLOps tools.

Toy Machine Learning Pipeline Table of Contents About Getting Started ML task description and evaluation procedure Dataset description Repository stru

Shreya Shankar 190 Dec 21, 2022
This repository is home to the Optimus data transformation plugins for various data processing needs.

Transformers Optimus's transformation plugins are implementations of Task and Hook interfaces that allows execution of arbitrary jobs in optimus. To i

Open Data Platform 37 Dec 14, 2022
Text editor on python to convert english text to malayalam(Romanization/Transiteration).

Manglish Text Editor This is a simple transiteration (romanization ) program which is used to convert manglish to malayalam (converts njaan to ഞാൻ ).

Merin Rose Tom 1 May 11, 2022
A notebook that shows how to import the IITB English-Hindi Parallel Corpus from the HuggingFace datasets repository

We provide a notebook that shows how to import the IITB English-Hindi Parallel Corpus from the HuggingFace datasets repository. The notebook also shows how to segment the corpus using BPE tokenizatio

Computation for Indian Language Technology (CFILT) 9 Oct 13, 2022
Universal Adversarial Triggers for Attacking and Analyzing NLP (EMNLP 2019)

Universal Adversarial Triggers for Attacking and Analyzing NLP This is the official code for the EMNLP 2019 paper, Universal Adversarial Triggers for

Eric Wallace 248 Dec 17, 2022
This is a project built for FALLABOUT2021 event under SRMMIC, This project deals with NLP poetry generation.

FALLABOUT-SRMMIC 21 POETRY-GENERATION HINGLISH DESCRIPTION We have developed a NLP(natural language processing) model which automatically generates a

7 Sep 28, 2021
Unofficial implementation of Google's FNet: Mixing Tokens with Fourier Transforms

FNet: Mixing Tokens with Fourier Transforms Pytorch implementation of Fnet : Mixing Tokens with Fourier Transforms. Citation: @misc{leethorp2021fnet,

Rishikesh (ऋषिकेश) 217 Dec 05, 2022
Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors"

SWRM Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors" Clone Clone th

14 Jan 03, 2023
BiNE: Bipartite Network Embedding

BiNE: Bipartite Network Embedding This repository contains the demo code of the paper: BiNE: Bipartite Network Embedding. Ming Gao, Leihui Chen, Xiang

leihuichen 214 Nov 24, 2022
Lattice methods in TensorFlow

TensorFlow Lattice TensorFlow Lattice is a library that implements constrained and interpretable lattice based models. It is an implementation of Mono

504 Dec 20, 2022
Code to reprudece NeurIPS paper: Accelerated Sparse Neural Training: A Provable and Efficient Method to Find N:M Transposable Masks

Accelerated Sparse Neural Training: A Provable and Efficient Method to FindN:M Transposable Masks Recently, researchers proposed pruning deep neural n

itay hubara 4 Feb 23, 2022
NLP Core Library and Model Zoo based on PaddlePaddle 2.0

PaddleNLP 2.0拥有丰富的模型库、简洁易用的API与高性能的分布式训练的能力,旨在为飞桨开发者提升文本建模效率,并提供基于PaddlePaddle 2.0的NLP领域最佳实践。

6.9k Jan 01, 2023
Quick insights from Zoom meeting transcripts using Graph + NLP

Transcript Analysis - Graph + NLP This program extracts insights from Zoom Meeting Transcripts (.vtt) using TigerGraph and NLTK. In order to run this

Advit Deepak 7 Sep 17, 2022
Utilizing RBERT model for KLUE Relation Extraction task

RBERT for Relation Extraction task for KLUE Project Description Relation Extraction task is one of the task of Korean Language Understanding Evaluatio

snoop2head 14 Nov 15, 2022
Dé op-de-vlucht Pieton vertaler. Wereldwijd gebruikt door meer dan 1.000+ succesvolle bedrijven!

Dé op-de-vlucht Pieton vertaler. Wereldwijd gebruikt door meer dan 1.000+ succesvolle bedrijven!

Lau 1 Dec 17, 2021
:P Some basic stuff I'm gonna use for my upcoming Agile Software Development and Devops

reverse-image-search-py bash script.sh img_name.jpg Requirements pip install requests pip install pyshorteners Dry run [ Sudhanva M 3 Dec 18, 2021

Japanese NLP Library

Japanese NLP Library Back to Home Contents 1 Requirements 1.1 Links 1.2 Install 1.3 History 2 Libraries and Modules 2.1 Tokenize jTokenize.py 2.2 Cabo

Pulkit Kathuria 144 Dec 27, 2022