Entity Disambiguation as text extraction (ACL 2022)

Overview

ExtEnD: Extractive Entity Disambiguation

Python Python PyTorch plugin: spacy Code style: black

This repository contains the code of ExtEnD: Extractive Entity Disambiguation, a novel approach to Entity Disambiguation (i.e. the task of linking a mention in context with its most suitable entity in a reference knowledge base) where we reformulate this task as a text extraction problem. This work was accepted at ACL 2022.

If you find our paper, code or framework useful, please reference this work in your paper:

@inproceedings{barba-etal-2021-extend,
    title = "{E}xt{E}n{D}: Extractive Entity Disambiguation",
    author = "Barba, Edoardo  and
      Procopio, Luigi  and
      Navigli, Roberto",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics",
    month = may,
    year = "2022",
    address = "Online and Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
}

ExtEnD Image

ExtEnD is built on top of the classy library. If you are interested in using this project, we recommend checking first its introduction, although it is not strictly required to train and use the models.

Finally, we also developed a few additional tools that make it simple to use and test ExtEnD models:

Setup the environment

Requirements:

  • Debian-based (e.g. Debian, Ubuntu, ...) system
  • conda installed

To quickly setup the environment to use ExtEnd/replicate our experiments, you can use the bash script setup.sh. The only requirements needed here is to have a Debian-based system (Debian, Ubuntu, ...) and conda installed.

bash setup.sh

Checkpoints

We release the following checkpoints:

Model Training Dataset Avg Score
Longformer Large AIDA 85.8

Once you have downloaded the files, untar them inside the experiments/ folder.

# move file to experiments folder
mv ~/Downloads/extend-longformer-large.tar.gz experiments/
# untar
tar -xf experiments/extend-longformer-large.tar.gz -C experiments/
rm experiments/extend-longformer-large.tar.gz

Data

All the datasets used to train and evaluate ExtEnD can be downloaded using the following script from the facebook GENRE repository.

We strongly recommend you organize them in the following structure under the data folder as it is used by several scripts in the project.

data
├── aida
│   ├── test.aida
│   ├── train.aida
│   └── validation.aida
└── out_of_domain
    ├── ace2004-test-kilt.ed
    ├── aquaint-test-kilt.ed
    ├── clueweb-test-kilt.ed
    ├── msnbc-test-kilt.ed
    └── wiki-test-kilt.ed

Training

To train a model from scratch, you just have to use the following command:

classy train qa <folder> -n my-model-name --profile aida-longformer-large-gam -pd extend

can be any folder containing exactly 3 files:

  • train.aida
  • validation.aida
  • test.aida

This is required to let classy automatically discover the dataset splits. For instance, to re-train our AIDA-only model:

classy train data/aida -n my-model-name --profile aida-longformer-large-gam -pd extend

Note that can be any folder, as long as:

  • it contains these 3 files
  • they are in the same format as the files in data/aida

So if you want to train on these different datasets, just create the corresponding directory and you are ready to go!

In case you want to modify some training hyperparameter, you just have to edit the aida-longformer-large-gam profile in the configurations/ folder. You can take a look to the modifiable parameters by adding the parameter --print to the training command. You can find more on this in classy official documentation.

Predict

You can use classy syntax to perform file prediction:

classy predict -pd extend file \
    experiments/extend-longformer-large \
    data/aida/test.aida \
    -o data/aida_test_predictions.aida

Evaluation

To evaluate a checkpoint, you can run the bash script scripts/full_evaluation.sh, passing its path as an input argument. This will evaluate the model provided against both AIDA and OOD resources.

# syntax: bash scripts/full_evaluation.sh <ckpt-path>
bash scripts/full_evaluation.sh experiments/extend-longformer-large/2021-10-22/09-11-39/checkpoints/best.ckpt

If you are interested in AIDA-only evaluation, you can use scripts/aida_evaluation.sh instead (same syntax).

Furthermore, you can evaluate the model on any dataset that respects the same format of the original ones with the following command:

classy evaluate \
    experiments/extend-longformer-large/2021-10-22/09-11-39/checkpoints/best.ckpt \
    data/aida/test.aida \
    -o data/aida_test_evaluation.txt \
    -pd extend

spaCy

You can also use ExtEnD with spaCy, allowing you to use our system with a seamless interface that tackles full end-to-end entity linking. To do so, you just need to have cloned the repo and run setup.sh to configure the environment. Then, you will be able to add extend as a custom component in the following way:

import spacy
from extend import spacy_component

nlp = spacy.load("en_core_web_sm")

extend_config = dict(
    checkpoint_path="<ckpt-path>",
    mentions_inventory_path="<inventory-path>",
    device=0,
    tokens_per_batch=4000,
)

nlp.add_pipe("extend", after="ner", config=extend_config)

input_sentence = "Japan began the defence of their title " \
                 "with a lucky 2-1 win against Syria " \
                 "in a championship match on Friday."

doc = nlp(input_sentence)

# [(Japan, Japan National Footbal Team), (Syria, Syria National Footbal Team)]
disambiguated_entities = [(ent.text, ent._.disambiguated_entity) for ent in doc.ents]

Where:

  • <ckpt-path> is the path to a pretrained checkpoint of extend that you can find in the Checkpoints section, and
  • <inventory-path> is the path to a file containing the mapping from mentions to the corresponding candidates.

We support two formats for <inventory-path>:

  • tsv:
    $ head -1 <inventory-path>
    Rome \[TAB\] Rome City \[TAB\] Rome Football Team \[TAB\] Roman Empire \[TAB\] ...
    That is, <inventory-path> is a tab-separated file where, for each row, we have the mention (Rome) followed by its possible entities.
  • sqlite: a sqlite3 database with a candidate table with two columns:
    • mention (text PRIMARY KEY)
    • entities (text). This must be a tab-separated list of the corresponding entities.

We release 6 possible pre-computed <inventory-path> that you could use (we recommend creating a folder data/inventories/ and placing the files downloaded there inside, e.g., = data/inventories/le-and-titov-2018-inventory.min-count-2.sqlite3):

Inventory Number of Mentions Source
le-and-titov-2018-inventory.min-count-2.tsv 12090972 Cleaned version of the candidate set released by Le and Titov (2018). We discard mentions whose count is less than 2.
[Recommended] le-and-titov-2018-inventory.min-count-2.sqlite3 12090972 Cleaned version of the candidate set released by Le and Titov (2018). We discard mentions whose count is less than 2.
le-and-titov-2018-inventory.tsv 21571265 The candidate set released by Le and Titov (2018)
le-and-titov-2018-inventory.sqlite3 21571265 The candidate set released by Le and Titov (2018)

Note that, as far as you respect either of these two formats, you can also create and use your own inventory!

Docker container

Finally, we also release a docker image running two services, a streamlit demo and a REST service:

$ docker run -p 22001:22001 -p 22002:22002 --rm -itd poccio/extend:1.0.1
<container id>

Now you can:

  • checkout the streamlit demo at http://127.0.0.1:22001/
  • invoke the REST service running at http://127.0.0.1:22002/ (http://127.0.0.1:22002/docs you can find the OpenAPI documentation):
    $ curl -X POST http://127.0.0.1:22002/ -H 'Content-Type: application/json' -d '[{"text": "Rome is in Italy"}]'
    [{"text":"Rome is in Italy","disambiguated_entities":[{"char_start":0,"char_end":4,"mention":"Rome","entity":"Rome"},{"char_start":11,"char_end":16,"mention":"Italy","entity":"Italy"}]}]

Acknowledgments

The authors gratefully acknowledge the support of the ERC Consolidator Grant MOUSSE No. 726487 under the European Union’s Horizon 2020 research and innovation programme.

This work was supported in part by the MIUR under grant “Dipartimenti di eccellenza 2018-2022” of the Department of Computer Science of the Sapienza University of Rome.

License

This work is under the Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.

Owner
Sapienza NLP group
The NLP group at the Sapienza University of Rome
Sapienza NLP group
Espresso: A Fast End-to-End Neural Speech Recognition Toolkit

Espresso Espresso is an open-source, modular, extensible end-to-end neural automatic speech recognition (ASR) toolkit based on the deep learning libra

Yiming Wang 919 Jan 03, 2023
NLPShala , the best IDE for all Natural language processing tasks.

The revolutionary IDE for all NLP (Natural language processing) stuffs on the internet.

Abhi 3 Aug 08, 2021
Ukrainian TTS (text-to-speech) using Coqui TTS

title emoji colorFrom colorTo sdk app_file pinned Ukrainian TTS 🐸 green green gradio app.py false Ukrainian TTS 📢 🤖 Ukrainian TTS (text-to-speech)

Yurii Paniv 85 Dec 26, 2022
Concept Modeling: Topic Modeling on Images and Text

Concept is a technique that leverages CLIP and BERTopic-based techniques to perform Concept Modeling on images.

Maarten Grootendorst 120 Dec 27, 2022
DANeS is an open-source E-newspaper dataset by collaboration between DATASET JSC (dataset.vn) and AIV Group (aivgroup.vn)

DANeS - Open-source E-newspaper dataset Source: Technology vector created by macrovector - www.freepik.com. DANeS is an open-source E-newspaper datase

DATASET .JSC 64 Aug 17, 2022
A method to generate speech across multiple speakers

VoiceLoop PyTorch implementation of the method described in the paper VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop. VoiceLoop is a n

Facebook Archive 873 Dec 15, 2022
ChainKnowledgeGraph, 产业链知识图谱包括A股上市公司、行业和产品共3类实体

ChainKnowledgeGraph, 产业链知识图谱包括A股上市公司、行业和产品共3类实体,包括上市公司所属行业关系、行业上级关系、产品上游原材料关系、产品下游产品关系、公司主营产品、产品小类共6大类。 上市公司4,654家,行业511个,产品95,559条、上游材料56,824条,上级行业480条,下游产品390条,产品小类52,937条,所属行业3,946条。

liuhuanyong 415 Jan 06, 2023
XLNet: Generalized Autoregressive Pretraining for Language Understanding

Introduction XLNet is a new unsupervised language representation learning method based on a novel generalized permutation language modeling objective.

Zihang Dai 6k Jan 07, 2023
The source code of HeCo

HeCo This repo is for source code of KDD 2021 paper "Self-supervised Heterogeneous Graph Neural Network with Co-contrastive Learning". Paper Link: htt

Nian Liu 106 Dec 27, 2022
Asr abc - Automatic speech recognition(ASR),中文语音识别

语音识别的简单示例,主要在课堂演示使用 创建python虚拟环境 在linux 和macos 上验证通过 # 如果已经有pyhon3.6 环境,跳过该步骤,使用

LIyong.Guo 8 Nov 11, 2022
Intent parsing and slot filling in PyTorch with seq2seq + attention

PyTorch Seq2Seq Intent Parsing Reframing intent parsing as a human - machine translation task. Work in progress successor to torch-seq2seq-intent-pars

Sean Robertson 159 Apr 04, 2022
Explore different way to mix speech model(wav2vec2, hubert) and nlp model(BART,T5,GPT) together

SpeechMix Explore different way to mix speech model(wav2vec2, hubert) and nlp model(BART,T5,GPT) together. Introduction For the same input: from datas

Eric Lam 31 Nov 07, 2022
This is the offline-training-pipeline for our project.

offline-training-pipeline This is the offline-training-pipeline for our project. We adopt the offline training and online prediction Machine Learning

0 Apr 22, 2022
This is the writeup of all the challenges from Advent-of-cyber-2019 of TryHackMe

Advent-of-cyber-2019-writeup This is the writeup of all the challenges from Advent-of-cyber-2019 of TryHackMe https://tryhackme.com/shivam007/badges/c

shivam danawale 5 Jul 17, 2022
A NLP program: tokenize method, PoS Tagging with deep learning

IRIS NLP SYSTEM A NLP program: tokenize method, PoS Tagging with deep learning Report Bug · Request Feature Table of Contents About The Project Built

Zakaria 7 Dec 13, 2022
:hot_pepper: R²SQL: "Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing." (AAAI 2021)

R²SQL The PyTorch implementation of paper Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing. (AAAI 2021) Requirement

huybery 60 Dec 31, 2022
Telegram AI chat bot written in Python using Pyrogram

Aurora_Al Just another Telegram AI chat bot written in Python using Pyrogram. A public running instance can be found on telegram as @AuroraAl. Require

♗CσNϙUҽRσR_MҽSƙEƚҽҽR 1 Oct 31, 2021
Pytorch implementation of Tacotron

Tacotron-pytorch A pytorch implementation of Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model. Requirements Install python 3 Install pytorc

soobin seo 203 Dec 02, 2022
A Telegram bot to add notes to Flomo.

flomo bot 使用 Telegram 机器人发送笔记到你的 Flomo. 你需要有一台可访问 Telegram 的服务器。 Steps @BotFather 新建机器人,获取 token Flomo 官网获取 API,链接 https://flomoapp.com/mine?source=in

Zhen 44 Dec 30, 2022
An easy-to-use framework for BERT models, with trainers, various NLP tasks and detailed annonations

FantasyBert English | 中文 Introduction An easy-to-use framework for BERT models, with trainers, various NLP tasks and detailed annonations. You can imp

Fan 137 Oct 26, 2022