Entity Disambiguation as text extraction (ACL 2022)

Overview

ExtEnD: Extractive Entity Disambiguation

Python Python PyTorch plugin: spacy Code style: black

This repository contains the code of ExtEnD: Extractive Entity Disambiguation, a novel approach to Entity Disambiguation (i.e. the task of linking a mention in context with its most suitable entity in a reference knowledge base) where we reformulate this task as a text extraction problem. This work was accepted at ACL 2022.

If you find our paper, code or framework useful, please reference this work in your paper:

@inproceedings{barba-etal-2021-extend,
    title = "{E}xt{E}n{D}: Extractive Entity Disambiguation",
    author = "Barba, Edoardo  and
      Procopio, Luigi  and
      Navigli, Roberto",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics",
    month = may,
    year = "2022",
    address = "Online and Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
}

ExtEnD Image

ExtEnD is built on top of the classy library. If you are interested in using this project, we recommend checking first its introduction, although it is not strictly required to train and use the models.

Finally, we also developed a few additional tools that make it simple to use and test ExtEnD models:

Setup the environment

Requirements:

  • Debian-based (e.g. Debian, Ubuntu, ...) system
  • conda installed

To quickly setup the environment to use ExtEnd/replicate our experiments, you can use the bash script setup.sh. The only requirements needed here is to have a Debian-based system (Debian, Ubuntu, ...) and conda installed.

bash setup.sh

Checkpoints

We release the following checkpoints:

Model Training Dataset Avg Score
Longformer Large AIDA 85.8

Once you have downloaded the files, untar them inside the experiments/ folder.

# move file to experiments folder
mv ~/Downloads/extend-longformer-large.tar.gz experiments/
# untar
tar -xf experiments/extend-longformer-large.tar.gz -C experiments/
rm experiments/extend-longformer-large.tar.gz

Data

All the datasets used to train and evaluate ExtEnD can be downloaded using the following script from the facebook GENRE repository.

We strongly recommend you organize them in the following structure under the data folder as it is used by several scripts in the project.

data
├── aida
│   ├── test.aida
│   ├── train.aida
│   └── validation.aida
└── out_of_domain
    ├── ace2004-test-kilt.ed
    ├── aquaint-test-kilt.ed
    ├── clueweb-test-kilt.ed
    ├── msnbc-test-kilt.ed
    └── wiki-test-kilt.ed

Training

To train a model from scratch, you just have to use the following command:

classy train qa <folder> -n my-model-name --profile aida-longformer-large-gam -pd extend

can be any folder containing exactly 3 files:

  • train.aida
  • validation.aida
  • test.aida

This is required to let classy automatically discover the dataset splits. For instance, to re-train our AIDA-only model:

classy train data/aida -n my-model-name --profile aida-longformer-large-gam -pd extend

Note that can be any folder, as long as:

  • it contains these 3 files
  • they are in the same format as the files in data/aida

So if you want to train on these different datasets, just create the corresponding directory and you are ready to go!

In case you want to modify some training hyperparameter, you just have to edit the aida-longformer-large-gam profile in the configurations/ folder. You can take a look to the modifiable parameters by adding the parameter --print to the training command. You can find more on this in classy official documentation.

Predict

You can use classy syntax to perform file prediction:

classy predict -pd extend file \
    experiments/extend-longformer-large \
    data/aida/test.aida \
    -o data/aida_test_predictions.aida

Evaluation

To evaluate a checkpoint, you can run the bash script scripts/full_evaluation.sh, passing its path as an input argument. This will evaluate the model provided against both AIDA and OOD resources.

# syntax: bash scripts/full_evaluation.sh <ckpt-path>
bash scripts/full_evaluation.sh experiments/extend-longformer-large/2021-10-22/09-11-39/checkpoints/best.ckpt

If you are interested in AIDA-only evaluation, you can use scripts/aida_evaluation.sh instead (same syntax).

Furthermore, you can evaluate the model on any dataset that respects the same format of the original ones with the following command:

classy evaluate \
    experiments/extend-longformer-large/2021-10-22/09-11-39/checkpoints/best.ckpt \
    data/aida/test.aida \
    -o data/aida_test_evaluation.txt \
    -pd extend

spaCy

You can also use ExtEnD with spaCy, allowing you to use our system with a seamless interface that tackles full end-to-end entity linking. To do so, you just need to have cloned the repo and run setup.sh to configure the environment. Then, you will be able to add extend as a custom component in the following way:

import spacy
from extend import spacy_component

nlp = spacy.load("en_core_web_sm")

extend_config = dict(
    checkpoint_path="<ckpt-path>",
    mentions_inventory_path="<inventory-path>",
    device=0,
    tokens_per_batch=4000,
)

nlp.add_pipe("extend", after="ner", config=extend_config)

input_sentence = "Japan began the defence of their title " \
                 "with a lucky 2-1 win against Syria " \
                 "in a championship match on Friday."

doc = nlp(input_sentence)

# [(Japan, Japan National Footbal Team), (Syria, Syria National Footbal Team)]
disambiguated_entities = [(ent.text, ent._.disambiguated_entity) for ent in doc.ents]

Where:

  • <ckpt-path> is the path to a pretrained checkpoint of extend that you can find in the Checkpoints section, and
  • <inventory-path> is the path to a file containing the mapping from mentions to the corresponding candidates.

We support two formats for <inventory-path>:

  • tsv:
    $ head -1 <inventory-path>
    Rome \[TAB\] Rome City \[TAB\] Rome Football Team \[TAB\] Roman Empire \[TAB\] ...
    That is, <inventory-path> is a tab-separated file where, for each row, we have the mention (Rome) followed by its possible entities.
  • sqlite: a sqlite3 database with a candidate table with two columns:
    • mention (text PRIMARY KEY)
    • entities (text). This must be a tab-separated list of the corresponding entities.

We release 6 possible pre-computed <inventory-path> that you could use (we recommend creating a folder data/inventories/ and placing the files downloaded there inside, e.g., = data/inventories/le-and-titov-2018-inventory.min-count-2.sqlite3):

Inventory Number of Mentions Source
le-and-titov-2018-inventory.min-count-2.tsv 12090972 Cleaned version of the candidate set released by Le and Titov (2018). We discard mentions whose count is less than 2.
[Recommended] le-and-titov-2018-inventory.min-count-2.sqlite3 12090972 Cleaned version of the candidate set released by Le and Titov (2018). We discard mentions whose count is less than 2.
le-and-titov-2018-inventory.tsv 21571265 The candidate set released by Le and Titov (2018)
le-and-titov-2018-inventory.sqlite3 21571265 The candidate set released by Le and Titov (2018)

Note that, as far as you respect either of these two formats, you can also create and use your own inventory!

Docker container

Finally, we also release a docker image running two services, a streamlit demo and a REST service:

$ docker run -p 22001:22001 -p 22002:22002 --rm -itd poccio/extend:1.0.1
<container id>

Now you can:

  • checkout the streamlit demo at http://127.0.0.1:22001/
  • invoke the REST service running at http://127.0.0.1:22002/ (http://127.0.0.1:22002/docs you can find the OpenAPI documentation):
    $ curl -X POST http://127.0.0.1:22002/ -H 'Content-Type: application/json' -d '[{"text": "Rome is in Italy"}]'
    [{"text":"Rome is in Italy","disambiguated_entities":[{"char_start":0,"char_end":4,"mention":"Rome","entity":"Rome"},{"char_start":11,"char_end":16,"mention":"Italy","entity":"Italy"}]}]

Acknowledgments

The authors gratefully acknowledge the support of the ERC Consolidator Grant MOUSSE No. 726487 under the European Union’s Horizon 2020 research and innovation programme.

This work was supported in part by the MIUR under grant “Dipartimenti di eccellenza 2018-2022” of the Department of Computer Science of the Sapienza University of Rome.

License

This work is under the Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.

Owner
Sapienza NLP group
The NLP group at the Sapienza University of Rome
Sapienza NLP group
NLP: SLU tagging

NLP: SLU tagging

北海若 3 Jan 14, 2022
Finds snippets in iambic pentameter in English-language text and tries to combine them to a rhyming sonnet.

Sonnet finder Finds snippets in iambic pentameter in English-language text and tries to combine them to a rhyming sonnet. Usage This is a Python scrip

Marcel Bollmann 11 Sep 25, 2022
PyTorch Implementation of the paper Single Image Texture Translation for Data Augmentation

SITT The repo contains official PyTorch Implementation of the paper Single Image Texture Translation for Data Augmentation. Authors: Boyi Li Yin Cui T

Boyi Li 52 Jan 05, 2023
An A-SOUL Text Generator Based on CPM-Distill.

ASOUL-Generator-Backend 本项目为 https://asoul.infedg.xyz/ 的后端。 模型为基于 CPM-Distill 的 transformers 转化版本 CPM-Generate-distill 训练而成。

infinityedge 46 Dec 11, 2022
Turkish Stop Words Türkçe Dolgu Sözcükleri

trstop Turkish Stop Words Türkçe Dolgu Sözcükleri In this repository I put Turkish stop words that is contained in the first 10 thousand words with th

Ahmet Aksoy 103 Nov 12, 2022
A simple Streamlit App to classify swahili news into different categories.

Swahili News Classifier Streamlit App A simple app to classify swahili news into different categories. Installation Install all streamlit requirements

Davis David 4 May 01, 2022
This program do translate english words to portuguese

Python-Dictionary This program is used to translate english words to portuguese. Web-Scraping This program use BeautifulSoap to make web scraping, so

João Assalim 1 Oct 10, 2022
Extract rooms type, door, neibour rooms, rooms corners nad bounding boxes, and generate graph from rplan dataset

Housegan-data-reader House-GAN++ (data-reader) Code and instructions for converting rplan dataset (raster images) to housegan++ data format. House-GAN

Sepid Hosseini 13 Nov 24, 2022
ProtFeat is protein feature extraction tool that utilizes POSSUM and iFeature.

Description: ProtFeat is designed to extract the protein features by employing POSSUM and iFeature python-based tools. ProtFeat includes a total of 39

GOKHAN OZSARI 5 Dec 16, 2022
Every Google, Azure & IBM text to speech voice for free

TTS-Grabber Quick thing i made about a year ago to download any text with any tts voice, over 630 voices to choose from currently. It will split the i

16 Dec 07, 2022
PyTorch code for EMNLP 2019 paper "LXMERT: Learning Cross-Modality Encoder Representations from Transformers".

LXMERT: Learning Cross-Modality Encoder Representations from Transformers Our servers break again :(. I have updated the links so that they should wor

Hao Tan 838 Dec 19, 2022
Guide: Finetune GPT2-XL (1.5 Billion Parameters) and GPT-NEO (2.7 B) on a single 16 GB VRAM V100 Google Cloud instance with Huggingface Transformers using DeepSpeed

Guide: Finetune GPT2-XL (1.5 Billion Parameters) and GPT-NEO (2.7 Billion Parameters) on a single 16 GB VRAM V100 Google Cloud instance with Huggingfa

289 Jan 06, 2023
Uncomplete archive of files from the European Nopsled Team

European Nopsled CTF Archive This is an archive of collected material from various Capture the Flag competitions that the European Nopsled team played

European Nopsled 4 Nov 24, 2021
Multilingual Emotion classification using BERT (fine-tuning). Published at the WASSA workshop (ACL2022).

XLM-EMO: Multilingual Emotion Prediction in Social Media Text Abstract Detecting emotion in text allows social and computational scientists to study h

MilaNLP 35 Sep 17, 2022
Unofficial PyTorch implementation of Google AI's VoiceFilter system

VoiceFilter Note from Seung-won (2020.10.25) Hi everyone! It's Seung-won from MINDs Lab, Inc. It's been a long time since I've released this open-sour

MINDs Lab 881 Jan 03, 2023
NLP command-line assistant powered by OpenAI

NLP command-line assistant powered by OpenAI

Axel 16 Dec 09, 2022
Python interface for converting Penn Treebank trees to Stanford Dependencies and Universal Depenencies

PyStanfordDependencies Python interface for converting Penn Treebank trees to Universal Dependencies and Stanford Dependencies. Example usage Start by

David McClosky 64 May 08, 2022
A python framework to transform natural language questions to queries in a database query language.

__ _ _ _ ___ _ __ _ _ / _` | | | |/ _ \ '_ \| | | | | (_| | |_| | __/ |_) | |_| | \__, |\__,_|\___| .__/ \__, | |_| |_| |___/

Machinalis 1.2k Dec 18, 2022
Korea Spell Checker

한국어 문서 koSpellPy Korean Spell checker How to use Install pip install kospellpy Use from kospellpy import spell_init spell_checker = spell_init() # d

kangsukmin 2 Oct 20, 2021
초성 해석기 based on ko-BART

초성 해석기 개요 한국어 초성만으로 이루어진 문장을 입력하면, 완성된 문장을 예측하는 초성 해석기입니다. 초성: ㄴㄴ ㄴㄹ ㅈㅇㅎ 예측 문장: 나는 너를 좋아해 모델 모델은 SKT-AI에서 공개한 Ko-BART를 이용합니다. 데이터 문장 단위로 이루어진 아무 코퍼스나

Dawoon Jung 29 Oct 28, 2022