LUKE -- Language Understanding with Knowledge-based Embeddings

Related tags

Text Data & NLPluke
Overview

LUKE

CircleCI


LUKE (Language Understanding with Knowledge-based Embeddings) is a new pre-trained contextualized representation of words and entities based on transformer. It was proposed in our paper LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention. It achieves state-of-the-art results on important NLP benchmarks including SQuAD v1.1 (extractive question answering), CoNLL-2003 (named entity recognition), ReCoRD (cloze-style question answering), TACRED (relation classification), and Open Entity (entity typing).

This repository contains the source code to pre-train the model and fine-tune it to solve downstream tasks.

News

November 24, 2021: Entity disambiguation example is available

The example code of entity disambiguation based on LUKE has been added to this repository. This model was originally proposed in our paper, and achieved state-of-the-art results on five standard entity disambiguation datasets: AIDA-CoNLL, MSNBC, AQUAINT, ACE2004, and WNED-WIKI.

For further details, please refer to the example directory.

August 3, 2021: New example code based on Hugging Face Transformers and AllenNLP is available

New fine-tuning examples of three downstream tasks, i.e., NER, relation classification, and entity typing, have been added to LUKE. These examples are developed based on Hugging Face Transformers and AllenNLP. The fine-tuning models are defined using simple AllenNLP's Jsonnet config files!

The example code is available in the examples_allennlp directory.

May 5, 2021: LUKE is added to Hugging Face Transformers

LUKE has been added to the master branch of the Hugging Face Transformers library. You can now solve entity-related tasks (e.g., named entity recognition, relation classification, entity typing) easily using this library.

For example, the LUKE-large model fine-tuned on the TACRED dataset can be used as follows:

>>> from transformers import LukeTokenizer, LukeForEntityPairClassification
>>> model = LukeForEntityPairClassification.from_pretrained("studio-ousia/luke-large-finetuned-tacred")
>>> tokenizer = LukeTokenizer.from_pretrained("studio-ousia/luke-large-finetuned-tacred")
>>> text = "Beyoncé lives in Los Angeles."
>>> entity_spans = [(0, 7), (17, 28)]  # character-based entity spans corresponding to "Beyoncé" and "Los Angeles"
>>> inputs = tokenizer(text, entity_spans=entity_spans, return_tensors="pt")
>>> outputs = model(**inputs)
>>> logits = outputs.logits
>>> predicted_class_idx = int(logits[0].argmax())
>>> print("Predicted class:", model.config.id2label[predicted_class_idx])
Predicted class: per:cities_of_residence

We also provide the following three Colab notebooks that show how to reproduce our experimental results on CoNLL-2003, TACRED, and Open Entity datasets using the library:

Please refer to the official documentation for further details.

November 5, 2021: LUKE-500K (base) model

We released LUKE-500K (base), a new pretrained LUKE model which is smaller than existing LUKE-500K (large). The experimental results of the LUKE-500K (base) and LUKE-500K (large) on SQuAD v1 and CoNLL-2003 are shown as follows:

Task Dataset Metric LUKE-500K (base) LUKE-500K (large)
Extractive Question Answering SQuAD v1.1 EM/F1 86.1/92.3 90.2/95.4
Named Entity Recognition CoNLL-2003 F1 93.3 94.3

We tuned only the batch size and learning rate in the experiments based on LUKE-500K (base).

Comparison with State-of-the-Art

LUKE outperforms the previous state-of-the-art methods on five important NLP tasks:

Task Dataset Metric LUKE-500K (large) Previous SOTA
Extractive Question Answering SQuAD v1.1 EM/F1 90.2/95.4 89.9/95.1 (Yang et al., 2019)
Named Entity Recognition CoNLL-2003 F1 94.3 93.5 (Baevski et al., 2019)
Cloze-style Question Answering ReCoRD EM/F1 90.6/91.2 83.1/83.7 (Li et al., 2019)
Relation Classification TACRED F1 72.7 72.0 (Wang et al. , 2020)
Fine-grained Entity Typing Open Entity F1 78.2 77.6 (Wang et al. , 2020)

These numbers are reported in our EMNLP 2020 paper.

Installation

LUKE can be installed using Poetry:

$ poetry install

The virtual environment automatically created by Poetry can be activated by poetry shell.

Released Models

We initially release the pre-trained model with 500K entity vocabulary based on the roberta.large model.

Name Base Model Entity Vocab Size Params Download
LUKE-500K (base) roberta.base 500K 253 M Link
LUKE-500K (large) roberta.large 500K 483 M Link

Reproducing Experimental Results

The experiments were conducted using Python3.6 and PyTorch 1.2.0 installed on a server with a single or eight NVidia V100 GPUs. We used NVidia's PyTorch Docker container 19.02. For computational efficiency, we used mixed precision training based on APEX library which can be installed as follows:

$ git clone https://github.com/NVIDIA/apex.git
$ cd apex
$ git checkout c3fad1ad120b23055f6630da0b029c8b626db78f
$ pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .

The APEX library is not needed if you do not use --fp16 option or reproduce the results based on the trained checkpoint files.

The commands that reproduce the experimental results are provided as follows:

Entity Typing on Open Entity Dataset

Dataset: Link
Checkpoint file (compressed): Link

Using the checkpoint file:

$ python -m examples.cli \
    --model-file=luke_large_500k.tar.gz \
    --output-dir=<OUTPUT_DIR> \
    entity-typing run \
    --data-dir=<DATA_DIR> \
    --checkpoint-file=<CHECKPOINT_FILE> \
    --no-train

Fine-tuning the model:

$ python -m examples.cli \
    --model-file=luke_large_500k.tar.gz \
    --output-dir=<OUTPUT_DIR> \
    entity-typing run \
    --data-dir=<DATA_DIR> \
    --train-batch-size=2 \
    --gradient-accumulation-steps=2 \
    --learning-rate=1e-5 \
    --num-train-epochs=3 \
    --fp16

Relation Classification on TACRED Dataset

Dataset: Link
Checkpoint file (compressed): Link

Using the checkpoint file:

$ python -m examples.cli \
    --model-file=luke_large_500k.tar.gz \
    --output-dir=<OUTPUT_DIR> \
    relation-classification run \
    --data-dir=<DATA_DIR> \
    --checkpoint-file=<CHECKPOINT_FILE> \
    --no-train

Fine-tuning the model:

$ python -m examples.cli \
    --model-file=luke_large_500k.tar.gz \
    --output-dir=<OUTPUT_DIR> \
    relation-classification run \
    --data-dir=<DATA_DIR> \
    --train-batch-size=4 \
    --gradient-accumulation-steps=8 \
    --learning-rate=1e-5 \
    --num-train-epochs=5 \
    --fp16

Named Entity Recognition on CoNLL-2003 Dataset

Dataset: Link
Checkpoint file (compressed): Link

Using the checkpoint file:

$ python -m examples.cli \
    --model-file=luke_large_500k.tar.gz \
    --output-dir=<OUTPUT_DIR> \
    ner run \
    --data-dir=<DATA_DIR> \
    --checkpoint-file=<CHECKPOINT_FILE> \
    --no-train

Fine-tuning the model:

$ python -m examples.cli\
    --model-file=luke_large_500k.tar.gz \
    --output-dir=<OUTPUT_DIR> \
    ner run \
    --data-dir=<DATA_DIR> \
    --train-batch-size=2 \
    --gradient-accumulation-steps=4 \
    --learning-rate=1e-5 \
    --num-train-epochs=5 \
    --fp16

Cloze-style Question Answering on ReCoRD Dataset

Dataset: Link
Checkpoint file (compressed): Link

Using the checkpoint file:

$ python -m examples.cli \
    --model-file=luke_large_500k.tar.gz \
    --output-dir=<OUTPUT_DIR> \
    entity-span-qa run \
    --data-dir=<DATA_DIR> \
    --checkpoint-file=<CHECKPOINT_FILE> \
    --no-train

Fine-tuning the model:

$ python -m examples.cli \
    --num-gpus=8 \
    --model-file=luke_large_500k.tar.gz \
    --output-dir=<OUTPUT_DIR> \
    entity-span-qa run \
    --data-dir=<DATA_DIR> \
    --train-batch-size=1 \
    --gradient-accumulation-steps=4 \
    --learning-rate=1e-5 \
    --num-train-epochs=2 \
    --fp16

Extractive Question Answering on SQuAD 1.1 Dataset

Dataset: Link
Checkpoint file (compressed): Link
Wikipedia data files (compressed): Link

Using the checkpoint file:

$ python -m examples.cli \
    --model-file=luke_large_500k.tar.gz \
    --output-dir=<OUTPUT_DIR> \
    reading-comprehension run \
    --data-dir=<DATA_DIR> \
    --checkpoint-file=<CHECKPOINT_FILE> \
    --no-negative \
    --wiki-link-db-file=enwiki_20160305.pkl \
    --model-redirects-file=enwiki_20181220_redirects.pkl \
    --link-redirects-file=enwiki_20160305_redirects.pkl \
    --no-train

Fine-tuning the model:

$ python -m examples.cli \
    --num-gpus=8 \
    --model-file=luke_large_500k.tar.gz \
    --output-dir=<OUTPUT_DIR> \
    reading-comprehension run \
    --data-dir=<DATA_DIR> \
    --no-negative \
    --wiki-link-db-file=enwiki_20160305.pkl \
    --model-redirects-file=enwiki_20181220_redirects.pkl \
    --link-redirects-file=enwiki_20160305_redirects.pkl \
    --train-batch-size=2 \
    --gradient-accumulation-steps=3 \
    --learning-rate=15e-6 \
    --num-train-epochs=2 \
    --fp16

Citation

If you use LUKE in your work, please cite the original paper:

@inproceedings{yamada2020luke,
  title={LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention},
  author={Ikuya Yamada and Akari Asai and Hiroyuki Shindo and Hideaki Takeda and Yuji Matsumoto},
  booktitle={EMNLP},
  year={2020}
}

Contact Info

Please submit a GitHub issue or send an e-mail to Ikuya Yamada ([email protected]) for help or issues using LUKE.

Owner
Studio Ousia
Studio Ousia
HuggingSound: A toolkit for speech-related tasks based on HuggingFace's tools

HuggingSound HuggingSound: A toolkit for speech-related tasks based on HuggingFace's tools. I have no intention of building a very complex tool here.

Jonatas Grosman 247 Dec 26, 2022
PyTorch source code of NAACL 2019 paper "An Embarrassingly Simple Approach for Transfer Learning from Pretrained Language Models"

This repository contains source code for NAACL 2019 paper "An Embarrassingly Simple Approach for Transfer Learning from Pretrained Language Models" (P

Alexandra Chronopoulou 89 Aug 12, 2022
Code and dataset for the EMNLP 2021 Finding paper "Can NLI Models Verify QA Systems’ Predictions?"

Code and dataset for the EMNLP 2021 Finding paper "Can NLI Models Verify QA Systems’ Predictions?"

Jifan Chen 22 Oct 21, 2022
YACLC - Yet Another Chinese Learner Corpus

汉语学习者文本多维标注数据集YACLC V1.0 中文 | English 汉语学习者文本多维标注数据集(Yet Another Chinese Learner

BLCU-ICALL 47 Dec 15, 2022
Türkçe küfürlü içerikleri bulan bir yapay zeka kütüphanesi / An ML library for profanity detection in Turkish sentences

"Kötü söz sahibine aittir." -Anonim Nedir? sinkaf uygunsuz yorumların bulunmasını sağlayan bir python kütüphanesidir. Farkı nedir? Diğer algoritmalard

KaraGoz 4 Feb 18, 2022
null

CP-Cluster Confidence Propagation Cluster aims to replace NMS-based methods as a better box fusion framework in 2D/3D Object detection, Instance Segme

Yichun Shen 41 Dec 08, 2022
[AAAI 21] Curriculum Labeling: Revisiting Pseudo-Labeling for Semi-Supervised Learning

◥ Curriculum Labeling ◣ Revisiting Pseudo-Labeling for Semi-Supervised Learning Paola Cascante-Bonilla, Fuwen Tan, Yanjun Qi, Vicente Ordonez. In the

UVA Computer Vision 113 Dec 15, 2022
Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

GenSen Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning Sandeep Subramanian, Adam Trischler, Yoshua B

Maluuba Inc. 309 Oct 19, 2022
NLP tool to extract emotional phrase from tweets 🤩

Emotional phrase extractor Extract phrase in the given text that is used to express the sentiment. Capturing sentiment in language is important in the

Shahul ES 38 Oct 17, 2022
Creating a Feed of MISP Events from ThreatFox (by abuse.ch)

ThreatFox2Misp Creating a Feed of MISP Events from ThreatFox (by abuse.ch) What will it do? This will fetch IOCs from ThreatFox by Abuse.ch, convert t

17 Nov 22, 2022
Code for paper "Role-oriented Network Embedding Based on Adversarial Learning between Higher-order and Local Features"

Role-oriented Network Embedding Based on Adversarial Learning between Higher-order and Local Features Train python main.py --dataset brazil-flights C

wang zhang 0 Jun 28, 2022
Pre-training with Extracted Gap-sentences for Abstractive SUmmarization Sequence-to-sequence models

PEGASUS library Pre-training with Extracted Gap-sentences for Abstractive SUmmarization Sequence-to-sequence models, or PEGASUS, uses self-supervised

Google Research 1.4k Dec 22, 2022
Problem: Given a nepali news find the category of the news

Classification of category of nepali news catorgory using different algorithms Problem: Multiclass Classification Approaches: TFIDF for vectorization

pudasainishushant 2 Jan 09, 2022
Python interface for converting Penn Treebank trees to Stanford Dependencies and Universal Depenencies

PyStanfordDependencies Python interface for converting Penn Treebank trees to Universal Dependencies and Stanford Dependencies. Example usage Start by

David McClosky 64 May 08, 2022
Train BPE with fastBPE, and load to Huggingface Tokenizer.

BPEer Train BPE with fastBPE, and load to Huggingface Tokenizer. Description The BPETrainer of Huggingface consumes a lot of memory when I am training

Lizhuo 1 Dec 23, 2021
Unsupervised Language Model Pre-training for French

FlauBERT and FLUE FlauBERT is a French BERT trained on a very large and heterogeneous French corpus. Models of different sizes are trained using the n

GETALP 212 Dec 10, 2022
simpleT5 is built on top of PyTorch-lightning⚡️ and Transformers🤗 that lets you quickly train your T5 models.

Quickly train T5 models in just 3 lines of code + ONNX support simpleT5 is built on top of PyTorch-lightning ⚡️ and Transformers 🤗 that lets you quic

Shivanand Roy 220 Dec 30, 2022
Diaformer: Automatic Diagnosis via Symptoms Sequence Generation

Diaformer Diaformer: Automatic Diagnosis via Symptoms Sequence Generation (AAAI 2022) Diaformer is an efficient model for automatic diagnosis via symp

Junying Chen 20 Dec 13, 2022