Convolutional 2D Knowledge Graph Embeddings resources

Related tags

Text Data & NLPConvE
Overview

ConvE

Convolutional 2D Knowledge Graph Embeddings resources.

Paper: Convolutional 2D Knowledge Graph Embeddings

Used in the paper, but do not use these datasets for your research: FB15k and WN18. Please also note that the Kinship and Nations datasets have a high number of inverse relationships which makes them unsuitable for research. Nations has +95% inverse relationships and Kinship about 48%.

ConvE key facts

Predictive performance

Dataset MR MRR [email protected] [email protected] [email protected]
FB15k 64 0.75 0.87 0.80 0.67
WN18 504 0.94 0.96 0.95 0.94
FB15k-237 246 0.32 0.49 0.35 0.24
WN18RR 4766 0.43 0.51 0.44 0.39
YAGO3-10 2792 0.52 0.66 0.56 0.45
Nations 2 0.82 1.00 0.88 0.72
UMLS 1 0.94 0.99 0.97 0.92
Kinship 2 0.83 0.98 0.91 0.73

Run time performance

For an embedding size of 200 and batch size 128, a single batch takes on a GTX Titan X (Maxwell):

  • 64ms for 100,000 entities
  • 80ms for 1,000,000 entities

Parameter efficiency

Parameters ConvE/DistMult MRR ConvE/DistMult [email protected] ConvE/DistMult [email protected]
~5.0M 0.32 / 0.24 0.49 / 0.42 0.24 / 0.16
1.89M 0.32 / 0.23 0.49 / 0.41 0.23 / 0.15
0.95M 0.30 / 0.22 0.46 / 0.39 0.22 / 0.14
0.24M 0.26 / 0.16 0.39 / 0.31 0.19 / 0.09

ConvE with 8 times less parameters is still more powerful than DistMult. Relational Graph Convolutional Networks use roughly 32x more parameters to have the same performance as ConvE.

Installation

This repo supports Linux and Python installation via Anaconda.

  1. Install PyTorch using Anaconda.
  2. Install the requirements pip install -r requirements.txt
  3. Download the default English model used by spaCy, which is installed in the previous step python -m spacy download en
  4. Run the preprocessing script for WN18RR, FB15k-237, YAGO3-10, UMLS, Kinship, and Nations: sh preprocess.sh
  5. You can now run the model

Running a model

Parameters need to be specified by white-space tuples for example:

CUDA_VISIBLE_DEVICES=0 python main.py --model conve --data FB15k-237 \
                                      --input-drop 0.2 --hidden-drop 0.3 --feat-drop 0.2 \
                                      --lr 0.003 --preprocess

will run a ConvE model on FB15k-237.

To run a model, you first need to preprocess the data once. This can be done by specifying the --preprocess parameter:

CUDA_VISIBLE_DEVICES=0 python main.py --data DATASET_NAME --preprocess

After the dataset is preprocessed it will be saved to disk and this parameter can be omitted.

CUDA_VISIBLE_DEVICES=0 python main.py --data DATASET_NAME

The following parameters can be used for the --model parameter:

conve
distmult
complex

The following datasets can be used for the --data parameter:

FB15k-237
WN18RR
YAGO3-10
umls
kinship
nations

And here a complete list of parameters.

Link prediction for knowledge graphs

optional arguments:
  -h, --help            show this help message and exit
  --batch-size BATCH_SIZE
                        input batch size for training (default: 128)
  --test-batch-size TEST_BATCH_SIZE
                        input batch size for testing/validation (default: 128)
  --epochs EPOCHS       number of epochs to train (default: 1000)
  --lr LR               learning rate (default: 0.003)
  --seed S              random seed (default: 17)
  --log-interval LOG_INTERVAL
                        how many batches to wait before logging training
                        status
  --data DATA           Dataset to use: {FB15k-237, YAGO3-10, WN18RR, umls,
                        nations, kinship}, default: FB15k-237
  --l2 L2               Weight decay value to use in the optimizer. Default:
                        0.0
  --model MODEL         Choose from: {conve, distmult, complex}
  --embedding-dim EMBEDDING_DIM
                        The embedding dimension (1D). Default: 200
  --embedding-shape1 EMBEDDING_SHAPE1
                        The first dimension of the reshaped 2D embedding. The
                        second dimension is infered. Default: 20
  --hidden-drop HIDDEN_DROP
                        Dropout for the hidden layer. Default: 0.3.
  --input-drop INPUT_DROP
                        Dropout for the input embeddings. Default: 0.2.
  --feat-drop FEAT_DROP
                        Dropout for the convolutional features. Default: 0.2.
  --lr-decay LR_DECAY   Decay the learning rate by this factor every epoch.
                        Default: 0.995
  --loader-threads LOADER_THREADS
                        How many loader threads to use for the batch loaders.
                        Default: 4
  --preprocess          Preprocess the dataset. Needs to be executed only
                        once. Default: 4
  --resume              Resume a model.
  --use-bias            Use a bias in the convolutional layer. Default: True
  --label-smoothing LABEL_SMOOTHING
                        Label smoothing value to use. Default: 0.1
  --hidden-size HIDDEN_SIZE
                        The side of the hidden layer. The required size
                        changes with the size of the embeddings. Default: 9728
                        (embedding size 200).

To reproduce most of the results in the ConvE paper, you can use the default parameters and execute the command below:

CUDA_VISIBLE_DEVICES=0 python main.py --data DATASET_NAME

For the reverse model, you can run the provided file with the name of the dataset name and a threshold probability:

python inverse_model.py WN18RR 0.9

Changing the embedding size for ConvE

If you want to change the embedding size you can do that via the ``--embedding-dim parameter. However, for ConvE, since the embedding is reshaped as a 2D embedding one also needs to pass the first dimension of the reshaped embedding (--embedding-shape1`) while the second dimension is infered. When once changes the embedding size, the hidden layer size `--hidden-size` also needs to be different but it is difficult to determine before run time. The easiest way to determine the hidden size is to run the model, let it run on an error due to wrong shape, and then reshape according to the dimension in the error message.

Example: Change embedding size to be 100. We want 10x10 2D embeddings. We run python main.py --embedding-dim 100 --embedding-shape1 10 and we run on an error due to wrong hidden dimension:

   ret = torch.addmm(bias, input, weight.t())
RuntimeError: size mismatch, m1: [128 x 4608], m2: [9728 x 100] at /opt/conda/conda-bld/pytorch_1565272271120/work/aten/src/THC/generic/THCTensorMathBlas.cu:273

Now we change the hidden dimension to 4608 accordingly: python main.py --embedding-dim 100 --embedding-shape1 10 --hidden-size 4608. Now the model runs with an embedding size of 100 and 10x10 2D embeddings.

Adding new datasets

To run it on a new datasets, copy your dataset folder into the data folder and make sure your dataset split files have the name train.txt, valid.txt, and test.txt which contain tab separated triples of a knowledge graph. Then execute python wrangle_KG.py FOLDER_NAME, afterwards, you can use the folder name of your dataset in the dataset parameter.

Adding your own model

You can easily write your own knowledge graph model by extending the barebone model MyModel that can be found in the model.py file.

Quirks

There are some quirks of this framework.

  1. The model currently ignores data that does not fit into the specified batch size, for example if your batch size is 100 and your test data is 220, then 20 samples will be ignored. This is designed in that way to improve performance on small datasets. To test on the full test-data you can save the model checkpoint, load the model (with the --resume True variable) and then evaluate with a batch size that fits the test data (for 220 you could use a batch size of 110). Another solution is to just use a fitting batch size from the start, that is, you could train with a batch size of 110.

Issues

It has been noted that #6 WN18RR does contain 212 entities in the test set that do not appear in the training set. About 6.7% of the test set is affected. This means that most models will find it impossible to make any reasonable predictions for these entities. This will make WN18RR appear more difficult than it really is, but it should not affect the usefulness of the dataset. If all researchers compared to the same datasets the scores will still be comparable.

Logs

Some log files of the original research are included in the repo (logs.tar.gz). These log files are mostly unstructured in names and might be created from checkpoints so that it is difficult to comprehend them. Nevertheless, it might help to replicate the results or study the behavior of the training under certain conditions and thus I included them here.

Citation

If you found this codebase or our work useful please cite us:

@inproceedings{dettmers2018conve,
	Author = {Dettmers, Tim and Pasquale, Minervini and Pontus, Stenetorp and Riedel, Sebastian},
	Booktitle = {Proceedings of the 32th AAAI Conference on Artificial Intelligence},
	Title = {Convolutional 2D Knowledge Graph Embeddings},
	Url = {https://arxiv.org/abs/1707.01476},
	Year = {2018},
        pages  = {1811--1818},
  	Month = {February}
}



Owner
Tim Dettmers
Tim Dettmers
LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

LV-BERT Introduction In this repo, we introduce LV-BERT by exploiting layer variety for BERT. For detailed description and experimental results, pleas

Weihao Yu 14 Aug 24, 2022
Open-World Entity Segmentation

Open-World Entity Segmentation Project Website Lu Qi*, Jason Kuen*, Yi Wang, Jiuxiang Gu, Hengshuang Zhao, Zhe Lin, Philip Torr, Jiaya Jia This projec

DV Lab 408 Dec 29, 2022
Text-to-Speech for Belarusian language

title emoji colorFrom colorTo sdk app_file pinned Belarusian TTS 🐸 green green gradio app.py false Belarusian TTS 📢 🤖 Belarusian TTS (text-to-speec

Yurii Paniv 1 Nov 27, 2021
Natural Language Processing with transformers

we want to create a repo to illustrate usage of transformers in chinese

Datawhale 763 Dec 27, 2022
A full spaCy pipeline and models for scientific/biomedical documents.

This repository contains custom pipes and models related to using spaCy for scientific documents. In particular, there is a custom tokenizer that adds

AI2 1.3k Jan 03, 2023
Korea Spell Checker

한국어 문서 koSpellPy Korean Spell checker How to use Install pip install kospellpy Use from kospellpy import spell_init spell_checker = spell_init() # d

kangsukmin 2 Oct 20, 2021
Tutorial to pretrain & fine-tune a 🤗 Flax T5 model on a TPUv3-8 with GCP

Pretrain and Fine-tune a T5 model with Flax on GCP This tutorial details how pretrain and fine-tune a FlaxT5 model from HuggingFace using a TPU VM ava

Gabriele Sarti 41 Nov 18, 2022
DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference

DeeBERT This is the code base for the paper DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference. Code in this repository is also available

Castorini 132 Nov 14, 2022
Repository for Graph2Pix: A Graph-Based Image to Image Translation Framework

Graph2Pix: A Graph-Based Image to Image Translation Framework Installation Install the dependencies in env.yml $ conda env create -f env.yml $ conda a

18 Nov 17, 2022
Code for ACL 2022 main conference paper "STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation".

STEMM: Self-learning with Speech-Text Manifold Mixup for Speech Translation This is a PyTorch implementation for the ACL 2022 main conference paper ST

ICTNLP 29 Oct 16, 2022
An official repository for tutorials of Probabilistic Modelling and Reasoning (2021/2022) - a University of Edinburgh master's course.

PMR computer tutorials on HMMs (2021-2022) This is a repository for computer tutorials of Probabilistic Modelling and Reasoning (2021/2022) - a Univer

Vaidotas Šimkus 10 Dec 06, 2022
The FinQA dataset from paper: FinQA: A Dataset of Numerical Reasoning over Financial Data

Data and code for EMNLP 2021 paper "FinQA: A Dataset of Numerical Reasoning over Financial Data"

Zhiyu Chen 114 Dec 29, 2022
HiFi DeepVariant + WhatsHap workflowHiFi DeepVariant + WhatsHap workflow

HiFi DeepVariant + WhatsHap workflow Workflow steps align HiFi reads to reference with pbmm2 call small variants with DeepVariant, using two-pass meth

William Rowell 2 May 14, 2022
Spert NLP Relation Extraction API deployed with torchserve for inference

URLMask Python program for Linux users to change a URL to ANY domain. A program than can take any url and mask it to any domain name you like. E.g. ne

Zichu Chen 1 Nov 24, 2021
Code for PED: DETR For (Crowd) Pedestrian Detection

Code for PED: DETR For (Crowd) Pedestrian Detection

36 Sep 13, 2022
History Aware Multimodal Transformer for Vision-and-Language Navigation

History Aware Multimodal Transformer for Vision-and-Language Navigation This repository is the official implementation of History Aware Multimodal Tra

Shizhe Chen 46 Nov 23, 2022
Tools for curating biomedical training data for large-scale language modeling

Tools for curating biomedical training data for large-scale language modeling

BigScience Workshop 242 Dec 25, 2022
Simple telegram bot to convert files into direct download link.you can use telegram as a file server 🪁

TGCLOUD 🪁 Simple telegram bot to convert files into direct download link.you can use telegram as a file server 🪁 Features Easy to Deploy Heroku Supp

Mr.Acid dev 6 Oct 18, 2022
Behavioral Testing of Clinical NLP Models

Behavioral Testing of Clinical NLP Models This repository contains code for testing the behavior of clinical prediction models based on patient letter

Betty van Aken 2 Sep 20, 2022
An assignment on creating a minimalist neural network toolkit for CS11-747

minnn by Graham Neubig, Zhisong Zhang, and Divyansh Kaushik This is an exercise in developing a minimalist neural network toolkit for NLP, part of Car

Graham Neubig 63 Dec 29, 2022