Compositional and Parameter-Efficient Representations for Large Knowledge Graphs

Overview

NodePiece - Compositional and Parameter-Efficient Representations for Large Knowledge Graphs

NodePiece is a "tokenizer" for reducing entity vocabulary size in knowledge graphs. Instead of shallow embedding every node to a vector, we first "tokenize" each node by K anchor nodes and M relation types in its relational context. Then, the resulting hash sequence is encoded through any injective function, e.g., MLP or Transformer.

Similar to Byte-Pair Encoding and WordPiece tokenizers commonly used in NLP, NodePiece can tokenize unseen nodes attached to the seen graph using the same anchor and relation vocabulary, which allows NodePiece to work out-of-the-box in the inductive settings using all the well-known scoring functions in the classical KG completion (like TransE or RotatE). NodePiece also works with GNNs (we tested on node classification, but not limited to it, of course).

NodePiece source code

The repo contains the code and experimental setups for reproducibility studies.

Each experiment resides in the respective folder:

  • LP_RP - link prediction and relation prediction
  • NC - node classification
  • OOS_LP - out-of-sample link prediction

The repo is based on Python 3.8. wandb is an optional requirement in case you have an existing account there and would like to track experimental results. If you have a wandb account, the repo assumes you've performed

wandb login <your_api_key>

Using a GPU is recommended.

First, run a script which will download all the necessary pre-processed data and datasets. It takes approximately 1 GB.

sh download_data.sh

We packed the pre-processed data for faster experimenting with the repo. Note that there are two NodePiece tokenization modes (-tkn_mode [option]): path and bfs:

  • path is an old tokenization strategy (based on finding shortest paths between each node and all anchors) under which we performed the experiments and packed the data for reproducibility;
  • bfs is a new strategy (based on iterative expansion of node's neighborhood until a desired number of anchors is reached) which is 5-50x faster and takes 4-5x less space depending on the KG. Currently, works for transductive LP/RP tasks;

Pre-processing times tested on M1 MacBook Pro / 8 GB:

mode FB15k-237 / vocab size WN18RR / vocab size YAGO 3-10 / vocab size
path 2 min / 28 MB 5 min / 140 MB ~ 5 hours / 240 MB
bfs 8 sec / 7.5 MB 30 sec / 20 MB 4.5 min / 40 MB

CoDEx-Large and YAGO path pre-processing is better run on a server with 16-32 GB RAM and might take 2-5 hours depending on the chosen number of anchors.

NB: we seek to further improve the algorithms to make the tokenization process even faster than the bfs strategy.

Second, install the dependencies in requirements.txt. Note that when installing Torch-Geometric you might want to use pre-compiled binaries for a certain version of python and torch. Check the manual here.

In the link prediction tasks, all the necessary datasets will be downloaded upon first script execution.

Link Prediction

The link prediction (LP) and relation prediction (RP) tasks use models, datasets, and evaluation protocols from PyKEEN.

Navigate to the lp_rp folder: cd lp_rp.

The list of CLI params can be found in run_lp.py.

  • Run the fb15k-237 experiment
python run_lp.py -loop lcwa -loss bce -b 512 -data fb15k237 -anchors 1000 -sp 100 -lr 0.0005 -ft_maxp 20 -pool cat -embedding 200 -sample_rels 15 -smoothing 0.4 -epochs 401
  • Run the wn18rr experiment
python run_lp.py -loop slcwa -loss nssal -margin 15 -b 512 -data wn18rr -anchors 500 -sp 100 -lr 0.0005 -ft_maxp 50 -pool cat -embedding 200 -negs 20 -subbatch 2000 -sample_rels 4 -epochs 601
  • Run the codex-l experiment
python run_lp.py -loop lcwa -loss bce -b 256 -data codex_l -anchors 7000 -sp 100 -lr 0.0005 -ft_maxp 20 -pool cat -embedding 200 -subbatch 10000 -sample_rels 6 -smoothing 0.3 -epochs 120
  • Run the yago 3-10 experiment
python run_lp.py -loop slcwa -loss nssal -margin 50 -b 512 -data yago -anchors 10000 -sp 100 -lr 0.00025 -ft_maxp 20 -pool cat -embedding 200 -subbatch 2000 -sample_rels 5 -negs 10 -epochs 601

Test evaluation reproducibility patch

PyKEEN 1.0.5 used in this repo has been identified to have issues at the filtering stage when evaluating on the test set. In order to fully reproduce the reported test set numbers for transductive LP/RP experiments from the paper and resolve this issue, please apply the patch from the lp_rp/patch folder:

  1. Locate pykeen in your environment installation:
<path_to_env>/lib/python3.<NUMBER>/site-packages/pykeen
  1. Replace the evaluation/evaluator.py with the one from the patch folder
cp ./lp_rp/patch/evaluator.py <path_to_env>/lib/python3.<NUMBER>/site-packages/pykeen/evaluation/
  1. Replace the stoppers/early_stopping.py with the one from the patch folder
cp ./lp_rp/patch/early_stopping.py <path_to_env>/lib/python3.<NUMBER>/site-packages/pykeen/stoppers/

This won't be needed once we port the codebase to newest versions of PyKEEN (1.4.0+) where this was fixed

Relation Prediction

The setup is very similar to that of link prediction (LP) but we predict relations (h,?,t) now.

Navigate to the lp_rp folder: cd lp_rp.

The list of CLI params can be found in run_lp.py

  • Run the fb15k-237 experiment
python run_lp.py -loop slcwa -loss nssal -b 512 -data fb15k237 -anchors 1000 -sp 100 -lr 0.0005 -ft_maxp 20 -margin 15 -subbatch 2000 -pool cat -embedding 200 -negs 20 -sample_rels 15 -epochs 21 --rel-prediction True
  • Run the wn18rr experiment
python run_lp.py -loop slcwa -loss nssal -b 512 -data wn18rr -anchors 500 -sp 100 -lr 0.0005 -ft_maxp 50 -margin 12 -subbatch 2000 -pool cat -embedding 200 -negs 20 -sample_rels 4 -epochs 151 --rel-prediction True
  • Run the yago 3-10 experiment
python run_lp.py -loop slcwa -loss nssal -b 512 -data yago -anchors 10000 -sp 100 -lr 0.0005 -ft_maxp 20 -margin 25 -subbatch 2000 -pool cat -embedding 200 -negs 20 -sample_rels 5 -epochs 7 --rel-prediction True

Node Classification

Navigate to the nc folder: cd nc .

The list of CLI params can be found in run_nc.py

If you have a GPU, use DEVICE cuda otherwise DEVICE cpu.

The run on 5% of labeled data:

python run_nc.py DATASET wd50k MAX_QPAIRS 3 STATEMENT_LEN 3 LABEL_SMOOTHING 0.1 EVAL_EVERY 5 DEVICE cpu WANDB False EPOCHS 4001 GCN_HID_DROP2 0.5 GCN_HID_DROP 0.5 GCN_FEAT_DROP 0.5 EMBEDDING_DIM 100 GCN_GCN_DIM 100 LEARNING_RATE 0.001 GCN_ATTENTION True GCN_GCN_DROP 0.3 GCN_ATTENTION_DROP 0.3 GCN_LAYERS 3 DS_TYPE transductive MODEL_NAME stare TR_RATIO 0.05 USE_FEATURES False TOKENIZE True NUM_ANCHORS 50 MAX_PATHS 10 USE_TEST True

The run on 10% of labeled data:

python run_nc.py DATASET wd50k MAX_QPAIRS 3 STATEMENT_LEN 3 LABEL_SMOOTHING 0.1 EVAL_EVERY 5 DEVICE cpu WANDB False EPOCHS 4001 GCN_HID_DROP2 0.5 GCN_HID_DROP 0.5 GCN_FEAT_DROP 0.5 EMBEDDING_DIM 100 GCN_GCN_DIM 100 LEARNING_RATE 0.001 GCN_ATTENTION True GCN_GCN_DROP 0.3 GCN_ATTENTION_DROP 0.3 GCN_LAYERS 3 DS_TYPE transductive MODEL_NAME stare TR_RATIO 0.1 USE_FEATURES False TOKENIZE True NUM_ANCHORS 50 MAX_PATHS 10 USE_TEST True

Out-of-sample Link Prediction

Navigate to the oos_lp folder: cd oos_lp/src.

The list of CLI params can be found in main.py.

  • Run the oos fb15k-237 experiment
python main.py -dataset FB15k-237 -model_name DM_NP_fb -ne 41 -lr 0.0005 -emb_dim 200 -batch_size 256 -simulated_batch_size 256 -save_each 100 -tokenize True -opt adam -pool trf -use_custom_reg False -reg_lambda 0.0 -loss_fc spl -margin 15 -neg_ratio 5 -wandb False -eval_every 20 -anchors 1000 -sample_rels 15
  • Run the oos yago3-10 experiment
python main.py -dataset YAGO3-10 -model_name DM_NP_yago -ne 41 -lr 0.0005 -emb_dim 200 -batch_size 256 -simulated_batch_size 256 -save_each 100 -tokenize True -opt adam -pool trf -use_custom_reg False -reg_lambda 0.0 -loss_fc spl -margin 15 -neg_ratio 5 -wandb False -eval_every 20 -anchors 10000 -sample_rels 5

Citation

If you find this work useful, please consider citing the paper:

@misc{galkin2021nodepiece,
    title={NodePiece: Compositional and Parameter-Efficient Representations of Large Knowledge Graphs},
    author={Mikhail Galkin and Jiapeng Wu and Etienne Denis and William L. Hamilton},
    year={2021},
    eprint={2106.12144},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Owner
Michael Galkin
Michael Galkin
Code and experiments for "Deep Neural Networks for Rank Consistent Ordinal Regression based on Conditional Probabilities"

corn-ordinal-neuralnet This repository contains the orginal model code and experiment logs for the paper "Deep Neural Networks for Rank Consistent Ord

Raschka Research Group 14 Dec 27, 2022
DeepFaceLab fork which provides IPython Notebook to use DFL with Google Colab

DFL-Colab — DeepFaceLab fork for Google Colab This project provides you IPython Notebook to use DeepFaceLab with Google Colaboratory. You can create y

779 Jan 05, 2023
Customised to detect objects automatically by a given model file(onnx)

LabelImg LabelImg is a graphical image annotation tool. It is written in Python and uses Qt for its graphical interface. Annotations are saved as XML

Heeone Lee 1 Jun 07, 2022
Hashformers is a framework for hashtag segmentation with transformers.

Hashtag segmentation is the task of automatically inserting the missing spaces between the words in a hashtag. Hashformers applies Transformer models

Ruan Chaves 41 Nov 09, 2022
Implementation of ConvMixer for "Patches Are All You Need? 🤷"

Patches Are All You Need? 🤷 This repository contains an implementation of ConvMixer for the ICLR 2022 submission "Patches Are All You Need?" by Asher

CMU Locus Lab 934 Jan 08, 2023
School of Artificial Intelligence at the Nanjing University (NJU)School of Artificial Intelligence at the Nanjing University (NJU)

F-Principle This is an exercise problem of the digital signal processing (DSP) course at School of Artificial Intelligence at the Nanjing University (

Thyrix 5 Nov 23, 2022
LBK 26 Dec 28, 2022
This repository contains the official code of the paper Equivariant Subgraph Aggregation Networks (ICLR 2022)

Equivariant Subgraph Aggregation Networks (ESAN) This repository contains the official code of the paper Equivariant Subgraph Aggregation Networks (IC

Beatrice Bevilacqua 59 Dec 13, 2022
A Closer Look at Invalid Action Masking in Policy Gradient Algorithms

A Closer Look at Invalid Action Masking in Policy Gradient Algorithms This repo contains the source code to reproduce the results in the paper A Close

Costa Huang 73 Dec 24, 2022
Send text to girlfriend in the morning

Girlfriend Text Send text to girlfriend (or really anyone with a phone number) in the morning 1. Configure your settings in utils.py. phone_number = "

Paras Adhikary 199 Oct 25, 2022
Pretrained language model and its related optimization techniques developed by Huawei Noah's Ark Lab.

Pretrained Language Model This repository provides the latest pretrained language models and its related optimization techniques developed by Huawei N

HUAWEI Noah's Ark Lab 2.6k Jan 01, 2023
Real-time pose estimation accelerated with NVIDIA TensorRT

trt_pose Want to detect hand poses? Check out the new trt_pose_hand project for real-time hand pose and gesture recognition! trt_pose is aimed at enab

NVIDIA AI IOT 803 Jan 06, 2023
Replication attempt for the Protein Folding Model

RGN2-Replica (WIP) To eventually become an unofficial working Pytorch implementation of RGN2, an state of the art model for MSA-less Protein Folding f

Eric Alcaide 36 Nov 29, 2022
OpenFace – a state-of-the art tool intended for facial landmark detection, head pose estimation, facial action unit recognition, and eye-gaze estimation.

OpenFace 2.2.0: a facial behavior analysis toolkit Over the past few years, there has been an increased interest in automatic facial behavior analysis

Tadas Baltrusaitis 5.8k Dec 31, 2022
基于PaddleOCR搭建的OCR server... 离线部署用

开头说明 DangoOCR 是基于大家的 CPU处理器 来运行的,CPU处理器 的好坏会直接影响其速度, 但不会影响识别的精度 ,目前此版本识别速度可能在 0.5-3秒之间,具体取决于大家机器的配置,可以的话尽量不要在运行时开其他太多东西。需要配合团子翻译器 Ver3.6 及其以上的版本才可以使用!

胖次团子 131 Dec 25, 2022
Aerial Single-View Depth Completion with Image-Guided Uncertainty Estimation (RA-L/ICRA 2020)

Aerial Depth Completion This work is described in the letter "Aerial Single-View Depth Completion with Image-Guided Uncertainty Estimation", by Lucas

ETHZ V4RL 70 Dec 22, 2022
Over9000 optimizer

Optimizers and tests Every result is avg of 20 runs. Dataset LR Schedule Imagenette size 128, 5 epoch Imagewoof size 128, 5 epoch Adam - baseline OneC

Mikhail Grankin 405 Nov 27, 2022
Codes for 'Dual Parameterization of Sparse Variational Gaussian Processes'

Dual Parameterization of Sparse Variational Gaussian Processes Documentation | Notebooks | API reference Introduction This repository is the official

AaltoML 7 Dec 23, 2022
The project is an official implementation of our CVPR2019 paper "Deep High-Resolution Representation Learning for Human Pose Estimation"

Deep High-Resolution Representation Learning for Human Pose Estimation (CVPR 2019) News [2020/07/05] A very nice blog from Towards Data Science introd

Leo Xiao 3.9k Jan 05, 2023
Implementation for "Seamless Manga Inpainting with Semantics Awareness" (SIGGRAPH 2021 issue)

Seamless Manga Inpainting with Semantics Awareness [SIGGRAPH 2021](To appear) | Project Website | BibTex Introduction: Manga inpainting fills up the d

101 Jan 01, 2023