BERTMap: A BERT-Based Ontology Alignment System

Overview

BERTMap: A BERT-based Ontology Alignment System

Important Notices

About

BERTMap is a BERT-based ontology alignment system, which utilizes the textual knowledge of ontologies to fine-tune BERT and make prediction. It also incorporates sub-word inverted indices for candidate selection, and (graph-based) extension and (logic-based) repair modules for mapping refinement.

Essential dependencies

The following packages are necessary but not sufficient for running BERTMap:

conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch  # pytorch
pip install cython  # the optimized parser of owlready2 relies on Cython
pip install owlready2  # for managing ontologies
pip install tensorboard  # tensorboard logging (optional)
pip install transformers  # huggingface library
pip install datasets  # huggingface datasets

Running BERTMap

IMPORTANT NOTICE: BERTMap relies on class labels for training, but different ontologies have different annotation properties to define the aliases (synonyms), so preprocessing is required for adding all the synonyms to rdf:label before running BERTMap. The preprocessed ontologies involved in our paper together with their reference mappings are available in data.zip.

Clone the repository and run:

# fine-tuning and evaluate bertmap prediction 
python run_bertmap.py -c config.json -m bertmap

# mapping extension (-e specify which mapping set {src, tgt, combined} to be extended)
python extend_bertmap.py -c config.json -e src

# evaluate extended bertmap 
python eval_bertmap.py -c config.json -e src

# repair and evluate final outputs (-t specify best validation threshold)
python repair_bertmap.py -c config.json -e src -t 0.999

# baseline models (edit similarity and pretrained bert embeddings)
python run_bertmap.py -c config.json -m nes
python run_bertmap.py -c config.json -m bertembeds

The script skips data construction once built for the first time to ensure that all of the models share the same set of pre-processed data.

The fine-tuning model is implemented with huggingface Trainer, which by default uses multiple GPUs, for restricting to GPUs of specified indices, please run (for example):

# only device (1) and (2) are visible to the script
CUDA_VISIBLE_DEVICES=1,2 python run_bertmap.py -c config.json -m bertmap 

Configurations

Here gives the explanations of the variables used in config.json for customized BERTMap running.

  • data:
    • task_dir: directory for saving all the output files.
    • src_onto: source ontology name.
    • tgt_onto: target ontology name.
    • task_suffix: any suffix of the task if needed, e.g. the LargeBio track has 'small' and 'whole'.
    • src_onto_file: source ontology file in .owl format.
    • tgt_onto_fil: target ontology file in .owl format.
    • properties: list of textual properties used for constructing semantic data , default is class labels: ["label"].
    • cut: threshold length for the keys of sub-word inverted index, preserve the keys only if their lengths > cut, default is 0.
  • corpora:
    • sample_rate: number of (soft) negative samples for each positive sample generated in corpora (not the ultimate fine-tuning data).
    • src2tgt_mappings_file: reference mapping file for evaluation and semi-supervised learning setting in .tsv format with columns: "Entity1", "Entity2" and "Value".
    • ignored_mappings_file: file in .tsv format but stores mappings that should be ignored by the evaluator.
    • train_map_ratio: proportion of training mappings to used in semi-supervised setting, default is 0.2.
    • val_map_ratio: proportion of validation mappings to used in semi-supervised setting, default is 0.1.
    • test_map_ratio: proportion of test mappings to used in semi-supervised setting, default is 0.7.
    • io_soft_neg_rate: number of soft negative sample for each positive sample generated in the fine-tuning data at the intra-ontology level.
    • io_hard_neg_rate: number of hard negative sample for each positive sample generated in the fine-tuning data at the intra-ontology level.
    • co_soft_neg_rate: number of soft negative sample for each positive sample generated in the fine-tuning data at the cross-ontology level.
    • depth_threshold: classes of depths larger than this threshold will not considered in hard negative generation, default is null.
    • depth_strategy: strategy to compute the depths of the classes if any threshold is set, default is max, choices are max and min.
  • bert
    • pretrained_path: real or huggingface library path for pretrained BERT, e.g. "emilyalsentzer/Bio_ClinicalBERT" (BioClinicalBERT).
    • tokenizer_path: real or huggingface library path for BERT tokenizer, e.g. "emilyalsentzer/Bio_ClinicalBERT" (BioClinicalBERT).
  • fine-tune
    • include_ids: include identity synonyms in the positive samples or not.
    • learning: choice of learning setting ss (semi-supervised) or us (unsupervised).
    • warm_up_ratio: portion of warm up steps.
    • max_length: maximum length for tokenizer (highly important for large task!).
    • num_epochs: number of training epochs, default is 3.0.
    • batch_size: batch size for fine-tuning BERT.
    • early_stop: whether or not to apply early stopping (patience has been set to 10), default is false.
    • resume_checkpoint: path to previous checkpoint if any, default is null.
  • map
    • candidate_limits: list of candidate limits used for mapping computation, suggested values are [25, 50, 100, 150, 200].
    • batch_size: batch size used for mapping computation.
    • nbest: number of top results to be considered.
    • string_match: whether or not to use string match before others.
    • strategy: strategy for classifier scoring method, default is mean.
  • eval:
    • automatic: whether or not automatically evaluate the mappings.

Should you need any further customizaions especially on the evaluation part, please set eval: automatic to false and use your own evaluation script.

Acknolwedgements

The repair module is credited to Ernesto Jiménez Ruiz et al., and the code can be found here.

Owner
KRR
Knowledge Representation and Reasoning Group - University of Oxford
KRR
DeepMoCap: Deep Optical Motion Capture using multiple Depth Sensors and Retro-reflectors

DeepMoCap: Deep Optical Motion Capture using multiple Depth Sensors and Retro-reflectors By Anargyros Chatzitofis, Dimitris Zarpalas, Stefanos Kollias

tofis 24 Oct 08, 2022
Implementation of MeMOT - Multi-Object Tracking with Memory - in Pytorch

MeMOT - Pytorch (wip) Implementation of MeMOT - Multi-Object Tracking with Memory - in Pytorch. This paper is just one in a line of work, but importan

Phil Wang 15 May 09, 2022
Send text to girlfriend in the morning

Girlfriend Text Send text to girlfriend (or really anyone with a phone number) in the morning 1. Configure your settings in utils.py. phone_number = "

Paras Adhikary 199 Oct 25, 2022
Anderson Acceleration for Deep Learning

Anderson Accelerated Deep Learning (AADL) AADL is a Python package that implements the Anderson acceleration to speed-up the training of deep learning

Oak Ridge National Laboratory 7 Nov 24, 2022
This repository implements and evaluates convolutional networks on the Möbius strip as toy model instantiations of Coordinate Independent Convolutional Networks.

Orientation independent Möbius CNNs This repository implements and evaluates convolutional networks on the Möbius strip as toy model instantiations of

Maurice Weiler 59 Dec 09, 2022
Large dataset storage format for Pytorch

H5Record Large dataset ( 100G, = 1T) storage format for Pytorch (wip) Support python 3 pip install h5record Why? Writing large dataset is still a

theblackcat102 43 Oct 22, 2022
Self-supervised Point Cloud Prediction Using 3D Spatio-temporal Convolutional Networks

Self-supervised Point Cloud Prediction Using 3D Spatio-temporal Convolutional Networks This is a Pytorch-Lightning implementation of the paper "Self-s

Photogrammetry & Robotics Bonn 111 Dec 06, 2022
Deep Learning Algorithms for Hedging with Frictions

Deep Learning Algorithms for Hedging with Frictions This repository contains the Forward-Backward Stochastic Differential Equation (FBSDE) solver and

Xiaofei Shi 3 Dec 22, 2022
A Gura parser implementation for Python

Gura Python parser This repository contains the implementation of a Gura (compliant with version 1.0.0) format parser in Python. Installation pip inst

Gura Config Lang 19 Jan 25, 2022
A PyTorch implementation of "Pathfinder Discovery Networks for Neural Message Passing"

A PyTorch implementation of "Pathfinder Discovery Networks for Neural Message Passing" (WebConf 2021). Abstract In this work we propose Pathfind

Benedek Rozemberczki 49 Dec 01, 2022
Hand-distance-measurement-game - Hand Distance Measurement Game

Hand Distance Measurement Game This is program is made to calculate the distance

Priyansh 2 Jan 12, 2022
PyTorch code for ICPR 2020 paper Future Urban Scene Generation Through Vehicle Synthesis

Future urban scene generation through vehicle synthesis This repository contains Pytorch code for the ICPR2020 paper "Future Urban Scene Generation Th

Alessandro Simoni 4 Oct 11, 2021
Official Pytorch implementation of the paper: "Locally Shifted Attention With Early Global Integration"

Locally-Shifted-Attention-With-Early-Global-Integration Pretrained models You can download all the models from here. Training Imagenet python -m torch

Shelly Sheynin 14 Apr 15, 2022
DeepHawkeye is a library to detect unusual patterns in images using features from pretrained neural networks

English | 简体中文 Introduction DeepHawkeye is a library to detect unusual patterns in images using features from pretrained neural networks Reference Pat

CV Newbie 28 Dec 13, 2022
Home for cuQuantum Python & NVIDIA cuQuantum SDK C++ samples

Welcome to the cuQuantum repository! This public repository contains two sets of files related to the NVIDIA cuQuantum SDK: samples: All C/C++ sample

NVIDIA Corporation 147 Dec 27, 2022
Deep Ensembling with No Overhead for either Training or Testing: The All-Round Blessings of Dynamic Sparsity

[ICLR 2022] Deep Ensembling with No Overhead for either Training or Testing: The All-Round Blessings of Dynamic Sparsity by Shiwei Liu, Tianlong Chen, Zahra Atashgahi, Xiaohan Chen, Ghada Sokar, Elen

VITA 18 Dec 31, 2022
Cortex-compatible model server for Python and TensorFlow

Nucleus model server Nucleus is a model server for TensorFlow and generic Python models. It is compatible with Cortex clusters, Kubernetes clusters, a

Cortex Labs 14 Nov 27, 2022
Python package facilitating the use of Bayesian Deep Learning methods with Variational Inference for PyTorch

PyVarInf PyVarInf provides facilities to easily train your PyTorch neural network models using variational inference. Bayesian Deep Learning with Vari

342 Dec 02, 2022
Learning Spatio-Temporal Transformer for Visual Tracking

STARK The official implementation of the paper Learning Spatio-Temporal Transformer for Visual Tracking Hiring research interns for visual transformer

Multimedia Research 484 Dec 29, 2022
FedScale: Benchmarking Model and System Performance of Federated Learning

FedScale: Benchmarking Model and System Performance of Federated Learning (Paper) This repository contains scripts and instructions of building FedSca

268 Jan 01, 2023