A PyTorch implementation of "SimGNN: A Neural Network Approach to Fast Graph Similarity Computation" (WSDM 2019).

Overview

SimGNN

PWC codebeat badge repo sizebenedekrozemberczki⠀⠀

A PyTorch implementation of SimGNN: A Neural Network Approach to Fast Graph Similarity Computation (WSDM 2019).

Abstract

Graph similarity search is among the most important graph-based applications, e.g. finding the chemical compounds that are most similar to a query compound. Graph similarity/distance computation, such as Graph Edit Distance (GED) and Maximum Common Subgraph (MCS), is the core operation of graph similarity search and many other applications, but very costly to compute in practice. Inspired by the recent success of neural network approaches to several graph applications, such as node or graph classification, we propose a novel neural network based approach to address this classic yet challenging graph problem, aiming to alleviate the computational burden while preserving a good performance. The proposed approach, called SimGNN, combines two strategies. First, we design a learnable embedding function that maps every graph into an embedding vector, which provides a global summary of a graph. A novel attention mechanism is proposed to emphasize the important nodes with respect to a specific similarity metric. Second, we design a pairwise node comparison method to sup plement the graph-level embeddings with fine-grained node-level information. Our model achieves better generalization on unseen graphs, and in the worst case runs in quadratic time with respect to the number of nodes in two graphs. Taking GED computation as an example, experimental results on three real graph datasets demonstrate the effectiveness and efficiency of our approach. Specifically, our model achieves smaller error rate and great time reduction compared against a series of baselines, including several approximation algorithms on GED computation, and many existing graph neural network based models. Our study suggests SimGNN provides a new direction for future research on graph similarity computation and graph similarity search.

This repository provides a PyTorch implementation of SimGNN as described in the paper:

SimGNN: A Neural Network Approach to Fast Graph Similarity Computation. Yunsheng Bai, Hao Ding, Song Bian, Ting Chen, Yizhou Sun, Wei Wang. WSDM, 2019. [Paper]

A reference Tensorflow implementation is accessible [here] and another implementation is [here].

Requirements

The codebase is implemented in Python 3.5.2. package versions used for development are just below.

networkx          2.4
tqdm              4.28.1
numpy             1.15.4
pandas            0.23.4
texttable         1.5.0
scipy             1.1.0
argparse          1.1.0
torch             1.1.0
torch-scatter     1.4.0
torch-sparse      0.4.3
torch-cluster     1.4.5
torch-geometric   1.3.2
torchvision       0.3.0
scikit-learn      0.20.0

Datasets

The code takes pairs of graphs for training from an input folder where each pair of graph is stored as a JSON. Pairs of graphs used for testing are also stored as JSON files. Every node id and node label has to be indexed from 0. Keys of dictionaries are stored strings in order to make JSON serialization possible.

Every JSON file has the following key-value structure:

{"graph_1": [[0, 1], [1, 2], [2, 3], [3, 4]],
 "graph_2":  [[0, 1], [1, 2], [1, 3], [3, 4], [2, 4]],
 "labels_1": [2, 2, 2, 2],
 "labels_2": [2, 3, 2, 2, 2],
 "ged": 1}

The **graph_1** and **graph_2** keys have edge list values which descibe the connectivity structure. Similarly, the **labels_1** and **labels_2** keys have labels for each node which are stored as list - positions in the list correspond to node identifiers. The **ged** key has an integer value which is the raw graph edit distance for the pair of graphs.

Options

Training a SimGNN model is handled by the `src/main.py` script which provides the following command line arguments.

Input and output options

  --training-graphs   STR    Training graphs folder.      Default is `dataset/train/`.
  --testing-graphs    STR    Testing graphs folder.       Default is `dataset/test/`.

Model options

  --filters-1             INT         Number of filter in 1st GCN layer.       Default is 128.
  --filters-2             INT         Number of filter in 2nd GCN layer.       Default is 64. 
  --filters-3             INT         Number of filter in 3rd GCN layer.       Default is 32.
  --tensor-neurons        INT         Neurons in tensor network layer.         Default is 16.
  --bottle-neck-neurons   INT         Bottle neck layer neurons.               Default is 16.
  --bins                  INT         Number of histogram bins.                Default is 16.
  --batch-size            INT         Number of pairs processed per batch.     Default is 128. 
  --epochs                INT         Number of SimGNN training epochs.        Default is 5.
  --dropout               FLOAT       Dropout rate.                            Default is 0.5.
  --learning-rate         FLOAT       Learning rate.                           Default is 0.001.
  --weight-decay          FLOAT       Weight decay.                            Default is 10^-5.
  --histogram             BOOL        Include histogram features.              Default is False.

Examples

The following commands learn a neural network and score on the test set. Training a SimGNN model on the default dataset.

python src/main.py

Training a SimGNN model for a 100 epochs with a batch size of 512.

python src/main.py --epochs 100 --batch-size 512

Training a SimGNN with histogram features.

python src/main.py --histogram

Training a SimGNN with histogram features and a large bin number.

python src/main.py --histogram --bins 32

Increasing the learning rate and the dropout.

python src/main.py --learning-rate 0.01 --dropout 0.9

You can save the trained model by adding the --save-path parameter.

python src/main.py --save-path /path/to/model-name

Then you can load a pretrained model using the --load-path parameter; note that the model will be used as-is, no training will be performed.

python src/main.py --load-path /path/to/model-name

License

Comments
  • Model test error is too high

    Model test error is too high

    I'm sorry to bother you.But when I tried to replicate your work,I ran into some difficulties. Here is the problem I met:

    python src/main.py +---------------------+------------------+ | Batch size | 128 | +=====================+==================+ | Bins | 16 | +---------------------+------------------+ | Bottle neck neurons | 16 | +---------------------+------------------+ | Dropout | 0.500 | +---------------------+------------------+ | Epochs | 5 | +---------------------+------------------+ | Filters 1 | 128 | +---------------------+------------------+ | Filters 2 | 64 | +---------------------+------------------+ | Filters 3 | 32 | +---------------------+------------------+ | Histogram | 0 | +---------------------+------------------+ | Learning rate | 0.001 | +---------------------+------------------+ | Tensor neurons | 16 | +---------------------+------------------+ | Testing graphs | ./dataset/test/ | +---------------------+------------------+ | Training graphs | ./dataset/train/ | +---------------------+------------------+ | Weight decay | 0.001 | +---------------------+------------------+

    Enumerating unique labels.

    100%|██████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 2533.57it/s]

    Model training.

    Epoch: 0%| | 0/5 [00:00<?, ?it/s] /home/jovyan/SimGNN/src/simgnn.py:212: UserWarning: Using a target size (torch.Size([1, 1])) that is different to the input size (torch.Size([1])). This will likely lead to incorrect results due to broadcasting. Please ensure they havethe same size. losses = losses + torch.nn.functional.mse_loss(data["target"], prediction) Epoch (Loss=3.87038): 100%|██████████████████████████████████████████████████████████████████| 5/5 [00:16<00:00, 3.23s/it] Batches: 100%|███████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00, 1.68s/it]

    Model evaluation.

    100%|█████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 102.39it/s]

    Baseline error: 0.41597.

    Model test error: 0.94024.

    I found the model test error too high! The only thing I changed was the version of the libraries,which I replaced with the latest. Could you help me with this problem?

    opened by Alice314 7
  • About dataset

    About dataset

    I am really interested in this amazing work, but I don't understand how datasets are generated (or processed), or are training data and test data generated from public data sets (just like Linux, AIDS, mentioned in the paper)? I desire to know how syngen.py works and the output of this function.

    Thanks a lot.

    opened by BenedictWongBJUT 6
  • Extracting Latent Space

    Extracting Latent Space

    How would you recommend we approach creating network embeddings (entire network is single point/vector) using this library?

    I was thinking of modifying the forward pass to output similarity and running MDS on the similarity matrix if I'm doing all vs all on the test set.

    I am hoping to compare a couple hundred generated graphs via ERGM and latent ERGM as well as other network approaches to the original graph and output an embedding of all of the graphs.

    Please let me know your recommendations before this becomes a time sink, else, I can find a way to hack it. Thanks!

    opened by jlevy44 4
  • Questions about the model.

    Questions about the model.

    Sorry for using this section to ask technical question rather than code-wise. Had a few questions.

    • Do you have the pretrained model on any of the dataset (like IMDB) the paper talks about? The size of sample in the github dataset is small.
    • Am I right that there is no early stopping in the model? As there is no validation set. [the reason I am asking this question is as I use a bigger dataset, with higher number of epochs, most of the ged predicted values are around 1]
    opened by BehroozMansouri 3
  • Error adapting code

    Error adapting code

    Hey! I am facing some issues adapting your SimGNN code into my graph dataset. I keep encountering an error that says:

    RuntimeError: Invalid index in scatterAdd at ../aten/src/TH/generic/THTensorEvenMoreMath.cpp:520

    I even tried to change one of the training JSON files to the example of the labels and graph structure you gave. I also see a warning about size.

    Am I missing something? Any help would be greatly appreciated.

    opened by kalkidann 3
  • Notice on package versions

    Notice on package versions

    Hello,

    Trying (with some struggle unfortunately) to get this to run, I noticed that the requirements' package versions in the README file are different to some package versions in the requirements.txt.

    You may want to check that out.

    Best.

    opened by Chuhtra 2
  • Performance of SimGNN

    Performance of SimGNN

    Hi @benedekrozemberczki

    We recently used this implementation of SimGNN and noticed that the performance does not match to the one that is outlined in the paper. In particular, for AIDS dataset we get an order of magnitude higher MSE that in the original paper. Did you verify that this implementation match the performance of the paper?

    P.S. we also benchmarked the orignal repo of SimGNN and noticed that it produces slightly better results, even though it's very long to run until convergence.

    opened by nd7141 2
  • Visualizing Attention

    Visualizing Attention

    Hi, I want to visualize the attention with respect to the input graph (like Figure 5 in the paper). Can you please guide me how to visualize the attention weight with the input graphs?

    opened by sajjadriaj 2
  • Add ability to save and load a trained model

    Add ability to save and load a trained model

    To achieve repeatable results, it was also necessary to keep the label in fixed order, so that the resulting one-hot encoding vectors are the same across different runs.

    Explanation of this feature has been added to the README.

    opened by Carlovan 1
  • Batch processing required.

    Batch processing required.

    Hi.

    Is there a version of the code where we can send batched data into the model? Current version works on one graph pair at a time. This is taking too long for larger training data with each data point being fed one at a time into the model via for loop.

    Thanks.

    opened by Indradyumna 1
  • Import error for scatter_cuda

    Import error for scatter_cuda

    Hi there,

    I am having a “no module named touchy_scatter .scatter_cuda” import error. I am sure I have the necessary toolkits and driver installed.

    Does the code require a GPU environment?

    opened by jonathan-goh 1
  • Error in the example from readme

    Error in the example from readme

    In the example in the repo: {"graph_1": [[0, 1], [1, 2], [2, 3], [3, 4]], "graph_2": [[0, 1], [1, 2], [1, 3], [3, 4], [2, 4]], "labels_1": [2, 2, 2, 2], "labels_2": [2, 3, 2, 2, 2], "ged": 1} I think there is a label missing in labels_1. Also the ged for the example is 2 if we consider the missing label is 2. Am I missing smth?

    opened by merascu 1
Releases(v_00001)
Owner
Benedek Rozemberczki
Machine Learning Engineer at AstraZeneca | PhD from The University of Edinburgh.
Benedek Rozemberczki
This repo is to present various code demos on how to use our Graph4NLP library.

Deep Learning on Graphs for Natural Language Processing Demo The repository contains code examples for DLG4NLP tutorials at NAACL 2021, SIGIR 2021, KD

Graph4AI 143 Dec 23, 2022
Ascend your Jupyter Notebook usage

Jupyter Ascending Sync Jupyter Notebooks from any editor About Jupyter Ascending lets you edit Jupyter notebooks from your favorite editor, then insta

Untitled AI 254 Jan 08, 2023
Open source simulator for autonomous vehicles built on Unreal Engine / Unity, from Microsoft AI & Research

Welcome to AirSim AirSim is a simulator for drones, cars and more, built on Unreal Engine (we now also have an experimental Unity release). It is open

Microsoft 13.8k Jan 05, 2023
Plaything for Autistic Children (demo for PaddlePaddle/Wechaty/Mixlab project)

星星的孩子 - 一款为孤独症孩子设计的聊天机器人游戏 孤独症儿童是目前常常被忽视的一类群体。他们有着类似性格内向的特征,实际却受着广泛性发育障碍的折磨。 项目背景 这类儿童在与人交往时存在着沟通障碍,其特点表现在: 社交交流差,互动障碍明显 认知能力有限,被动认知 兴趣狭窄,重复刻板,缺乏变化和想象

Tianyi Pan 35 Nov 24, 2022
Noether Networks: meta-learning useful conserved quantities

Noether Networks: meta-learning useful conserved quantities This repository contains the code necessary to reproduce experiments from "Noether Network

Dylan Doblar 33 Nov 23, 2022
Unofficial Implementation of MLP-Mixer, Image Classification Model

MLP-Mixer Unoffical Implementation of MLP-Mixer, easy to use with terminal. Train and test easly. https://arxiv.org/abs/2105.01601 MLP-Mixer is an arc

Oğuzhan Ercan 6 Dec 05, 2022
Turning SymPy expressions into PyTorch modules.

sympytorch A micro-library as a convenience for turning SymPy expressions into PyTorch Modules. All SymPy floats become trainable parameters. All SymP

Patrick Kidger 89 Dec 13, 2022
Materials for my scikit-learn tutorial

Scikit-learn Tutorial Jake VanderPlas email: [email protected] twitter: @jakevdp gith

Jake Vanderplas 1.6k Dec 30, 2022
AISTATS 2019: Confidence-based Graph Convolutional Networks for Semi-Supervised Learning

Confidence-based Graph Convolutional Networks for Semi-Supervised Learning Source code for AISTATS 2019 paper: Confidence-based Graph Convolutional Ne

MALL Lab (IISc) 56 Dec 03, 2022
Implement of homography net by pytorch

HomographyNet Implement of homography net by pytorch Brief Introduction This project is based on the work Homography-Net: @article{detone2016deep, t

ronghao_CN 4 May 19, 2022
This repo includes our code for evaluating and improving transferability in domain generalization (NeurIPS 2021)

Transferability for domain generalization This repo is for evaluating and improving transferability in domain generalization (NeurIPS 2021), based on

gordon 9 Nov 29, 2022
Official PyTorch Implementation of SSMix (Findings of ACL 2021)

SSMix: Saliency-based Span Mixup for Text Classification (Findings of ACL 2021) Official PyTorch Implementation of SSMix | Paper Abstract Data augment

Clova AI Research 52 Dec 27, 2022
PassAPI is a password generator in hash format and fully developed in Python, with the aim of teaching how to handle and build

simple, elegant and safe Introduction PassAPI is a password generator in hash format and fully developed in Python, with the aim of teaching how to ha

Johnsz 2 Mar 02, 2022
Code for Low-Cost Algorithmic Recourse for Users With Uncertain Cost Functions

EMS-COLS-recourse Initial Code for Low-Cost Algorithmic Recourse for Users With Uncertain Cost Functions Folder structure: data folder contains raw an

Prateek Yadav 1 Nov 25, 2022
Learning High-Speed Flight in the Wild

Learning High-Speed Flight in the Wild This repo contains the code associated to the paper Learning Agile Flight in the Wild. For more information, pl

Robotics and Perception Group 391 Dec 29, 2022
HHP-Net: A light Heteroscedastic neural network for Head Pose estimation with uncertainty

HHP-Net: A light Heteroscedastic neural network for Head Pose estimation with uncertainty Giorgio Cantarini, Francesca Odone, Nicoletta Noceti, Federi

18 Aug 02, 2022
“英特尔创新大师杯”深度学习挑战赛 赛道3:CCKS2021中文NLP地址相关性任务

ccks2021-track3 CCKS2021中文NLP地址相关性任务-赛道三-冠军方案 团队:我的加菲鱼- wodejiafeiyu 初赛第二/复赛第一/决赛第一 前言 19年开始,陆陆续续参加了一些比赛,拿到过一些top,比较懒一直都没分享过,这次比较幸运又拿了top1,打算分享下 分类的任务

shaochenjie 131 Dec 31, 2022
Volumetric parameterization of the placenta to a flattened template

placenta-flattening A MATLAB algorithm for volumetric mesh parameterization. Developed for mapping a placenta segmentation derived from an MRI image t

Mazdak Abulnaga 12 Mar 14, 2022
Implementation of ETSformer, state of the art time-series Transformer, in Pytorch

ETSformer - Pytorch Implementation of ETSformer, state of the art time-series Transformer, in Pytorch Install $ pip install etsformer-pytorch Usage im

Phil Wang 121 Dec 30, 2022
Differential fuzzing for the masses!

NEZHA NEZHA is an efficient and domain-independent differential fuzzer developed at Columbia University. NEZHA exploits the behavioral asymmetries bet

147 Dec 05, 2022