MPLP: Metapath-Based Label Propagation for Heterogenous Graphs

Last update: Jun 28, 2022

Related tags

Overview

MPLP: Metapath-Based Label Propagation for Heterogenous Graphs

Results on MAG240M

Here, we demonstrate the following performance on the MAG240M dataset from [email protected] 2021.

Model	Test Acc	Validation Acc	Parameters	Hardware
Our Model	0.7447	0.7669 ± 0.0003 (ensemble 0.7696)	743,449	Tesla V100 (21GB)

Reproducing results

0. Requirements

Here just list python3 packages we used in this competition:

numpy==1.19.2
torch==1.5.1+cu101
dgl-cu101==0.6.0.post1
ogb==1.3.1
sklearn==0.23.2
tqdm==4.46.1

1. Prepare Graph and Features

The preprocess code modifed from dgl baseline. We created graph with 6 different edge types instead of 5.

# Time cost: 3hours,30mins

python3 $MAG_CODE_PATH/preprocess.py
        --rootdir $MAG_INPUT_PATH \
        --author-output-path $MAG_PREP_PATH/author.npy \
        --inst-output-path $MAG_PREP_PATH/inst.npy \
        --graph-output-path $MAG_PREP_PATH \
        --graph-as-homogeneous \
        --full-output-path $MAG_PREP_PATH/full_feat.npy

The graphs and features will be saved in MAG_PREP_PATH , where the MAG_PREP_PATH is specified in run.sh.

Calculate features

The meta-path based features are generated by this script. Details can be found in our technical report.

# Time cost: 2hours,20mins (only generate label related features)

python3 $MAG_CODE_PATH/feature.py
        $MAG_INPUT_PATH \
        $MAG_PREP_PATH/dgl_graph_full_heterogeneous_csr.bin \
        $MAG_FEAT_PATH \
        --seed=42

Train RGAT model and prepare RGAT features

The RGAT model is modifed from dgl baseline. The validation accuracy is 0.701 , as same as described in the dgl baseline github.

# Time cost: 33hours,40mins (20mins for each epoch)

python3 $MAG_CODE_PATH/rgat.py
        --rootdir $MAG_INPUT_PATH \
        --graph-path $MAG_PREP_PATH/dgl_graph_full_homogeneous_csc.bin \
        --full-feature-path $MAG_PREP_PATH/full_feat.npy \
        --output-path $MAG_RGAT_PATH/ \
        --epochs=100 \
        --model-path $MAG_RGAT_PATH/model.pt \
        --submission-path $MAG_RGAT_PATH/

You will get embeddings as input features of the following MPLP models.

2. Train MPLP models

The train process splits to two steps:

train the model with full train samples at a large learning rate (here we use StepLR(lr=0.01, step_size=100, gamma=0.25))
then fine tune the model with latest train samples (eg, paper with year >= 2018) with a small learning rate (0.000625)

You can train the MPLP model by running the following commands:

# Time cost: 2hours,40mins for each seed

for seed in $(seq 0 7);
do
    python3 $MAG_CODE_PATH/mplp.py \
            $MAG_INPUT_PATH \
            $MAG_MPLP_PATH/data/ \
            $MAG_MPLP_PATH/output/seed${seed} \
            --gpu \
            --seed=${seed} \
            --batch_size=10240 \
            --epochs=200 \
            --num_layers=2 \
            --learning_rate=0.01 \
            --dropout=0.5 \
            --num_splits=5
done

3. Ensemble MPLP results

While having all the results with k-fold cross validation training under 8 different seeds, you can average the results by running code below:

python3 $MAG_CODE_PATH/ensemble.py $MAG_MPLP_PATH/output/ $MAG_SUBM_PATH

MPLP: Metapath-Based Label Propagation for Heterogenous Graphs

Related tags

Overview

MPLP: Metapath-Based Label Propagation for Heterogenous Graphs

Results on MAG240M

Reproducing results

0. Requirements

1. Prepare Graph and Features

Calculate features

Train RGAT model and prepare RGAT features

2. Train MPLP models

3. Ensemble MPLP results

Owner

Qiuying Peng

Official Implementation and Dataset of "PPR10K: A Large-Scale Portrait Photo Retouching Dataset with Human-Region Mask and Group-Level Consistency", CVPR 2021

A tool to estimate time varying instantaneous reproduction number during epidemics

This package proposes simplified exporting pytorch models to ONNX and TensorRT, and also gives some base interface for model inference.

Coded illumination for improved lensless imaging

Machine Learning automation and tracking

Binary classification for arrythmia detection with ECG datasets.

Offline Multi-Agent Reinforcement Learning Implementations: Solving Overcooked Game with Data-Driven Method

Using deep learning model to detect breast cancer.

Social Distancing Detector

Canonical Capsules: Unsupervised Capsules in Canonical Pose (NeurIPS 2021)

Keras implementation of Normalizer-Free Networks and SGD - Adaptive Gradient Clipping

LETR: Line Segment Detection Using Transformers without Edges

AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty

A Pythonic library for Nvidia Codec.

A rule learning algorithm for the deduction of syndrome definitions from time series data.

Real-Time High-Resolution Background Matting

Official PyTorch Implementation for InfoSwap: Information Bottleneck Disentanglement for Identity Swapping

Explainer for black box models that predict molecule properties

Personal project about genus-0 meshes, spherical harmonics and a cow

Jigsaw Rate Severity of Toxic Comments