GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training @ KDD 2020

Last update: Dec 27, 2022

Overview

GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training

Original implementation for paper GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training.

GCC is a contrastive learning framework that implements unsupervised structural graph representation pre-training and achieves state-of-the-art on 10 datasets on 3 graph mining tasks.

GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training

Installation

Requirements

Linux with Python ≥ 3.6
PyTorch ≥ 1.4.0
0.5 > DGL ≥ 0.4.3
pip install -r requirements.txt
Install RDKit with conda install -c conda-forge rdkit=2019.09.2.

Quick Start

Pretraining

Pre-training datasets

python scripts/download.py --url https://drive.google.com/open?id=1JCHm39rf7HAJSp-1755wa32ToHCn2Twz --path data --fname small.bin
# For regions where Google is not accessible, use
# python scripts/download.py --url https://cloud.tsinghua.edu.cn/f/b37eed70207c468ba367/?dl=1 --path data --fname small.bin

E2E

Pretrain E2E with K = 255:

bash scripts/pretrain.sh <gpu> --batch-size 256

MoCo

Pretrain MoCo with K = 16384; m = 0.999:

bash scripts/pretrain.sh <gpu> --moco --nce-k 16384

Download Pretrained Models

Instead of pretraining from scratch, you can download our pretrained models.

python scripts/download.py --url https://drive.google.com/open?id=1lYW_idy9PwSdPEC7j9IH5I5Hc7Qv-22- --path saved --fname pretrained.tar.gz
# For regions where Google is not accessible, use
# python scripts/download.py --url https://cloud.tsinghua.edu.cn/f/cabec37002a9446d9b20/?dl=1 --path saved --fname pretrained.tar.gz

Downstream Tasks

Downstream datasets

python scripts/download.py --url https://drive.google.com/open?id=12kmPV3XjVufxbIVNx5BQr-CFM9SmaFvM --path data --fname downstream.tar.gz
# For regions where Google is not accessible, use
# python scripts/download.py --url https://cloud.tsinghua.edu.cn/f/2535437e896c4b73b6bb/?dl=1 --path data --fname downstream.tar.gz

Generate embeddings on multiple datasets with

bash scripts/generate.sh <gpu> <load_path> <dataset_1> <dataset_2> ...

For example:

bash scripts/generate.sh 0 saved/Pretrain_moco_True_dgl_gin_layer_5_lr_0.005_decay_1e-05_bsz_32_hid_64_samples_2000_nce_t_0.07_nce_k_16384_rw_hops_256_restart_prob_0.8_aug_1st_ft_False_deg_16_pos_32_momentum_0.999/current.pth usa_airport kdd imdb-binary

Node Classification

Unsupervised (Table 2 freeze)

Run baselines on multiple datasets with bash scripts/node_classification/baseline.sh <hidden_size> <baseline:prone/graphwave> usa_airport h-index.

Evaluate GCC on multiple datasets:

bash scripts/generate.sh <gpu> <load_path> usa_airport h-index
bash scripts/node_classification/ours.sh <load_path> <hidden_size> usa_airport h-index

Supervised (Table 2 full)

Finetune GCC on multiple datasets:

bash scripts/finetune.sh <load_path> <gpu> usa_airport

Note this finetunes the whole network and will take much longer than the freezed experiments above.

Graph Classification

Unsupervised (Table 3 freeze)

bash scripts/generate.sh <gpu> <load_path> imdb-binary imdb-multi collab rdt-b rdt-5k
bash scripts/graph_classification/ours.sh <load_path> <hidden_size> imdb-binary imdb-multi collab rdt-b rdt-5k

Supervised (Table 3 full)

bash scripts/finetune.sh <load_path> <gpu> imdb-binary

Similarity Search (Table 4)

Run baseline (graphwave) on multiple datasets with bash scripts/similarity_search/baseline.sh <hidden_size> graphwave kdd_icdm sigir_cikm sigmod_icde.

Run GCC:

bash scripts/generate.sh <gpu> <load_path> kdd icdm sigir cikm sigmod icde
bash scripts/similarity_search/ours.sh <load_path> <hidden_size> kdd_icdm sigir_cikm sigmod_icde

❗ Common Issues

"XXX file not found" when running pretraining/downstream tasks.

Please make sure you've downloaded the pretraining dataset or downstream task datasets according to GETTING_STARTED.md.

Server crashes/hangs after launching pretraining experiments.

In addition to GPU, our pretraining stage requires a lot of computation resources, including CPU and RAM. If this happens, it usually means the CPU/RAM is exhausted on your machine. You can decrease `--num-workers` (number of dataloaders using CPU) and `--num-copies` (number of datasets copies residing in RAM). With the lowest profile, try `--num-workers 1 --num-copies 1`.

If this still fails, please upgrade your machine :). In the meanwhile, you can still download our pretrained model and evaluate it on downstream tasks.

Having difficulty installing RDKit.

See the P.S. section in [this](https://github.com/THUDM/GCC/issues/12#issue-752080014) post.

Citing GCC

If you use GCC in your research or wish to refer to the baseline results, please use the following BibTeX.

@article{qiu2020gcc,
  title={GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training},
  author={Qiu, Jiezhong and Chen, Qibin and Dong, Yuxiao and Zhang, Jing and Yang, Hongxia and Ding, Ming and Wang, Kuansan and Tang, Jie},
  journal={arXiv preprint arXiv:2006.09963},
  year={2020}
}

Acknowledgements

Part of this code is inspired by Yonglong Tian et al.'s CMC: Contrastive Multiview Coding.

GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training @ KDD 2020

Related tags

Overview

GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training

Installation

Requirements

Quick Start

Pretraining

Pre-training datasets

E2E

MoCo

Download Pretrained Models

Downstream Tasks

Downstream datasets

Node Classification

Unsupervised (Table 2 freeze)

Supervised (Table 2 full)

Graph Classification

Unsupervised (Table 3 freeze)

Supervised (Table 3 full)

Similarity Search (Table 4)

❗ Common Issues

Citing GCC

Acknowledgements

Owner

THUDM

An official implementation of the Anchor DETR.

E2VID_ROS - E2VID_ROS: E2VID to a real-time system

PyTorch implementations of Generative Adversarial Networks.

BackgroundRemover lets you Remove Background from images and video with a simple command line interface

Bib-parser - Convenient script to parse .bib files with the ACM Digital Library like metadata

Dynamic wallpaper generator.

Code for 'Blockwise Sequential Model Learning for Partially Observable Reinforcement Learning' (AAAI 2022)

Survival analysis in Python

Aerial Single-View Depth Completion with Image-Guided Uncertainty Estimation (RA-L/ICRA 2020)

Visualizing Yolov5's layers using GradCam

RoMA: Robust Model Adaptation for Offline Model-based Optimization

E2EC: An End-to-End Contour-based Method for High-Quality High-Speed Instance Segmentation

Escaping the Gradient Vanishing: Periodic Alternatives of Softmax in Attention Mechanism

Multi-Task Temporal Shift Attention Networks for On-Device Contactless Vitals Measurement (NeurIPS 2020)

Plugin for Gaffer providing direct acess to asset from PolyHaven.com. Only HDRIs at the moment, Cycles and Arnold supported

Final report with code for KAIST Course KSE 801.

This repo contains research materials released by members of the Google Brain team in Tokyo.

Code for Multinomial Diffusion

Generative Autoregressive, Normalized Flows, VAEs, Score-based models (GANVAS)

Digital Twin Mobility Profiling: A Spatio-Temporal Graph Learning Approach