Recommendation algorithms for large graphs

Related tags

Deep Learningpygrank
Overview

pygrank

Fast recommendation algorithms for large graphs based on link analysis.

License: Apache Software License
Author: Emmanouil (Manios) Krasanakis
Dependencies: networkx, numpy, scipy, sklearn, wget (required) tensorflow, torch (optional)

build codecov Downloads

Roadmap for 0.2.X

The following roadmap overviews short-term development goals and will be updated appropriately.

✔️ Reach a stable architecture with comprehensive development management (achieved as of 0.2.3, no longer backwards compatible with 0.1.17, most important change to_scipy >> preprocessor.)
✔️ Graph neural network support with dropout, renormalization and tensorflow backend (achieved as of 0.2.3)
✔️ Pytorch backend (achieved as of 0.2.4)
Pytorch gnns
100% code coverage
100% documentation completeness
Automatic download for all related publication datasets
Updated reference docs and automated citation discovery for algorithms
Enable Arnoldi and Lanczos optimizations in non-numpy backends
Transparent handling of float and double precisions (as of 0.2.4 everything is in float32, but this will change)

🛠️ Installation

pygrank is meant to work with Python 3.9 or later. It can be installed with pip per:

pip install pygrank

To automatically use the machine learning backends (e.g. to integrate the package in machine learning projects) tensorflow and pytorch, manually change the automatically created configuration file whose path is displayed in the error console. If you want others to run your code that depends on pygrank with specific backends, add the following recipe at your code's entry point to override other configurations:

import pygrank as pg
pg.load_backend(`pytorch`)

Quickstart

As a quick start, let us construct a networkx graph G and a set of nodes seeds.

>>> import networkx as nx
>>> graph = nx.Graph()
>>> graph.add_edge("A", "B")
>>> graph.add_edge("B", "C")
>>> graph.add_edge("C", "D")
>>> graph.add_edge("D", "E")
>>> graph.add_edge("A", "C")
>>> graph.add_edge("C", "E")
>>> graph.add_edge("B", "E")
>>> seeds = {"A", "B"}

We now run a personalized PageRank graph filter to score the structural relatedness of graph nodes to the ones of the given set. We start by importing the library,

>>> import pygrank as pg

For instructional purposes, we select the PageRank filter. This and more filters can be found in the module pygrank.algorithms.filters, but for ease-of-use can be accessed from the top-level import. We also set the default values of some parameters: the graph diffusion rate alpha required by the algorithm, a numerical tolerance tol at the convergence point and a graph preprocessing strategy "auto" normalization of the garph adjacency matrix to determine between column-based and symmetric normalization depending on whether the graph is undirected (as in this example) or not respectively.

>>> ranker = pg.PageRank(alpha=0.85, tol=1.E-6, normalization="auto")
>>> ranks = ranker(graph, {v: 1 for v in seeds})

Node ranking output is always organized into graph signals which can be used like dictionaries. For example, we can print the scores of some nodes per:

>>> print(ranks["B"], ranks["D"], ranks["E"])
0.25865456609095644 0.12484722044728883 0.17079023174039495

We alter this outcome so that it outputs node order, where higher node scores are assigned lower order. This is achieved by wrapping a postprocessor around the algorithm. There are various postprocessors, including ones to make scores fairness-aware. Again, postprocessors can be found in pygrank.algorithms.postprocess, but for shortcut purposes can be used from the top-level package import.

>>> ordinals = pg.Ordinals(ranker).rank(graph, {v: 1 for v in seeds})
>>> print(ordinals["B"], ordinals["D"], ordinals["E"])
1 5 4

How much time did it take for the base ranker to converge? (Depends on backend and device characteristics.)

>>> print(ranker.convergence)
19 iterations (0.001831000000009908 sec)

Since only the node order is important, we can use a different way to specify convergence:

>>> convergence = pg.RankOrderConvergenceManager(pagerank_alpha=0.85, confidence=0.98) 
>>> early_stop_ranker = pg.PageRank(alpha=0.85, convergence=convergence)
>>> ordinals = pg.Ordinals(early_stop_ranker).rank(graph, {v: 1 for v in seeds})
>>> print(early_stop_ranker.convergence)
2 iterations (0.0006313000000091051 sec)
>>> print(ordinals["B"], ordinals["D"], ordinals["E"])
1 5 4

Close to the previous results at a fraction of the time!! Note that convergence time measurements do not take into account the time needed to preprocess graphs.

🧠 Overview

Analyzing graph edges (links) between nodes can help rank/score graph nodes based on their structural proximity to structural or attribute-based communities of nodes. With the introduction of graph signal processing and decoupled graph neural networks the importance of node ranking has drastically increased, as its ability to perform inductive learning by quickly spreading node information through edges has been theoretically and experimentally corroborated. For example, it can be used to make predictions based on few known node attributes or base predictions outputted by low-quality feature-based machine learning models.

pygrank is a collection of node ranking algorithms and practices that support real-world conditions, such as large graphs and heterogeneous preprocessing and postprocessing requiremenets. Thus, it provides ready-to-use tools that simplify deployment of theoretical advancements and testing of new algorithms.

Some of the library's advantages are:

  1. Compatibility with networkx, tensorflow and pytorch.
  2. Datacentric interfaces that do not require transformations to identifiers.
  3. Large graph support with sparse representations and fast algorithms.
  4. Seamless pipelines, from graph preprocessing up to benchmarking and evaluation.
  5. Modular combination of components.

🔗 Material

Tutorials & Documentation

Quick links
Measures
Graph Filters
Postprocessors
Tuners
Downloadable Datasets

🔥 Features

  • Graph filters
  • Community detection
  • Graph normalization
  • Convergence criteria
  • Postprocessing (e.g. fairness awareness)
  • Evaluation measures
  • Benchmarks
  • Autotuning

👍 Contributing

Feel free to contribute in any way, for example through the issue tracker or by participating in discussions. Please check out the contribution guidelines to bring modifications to the code base. If so, make sure to follow the pull checklist described in the guidelines.

📓 Citation

If pygrank has been useful in your research and you would like to cite it in a scientific publication, please refer to the following paper:

TBD

To publish research that uses provided methods, please cite the appropriate publications.

Comments
  • Performant use of sparse matrices?

    Performant use of sparse matrices?

    I'm trying to use pygrank with larger graphs: 100k-1m nodes, hundreds of millions of edges. My graphs are in sparse matrix format. So far I've just converted to networkx and used those:

    g = nx.from_scipy_sparse_array(A, create_using=nx.DiGraph)
    
    signal_dict = {i: 1.0 for i in seeds}
    
    signal = pg.to_signal(g, signal_dict)
    
    # normalize signal
    signal.np /= signal.np.sum()
    
    result = algorithm(signal).np
    

    Is there a more performant option available?

    opened by deklanw 5
  • Tune on non-seeds?

    Tune on non-seeds?

    Is it possible to run the tuners with non-seed nodes? For example if I have a seed_set and a target_set can I run the tuner diffusions with the signal from the former but optimize for metrics defined with respect to the latter? In this case I have a desired ranking of the nodes in the target_set.

    opened by deklanw 2
  • Seamless verbosity

    Seamless verbosity

    Automatically display verbose progress (e.g. for dataset downloading) that disappears once tasks are complete (e.g. to resume normal benchmarking prints).

    feature 
    opened by maniospas 2
  • Automatically switching between backend interpretations

    Automatically switching between backend interpretations

    Currently, graph signal .np attributes need to be manually converted between backends. Switch to a @property getter interface that automatically performs needed conversions to remove the burden of checking for backend compliance from developers.

    feature 
    opened by maniospas 1
  • Numeric graph signal operations

    Numeric graph signal operations

    Currently, there needs to be clear distinction between graph signal objects and extraction of their .np fields. Reframe code so that, when signal operations are employed, their respective .np fields are used in their place. This can help write comphrehensible high-level code.

    feature 
    opened by maniospas 1
  • Torch GNN support

    Torch GNN support

    Current helper methods for GNNs are centered on tensorflow and keras. Create backend operations to abstract them so that they can also be implemented through torch. This needs to add a new test to make sure everything is working.

    Related tests: tests.test_gnn.test_appnp

    feature 
    opened by maniospas 1
  • Citation discovery for postprocessors

    Citation discovery for postprocessors

    Recommend how to cite graph filters and tuners through their cite() method.

    Related tests: test_autorefs.test_autorefs, test_autorefs.test_postprocessor_citations

    feature 
    opened by maniospas 1
  • Citation discovery for graph filters

    Citation discovery for graph filters

    Recommend how to cite graph filters and tuners through their cite() method.

    Related tests: test_autorefs.test_autorefs, test_autorefs.test_filter_citations

    feature 
    opened by maniospas 1
  • Automatic citation discovery

    Automatic citation discovery

    Since node ranking algorithms can comprise multiple components, some of which are implicitly determined (e.g. through default instantiation or specific argument parameters), create methods that can summarize used components and provide citations.

    Usefulness: This can help streamline citation practices.

    Related tests: None

    feature 
    opened by maniospas 1
  • optimization_dict not improving performance

    optimization_dict not improving performance

    The optimization_dict argument to the ClosedFormGraphFilter class does not seem to produce as an improvement in runnng time. This could indicate either a bug or bottlenecks in other parts of the pipeline, e.g. in graph signal instantiation.

    Version: run with version 2.3 adjusted to run experiments 50 times when measuring time

    Demonstration:

    >>> import pygrank as pg
    >>> optimization_dict = dict()
    >>> pg.benchmark_print(pg.benchmark({"HK": pg.HeatKernel(optimization_dict=optimization_dict)}, pg.load_datasets_all_communities(["bigraph"]), metric="time"))
                   	 HK 
    bigraph0       	 3.06
    bigraph1       	 3.36
    >>> pg.benchmark_print(pg.benchmark({"HK": pg.HeatKernel()}, pg.load_datasets_all_communities(["bigraph"]), metric="time"))
                   	 HK 
    bigraph0       	 2.98
    bigraph1       	 2.96
    

    Related tests: None

    bug fixed 
    opened by maniospas 1
  • Implementing PageRank as a GenericGraphFilter does not seem to work

    Implementing PageRank as a GenericGraphFilter does not seem to work

    In the following code, the two algorithm should be close to equivalent, yet there is a signficant Mabs error between them.

    >>> import pygrank as pg
    >>> graph = next(pg.load_datasets_graph(["graph9"]))
    >>> ranks1 = pg.Normalize(pg.PageRank(0.85, tol=1.E-12, max_iters=1000)).rank(graph, {"A": 1})
    >>> ranks2 = pg.Normalize(pg.GenericGraphFilter([0.85**i for i in range(20)], tol=1.E-12)).rank(graph, {"A": 1})
    >>> print(pg.Mabs(ranks1)(ranks2))
    0.025585056372903574
    

    Related tests: tests.test_filters.test_custom_runs

    bug invalid 
    opened by maniospas 1
  • Potential issue in the GNN demonstrator example with tensorflow backend

    Potential issue in the GNN demonstrator example with tensorflow backend

    During the review process of the library's paper, a reviewer pointed out that the following error occurs in their local system with TensorFlow 3.9.2 and Python 3.10.6.

    TypeError: Sequential.call() got multiple values for argument 'training'
    

    This occurs when running the code of the APPNP example. The issue lies fully with the example and not with any additional library functionality - it will not motivate a hotfix.

    Investigate whether this issue is unique to the version of TensorFlow or whether the latter has yet again updated something that will break the example in all future versions. At the very least, this error should not occur in github actions.

    If this is not the case, investigate whether this issue is platform-dependent.

    bug 
    opened by maniospas 0
  • Possible sparse_dot_mkl integration?

    Possible sparse_dot_mkl integration?

    I saw that you have your own sparse matrix library called matvec which parallelizes sparse-dense multiplication. There is an existing Python library called sparsedot which does the same but with scipy csr/csc matrices https://github.com/flatironinstitute/sparse_dot.

    I benchmarked the two with a matrix of size

    <6357204x6357204 sparse matrix of type '<class 'numpy.float32'>'
    	with 3614017927 stored elements in Compressed Sparse Column format>
    

    With the 32 core/64 thread server CPU I'm testing on the times for 10 matrix-vec multiplications on the right and left are:

    matvec
    right-vec  25.15
    left-vec  19.47
    
    sparse_dot csc
    right-vec  40.17
    left-vec  14.91
    
    sparse_dot csr
    right-vec  10.38
    left-vec  28.53
    

    The times look competitive. I'm not sure if matvec has some other advantages I'm not considering here, but that sparsedot works with the existing scipy types would be a huge benefit (for my usecase, at least). Sparsedot does require installing the mkl library and for giant matrices as above requires the environment variable MKL_INTERFACE_LAYER=ILP64.

    opened by deklanw 4
  • Convergence management tracking

    Convergence management tracking

    IImplement a high-level way of summarizing convergence analysis, for example to help measure running time and iterations when algorithms are wrapped by postprocessors (including iterative schemes). For example, a list of all convergence manager run outcomes could be obtained. Perhaps this could be achieved with some combination of dependent algorithm discovery and keeping convergence manager history on restart.

    Related tests: tests.test_filters.test_convergence_string_conversion

    feature 
    opened by maniospas 0
  • Check FairWalk correctness

    Check FairWalk correctness

    Fairwalk does not achieve the same level of fairness (as high a pRule) as other fairness-aware heuristics during tests. This could arise from erroneous implementation. If the implementation is found to be correct, separate its tests from other heuristics to account for the lower expected improvement.

    Related tests: tests.test_fairness.test_fair_heuristics

    invalid 
    opened by maniospas 0
Releases(0.2.10)
Owner
Multimedia Knowledge and Social Analytics Lab
MKLab is part of the Information Technologies Institute.
Multimedia Knowledge and Social Analytics Lab
Link prediction using Multiple Order Local Information (MOLI)

Understanding the network formation pattern for better link prediction Authors: [e

Wu Lab 0 Oct 18, 2021
OpenMMLab Text Detection, Recognition and Understanding Toolbox

Introduction English | 简体中文 MMOCR is an open-source toolbox based on PyTorch and mmdetection for text detection, text recognition, and the correspondi

OpenMMLab 3k Jan 07, 2023
Direct LiDAR Odometry: Fast Localization with Dense Point Clouds

Direct LiDAR Odometry: Fast Localization with Dense Point Clouds DLO is a lightweight and computationally-efficient frontend LiDAR odometry solution w

VECTR at UCLA 369 Dec 30, 2022
FOSS Digital Asset Distribution Platform built on Frappe.

Digistore FOSS Digital Assets Marketplace. Distribute digital assets, like a pro. Video Demo Here Features Create, attach and list digital assets (PDF

Mohammad Hussain Nagaria 30 Dec 08, 2022
Source code for CVPR2022 paper "Abandoning the Bayer-Filter to See in the Dark"

Abandoning the Bayer-Filter to See in the Dark (CVPR 2022) Paper: https://arxiv.org/abs/2203.04042 (Arxiv version) This code includes the training and

74 Dec 15, 2022
Where2Act: From Pixels to Actions for Articulated 3D Objects

Where2Act: From Pixels to Actions for Articulated 3D Objects The Proposed Where2Act Task. Given as input an articulated 3D object, we learn to propose

Kaichun Mo 69 Nov 28, 2022
Learnable Motion Coherence for Correspondence Pruning

Learnable Motion Coherence for Correspondence Pruning Yuan Liu, Lingjie Liu, Cheng Lin, Zhen Dong, Wenping Wang Project Page Any questions or discussi

liuyuan 41 Nov 30, 2022
This is the official pytorch implementation for our ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering" on VQA Task

🌈 ERASOR (RA-L'21 with ICRA Option) Official page of "ERASOR: Egocentric Ratio of Pseudo Occupancy-based Dynamic Object Removal for Static 3D Point C

Hyungtae Lim 225 Dec 29, 2022
FID calculation with proper image resizing and quantization steps

clean-fid: Fixing Inconsistencies in FID Project | Paper The FID calculation involves many steps that can produce inconsistencies in the final metric.

Gaurav Parmar 606 Jan 06, 2023
Accelerated Multi-Modal MR Imaging with Transformers

Accelerated Multi-Modal MR Imaging with Transformers Dependencies numpy==1.18.5 scikit_image==0.16.2 torchvision==0.8.1 torch==1.7.0 runstats==1.8.0 p

54 Dec 16, 2022
OrienMask: Real-time Instance Segmentation with Discriminative Orientation Maps

OrienMask This repository implements the framework OrienMask for real-time instance segmentation. It achieves 34.8 mask AP on COCO test-dev at the spe

45 Dec 13, 2022
The Generic Manipulation Driver Package - Implements a ROS Interface over the robotics toolbox for Python

Armer Driver Armer aims to provide an interface layer between the hardware drivers of a robotic arm giving the user control in several ways: Joint vel

QUT Centre for Robotics (QCR) 13 Nov 26, 2022
Implementation of OmniNet, Omnidirectional Representations from Transformers, in Pytorch

Omninet - Pytorch Implementation of OmniNet, Omnidirectional Representations from Transformers, in Pytorch. The authors propose that we should be atte

Phil Wang 48 Nov 21, 2022
code for the ICLR'22 paper: On Robust Prefix-Tuning for Text Classification

On Robust Prefix-Tuning for Text Classification Prefix-tuning has drawed much attention as it is a parameter-efficient and modular alternative to adap

Zonghan Yang 12 Nov 30, 2022
Awesome Weak-Shot Learning

Awesome Weak-Shot Learning In weak-shot learning, all categories are split into non-overlapped base categories and novel categories, in which base cat

BCMI 162 Dec 30, 2022
Pre-trained Deep Learning models and demos (high quality and extremely fast)

OpenVINO™ Toolkit - Open Model Zoo repository This repository includes optimized deep learning models and a set of demos to expedite development of hi

OpenVINO Toolkit 3.4k Dec 31, 2022
Hi Guys, here I am providing examples, which will help you in Lerarning Python

LearningPython Hi guys, here I am trying to include as many practice examples of Python Language, as i Myself learn, and hope these will help you in t

4 Feb 03, 2022
Efficient Online Bayesian Inference for Neural Bandits

Efficient Online Bayesian Inference for Neural Bandits By Gerardo Durán-Martín, Aleyna Kara, and Kevin Murphy AISTATS 2022.

Probabilistic machine learning 49 Dec 27, 2022
Official pytorch implementation of "DSPoint: Dual-scale Point Cloud Recognition with High-frequency Fusion"

DSPoint Official implementation of "DSPoint: Dual-scale Point Cloud Recognition with High-frequency Fusion". Paper link: https://arxiv.org/abs/2111.10

Ziyao Zeng 14 Feb 26, 2022
RCD: Relation Map Driven Cognitive Diagnosis for Intelligent Education Systems

RCD: Relation Map Driven Cognitive Diagnosis for Intelligent Education Systems This is our implementation for the paper: Weibo Gao, Qi Liu*, Zhenya Hu

BigData Lab @USTC 中科大大数据实验室 10 Oct 16, 2022