Open-source Laplacian Eigenmaps for dimensionality reduction of large data in python.

Last update: Jul 09, 2022

Overview

Fast Laplacian Eigenmaps in python

Open-source Laplacian Eigenmaps for dimensionality reduction of large data in python. Comes with an wrapper for NMSlib to compute approximate-nearest-neighbors. Performs several times faster than the default scikit-learn implementation.

Installation

You'll need NMSlib for using this package properly. Installing it with no binaries is recommended if your CPU supports advanced instructions (it problably does):

pip3 install --no-binary :all: nmslib
# Along with requirements:
pip3 install numpy pandas scipy scikit-learn

Then you can install this package with pip:

pip3 install fastlapmap

Usage

See the following example with the handwritten digits data. Here, I visually compare results from the scikit-learn Laplacian Eigenmaps implementation to those from my implementation.

Note that this implementation contains two similarity-learning algorithms: anisotropic diffusion maps and fuzzy simplicial sets.

# Import libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import SpectralEmbedding
from fastlapmap import LapEigenmap

# Load some data
from sklearn.datasets import load_digits
digits = load_digits()
data = digits.data

# Define hyperparameters
N_EIGS=2
N_NEIGHBORS=10
N_JOBS=10

sk_se = SpectralEmbedding(n_components=N_EIGS, n_neighbors=N_NEIGHBORS, n_jobs=N_JOBS).fit_transform(data)

flapmap_diff = LapEigenmap(data, n_eigs=2, similarity='diffusion', norm_laplacian=True, k=N_NEIGHBORS, n_jobs=N_JOBS)
flapmap_fuzzy = LapEigenmap(data, n_eigs=2, similarity='fuzzy', norm_laplacian=True, k=N_NEIGHBORS, n_jobs=N_JOBS)

fig, (ax1, ax2, ax3) = plt.subplots(1, 3)
fig.suptitle('Handwritten digits data:', fontsize=24)
ax1.scatter(sk_se[:, 0], sk_se[:, 1], c=digits.target, cmap='Spectral', s=5)
ax1.set_title('Sklearn\'s Laplacian Eigenmaps', fontsize=20)
ax2.scatter(flapmap_diff[:, 0], flapmap_diff[:, 1], c=digits.target, cmap='Spectral', s=5)
ax2.set_title('Fast Laplacian Eigenmaps with diffusion harmonics', fontsize=20)
ax3.scatter(flapmap_fuzzy[:, 0], flapmap_fuzzy[:, 1], c=digits.target, cmap='Spectral', s=5)
ax3.set_title('Fast Laplacian Eigenmaps with fuzzy simplicial sets', fontsize=20)
plt.show()

As we can see, results are nearly identical.

Benchmark

See the runtime comparison between this implementation and scikit-learn:

## Load benchmark function:
from fastlapmap.benchmark import runtime_benchmark

# Load data
from sklearn.datasets import load_digits
digits = load_digits()
data = digits.data

# Define hyperparameters
N_EIGS = 2
N_NEIGHBORS = 10
N_JOBS = 10
SIZES = [1000, 5000, 10000, 25000, 50000, 100000]
N_RUNS = 3

runtime_benchmark(data,
                  n_eigs=N_EIGS,
                  n_neighbors=N_NEIGHBORS,
                  n_jobs=N_JOBS,
                  sizes=SIZES,
                  n_runs=N_RUNS)

As you can see, the diffusion harmoics model is the fastest, followed closely by fuzzy simplicial sets. Both outperform scikit-learn default implementation and escalate linearly with sample size.

Open-source Laplacian Eigenmaps for dimensionality reduction of large data in python.

Related tags

Overview

Fast Laplacian Eigenmaps in python

Installation

Usage

Benchmark

Owner

Useful tool for inserting DataFrames into the Excel sheet.

Instant search for and access to many datasets in Pyspark.

Zipline, a Pythonic Algorithmic Trading Library

X-news - Pipeline data use scrapy, kafka, spark streaming, spark ML and elasticsearch, Kibana

CubingB is a timer/analyzer for speedsolving Rubik's cubes, with smart cube support

MotorcycleParts DataAnalysis python

Sentiment analysis on streaming twitter data using Spark Structured Streaming & Python

Python Implementation of Scalable In-Memory Updatable Bitmap Indexing

Multiple Pairwise Comparisons (Post Hoc) Tests in Python

Produces a summary CSV report of an Amber Electric customer's energy consumption and cost data.

Important dataframe statistics with a single command

Additional tools for particle accelerator data analysis and machine information

This is a python script to navigate and extract the FSD50K dataset

A Python adaption of Augur to prioritize cell types in perturbation analysis.

Cold Brew: Distilling Graph Node Representations with Incomplete or Missing Neighborhoods

Pyspark project that able to do joins on the spark data frames.

Single machine, multiple cards training; mix-precision training; DALI data loader.

A Python package for Bayesian forecasting with object-oriented design and probabilistic models under the hood.

Python reader for Linked Data in HDF5 files

Python beta calculator that retrieves stock and market data and provides linear regressions.