Approximate Nearest Neighbor Search for Sparse Data in Python!

Last update: Jan 01, 2023

Related tags

Overview

PySparNN

Approximate Nearest Neighbor Search for Sparse Data in Python! This library is well suited to finding nearest neighbors in sparse, high dimensional spaces (like text documents).

Out of the box, PySparNN supports Cosine Distance (i.e. 1 - cosine_similarity).

PySparNN benefits:

Designed to be efficient on sparse data (memory & cpu).
Implemented leveraging existing python libraries (scipy & numpy).
Easily extended with other metrics: Manhattan, Euclidian, Jaccard, etc.
Supports incremental insertion of elements.

If your data is NOT SPARSE - please consider faiss or annoy. They use similar methods and I am a big fan of both. You should expect better performance on dense vectors from both of those projects.

The most comparable library to PySparNN is scikit-learn's LSHForest module. As of this writing, PySparNN is ~4x faster on the 20newsgroups dataset (as a sparse vector). A more robust benchmarking on sparse data is desired. Here is the comparison. Here is another comparison on the larger Enron email dataset.

Example Usage

Simple Example

import pysparnn.cluster_index as ci

import numpy as np
from scipy.sparse import csr_matrix

features = np.random.binomial(1, 0.01, size=(1000, 20000))
features = csr_matrix(features)

# build the search index!
data_to_return = range(1000)
cp = ci.MultiClusterIndex(features, data_to_return)

cp.search(features[:5], k=1, return_distance=False)
>> [[0], [1], [2], [3], [4]]

Text Example

import pysparnn.cluster_index as ci

from sklearn.feature_extraction.text import TfidfVectorizer

data = [
    'hello world',
    'oh hello there',
    'Play it',
    'Play it again Sam',
]    

tv = TfidfVectorizer()
tv.fit(data)

features_vec = tv.transform(data)

# build the search index!
cp = ci.MultiClusterIndex(features_vec, data)

# search the index with a sparse matrix
search_data = [
    'oh there',
    'Play it again Frank'
]

search_features_vec = tv.transform(search_data)

cp.search(search_features_vec, k=1, k_clusters=2, return_distance=False)
>> [['oh hello there'], ['Play it again Sam']]

Requirements

PySparNN requires numpy and scipy. Tested with numpy 1.11.2 and scipy 0.18.1.

Installation

# clone pysparnn
cd pysparnn 
pip install -r requirements.txt 
python setup.py install

How PySparNN works

Searching for a document in an collection of D documents is naively O(D) (assuming documents are constant sized).

However! we can create a tree structure where the first level is O(sqrt(D)) and each of the leaves are also O(sqrt(D)) - on average.

We randomly pick sqrt(D) candidate items to be in the top level. Then -- each document in the full list of D documents is assigned to the closest candidate in the top level.

This breaks up one O(D) search into two O(sqrt(D)) searches which is much much faster when D is big!

This generalizes to h levels. The runtime becomes: O(h * h_root(D))

Further Information

http://nlp.stanford.edu/IR-book/html/htmledition/cluster-pruning-1.html

See the CONTRIBUTING file for how to help out.

License

PySparNN is BSD-licensed. We also provide an additional patent grant.

Approximate Nearest Neighbor Search for Sparse Data in Python!

Related tags

Overview

PySparNN

Example Usage

Simple Example

Text Example

Requirements

Installation

How PySparNN works

Further Information

License

Owner

Meta Research

The Master's in Data Science Program run by the Faculty of Mathematics and Information Science

Python Library for learning (Structure and Parameter) and inference (Statistical and Causal) in Bayesian Networks.

Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data.

An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify.

A Python 3 library making time series data mining tasks, utilizing matrix profile algorithms

In this tutorial, raster models of soil depth and soil water holding capacity for the United States will be sampled at random geographic coordinates within the state of Colorado.

Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

talkbox is a scikit for signal/speech processing, to extend scipy capabilities in that domain.

Single-Cell Analysis in Python. Scales to >1M cells.

High Dimensional Portfolio Selection with Cardinality Constraints

PySpark Structured Streaming ROS Kafka ApacheSpark Cassandra

Conduits - A Declarative Pipelining Tool For Pandas

Bigdata Simulation Library Of Dream By Sandman Books

An Integrated Experimental Platform for time series data anomaly detection.

Analytical view of olist e-commerce in Brazil

This program analyzes a DNA sequence and outputs snippets of DNA that are likely to be protein-coding genes.

A tax calculator for stocks and dividends activities.

Data Scientist in Simple Stock Analysis of PT Bukalapak.com Tbk for Long Term Investment

A data analysis using python and pandas to showcase trends in school performance.

apricot implements submodular optimization for the purpose of selecting subsets of massive data sets to train machine learning models quickly.