Approximate Nearest Neighbor Search for Sparse Data in Python!

Related tags

Data Analysispysparnn
Overview

PySparNN

Approximate Nearest Neighbor Search for Sparse Data in Python! This library is well suited to finding nearest neighbors in sparse, high dimensional spaces (like text documents).

Out of the box, PySparNN supports Cosine Distance (i.e. 1 - cosine_similarity).

PySparNN benefits:

  • Designed to be efficient on sparse data (memory & cpu).
  • Implemented leveraging existing python libraries (scipy & numpy).
  • Easily extended with other metrics: Manhattan, Euclidian, Jaccard, etc.
  • Supports incremental insertion of elements.

If your data is NOT SPARSE - please consider faiss or annoy. They use similar methods and I am a big fan of both. You should expect better performance on dense vectors from both of those projects.

The most comparable library to PySparNN is scikit-learn's LSHForest module. As of this writing, PySparNN is ~4x faster on the 20newsgroups dataset (as a sparse vector). A more robust benchmarking on sparse data is desired. Here is the comparison. Here is another comparison on the larger Enron email dataset.

Example Usage

Simple Example

import pysparnn.cluster_index as ci

import numpy as np
from scipy.sparse import csr_matrix

features = np.random.binomial(1, 0.01, size=(1000, 20000))
features = csr_matrix(features)

# build the search index!
data_to_return = range(1000)
cp = ci.MultiClusterIndex(features, data_to_return)

cp.search(features[:5], k=1, return_distance=False)
>> [[0], [1], [2], [3], [4]]

Text Example

import pysparnn.cluster_index as ci

from sklearn.feature_extraction.text import TfidfVectorizer

data = [
    'hello world',
    'oh hello there',
    'Play it',
    'Play it again Sam',
]    

tv = TfidfVectorizer()
tv.fit(data)

features_vec = tv.transform(data)

# build the search index!
cp = ci.MultiClusterIndex(features_vec, data)

# search the index with a sparse matrix
search_data = [
    'oh there',
    'Play it again Frank'
]

search_features_vec = tv.transform(search_data)

cp.search(search_features_vec, k=1, k_clusters=2, return_distance=False)
>> [['oh hello there'], ['Play it again Sam']]

Requirements

PySparNN requires numpy and scipy. Tested with numpy 1.11.2 and scipy 0.18.1.

Installation

# clone pysparnn
cd pysparnn 
pip install -r requirements.txt 
python setup.py install

How PySparNN works

Searching for a document in an collection of D documents is naively O(D) (assuming documents are constant sized).

However! we can create a tree structure where the first level is O(sqrt(D)) and each of the leaves are also O(sqrt(D)) - on average.

We randomly pick sqrt(D) candidate items to be in the top level. Then -- each document in the full list of D documents is assigned to the closest candidate in the top level.

This breaks up one O(D) search into two O(sqrt(D)) searches which is much much faster when D is big!

This generalizes to h levels. The runtime becomes: O(h * h_root(D))

Further Information

http://nlp.stanford.edu/IR-book/html/htmledition/cluster-pruning-1.html

See the CONTRIBUTING file for how to help out.

License

PySparNN is BSD-licensed. We also provide an additional patent grant.

Owner
Meta Research
Meta Research
A collection of learning outcomes data analysis using Python and SQL, from DQLab.

Data Analyst with PYTHON Data Analyst berperan dalam menghasilkan analisa data serta mempresentasikan insight untuk membantu proses pengambilan keputu

6 Oct 11, 2022
follow-analyzer helps GitHub users analyze their following and followers relationship

follow-analyzer follow-analyzer helps GitHub users analyze their following and followers relationship by providing a report in html format which conta

Yin-Chiuan Chen 2 May 02, 2022
Open source platform for Data Science Management automation

Hydrosphere examples This repo contains demo scenarios and pre-trained models to show Hydrosphere capabilities. Data and artifacts management Some mod

hydrosphere.io 6 Aug 10, 2021
Import, connect and transform data into Excel

xlwings_query Import, connect and transform data into Excel. Description The concept is to apply data transformations to a main query object. When the

George Karakostas 1 Jan 19, 2022
GWpy is a collaboration-driven Python package providing tools for studying data from ground-based gravitational-wave detectors

GWpy is a collaboration-driven Python package providing tools for studying data from ground-based gravitational-wave detectors. GWpy provides a user-f

GWpy 342 Jan 07, 2023
A Streamlit web-app for a data-science project that aims to evaluate if the answer to a question is helpful.

How useful is the aswer? A Streamlit web-app for a data-science project that aims to evaluate if the answer to a question is helpful. If you want to l

1 Dec 17, 2021
Generate lookml for views from dbt models

dbt2looker Use dbt2looker to generate Looker view files automatically from dbt models. Features Column descriptions synced to looker Dimension for eac

lightdash 126 Dec 28, 2022
PATC: Introduction to Big Data Analytics. Practical Data Analytics for Solving Real World Problems

PATC: Introduction to Big Data Analytics. Practical Data Analytics for Solving Real World Problems

1 Feb 07, 2022
Pyspark project that able to do joins on the spark data frames.

SPARK JOINS This project is to perform inner, all outer joins and semi joins. create_df.py: load_data.py : helps to put data into Spark data frames. d

Joshua 1 Dec 14, 2021
ToeholdTools is a Python package and desktop app designed to facilitate analyzing and designing toehold switches, created as part of the 2021 iGEM competition.

ToeholdTools Category Status Repository Package Build Quality A library for the analysis of toehold switch riboregulators created by the iGEM team Cit

0 Dec 01, 2021
Developed for analyzing the covariance for OrcVIO

about This repo is developed for analyzing the covariance for OrcVIO environment setup platform ubuntu 18.04 using conda conda env create --file envir

Sean 1 Dec 08, 2021
BasstatPL is a package for performing different tabulations and calculations for descriptive statistics.

BasstatPL is a package for performing different tabulations and calculations for descriptive statistics. It provides: Frequency table constr

Angel Chavez 1 Oct 31, 2021
Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Data Scientist Learning Plan Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Trung-Duy Nguyen 27 Nov 01, 2022
Semi-Automated Data Processing

Perform semi automated exploratory data analysis, feature engineering and feature selection on provided dataset by visualizing every possibilities on each step and assisting the user to make a meanin

Arun Singh Babal 1 Jan 17, 2022
Intercepting proxy + analysis toolkit for Second Life compatible virtual worlds

Hippolyzer Hippolyzer is a revival of Linden Lab's PyOGP library targeting modern Python 3, with a focus on debugging issues in Second Life-compatible

Salad Dais 6 Sep 01, 2022
Anomaly Detection with R

AnomalyDetection R package AnomalyDetection is an open-source R package to detect anomalies which is robust, from a statistical standpoint, in the pre

Twitter 3.5k Dec 27, 2022
Python dataset creator to construct datasets composed of OpenFace extracted features and Shimmer3 GSR+ Sensor datas

Python dataset creator to construct datasets composed of OpenFace extracted features and Shimmer3 GSR+ Sensor datas

Gabriele 3 Jul 05, 2022
A stock analysis app with streamlit

StockAnalysisApp A stock analysis app with streamlit. You select the ticker of the stock and the app makes a series of analysis by using the price cha

Antonio Catalano 50 Nov 27, 2022
Shot notebooks resuming the main functions of GeoPandas

Shot notebooks resuming the main functions of GeoPandas, 2 notebooks written as Exercises to apply these functions.

1 Jan 12, 2022
Dbt-core - dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.

Dbt-core - dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.

dbt Labs 6.3k Jan 08, 2023