An Indexer that works out-of-the-box when you have less than 100K stored Documents

Last update: Mar 15, 2022

Related tags

Overview

U100KIndexer

An Indexer that works out-of-the-box when you have less than 100K stored Documents. U100K means under 100K. At 100K stored Documents with 768-dim embeddings, you can expect 300ms for single query or 20~120QPS for batch queries. Results are full Documents.

U100KIndexer leverages jina.DocumenetArrayMemmap as the storage backend and .match() to conduct nearest neighbours search. It returns the full Documents as-is, hence no need to concatenate it with another key-value indexer to retrieve Documents.

Pros & cons

Pros

Exhaustive search: highest recall
Fast indexing
Acceptable query performance under 100K
Always return full Documents
No extra dependencies

Cons

Slow query time

Performance

The indexing and query performance on 768-dim embeddings is as follows (unit is second):

Stored data	Indexing time	Query size=1	Query size=8	Query size=64
10000	0.256	0.019	0.029	0.086
50000	1.156	0.147	0.177	0.314
100000	2.329	0.297	0.332	0.536
200000	4.704	0.656	0.744	1.050
400000	11.105	1.289	1.536	2.793

Benchmark script can be found in benchmark.py.

Tips

To change workspace,

U100KIndexer(metas={'workspace': './my'})

Or .add(..., uses_metas={'workspace': './my'}) when you use it in a Flow.

An Indexer that works out-of-the-box when you have less than 100K stored Documents

Related tags

Overview

U100KIndexer

Pros & cons

Pros

Cons

Performance

Tips

Owner

Jina AI

Titanic data analysis for python

Python scripts aim to use a Random Forest machine learning algorithm to predict the water affinity of Metal-Organic Frameworks

Demonstrate a Dataflow pipeline that saves data from an API into BigQuery table

A real-time financial data streaming pipeline and visualization platform using Apache Kafka, Cassandra, and Bokeh.

Important dataframe statistics with a single command

Includes all files needed to satisfy hw02 requirements

yt is an open-source, permissively-licensed Python library for analyzing and visualizing volumetric data.

Evaluation of a Monocular Eye Tracking Set-Up

INF42 - Topological Data Analysis

Working Time Statistics of working hours and working conditions by industry and company

Analyzing Covid-19 Outbreaks in Ontario

ELFXtract is an automated analysis tool used for enumerating ELF binaries

A script to "SHUA" H1-2 map of Mercenaries mode of Hearthstone

Data and code accompanying the paper Politics and Virality in the Time of Twitter

Building house price data pipelines with Apache Beam and Spark on GCP

Import, connect and transform data into Excel

Feature engineering and machine learning: together at last

2019 Data Science Bowl

[CVPR2022] This repository contains code for the paper "Nested Collaborative Learning for Long-Tailed Visual Recognition", published at CVPR 2022

Randomisation-based inference in Python based on data resampling and permutation.