A production-ready, scalable Indexer for the Jina neural search framework, based on HNSW and PSQL

Overview

🌟 HNSW + PostgreSQL Indexer

HNSWPostgreSQLIndexer Jina is a production-ready, scalable Indexer for the Jina neural search framework.

It combines the reliability of PostgreSQL with the speed and efficiency of the HNSWlib nearest neighbor library.

It thus provides all the CRUD operations expected of a database system, while also offering fast and reliable vector lookup.

Requires a running PostgreSQL database service. For quick testing, you can run a containerized version locally with:

docker run -e POSTGRES_PASSWORD=123456 -p 127.0.0.1:5432:5432/tcp postgres:13.2

Syncing between PSQL and HNSW

By default, all data is stored in a PSQL database (as defined in the arguments). In order to add data to / build a HNSW index with your data, you need to manually call the /sync endpoint. This iterates through the data you have stored, and adds it to the HNSW index. By default, this is done incrementally, on top of whatever data the HNSW index already has. If you want to completely rebuild the index, use the parameter rebuild, like so:

flow.post(on='/sync', parameters={'rebuild': True})

At start-up time, the data from PSQL is synced into HNSW automatically. You can disable this with:

Flow().add(
    uses='jinahub://HNSWPostgresIndexer',
    uses_with={'startup_sync': False}
)

Automatic background syncing

WARNING: Experimental feature

Optionally, you can enable the option for automatic background syncing of the data into HNSW. This creates a thread in the background of the main operations, that will regularly perform the synchronization. This can be done with the sync_interval constructor argument, like so:

Flow().add(
    uses='jinahub://HNSWPostgresIndexer',
    uses_with={'sync_interval': 5}
)

sync_interval argument accepts an integer that represents the amount of seconds to wait between synchronization attempts. This should be adjusted based on your specific data amounts. For the duration of the background sync, the HNSW index will be locked to avoid invalid state, so searching will be queued. When sync_interval is enabled, the index will also be locked during search mode, so that syncing will be queued.

CRUD operations

You can perform all the usual operations on the respective endpoints

  • /index. Add new data to PostgreSQL
  • /search. Query the HNSW index with your Documents.
  • /update. Update documents in PostgreSQL
  • /delete. Delete documents in PostgreSQL.

Note. This only performs soft-deletion by default. This is done in order to not break the look-up of the document id after doing a search. For a hard delete, add 'soft_delete': False' to parameters. You might also perform a cleanup after a full rebuild of the HNSW index, by calling /cleanup.

Status endpoint

You can also get the information about the status of your data via the /status endpoint. This returns a Document whose tags contain the relevant information. The information can be returned via the following keys:

  • 'psql_docs': number of Documents stored in the PSQL database (includes entries that have been "soft-deleted")
  • 'hnsw_docs': the number of Documents indexed in the HNSW index
  • 'last_sync': the time of the last synchronization of PSQL into HNSW
  • 'pea_id': the shard number

In a sharded environment (parallel>1) you will get one Document from each shard. Each shard will have its own 'hnsw_docs', 'last_sync', 'pea_id', but they will all report the same 'psql_docs' (The PSQL database is available to all your shards). You need to sum the 'hnsw_docs' across these Documents, like so

result = f.post('/status', None, return_results=True)
result_docs = result[0].docs
total_hnsw_docs = sum(d.tags['hnsw_docs'] for d in result_docs)
Comments
  • Changing how /status method returns its values to try and merge with …

    Changing how /status method returns its values to try and merge with …

    …any pre-existing tags from previous executors if any.

    A shot at addressing the issue mentioned in https://github.com/jina-ai/executor-hnsw-postgres/issues/23

    opened by louisconcentricsky 6
  • feat: performance improvements

    feat: performance improvements

    Closes https://github.com/jina-ai/executor-hnsw-postgres/issues/6

    Results before this PR:

    indexing 1000 takes 0 seconds (0.22s)
    rolling update 3 replicas x 2 shards takes 0 seconds (0.82s)
    search with 10 takes 0 seconds (0.23s)
    
    indexing 10000 takes 0 seconds (0.75s)
    rolling update 3 replicas x 2 shards takes 9 seconds (9.08s)
    search with 10 takes 0 seconds (0.22s)
    
    indexing 100000 takes 7 seconds (7.59s)
    rolling update 3 replicas x 2 shards takes 7 minutes and 17 seconds (437.44s)
    search with 10 takes 0 seconds (0.22s)
    
    

    RESULTS NOW

    indexing 1000 takes 0 seconds (0.44s)                                                                                   
    rolling update 3 replicas x 2 shards takes 0 seconds (0.81s)
    
    indexing 10000 takes 1 second (1.01s)                                                                                   
    rolling update 3 replicas x 2 shards takes 2 seconds (2.63s)
    
    indexing 100000 takes 8 seconds (8.10s)                                                                                 
    rolling update 3 replicas x 2 shards takes 3 minutes and 27 seconds (207.14s)
    
    

    MORE BENCHMARKING

    indexing 500000 takes 30 seconds (30.07s)    
    rolling update 3 replicas x 2 shards takes 26 minutes and 57 seconds (1617.99s)
    search with 10 takes 0 seconds (0.21s)
    
    opened by cristianmtr 3
  • Status endpoint does not allow for compositing data with other executors

    Status endpoint does not allow for compositing data with other executors

    If another executor would also like to report some status information using the same status endpoint the return of the HNSQPostgresIndexer will remove it.

    It seems some manner of using object update on the tags or just placing the status under a particular key would be more friendlier.

    https://github.com/jina-ai/executor-hnsw-postgres/blob/79754090665e8bb86e85ab5693fa9b8be80977ce/executor/hnswpsql.py#L322

    opened by louisconcentricsky 1
  • feat: background sync (with threads)

    feat: background sync (with threads)

    Closes https://github.com/jina-ai/internal-tasks/issues/293

    Issues

    • [x] timestamp timezone difference
    • [x] psql connection pool gets exhausted
    • [x] locking resources in threaded access

    NOTE: Even if we don't merge this, the refactoring of PSQL Handler still needs to be merged, as the previous usage of Conn Pool had issues.

    opened by cristianmtr 1
  • fail to connect to PostgreSQL with docker-compose

    fail to connect to PostgreSQL with docker-compose

    • start a PostgreSQL service with docker:

    docker run -e POSTGRES_PASSWORD=123456 -p 127.0.0.1:5432:5432/tcp postgres:13.2

    • build a flow with one executor:HNSWPostgresIndexer

    • run the flow locally, it works well

    • expose the flow to docker-compose yaml, and run the flow with docker-compose ,get an error:

    image

    jina version info:

    
    - jina 3.3.19
    - docarray 0.12.2
    - jina-proto 0.1.8
    - jina-vcs-tag (unset)
    - protobuf 3.20.0
    - proto-backend cpp
    - grpcio 1.43.0
    - pyyaml 6.0
    - python 3.10.2
    - platform Linux
    - platform-release 4.4.0-186-generic
    - platform-version #216-Ubuntu SMP Wed Jul 1 05:34:05 UTC 2020
    - architecture x86_64
    - processor x86_64
    - uid 48710637999860
    - session-id 906abcd2-c797-11ec-b1df-2c4d544656f4
    - uptime 2022-04-29T16:37:11.758133
    - ci-vendor (unset)
    * JINA_DEFAULT_HOST (unset)
    * JINA_DEFAULT_TIMEOUT_CTRL (unset)
    * JINA_DEFAULT_WORKSPACE_BASE /home/chenhao/.jina/executor-workspace
    * JINA_DEPLOYMENT_NAME (unset)
    * JINA_DISABLE_UVLOOP (unset)
    * JINA_FULL_CLI (unset)
    * JINA_GATEWAY_IMAGE (unset)
    * JINA_GRPC_RECV_BYTES (unset)
    * JINA_GRPC_SEND_BYTES (unset)
    * JINA_HUBBLE_REGISTRY (unset)
    * JINA_HUB_CACHE_DIR (unset)
    * JINA_HUB_NO_IMAGE_REBUILD (unset)
    * JINA_HUB_ROOT (unset)
    * JINA_LOG_CONFIG (unset)
    * JINA_LOG_LEVEL (unset)
    * JINA_LOG_NO_COLOR (unset)
    * JINA_MP_START_METHOD (unset)
    * JINA_RANDOM_PORT_MAX (unset)
    * JINA_RANDOM_PORT_MIN (unset)
    * JINA_VCS_VERSION (unset)
    * JINA_CHECK_VERSION True
    
    opened by jerrychen1990 0
  • test: bug rolling update clear

    test: bug rolling update clear

    if you remove from tests/integration/test_hnsw_psql.py

    L:180

            if benchmark:
                f.post('/clear')
    

    the test test_benchmark_basic fails when it runs the second case

    even though clear is called at the beginning of the flow.

    Why?

    yes, /clear only hits one replica. but when we restart the flow there should be completely new replicas anyway

    opened by cristianmtr 0
  • performance(HNSWPSQL): syncing is slow

    performance(HNSWPSQL): syncing is slow

    Right now sync will be slow

    • [ ] we are iterating and doing individual updates (should batch somehow, per sync operation type - index, update, delete)
    • [x] if rebuild, the operations will always be index. We should optimize for this. Done in #5

    Numbers before any perf refactoring

    Performance

    indexing 1000 ...       indexing 1000 takes 0 seconds (0.22s)
    rolling update 3 replicas x 2 shards ...            [email protected][I]:Using existing table
        [email protected][I]:Using existing table
        [email protected][I]:Using existing table
        [email protected][I]:Using existing table
        [email protected][I]:Using existing table
        [email protected][I]:Using existing table
    rolling update 3 replicas x 2 shards takes 0 seconds (0.82s)
    search with 10 ...      search with 10 takes 0 seconds (0.23s)
    
    indexing 10000 ...      indexing 10000 takes 0 seconds (0.75s)
    rolling update 3 replicas x 2 shards ...            [email protected][I]:Using existing table
        [email protected][I]:Using existing table
        [email protected][I]:Using existing table
        [email protected][I]:Using existing table
        [email protected][I]:Using existing table
        [email protected][I]:Using existing table
    rolling update 3 replicas x 2 shards takes 9 seconds (9.08s)
    search with 10 ...      search with 10 takes 0 seconds (0.22s)
    
    indexing 100000 ...     indexing 100000 takes 7 seconds (7.59s)
    rolling update 3 replicas x 2 shards ...            [email protected][I]:Using existing table
        [email protected][I]:Using existing table
        [email protected][I]:Using existing table
        [email protected][I]:Using existing table
        [email protected][I]:Using existing table
        [email protected][I]:Using existing table
    rolling update 3 replicas x 2 shards takes 7 minutes and 17 seconds (437.44s)
    search with 10 ...      search with 10 takes 0 seconds (0.22s)
    
    
    priority/important-soon type/maintenance 
    opened by cristianmtr 0
Releases(v0.9)
  • v0.8(Mar 8, 2022)

  • v0.7(Feb 11, 2022)

  • v0.6(Jan 3, 2022)

    What's Changed

    • docs: fix typo in delete endpoint and clarify by @cristianmtr in https://github.com/jina-ai/executor-hnsw-postgres/pull/14

    Full Changelog: https://github.com/jina-ai/executor-hnsw-postgres/compare/v0.5...v0.6

    Source code(tar.gz)
    Source code(zip)
  • v0.5(Dec 14, 2021)

    What's Changed

    • fix: type of trav paths by @cristianmtr in https://github.com/jina-ai/executor-hnsw-postgres/pull/13

    Full Changelog: https://github.com/jina-ai/executor-hnsw-postgres/compare/v0.4...v0.5

    Source code(tar.gz)
    Source code(zip)
  • v0.4(Dec 9, 2021)

    What's Changed

    • fix: allow using Executor in local mode by @cristianmtr in https://github.com/jina-ai/executor-hnsw-postgres/pull/12

    Full Changelog: https://github.com/jina-ai/executor-hnsw-postgres/compare/v0.3...v0.4

    Source code(tar.gz)
    Source code(zip)
  • v0.3(Nov 26, 2021)

    What's Changed

    • feat: background sync (with threads) by @cristianmtr in https://github.com/jina-ai/executor-hnsw-postgres/pull/9
    • docs: add docs on bg sync by @cristianmtr in https://github.com/jina-ai/executor-hnsw-postgres/pull/11

    Full Changelog: https://github.com/jina-ai/executor-hnsw-postgres/compare/v0.2...v0.3

    Source code(tar.gz)
    Source code(zip)
  • v0.2(Nov 22, 2021)

  • v0.1(Nov 18, 2021)

Owner
Jina AI
A Neural Search Company. We provide the cloud-native neural search solution powered by state-of-the-art AI technology.
Jina AI
Code for CPM-2 Pre-Train

CPM-2 Pre-Train Pre-train CPM-2 此分支为110亿非 MoE 模型的预训练代码,MoE 模型的预训练代码请切换到 moe 分支 CPM-2技术报告请参考link。 0 模型下载 请在智源资源下载页面进行申请,文件介绍如下: 文件名 描述 参数大小 100000.tar

Tsinghua AI 136 Dec 28, 2022
Anchor-free Oriented Proposal Generator for Object Detection

Anchor-free Oriented Proposal Generator for Object Detection Gong Cheng, Jiabao Wang, Ke Li, Xingxing Xie, Chunbo Lang, Yanqing Yao, Junwei Han, Intro

jbwang1997 56 Nov 15, 2022
for taichi voxel-challange event

Taichi Voxel Challenge Figure: result of python3 example6.py. Please replace the image above (demo.jpg) with yours, so that other people can immediate

Liming Xu 20 Nov 26, 2022
Download and preprocess popular sequential recommendation datasets

Sequential Recommendation Datasets This repository collects some commonly used sequential recommendation datasets in recent research papers and provid

125 Dec 06, 2022
Lbl2Vec learns jointly embedded label, document and word vectors to retrieve documents with predefined topics from an unlabeled document corpus.

Lbl2Vec Lbl2Vec is an algorithm for unsupervised document classification and unsupervised document retrieval. It automatically generates jointly embed

sebis - TUM - Germany 61 Dec 20, 2022
Face Recognition plus identification simply and fast | Python

PyFaceDetection Face Recognition plus identification simply and fast Ubuntu Setup sudo pip3 install numpy sudo pip3 install cmake sudo pip3 install dl

Peyman Majidi Moein 16 Sep 22, 2022
Official codebase used to develop Vision Transformer, MLP-Mixer, LiT and more.

Big Vision This codebase is designed for training large-scale vision models on Cloud TPU VMs. It is based on Jax/Flax libraries, and uses tf.data and

Google Research 701 Jan 03, 2023
TensorFlow port of PyTorch Image Models (timm) - image models with pretrained weights.

TensorFlow-Image-Models Introduction Usage Models Profiling License Introduction TensorfFlow-Image-Models (tfimm) is a collection of image models with

Martins Bruveris 227 Dec 20, 2022
Prototypical Networks for Few shot Learning in PyTorch

Prototypical Networks for Few shot Learning in PyTorch Simple alternative Implementation of Prototypical Networks for Few Shot Learning (paper, code)

Orobix 835 Jan 08, 2023
Real-Time Social Distance Monitoring tool using Computer Vision

Social Distance Detector A Real-Time Social Distance Monitoring Tool Table of Contents Motivation YOLO Theory Detection Output Tech Stack Functionalit

Pranav B 13 Oct 14, 2022
Official repository of the paper "A Variational Approximation for Analyzing the Dynamics of Panel Data". Mixed Effect Neural ODE. UAI 2021.

Official repository of the paper (UAI 2021) "A Variational Approximation for Analyzing the Dynamics of Panel Data", Mixed Effect Neural ODE. Panel dat

Jurijs Nazarovs 7 Nov 26, 2022
Dataset and Code for the paper "DepthTrack: Unveiling the Power of RGBD Tracking" (ICCV2021), and "Depth-only Object Tracking" (BMVC2021)

DeT and DOT Code and datasets for "DepthTrack: Unveiling the Power of RGBD Tracking" (ICCV2021) "Depth-only Object Tracking" (BMVC2021) @InProceedings

Yan Song 55 Dec 15, 2022
Automates Machine Learning Pipeline with Feature Engineering and Hyper-Parameters Tuning :rocket:

MLJAR Automated Machine Learning Documentation: https://supervised.mljar.com/ Source Code: https://github.com/mljar/mljar-supervised Table of Contents

MLJAR 2.4k Dec 31, 2022
ViViT: Curvature access through the generalized Gauss-Newton's low-rank structure

ViViT is a collection of numerical tricks to efficiently access curvature from the generalized Gauss-Newton (GGN) matrix based on its low-rank structure. Provided functionality includes computing

Felix Dangel 12 Dec 08, 2022
SOTR: Segmenting Objects with Transformers [ICCV 2021]

SOTR: Segmenting Objects with Transformers [ICCV 2021] By Ruohao Guo, Dantong Niu, Liao Qu, Zhenbo Li Introduction This is the official implementation

186 Dec 20, 2022
MADE (Masked Autoencoder Density Estimation) implementation in PyTorch

pytorch-made This code is an implementation of "Masked AutoEncoder for Density Estimation" by Germain et al., 2015. The core idea is that you can turn

Andrej 498 Dec 30, 2022
Deploy optimized transformer based models on Nvidia Triton server

Deploy optimized transformer based models on Nvidia Triton server

Lefebvre Sarrut Services 1.2k Jan 05, 2023
GAN JAX - A toy project to generate images from GANs with JAX

GAN JAX - A toy project to generate images from GANs with JAX This project aims to bring the power of JAX, a Python framework developped by Google and

Valentin Goldité 14 Nov 29, 2022
This repository implements and evaluates convolutional networks on the Möbius strip as toy model instantiations of Coordinate Independent Convolutional Networks.

Orientation independent Möbius CNNs This repository implements and evaluates convolutional networks on the Möbius strip as toy model instantiations of

Maurice Weiler 59 Dec 09, 2022
PyElastica is the Python implementation of Elastica, an open-source software for the simulation of assemblies of slender, one-dimensional structures using Cosserat Rod theory.

PyElastica PyElastica is the python implementation of Elastica: an open-source project for simulating assemblies of slender, one-dimensional structure

Gazzola Lab 105 Jan 09, 2023