A production-ready, scalable Indexer for the Jina neural search framework, based on HNSW and PSQL

Overview

🌟 HNSW + PostgreSQL Indexer

HNSWPostgreSQLIndexer Jina is a production-ready, scalable Indexer for the Jina neural search framework.

It combines the reliability of PostgreSQL with the speed and efficiency of the HNSWlib nearest neighbor library.

It thus provides all the CRUD operations expected of a database system, while also offering fast and reliable vector lookup.

Requires a running PostgreSQL database service. For quick testing, you can run a containerized version locally with:

docker run -e POSTGRES_PASSWORD=123456 -p 127.0.0.1:5432:5432/tcp postgres:13.2

Syncing between PSQL and HNSW

By default, all data is stored in a PSQL database (as defined in the arguments). In order to add data to / build a HNSW index with your data, you need to manually call the /sync endpoint. This iterates through the data you have stored, and adds it to the HNSW index. By default, this is done incrementally, on top of whatever data the HNSW index already has. If you want to completely rebuild the index, use the parameter rebuild, like so:

flow.post(on='/sync', parameters={'rebuild': True})

At start-up time, the data from PSQL is synced into HNSW automatically. You can disable this with:

Flow().add(
    uses='jinahub://HNSWPostgresIndexer',
    uses_with={'startup_sync': False}
)

Automatic background syncing

WARNING: Experimental feature

Optionally, you can enable the option for automatic background syncing of the data into HNSW. This creates a thread in the background of the main operations, that will regularly perform the synchronization. This can be done with the sync_interval constructor argument, like so:

Flow().add(
    uses='jinahub://HNSWPostgresIndexer',
    uses_with={'sync_interval': 5}
)

sync_interval argument accepts an integer that represents the amount of seconds to wait between synchronization attempts. This should be adjusted based on your specific data amounts. For the duration of the background sync, the HNSW index will be locked to avoid invalid state, so searching will be queued. When sync_interval is enabled, the index will also be locked during search mode, so that syncing will be queued.

CRUD operations

You can perform all the usual operations on the respective endpoints

  • /index. Add new data to PostgreSQL
  • /search. Query the HNSW index with your Documents.
  • /update. Update documents in PostgreSQL
  • /delete. Delete documents in PostgreSQL.

Note. This only performs soft-deletion by default. This is done in order to not break the look-up of the document id after doing a search. For a hard delete, add 'soft_delete': False' to parameters. You might also perform a cleanup after a full rebuild of the HNSW index, by calling /cleanup.

Status endpoint

You can also get the information about the status of your data via the /status endpoint. This returns a Document whose tags contain the relevant information. The information can be returned via the following keys:

  • 'psql_docs': number of Documents stored in the PSQL database (includes entries that have been "soft-deleted")
  • 'hnsw_docs': the number of Documents indexed in the HNSW index
  • 'last_sync': the time of the last synchronization of PSQL into HNSW
  • 'pea_id': the shard number

In a sharded environment (parallel>1) you will get one Document from each shard. Each shard will have its own 'hnsw_docs', 'last_sync', 'pea_id', but they will all report the same 'psql_docs' (The PSQL database is available to all your shards). You need to sum the 'hnsw_docs' across these Documents, like so

result = f.post('/status', None, return_results=True)
result_docs = result[0].docs
total_hnsw_docs = sum(d.tags['hnsw_docs'] for d in result_docs)
Comments
  • Changing how /status method returns its values to try and merge with …

    Changing how /status method returns its values to try and merge with …

    …any pre-existing tags from previous executors if any.

    A shot at addressing the issue mentioned in https://github.com/jina-ai/executor-hnsw-postgres/issues/23

    opened by louisconcentricsky 6
  • feat: performance improvements

    feat: performance improvements

    Closes https://github.com/jina-ai/executor-hnsw-postgres/issues/6

    Results before this PR:

    indexing 1000 takes 0 seconds (0.22s)
    rolling update 3 replicas x 2 shards takes 0 seconds (0.82s)
    search with 10 takes 0 seconds (0.23s)
    
    indexing 10000 takes 0 seconds (0.75s)
    rolling update 3 replicas x 2 shards takes 9 seconds (9.08s)
    search with 10 takes 0 seconds (0.22s)
    
    indexing 100000 takes 7 seconds (7.59s)
    rolling update 3 replicas x 2 shards takes 7 minutes and 17 seconds (437.44s)
    search with 10 takes 0 seconds (0.22s)
    
    

    RESULTS NOW

    indexing 1000 takes 0 seconds (0.44s)                                                                                   
    rolling update 3 replicas x 2 shards takes 0 seconds (0.81s)
    
    indexing 10000 takes 1 second (1.01s)                                                                                   
    rolling update 3 replicas x 2 shards takes 2 seconds (2.63s)
    
    indexing 100000 takes 8 seconds (8.10s)                                                                                 
    rolling update 3 replicas x 2 shards takes 3 minutes and 27 seconds (207.14s)
    
    

    MORE BENCHMARKING

    indexing 500000 takes 30 seconds (30.07s)    
    rolling update 3 replicas x 2 shards takes 26 minutes and 57 seconds (1617.99s)
    search with 10 takes 0 seconds (0.21s)
    
    opened by cristianmtr 3
  • Status endpoint does not allow for compositing data with other executors

    Status endpoint does not allow for compositing data with other executors

    If another executor would also like to report some status information using the same status endpoint the return of the HNSQPostgresIndexer will remove it.

    It seems some manner of using object update on the tags or just placing the status under a particular key would be more friendlier.

    https://github.com/jina-ai/executor-hnsw-postgres/blob/79754090665e8bb86e85ab5693fa9b8be80977ce/executor/hnswpsql.py#L322

    opened by louisconcentricsky 1
  • feat: background sync (with threads)

    feat: background sync (with threads)

    Closes https://github.com/jina-ai/internal-tasks/issues/293

    Issues

    • [x] timestamp timezone difference
    • [x] psql connection pool gets exhausted
    • [x] locking resources in threaded access

    NOTE: Even if we don't merge this, the refactoring of PSQL Handler still needs to be merged, as the previous usage of Conn Pool had issues.

    opened by cristianmtr 1
  • fail to connect to PostgreSQL with docker-compose

    fail to connect to PostgreSQL with docker-compose

    • start a PostgreSQL service with docker:

    docker run -e POSTGRES_PASSWORD=123456 -p 127.0.0.1:5432:5432/tcp postgres:13.2

    • build a flow with one executor:HNSWPostgresIndexer

    • run the flow locally, it works well

    • expose the flow to docker-compose yaml, and run the flow with docker-compose ,get an error:

    image

    jina version info:

    
    - jina 3.3.19
    - docarray 0.12.2
    - jina-proto 0.1.8
    - jina-vcs-tag (unset)
    - protobuf 3.20.0
    - proto-backend cpp
    - grpcio 1.43.0
    - pyyaml 6.0
    - python 3.10.2
    - platform Linux
    - platform-release 4.4.0-186-generic
    - platform-version #216-Ubuntu SMP Wed Jul 1 05:34:05 UTC 2020
    - architecture x86_64
    - processor x86_64
    - uid 48710637999860
    - session-id 906abcd2-c797-11ec-b1df-2c4d544656f4
    - uptime 2022-04-29T16:37:11.758133
    - ci-vendor (unset)
    * JINA_DEFAULT_HOST (unset)
    * JINA_DEFAULT_TIMEOUT_CTRL (unset)
    * JINA_DEFAULT_WORKSPACE_BASE /home/chenhao/.jina/executor-workspace
    * JINA_DEPLOYMENT_NAME (unset)
    * JINA_DISABLE_UVLOOP (unset)
    * JINA_FULL_CLI (unset)
    * JINA_GATEWAY_IMAGE (unset)
    * JINA_GRPC_RECV_BYTES (unset)
    * JINA_GRPC_SEND_BYTES (unset)
    * JINA_HUBBLE_REGISTRY (unset)
    * JINA_HUB_CACHE_DIR (unset)
    * JINA_HUB_NO_IMAGE_REBUILD (unset)
    * JINA_HUB_ROOT (unset)
    * JINA_LOG_CONFIG (unset)
    * JINA_LOG_LEVEL (unset)
    * JINA_LOG_NO_COLOR (unset)
    * JINA_MP_START_METHOD (unset)
    * JINA_RANDOM_PORT_MAX (unset)
    * JINA_RANDOM_PORT_MIN (unset)
    * JINA_VCS_VERSION (unset)
    * JINA_CHECK_VERSION True
    
    opened by jerrychen1990 0
  • test: bug rolling update clear

    test: bug rolling update clear

    if you remove from tests/integration/test_hnsw_psql.py

    L:180

            if benchmark:
                f.post('/clear')
    

    the test test_benchmark_basic fails when it runs the second case

    even though clear is called at the beginning of the flow.

    Why?

    yes, /clear only hits one replica. but when we restart the flow there should be completely new replicas anyway

    opened by cristianmtr 0
  • performance(HNSWPSQL): syncing is slow

    performance(HNSWPSQL): syncing is slow

    Right now sync will be slow

    • [ ] we are iterating and doing individual updates (should batch somehow, per sync operation type - index, update, delete)
    • [x] if rebuild, the operations will always be index. We should optimize for this. Done in #5

    Numbers before any perf refactoring

    Performance

    indexing 1000 ...       indexing 1000 takes 0 seconds (0.22s)
    rolling update 3 replicas x 2 shards ...            [email protected][I]:Using existing table
        [email protected][I]:Using existing table
        [email protected][I]:Using existing table
        [email protected][I]:Using existing table
        [email protected][I]:Using existing table
        [email protected][I]:Using existing table
    rolling update 3 replicas x 2 shards takes 0 seconds (0.82s)
    search with 10 ...      search with 10 takes 0 seconds (0.23s)
    
    indexing 10000 ...      indexing 10000 takes 0 seconds (0.75s)
    rolling update 3 replicas x 2 shards ...            [email protected][I]:Using existing table
        [email protected][I]:Using existing table
        [email protected][I]:Using existing table
        [email protected][I]:Using existing table
        [email protected][I]:Using existing table
        [email protected][I]:Using existing table
    rolling update 3 replicas x 2 shards takes 9 seconds (9.08s)
    search with 10 ...      search with 10 takes 0 seconds (0.22s)
    
    indexing 100000 ...     indexing 100000 takes 7 seconds (7.59s)
    rolling update 3 replicas x 2 shards ...            [email protected][I]:Using existing table
        [email protected][I]:Using existing table
        [email protected][I]:Using existing table
        [email protected][I]:Using existing table
        [email protected][I]:Using existing table
        [email protected][I]:Using existing table
    rolling update 3 replicas x 2 shards takes 7 minutes and 17 seconds (437.44s)
    search with 10 ...      search with 10 takes 0 seconds (0.22s)
    
    
    priority/important-soon type/maintenance 
    opened by cristianmtr 0
Releases(v0.9)
  • v0.8(Mar 8, 2022)

  • v0.7(Feb 11, 2022)

  • v0.6(Jan 3, 2022)

    What's Changed

    • docs: fix typo in delete endpoint and clarify by @cristianmtr in https://github.com/jina-ai/executor-hnsw-postgres/pull/14

    Full Changelog: https://github.com/jina-ai/executor-hnsw-postgres/compare/v0.5...v0.6

    Source code(tar.gz)
    Source code(zip)
  • v0.5(Dec 14, 2021)

    What's Changed

    • fix: type of trav paths by @cristianmtr in https://github.com/jina-ai/executor-hnsw-postgres/pull/13

    Full Changelog: https://github.com/jina-ai/executor-hnsw-postgres/compare/v0.4...v0.5

    Source code(tar.gz)
    Source code(zip)
  • v0.4(Dec 9, 2021)

    What's Changed

    • fix: allow using Executor in local mode by @cristianmtr in https://github.com/jina-ai/executor-hnsw-postgres/pull/12

    Full Changelog: https://github.com/jina-ai/executor-hnsw-postgres/compare/v0.3...v0.4

    Source code(tar.gz)
    Source code(zip)
  • v0.3(Nov 26, 2021)

    What's Changed

    • feat: background sync (with threads) by @cristianmtr in https://github.com/jina-ai/executor-hnsw-postgres/pull/9
    • docs: add docs on bg sync by @cristianmtr in https://github.com/jina-ai/executor-hnsw-postgres/pull/11

    Full Changelog: https://github.com/jina-ai/executor-hnsw-postgres/compare/v0.2...v0.3

    Source code(tar.gz)
    Source code(zip)
  • v0.2(Nov 22, 2021)

  • v0.1(Nov 18, 2021)

Owner
Jina AI
A Neural Search Company. We provide the cloud-native neural search solution powered by state-of-the-art AI technology.
Jina AI
Voice Conversion by CycleGAN (语音克隆/语音转换):CycleGAN-VC3

CycleGAN-VC3-PyTorch 中文说明 | English This code is a PyTorch implementation for paper: CycleGAN-VC3: Examining and Improving CycleGAN-VCs for Mel-spectr

Kun Ma 110 Dec 24, 2022
Code repo for EMNLP21 paper "Zero-Shot Information Extraction as a Unified Text-to-Triple Translation"

Zero-Shot Information Extraction as a Unified Text-to-Triple Translation Source code repo for paper Zero-Shot Information Extraction as a Unified Text

cgraywang 88 Dec 31, 2022
Project page for the paper Semi-Supervised Raw-to-Raw Mapping 2021.

Project page for the paper Semi-Supervised Raw-to-Raw Mapping 2021.

Mahmoud Afifi 22 Nov 08, 2022
The DL Streamer Pipeline Zoo is a catalog of optimized media and media analytics pipelines.

The DL Streamer Pipeline Zoo is a catalog of optimized media and media analytics pipelines. It includes tools for downloading pipelines and their dependencies and tools for measuring their performace

8 Dec 04, 2022
Serve TensorFlow ML models with TF-Serving and then create a Streamlit UI to use them

TensorFlow Serving + Streamlit! ✨ 🖼️ Serve TensorFlow ML models with TF-Serving and then create a Streamlit UI to use them! This is a pretty simple S

Álvaro Bartolomé 18 Jan 07, 2023
This repository contains the source code and data for reproducing results of Deep Continuous Clustering paper

Deep Continuous Clustering Introduction This is a Pytorch implementation of the DCC algorithms presented in the following paper (paper): Sohil Atul Sh

Sohil Shah 197 Nov 29, 2022
It is modified Tensorflow 2.x version of Mask R-CNN

[TF 2.X] Mask R-CNN for Object Detection and Segmentation [Notice] : The original mask-rcnn uses the tensorflow 1.X version. I modified it for tensorf

Milner 34 Nov 09, 2022
CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP

CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP Andreas Fürst* 1, Elisabeth Rumetshofer* 1, Viet Tran1, Hubert Ramsauer1, Fei Tang3, Joh

Institute for Machine Learning, Johannes Kepler University Linz 133 Jan 04, 2023
COCO Style Dataset Generator GUI

A simple GUI-based COCO-style JSON Polygon masks' annotation tool to facilitate quick and efficient crowd-sourced generation of annotation masks and bounding boxes. Optionally, one could choose to us

Hans Krupakar 142 Dec 09, 2022
Implementation of CVPR'2022:Surface Reconstruction from Point Clouds by Learning Predictive Context Priors

Surface Reconstruction from Point Clouds by Learning Predictive Context Priors (CVPR 2022) Personal Web Pages | Paper | Project Page This repository c

136 Dec 12, 2022
Code for "The Intrinsic Dimension of Images and Its Impact on Learning" - ICLR 2021 Spotlight

dimensions Estimating the instrinsic dimensionality of image datasets Code for: The Intrinsic Dimensionaity of Images and Its Impact On Learning - Phi

Phil Pope 41 Dec 10, 2022
Node Dependent Local Smoothing for Scalable Graph Learning

Node Dependent Local Smoothing for Scalable Graph Learning Requirements Environments: Xeon Gold 5120 (CPU), 384GB(RAM), TITAN RTX (GPU), Ubuntu 16.04

Wentao Zhang 15 Nov 28, 2022
Code for the paper "There is no Double-Descent in Random Forests"

Code for the paper "There is no Double-Descent in Random Forests" This repository contains the code to run the experiments for our paper called "There

2 Jan 14, 2022
Extracting knowledge graphs from language models as a diagnostic benchmark of model performance.

Interpreting Language Models Through Knowledge Graph Extraction Idea: How do we interpret what a language model learns at various stages of training?

EPFL Machine Learning and Optimization Laboratory 9 Oct 25, 2022
This is the source code of the 1st place solution for segmentation task (with Dice 90.32%) in 2021 CCF BDCI challenge.

1st place solution in CCF BDCI 2021 ULSEG challenge This is the source code of the 1st place solution for ultrasound image angioma segmentation task (

Chenxu Peng 30 Nov 22, 2022
In this tutorial, you will perform inference across 10 well-known pre-trained object detectors and fine-tune on a custom dataset. Design and train your own object detector.

Object Detection Object detection is a computer vision task for locating instances of predefined objects in images or videos. In this tutorial, you wi

Ibrahim Sobh 62 Dec 25, 2022
A pytorch implementation of faster RCNN detection framework (Use detectron2, it's a masterpiece)

Notice(2019.11.2) This repo was built back two years ago when there were no pytorch detection implementation that can achieve reasonable performance.

Ruotian(RT) Luo 1.8k Jan 01, 2023
Why Are You Weird? Infusing Interpretability in Isolation Forest for Anomaly Detection

Why, hello there! This is the supporting notebook for the research paper — Why Are You Weird? Infusing Interpretability in Isolation Forest for Anomal

2 Dec 14, 2021
Semi-Supervised Learning, Object Detection, ICCV2021

End-to-End Semi-Supervised Object Detection with Soft Teacher By Mengde Xu*, Zheng Zhang*, Han Hu, Jianfeng Wang, Lijuan Wang, Fangyun Wei, Xiang Bai,

Microsoft 789 Dec 27, 2022
Attempt at implementation of a simple GAN using Keras

Simple GAN This is my attempt to make a wrapper class for a GAN in keras which can be used to abstract the whole architecture process. Simple GAN Over

Deven96 7 May 23, 2019