Header-only C++ HNSW implementation with python bindings

Overview

Hnswlib - fast approximate nearest neighbor search

Header-only C++ HNSW implementation with python bindings.

NEWS:

version 0.6

  • Thanks to (@dyashuni) hnswlib now uses github actions for CI, there is a search speedup in some scenarios with deletions. unmark_deleted(label) is now also a part of the python interface (note now it throws an exception for double deletions).
  • Thanks to (@slice4e) we now support AVX512; thanks to (@LTLA) the cmake interface for the lib is now updated.
  • Thanks to (@alonre24) we now have a python bindings for brute-force (and examples for recall tuning: TESTING_RECALL.md.
  • Thanks to (@dorosy-yeong) there is a bug fixed in the handling large quantities of deleted elements and large K.

Highlights:

  1. Lightweight, header-only, no dependencies other than C++ 11
  2. Interfaces for C++, Java, Python and R (https://github.com/jlmelville/rcpphnsw).
  3. Has full support for incremental index construction. Has support for element deletions (by marking them in index). Index is picklable.
  4. Can work with custom user defined distances (C++).
  5. Significantly less memory footprint and faster build time compared to current nmslib's implementation.

Description of the algorithm parameters can be found in ALGO_PARAMS.md.

Python bindings

Supported distances:

Distance parameter Equation
Squared L2 'l2' d = sum((Ai-Bi)^2)
Inner product 'ip' d = 1.0 - sum(Ai*Bi)
Cosine similarity 'cosine' d = 1.0 - sum(Ai*Bi) / sqrt(sum(Ai*Ai) * sum(Bi*Bi))

Note that inner product is not an actual metric. An element can be closer to some other element than to itself. That allows some speedup if you remove all elements that are not the closest to themselves from the index.

For other spaces use the nmslib library https://github.com/nmslib/nmslib.

Short API description

  • hnswlib.Index(space, dim) creates a non-initialized index an HNSW in space space with integer dimension dim.

hnswlib.Index methods:

  • init_index(max_elements, M = 16, ef_construction = 200, random_seed = 100) initializes the index from with no elements.

    • max_elements defines the maximum number of elements that can be stored in the structure(can be increased/shrunk).
    • ef_construction defines a construction time/accuracy trade-off (see ALGO_PARAMS.md).
    • M defines tha maximum number of outgoing connections in the graph (ALGO_PARAMS.md).
  • add_items(data, ids, num_threads = -1) - inserts the data(numpy array of vectors, shape:N*dim) into the structure.

    • num_threads sets the number of cpu threads to use (-1 means use default).
    • ids are optional N-size numpy array of integer labels for all elements in data.
      • If index already has the elements with the same labels, their features will be updated. Note that update procedure is slower than insertion of a new element, but more memory- and query-efficient.
    • Thread-safe with other add_items calls, but not with knn_query.
  • mark_deleted(label) - marks the element as deleted, so it will be omitted from search results. Throws an exception if it is already deleted.

  • unmark_deleted(label) - unmarks the element as deleted, so it will be not be omitted from search results.

  • resize_index(new_size) - changes the maximum capacity of the index. Not thread safe with add_items and knn_query.

  • set_ef(ef) - sets the query time accuracy/speed trade-off, defined by the ef parameter ( ALGO_PARAMS.md). Note that the parameter is currently not saved along with the index, so you need to set it manually after loading.

  • knn_query(data, k = 1, num_threads = -1) make a batch query for k closest elements for each element of the

    • data (shape:N*dim). Returns a numpy array of (shape:N*k).
    • num_threads sets the number of cpu threads to use (-1 means use default).
    • Thread-safe with other knn_query calls, but not with add_items.
  • load_index(path_to_index, max_elements = 0) loads the index from persistence to the uninitialized index.

    • max_elements(optional) resets the maximum number of elements in the structure.
  • save_index(path_to_index) saves the index from persistence.

  • set_num_threads(num_threads) set the default number of cpu threads used during data insertion/querying.

  • get_items(ids) - returns a numpy array (shape:N*dim) of vectors that have integer identifiers specified in ids numpy vector (shape:N). Note that for cosine similarity it currently returns normalized vectors.

  • get_ids_list() - returns a list of all elements' ids.

  • get_max_elements() - returns the current capacity of the index

  • get_current_count() - returns the current number of element stored in the index

Read-only properties of hnswlib.Index class:

  • space - name of the space (can be one of "l2", "ip", or "cosine").

  • dim - dimensionality of the space.

  • M - parameter that defines the maximum number of outgoing connections in the graph.

  • ef_construction - parameter that controls speed/accuracy trade-off during the index construction.

  • max_elements - current capacity of the index. Equivalent to p.get_max_elements().

  • element_count - number of items in the index. Equivalent to p.get_current_count().

Properties of hnswlib.Index that support reading and writing:

  • ef - parameter controlling query time/accuracy trade-off.

  • num_threads - default number of threads to use in add_items or knn_query. Note that calling p.set_num_threads(3) is equivalent to p.num_threads=3.

Python bindings examples

import hnswlib
import numpy as np
import pickle

dim = 128
num_elements = 10000

# Generating sample data
data = np.float32(np.random.random((num_elements, dim)))
ids = np.arange(num_elements)

# Declaring index
p = hnswlib.Index(space = 'l2', dim = dim) # possible options are l2, cosine or ip

# Initializing index - the maximum number of elements should be known beforehand
p.init_index(max_elements = num_elements, ef_construction = 200, M = 16)

# Element insertion (can be called several times):
p.add_items(data, ids)

# Controlling the recall by setting ef:
p.set_ef(50) # ef should always be > k

# Query dataset, k - number of closest elements (returns 2 numpy arrays)
labels, distances = p.knn_query(data, k = 1)

# Index objects support pickling
# WARNING: serialization via pickle.dumps(p) or p.__getstate__() is NOT thread-safe with p.add_items method!
# Note: ef parameter is included in serialization; random number generator is initialized with random_seed on Index load
p_copy = pickle.loads(pickle.dumps(p)) # creates a copy of index p using pickle round-trip

### Index parameters are exposed as class properties:
print(f"Parameters passed to constructor:  space={p_copy.space}, dim={p_copy.dim}") 
print(f"Index construction: M={p_copy.M}, ef_construction={p_copy.ef_construction}")
print(f"Index size is {p_copy.element_count} and index capacity is {p_copy.max_elements}")
print(f"Search speed/quality trade-off parameter: ef={p_copy.ef}")

An example with updates after serialization/deserialization:

import hnswlib
import numpy as np

dim = 16
num_elements = 10000

# Generating sample data
data = np.float32(np.random.random((num_elements, dim)))

# We split the data in two batches:
data1 = data[:num_elements // 2]
data2 = data[num_elements // 2:]

# Declaring index
p = hnswlib.Index(space='l2', dim=dim)  # possible options are l2, cosine or ip

# Initializing index
# max_elements - the maximum number of elements (capacity). Will throw an exception if exceeded
# during insertion of an element.
# The capacity can be increased by saving/loading the index, see below.
#
# ef_construction - controls index search speed/build speed tradeoff
#
# M - is tightly connected with internal dimensionality of the data. Strongly affects memory consumption (~M)
# Higher M leads to higher accuracy/run_time at fixed ef/efConstruction

p.init_index(max_elements=num_elements//2, ef_construction=100, M=16)

# Controlling the recall by setting ef:
# higher ef leads to better accuracy, but slower search
p.set_ef(10)

# Set number of threads used during batch search/construction
# By default using all available cores
p.set_num_threads(4)


print("Adding first batch of %d elements" % (len(data1)))
p.add_items(data1)

# Query the elements for themselves and measure recall:
labels, distances = p.knn_query(data1, k=1)
print("Recall for the first batch:", np.mean(labels.reshape(-1) == np.arange(len(data1))), "\n")

# Serializing and deleting the index:
index_path='first_half.bin'
print("Saving index to '%s'" % index_path)
p.save_index("first_half.bin")
del p

# Re-initializing, loading the index
p = hnswlib.Index(space='l2', dim=dim)  # the space can be changed - keeps the data, alters the distance function.

print("\nLoading index from 'first_half.bin'\n")

# Increase the total capacity (max_elements), so that it will handle the new data
p.load_index("first_half.bin", max_elements = num_elements)

print("Adding the second batch of %d elements" % (len(data2)))
p.add_items(data2)

# Query the elements for themselves and measure recall:
labels, distances = p.knn_query(data, k=1)
print("Recall for two batches:", np.mean(labels.reshape(-1) == np.arange(len(data))), "\n")

Bindings installation

You can install from sources:

apt-get install -y python-setuptools python-pip
git clone https://github.com/nmslib/hnswlib.git
cd hnswlib
pip install .

or you can install via pip: pip install hnswlib

For developers

When making changes please run tests (and please add a test to python_bindings/tests in case there is new functionality):

python -m unittest discover --start-directory python_bindings/tests --pattern "*_test*.py

Other implementations

Contributing to the repository

Contributions are highly welcome!

Please make pull requests against the develop branch.

200M SIFT test reproduction

To download and extract the bigann dataset (from root directory):

python3 download_bigann.py

To compile:

mkdir build
cd build
cmake ..
make all

To run the test on 200M SIFT subset:

./main

The size of the BigANN subset (in millions) is controlled by the variable subset_size_millions hardcoded in sift_1b.cpp.

Updates test

To generate testing data (from root directory):

cd examples
python update_gen_data.py

To compile (from root directory):

mkdir build
cd build
cmake ..
make 

To run test without updates (from build directory)

./test_updates

To run test with updates (from build directory)

./test_updates update

HNSW example demos

References

@article{malkov2018efficient, title={Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs}, author={Malkov, Yu A and Yashunin, Dmitry A}, journal={IEEE transactions on pattern analysis and machine intelligence}, volume={42}, number={4}, pages={824--836}, year={2018}, publisher={IEEE} }

Comments
  • Handling missing values

    Handling missing values

    Hi,

    at first thanks for this great library. It works great.

    My question refers to missing values in feature vectors. Say, we have used a set of N d-dimensional vectors to create the classifier. Is it possible to query neighbours if our query vector has less than d dimension, e.g. a missing values in one or more dimensions?

    Thanks in advance & Best, Max

    opened by maxstrobel 33
  • module 'hnswlib' has no attribute 'Index'

    module 'hnswlib' has no attribute 'Index'

    I downloaded hnswlib package to my env but I am constantly getting an error about AttributeError Traceback (most recent call last) in 1 # Declaring index ----> 2 p = hnswlib.Index(space = 'cosine', dim = EMBEDDING_SIZE) # possible options are l2, cosine or ip

    AttributeError: module 'hnswlib' has no attribute 'Index'

    opened by AnnaTumanova 19
  • Add pickle support to python bindings `Index` class

    Add pickle support to python bindings `Index` class

    Changes

    • Index class

      • Static factory methods createFromIndex and createFromParams for Index construction from another Index object or Index parameter tuple, respectively
      • Method getIndexParams serializes Index object. Returns a tuple with Index parameters.
      • Method getAnnData returns a tuple with hnsw-specific parameters. Copy of appr_alg->data_level0_memory_ and appr_alg->linkLists_ are returned as py::array_t<char> objects to avoid additional copying from python side
    • Python bindings for Index class

      • Index serialization is implemented with py::pickle definition from pybind11
      • Bind Index.__init__ to static factory methods createFromIndex and createFromParams
      • Expose parameters of the hnsw index as read-only properties in python:
        • space_name, dim, max_elements, element_count, ef_construction, M, num_threads, ef
      • And, two properties that support read and write:
        • ef, num_threads
    • Updated API documentation and the first python example in README.md

    • New test in python_bindings/tests/bindings_test_pickle.py.

      • Verifies that results of knn_query match results from copies of the same index. Index objects are copied using round-trip pickling.
      • Verify that knn_query gives recall of (almost) 100% for k=25 using three spaces. Brute-force search is used to return ground-truth labels for the randomly generated items.
      • Create separate test for each space
      • Sample test output:
    > python3 -m unittest  tests/bindings_test_pickle.py -k Inner
    Running pickle tests for <hnswlib.Index(space='ip', dim=48)>
    Warning: 1 labels are missing from ann results (k=25, err_thresh=5)
    Warning: 1 labels are missing from ann results (k=25, err_thresh=5)
     ... 
    Warning: 16 ann distance values are different from brute-force values (total # of values=5000, dists_thresh=50)
    
    .
    ----------------------------------------------------------------------
    Ran 1 test in 19.057s
    
    OK
    
    opened by dbespalov 18
  • PEP-517 and PEP-518 support (add pyproject.toml)

    PEP-517 and PEP-518 support (add pyproject.toml)

    Closes: https://github.com/nmslib/hnswlib/issues/269 https://github.com/nmslib/hnswlib/issues/177

    hnswlib does not use the recommended packaging approach for pybind11. https://pybind11.readthedocs.io/en/stable/compiling.html#setup-helpers-pep518

    The pyproject.toml file can specify the requirements necessary for the build backend (setuptools in this case) that are installed before the actual build takes place.

    This should also make it easier to start packaging wheels etc if this project moves to a more modern packaging approach.

    opened by groodt 16
  • core dumped when import hnswlib

    core dumped when import hnswlib

    The error also occurs when running the example code of hnswlib. It seems it is correlated with the system eviroment, but i can not figure it out. Same code runs well on another machine. The error occurs when running this code line:

    p.init_index(max_elements = num_elements, ef_construction = 200, M = 16)
    

    I reinstall unbuntu system and hnsw, and it runs into core dump when import hnswlib ...

    Any hint or suggestions here?

    opened by universewill 16
  • Filter elements with an optional filtering function

    Filter elements with an optional filtering function

    Fixes https://github.com/nmslib/hnswlib/issues/366

    Summary of the change

    An optional filtering function can be specified as a template parameter that determines if a given ID should be picked. The default is a function that allows all labels via areturn true -- this will be optimised away by the compiler, so there will be no extra cost for those who don't provide a filtering function.

    The use of a templated filter function also ensures that the filtering logic can be entirely inlined by the compiler, and is an implementation detail, instead of forcing a std::set<> of allowed IDs as discussed in #366.

    I've added a test that asserts on both the brute force and hnsw implementations. Verified that existing search knn tests pass.

    opened by kishorenc 14
  • Accuracy Problems with Larger Datasets

    Accuracy Problems with Larger Datasets

    I have created a matrix with 300 dimensions x 70,000 documents (L2 normalized Latent Semantic Vectors). With this data set using M of 300 and efConstruction of 400, I get 99.671% accuracy on retrieving the top 300 items with efSearch of 400. efSearch = 100, I get 82%.

    NOTE: using HNSW and "consinesimil" when building the index.

    With an M value of 96, and all other parameters the same I get an accuracy of 80.03% for efSearch of 400 and %80.8 for an efSearch value of 1600.

    ISSUES:

    1. When I increase the # of items to 6,000,000 my results are terrible, even with large values of M (300 or more). Exactly the same kind of data, just a lot more of it.
    2. The guidance for setting M and efConstruction size (or just overiding maxM0) is very minimal but seems to suggest that the M value should not need to be so large. I am guessing that the issue is due to having a very large number of similar documents.

    Is there better guidance on setting these values or do I need to know the inrinsic amount of similarity in the data collection to put together "optimal" settings? I was thining of doing a random sample of data and get some real measurements and then use that to predict the best parameters.

    Any help would be greatly appreciated.

    Mike

    opened by ghost 13
  • Question about the recall performance

    Question about the recall performance

    Hi yurymalkov,

    I'm sorry to disturb you again. I have a question about the recall performance evaluation in sift_1b.cpp. In sift_1b.cpp, the number of returned sample for each query is 1, then for the samples at top K ([email protected]) in the ground truth, if the returned sample is in the samples at top K, it counts 1. I think there is another method which is more appropriate than this. This method is as follows:

    The number of retrieved samples for each query is K ([email protected]), and for the sample at top 1 ([email protected]) in the ground truth, we check whether it (the sample of ground truth at top 1) in the [email protected] of retrieved samples. If true, it counts 1.

    I think this evaluation is better than the method adopted in sift_1b.cpp. Taking face retrieval as example, we want to get high recall rate. If the same face doesn't recall at top of 10, we can set the number of retrieved samples to big, such as [email protected].

    How do you think about the evaluation method in the sift_1b.cpp and the method I show it on above?

    Looking forward to your reply.

    opened by willard-yuan 13
  • Perf improvement for dimension not of factor 4 and 16

    Perf improvement for dimension not of factor 4 and 16

    Currently SIMD (SSE or AVX) is used for the cases when dimension is multiple of 4 or 16, while when dimension size is not strictly equal to multiple of 4 or 16 a slower non-vectorized method is used.

    To improve performnance for these cases new methods are added: L2SqrSIMD(4|16)ExtResidual - it relies on existing L2SqrSIMD(4|16)Ext to compute up to *4 and *16 dimensions and finishes residual computation by non-vectorized method L2Sqr.

    opened by 2ooom 12
  • Python: filter elements with an optional filtering function

    Python: filter elements with an optional filtering function

    Summary of the change

    Expose filtering functionality introduced in #402 in Python API. Changes are kept to a minimum, only HNSW index is implemented and not brute force.

    For a discussion of the interface design see here.

    Preliminary performance characteristics for filtering (not strictly related to the changes):

    | filter | queries per second | | ----------- | ----------- | | 0.1 | 4241.43 | | 0.2 | 5993.89 | | 0.3 | 8232.54 | | 0.4 | 10688.06 | | 0.5 | 11677.54 | | 0.6 | 13242.56 | | 0.7 | 14520.50 | | 0.8 | 14659.25 | | 0.9 | 14932.67 | | none | 503578.34 |

    Here filter denotes the fraction of elements the query was restricted to. none denotes that no filtering has been applied.

    As per the above table, there is a threshold below which exact ANN is probably preferable.

    opened by gtsoukas 10
  • fix Mac install as clang no longer has that option

    fix Mac install as clang no longer has that option

    The following error occurs when trying to pip install hnswlib: clang: error: the clang compiler does not support '-march=native'

    The command: clang -cc1 --help | grep march Returns nothing.

    Mac no longer has "-march=native" option.

    This PR modifies setup.py to not include that option for mac.

    opened by srajabi 10
  • Wrong direction of inequality in getNeighborsByHeuristic2 (Flipping it increases the accuracy)

    Wrong direction of inequality in getNeighborsByHeuristic2 (Flipping it increases the accuracy)

    void getNeighborsByHeuristic2(
                std::priority_queue<std::pair<dist_t, tableint>, std::vector<std::pair<dist_t, tableint>>, CompareByFirst> &top_candidates,
                const size_t M)
            {
                if (top_candidates.size() < M)
                {
                    return;
                }
    
                std::priority_queue<std::pair<dist_t, tableint>> queue_closest;
                std::vector<std::pair<dist_t, tableint>> return_list;
                while (top_candidates.size() > 0)
                {
                    queue_closest.emplace(-top_candidates.top().first, top_candidates.top().second);
                    top_candidates.pop();
                }
    
                while (queue_closest.size())
                {
                    if (return_list.size() >= M)
                        break;
                    std::pair<dist_t, tableint> curent_pair = queue_closest.top();
                    dist_t dist_to_query = -curent_pair.first; 
                    queue_closest.pop();
                    bool good = true;
    
                    for (std::pair<dist_t, tableint> second_pair : return_list)
                    {
                        dist_t curdist =
                            fstdistfunc_(getDataByInternalId(second_pair.second),
                                         getDataByInternalId(curent_pair.second),
                                         dist_func_param_);
                        ;
                        if (curdist < dist_to_query) # FLIP THE INEQUALITY HERE
                        {
                            good = false;
                            break;
                        }
                    }
                    if (good)
                    {
                        return_list.push_back(curent_pair);
                    }
                }
    

    Say the distance function is inner product. curdist will be a similarity between the neighbours. dist_to_query is a similarity between the current node and the query.

    In the code below,

    if (curdist < dist_to_query) # FLIP THE INEQUALITY HERE
    {
        good = false;
        break;
    }
    

    When heuristically choosing the neighbours, you should only keep the representational nodes. However, this code is removing the nodes that are more similar to the query than their neighbours are similar to each other when you should the other way around. After several tests, I could confirm that flipping the inequality significantly boosts the accuracy of this algorithm.

    opened by lukeleeai 0
  • c++ use multiple threads but much slower than python

    c++ use multiple threads but much slower than python

    I use ParallelFor as https://github.com/nmslib/hnswlib/blob/7cc0ecbd43723418f43b8e73a46debbbc3940346/python_bindings/bindings.cpp#L239

    // c++ code, add 10000 points, addPoint cost 7580ms
    // compile flags:  -std=c++11 -g -pipe -W -Wall -fPIC -pthread -Ofast -fwrapv
    int d = 256;
    hnswlib::labeltype n = 10000;
       
    std::vector<float> data(n * d);
    std::mt19937 rng;
    rng.seed(47);
    std::uniform_real_distribution<> distrib;
    
    for (hnswlib::labeltype i = 0; i < n * d; ++i) {
        data[i] = distrib(rng);
    }
    
    hnswlib::L2Space space(d);
    hnswlib::AlgorithmInterface<float>* alg_hnsw = new hnswlib::HierarchicalNSW<float>(&space, 2 * n);
    int num_threads = std::thread::hardware_concurrency();
    ParallelFor(0, n, num_threads, [&](size_t row, size_t threadId) {
                        alg_hnsw->addPoint((void *) data.data() + d * row, (size_t) row);
    });
    
    # python code, add 10000 points, add_items cost 310ms
    import numpy as np
    import hnswlib
    import time
    dim = 256
    num_elements = 10000
    data = np.float32(np.random.random((num_elements, dim)))
    ids = np.arange(num_elements)
    p = hnswlib.Index(space = 'ip', dim = dim)
    p.init_index(max_elements = num_elements, ef_construction = 200, M = 16)
    start=time.time();p.add_items(data, ids);end=time.time();
    

    Is there anything i missed in c++ code? Why c++ is mush slower?

    opened by bluemandora 1
  • has_deletions == false

    has_deletions == false

    https://github.com/nmslib/hnswlib/blob/443d667478fddf1e13f2e06b1da4e1ec3a9fe716/hnswlib/hnswalg.h#L267

    “ || has_deletions == false”

    There is a problem with this judgment condition, which will cause the loop to exit early .

    opened by lockeliu 5
  • deadlock when add duplicated labels by multithreads

    deadlock when add duplicated labels by multithreads

    Hello, everyone,

    Some labels are duplicate in our case. We add labels by multi-thread. I found thread may lock the same mutex two times in one addPoint which lead to deadlock. The first time to lock mutex at here: https://github.com/nmslib/hnswlib/blob/master/hnswlib/hnswalg.h#L1021 The second time to lock the same mutex at here: https://github.com/nmslib/hnswlib/blob/master/hnswlib/hnswalg.h#L185

    opened by cdjingit 2
Releases(v0.6.2)
  • v0.6.2(Feb 14, 2022)

    • Fixed a bug in saving of large pickles. The pickles with > 4GB could have been corrupted. Thanks Kai Wohlfahrt for reporting.
    • Thanks to (@GuyAv46) hnswlib inner product now is more consitent accross architectures (SSE, AVX, etc).
    Source code(tar.gz)
    Source code(zip)
  • v0.6.1(Feb 6, 2022)

    Thanks to (@tony-kuo) hnswlib AVX512 and AVX builds are not backwards-compatible with older SSE and non-AVX512 architectures. Thanks to (@psobot) there is now a sensible message instead of segfault when passing a scalar to get_items. Thanks to (@urigoren) hnswlib has a lazy index creation python wrapper.

    Source code(tar.gz)
    Source code(zip)
  • v0.6.0(Dec 9, 2021)

    Thanks to (@dyashuni) hnswlib now uses github actions for CI, there is a search speedup in some scenarios with deletions. unmark_deleted(label) is now also a part of the python interface (note now it throws an exception for double deletions). Thanks to (@slice4e) we now support AVX512; thanks to (@LTLA) the cmake interface for the lib is now updated. Thanks to (@alonre24) we now have a python bindings for brute-force (and examples for recall tuning: TESTING_RECALL.md. Thanks to (@dorosy-yeong) there is a bug fixed in the handling large quantities of deleted elements and large K.

    Source code(tar.gz)
    Source code(zip)
  • v0.5.2(Jun 30, 2021)

    Bugfixes and improvements. Many thanks to: @marekhanus for fixing the missing arguments, adding support for python 3.8, 3.9 in Travis, improving python wrapper and fixing typos/code style; @apoorv-sharma for fixing the bug int the insertion/deletion logic; @shengjun1985 for simplifying the memory reallocation logic; @TakaakiFuruse for improved description of add_items; @psobot for improving error handling; @ShuAiii for reporting the bug in the python interface.

    Source code(tar.gz)
    Source code(zip)
  • v0.5.0(Jan 29, 2021)

    Added support for pickling indices, support for PEP-517 and PEP-518 building, small speedups, bugfixes and documentation improvements. The setup.py file now resides in the root.

    Many thanks to @dbespalov, @dyashuni, @groodt, @uestc-lfs, @vinnitu, @fabiencastan, @JinHai-CN, @js1010!

    Source code(tar.gz)
    Source code(zip)
  • v0.4.0(Jun 27, 2020)

    Thanks to Apoorv Sharma @apoorv-sharma, hnswlib now supports true element updates (the interface remained the same, but when you the performance/memory should not degrade as you update the element embeddings).

    Thanks to Dmitry @2ooom, hnswlib got a boost in performance for vector dimensions that are not multiple of 4.

    Bugfixes and other updates (@xiejianqiao, @Shujian2015, @mohamed-ali, @hussamaa).

    Many thanks to all contributors!

    Source code(tar.gz)
    Source code(zip)
  • v0.3.4(Dec 16, 2019)

    Fixed many bugs and error messages. Added support to marking elements as deleted. Stopped support for old indices.

    A huge thanks to all of the contributors!

    Source code(tar.gz)
    Source code(zip)
Experiments in converting wikidata to ftm

FollowTheMoney / Wikidata mappings This repo will contain tools for converting Wikidata entities into FtM schema. Prefixes: https://www.mediawiki.org/

Friedrich Lindenberg 2 Nov 12, 2021
Multiple implementations for abstractive text summurization , using google colab

Text Summarization models if you are able to endorse me on Arxiv, i would be more than glad https://arxiv.org/auth/endorse?x=FRBB89 thanks This repo i

463 Dec 26, 2022
Korean extractive summarization. 2021 AI 텍스트 요약 온라인 해커톤 화성갈끄니까팀 코드

korean extractive summarization 2021 AI 텍스트 요약 온라인 해커톤 화성갈끄니까팀 코드 Leaderboard Notice Text Summarization with Pretrained Encoders에 나오는 bertsumext모델(ext

3 Aug 10, 2022
A unified tokenization tool for Images, Chinese and English.

ICE Tokenizer Token id [0, 20000) are image tokens. Token id [20000, 20100) are common tokens, mainly punctuations. E.g., icetk[20000] == 'unk', ice

THUDM 42 Dec 27, 2022
My Implementation for the paper EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks using Tensorflow

Easy Data Augmentation Implementation This repository contains my Implementation for the paper EDA: Easy Data Augmentation Techniques for Boosting Per

Aflah 9 Oct 31, 2022
ETM - R package for Topic Modelling in Embedding Spaces

ETM - R package for Topic Modelling in Embedding Spaces This repository contains an R package called topicmodels.etm which is an implementation of ETM

bnosac 37 Nov 06, 2022
🗣️ NALP is a library that covers Natural Adversarial Language Processing.

NALP: Natural Adversarial Language Processing Welcome to NALP. Have you ever wanted to create natural text from raw sources? If yes, NALP is for you!

Gustavo Rosa 21 Aug 12, 2022
Develop open-source Python Arabic NLP libraries that the Arab world will easily use in all Natural Language Processing applications

Develop open-source Python Arabic NLP libraries that the Arab world will easily use in all Natural Language Processing applications

BADER ALABDAN 2 Oct 22, 2022
Exploration of BERT-based models on twitter sentiment classifications

twitter-sentiment-analysis Explore the relationship between twitter sentiment of Tesla and its stock price/return. Explore the effect of different BER

Sammy Cui 2 Oct 02, 2022
PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer

Cross-Covariance Image Transformer (XCiT) PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer L

Facebook Research 605 Jan 02, 2023
Repositório do trabalho de introdução a NLP

Trabalho da disciplina de BI NLP Repositório do trabalho da disciplina Introdução a Processamento de Linguagem Natural da pós BI-Master da PUC-RIO. Eq

Leonardo Lins 1 Jan 18, 2022
An open source library for deep learning end-to-end dialog systems and chatbots.

DeepPavlov is an open-source conversational AI library built on TensorFlow, Keras and PyTorch. DeepPavlov is designed for development of production re

Neural Networks and Deep Learning lab, MIPT 6k Dec 30, 2022
All the code I wrote for Overwatch-related projects that I still own the rights to.

overwatch_shit.zip This is (eventually) going to contain all the software I wrote during my five-year imprisonment stay playing Overwatch. I'll be add

zkxjzmswkwl 2 Dec 31, 2021
A pytorch implementation of the ACL2019 paper "Simple and Effective Text Matching with Richer Alignment Features".

RE2 This is a pytorch implementation of the ACL 2019 paper "Simple and Effective Text Matching with Richer Alignment Features". The original Tensorflo

286 Jan 02, 2023
Quick insights from Zoom meeting transcripts using Graph + NLP

Transcript Analysis - Graph + NLP This program extracts insights from Zoom Meeting Transcripts (.vtt) using TigerGraph and NLTK. In order to run this

Advit Deepak 7 Sep 17, 2022
BERT, LDA, and TFIDF based keyword extraction in Python

BERT, LDA, and TFIDF based keyword extraction in Python kwx is a toolkit for multilingual keyword extraction based on Google's BERT and Latent Dirichl

Andrew Tavis McAllister 41 Dec 27, 2022
构建一个多源(公众号、RSS)、干净、个性化的阅读环境

2C 构建一个多源(公众号、RSS)、干净、个性化的阅读环境 作为一名微信公众号的重度用户,公众号一直被我设为汲取知识的地方。随着使用程度的增加,相信大家或多或少会有一个比较头疼的问题——广告问题。 假设你关注的公众号有十来个,若一个公众号两周接一次广告,理论上你会面临二十多次广告,实际上会更多,运

howie.hu 678 Dec 28, 2022
IEEEXtreme15.0 Questions And Answers

IEEEXtreme15.0 Questions And Answers IEEEXtreme is a global challenge in which teams of IEEE Student members – advised and proctored by an IEEE member

Dilan Perera 15 Oct 24, 2022
Code for "Semantic Role Labeling as Dependency Parsing: Exploring Latent Tree Structures Inside Arguments".

Code for "Semantic Role Labeling as Dependency Parsing: Exploring Latent Tree Structures Inside Arguments".

Yu Zhang 50 Nov 08, 2022
This is a simple item2vec implementation using gensim for recbole

recbole-item2vec-model This is a simple item2vec implementation using gensim for recbole( https://recbole.io ) Usage When you want to run experiment f

Yusuke Fukasawa 2 Oct 06, 2022