pyprobables is a pure-python library for probabilistic data structures

Overview

PyProbables

License GitHub release Build Status Test Coverage Documentation Status Pypi Release Downloads

pyprobables is a pure-python library for probabilistic data structures. The goal is to provide the developer with a pure-python implementation of common probabilistic data-structures to use in their work.

To achieve better raw performance, it is recommended supplying an alternative hashing algorithm that has been compiled in C. This could include using the md5 and sha512 algorithms provided or installing a third party package and writing your own hashing strategy. Some options include the murmur hash mmh3 or those from the pyhash library. Each data object in pyprobables makes it easy to pass in a custom hashing function.

Read more about how to use Supplying a pre-defined, alternative hashing strategies or Defining hashing function using the provided decorators.

Installation

Pip Installation:

$ pip install pyprobables

To install from source:

To install pyprobables, simply clone the repository on GitHub, then run from the folder:

$ python setup.py install

pyprobables supports python 3.5 - 3.9+

For python 2.7 support, install release 0.3.2

$ pip install pyprobables==0.3.2

API Documentation

The documentation of is hosted on readthedocs.io

You can build the documentation locally by running:

$ pip install sphinx
$ cd docs/
$ make html

Automated Tests

To run automated tests, one must simply run the following command from the downloaded folder:

$ python setup.py test

Quickstart

Import pyprobables and setup a Bloom Filter

from probables import (BloomFilter)
blm = BloomFilter(est_elements=1000, false_positive_rate=0.05)
blm.add('google.com')
blm.check('facebook.com')  # should return False
blm.check('google.com')  # should return True

Import pyprobables and setup a Count-Min Sketch

from probables import (CountMinSketch)
cms = CountMinSketch(width=1000, depth=5)
cms.add('google.com')  # should return 1
cms.add('facebook.com', 25)  # insert 25 at once; should return 25

Import pyprobables and setup a Cuckoo Filter

from probables import (CuckooFilter)
cko = CuckooFilter(capacity=100, max_swaps=10)
cko.add('google.com')
cko.check('facebook.com')  # should return False
cko.check('google.com')  # should return True

Supplying a pre-defined, alternative hashing strategies

from probables import (BloomFilter)
from probables.hashes import (default_sha256)
blm = BloomFilter(est_elements=1000, false_positive_rate=0.05,
                  hash_function=default_sha256)
blm.add('google.com')
blm.check('facebook.com')  # should return False
blm.check('google.com')  # should return True

Defining hashing function using the provided decorators

import mmh3  # murmur hash 3 implementation (pip install mmh3)
from pyprobables.hashes import (hash_with_depth_bytes)
from pyprobables import (BloomFilter)

@hash_with_depth_bytes
def my_hash(key):
    return mmh3.hash_bytes(key)

blm = BloomFilter(est_elements=1000, false_positive_rate=0.05, hash_function=my_hash)
import mmh3  # murmur hash 3 implementation (pip install mmh3)
from pyprobables.hashes import (hash_with_depth_int)
from pyprobables import (BloomFilter)

@hash_with_depth_int
def my_hash(key, encoding='utf-8'):
    max64mod = UINT64_T_MAX + 1
    val = int(hashlib.sha512(key.encode(encoding)).hexdigest(), 16)
    return val % max64mod

blm = BloomFilter(est_elements=1000, false_positive_rate=0.05, hash_function=my_hash)

See the API documentation for other data structures available and the quickstart page for more examples!

Changelog

Please see the changelog for a list of all changes.

Comments
  • Math domain error

    Math domain error

    Hello,

    I'm getting the following error when using print(bloom_filter).

    File "/home/user/.conda/envs/biopython/lib/python3.9/site-packages/probables/blooms/bloom.py", line 127, in __str__
        self.estimate_elements(),
      File "/home/user/.conda/envs/biopython/lib/python3.9/site-packages/probables/blooms/bloom.py", line 350, in estimate_elements
        log_n = math.log(1 - (float(setbits) / float(self.number_bits)))
    ValueError: math domain error
    

    I'm running the latest version, downloaded from pipit only the other day and I'm using python version 3.8.6.

    opened by Glfrey 31
  • Wrong result with large filter

    Wrong result with large filter

    I expect that if I ask the filter to check for a membership and it tells me FALSE, then its definitely NOT a member. I did the following:

    def verifyMembership(key):
        global bloom
        if key in bloom:
            print('Its possibly in')
        else:
            print('Definitly not in')
    
    key = 'some'
    filterFile = 'index.dat'
    bloom = BloomFilter(est_elements=100000000, false_positive_rate=0.03, filepath=filterFile)
    verifyMembership(key)
    bloom.add(key)
    verifyMembership(key)
    bloom.export(filterFile)
    

    I called my script twice and the output is:

    Definitly not in
    Its possibly in
    Definitly not in
    Its possibly in
    

    But I would expect:

    Definitly not in
    Its possibly in
    Its possibly in
    Its possibly in
    

    If i am reducing the est_elements to lets say 10000, then its fine.

    opened by mrqc 7
  • Use black to format code, add support for poetry and pre-commit

    Use black to format code, add support for poetry and pre-commit

    This is mainly a cosmetic update for the codebase. Black is now a de-facto standard code formatter.

    I added minimal config for pre-commit.

    I also took the liberty of adding poetry support as it already uses pyproject.toml file used also by black and isort. And it's the best venv solution on the market.

    opened by dekoza 6
  • Hotfix/export-c-header

    Hotfix/export-c-header

    @dnanto this is a minor tweak to the export_c_header function. It allows for the exported bloom as chars to be compatible with the hex export which is how it can also be tested.

    Thoughts?

    opened by barrust 4
  • Several problems with the default hash

    Several problems with the default hash

    Hi, I found some problems with the default fnv hash used. Even though it is recommended to provide custom hashes, some users may expect the defaults to work properly.

    First off, the results differ from standardized implementations:

    $ python -c 'from probables.hashes import fnv_1a; print(fnv_1a("foo"))'
    3411864951044856955 # should be 15902901984413996407
    $ python -c 'from probables.hashes import fnv_1a; print(fnv_1a("bar"))'
    1047262940628067782 # should be 16101355973854746
    

    This is caused by wrong hval value here https://github.com/barrust/pyprobables/blob/beb73f2f6c2ab9d8b8b477381e84271c88b25e8f/probables/hashes.py#L85 (should be 14695981039346656037 instead of 14695981039346656073). Changing this constant helps:

    $ python -c 'from probables.hashes import fnv_1a; print(fnv_1a("foo"))'
    15902901984413996407
    $ python -c 'from probables.hashes import fnv_1a; print(fnv_1a("bar"))'
    16101355973854746
    

    The second problem is in the @hash_with_depth_int wrapper once more hashes than one are computed. Because the value of the first hash is used as a seed for the subsequent hashes, once we get a collision in the first hash, all other hashes are identical:

    $ python3 -c 'from probables.hashes import default_fnv_1a; print(default_fnv_1a("gMPflVXtwGDXbIhP73TX", 3))'
    [10362567113185002004, 14351534809307984379, 3092021042139682764]
    $ python3 -c 'from probables.hashes import default_fnv_1a; print(default_fnv_1a("LtHf1prlU1bCeYZEdqWf", 3))'
    [10362567113185002004, 14351534809307984379, 3092021042139682764]
    

    This makes all Count*Sketch data structures much less accurate, since they rely on small probabilities of collision in all hash functions involved.

    opened by simonmandlik 4
  • Missing method to aggregate count-min sketches

    Missing method to aggregate count-min sketches

    Count-min sketch has in theory property that 2 tables can be summed together which allows parallel count-min sketch building, but I don't see it implemented there. Should I make pull request which implements it?

    opened by racinmat 4
  • add `frombytes` to all probabilistic data structures

    add `frombytes` to all probabilistic data structures

    • added ability to load data structures using the exported bytes
    • added tests to verify the frombytes() functionality
    • minor changes to MMap class for possible use in the future (with tests)

    Resolves #88 @KOLANICH

    opened by barrust 3
  • Several fixes and performance improvement

    Several fixes and performance improvement

    Fixes #60 Fixes #61 Fixes part of #62 . Uses correct seed for default hash. Modified tests to reflect these changes. Changed order of arguments for assertion to follow assert(expected, actual) paradigm. When not followed, unittest yields misleading error messages. The problem with propagating collisions to larger depth still needs to be addressed though. Should I do it in this PR, or in some other?

    opened by racinmat 3
  • bloom filter intersection failure

    bloom filter intersection failure

    Tried an intersection of 2 bloom filters both with est_elements=16000000, got a list index out of range error

    Works fine if both have est_elements=16000001.

    If one is 160000000 and the other is 16000001, get a None return on the intersection, rather than throwing an error explaining what the problem is.

    opened by sfletc 3
  • How to cPickle count min sketch instance

    How to cPickle count min sketch instance

    I encounter this error when using cPickle to save count min sketch instance:

    Traceback (most recent call last): File "test.py", line 14, in <module> pkl.dump(cms, f) File "/usr/local/Cellar/[email protected]/2.7.16/Frameworks/Python.framework/Versions/2.7/lib/python2.7/copy_reg.py", line 77, in _reduce_ex raise TypeError("a class that defines __slots__ without " TypeError: a class that defines __slots__ without defining __getstate__ cannot be pickled

    opened by huuthonguyen76 3
  • Moved the metadata into PEP 621-compliant sections.

    Moved the metadata into PEP 621-compliant sections.

    Since poetry has not yet implemented PEP 621 I have temporarily switched the build backend to flit. To use setuptools one has to comment out the lines choosing flit and uncomment the lines choosing setuptools. setup.py and setup.cfg have been removed, their content has been integrated into pyproject.toml.

    PEP 621 support in poetry is tracked here: https://github.com/python-poetry/roadmap/issues/3

    opened by KOLANICH 2
Releases(v0.5.6)
  • v0.5.6(Mar 10, 2022)

  • v0.5.5(Jan 15, 2022)

    • Bloom Filters:
      • Re-implemented the entire Bloom Filter data structure to reduce complexity and code duplication
    • Removed unused imports
    • Removed unnecessary casts
    • Pylint Requested Style Changes:
      • Use python 3 super()
      • Use python 3 classes
    • Remove use of temporary variables if possible and still clear
    Source code(tar.gz)
    Source code(zip)
  • v0.5.4(Jan 8, 2022)

    • All Probablistic Data Structures:
      • Added ability to load each frombytes()
      • Updated underlying data structures of number based lists to be more space and time efficient; see Issue #60
    • Cuckoo Filters:
      • Added fingerprint_size_bits property
      • Added error_rate property
      • Added ability to initialize based on error rate
    • Simplified typing
    • Ensure all filepaths can be str or Path
    Source code(tar.gz)
    Source code(zip)
  • v0.5.3(Dec 29, 2021)

    • Additional type hinting
    • Improved format parsing and serialization; see PR#81. Thanks @KOLANICH
    • Bloom Filters
      • Added export_to_hex functionality for Bloom Filters on Disk
      • Export as C header (*.h) for Bloom Filters on Disk and Counting Bloom Filters
    • Added support for more input types for exporting and loading of saved files
    Source code(tar.gz)
    Source code(zip)
  • v0.5.2(Dec 13, 2021)

  • v0.5.1(Nov 19, 2021)

    • Bloom Filter:
      • Export as a C header (*.h)
    • Count-Min Sketch
      • Add join/merge functionality
    • Moved testing to use NamedTemporaryFile for file based tests
    Source code(tar.gz)
    Source code(zip)
  • v0.5.0(Oct 19, 2021)

    • BACKWARD INCOMPATIBLE CHANGES
      • NOTE: Breaks backwards compatibility with previously exported blooms, counting-blooms, cuckoo filter, or count-min-sketch files using the default hash!
      • Update to the FNV_1a hash function
      • Simplified the default hash to use a seed value
    • Ensure passing of depth to hashing function when using hash_with_depth_int or hash_with_depth_bytes
    Source code(tar.gz)
    Source code(zip)
  • v0.4.1(Apr 30, 2021)

  • v0.4.0(Dec 31, 2020)

  • v0.3.2(Aug 9, 2020)

  • v0.3.1(Mar 20, 2020)

  • v0.3.0(Nov 21, 2018)

  • v0.2.6(Nov 12, 2018)

  • v0.2.5(Nov 10, 2018)

  • v0.2.0(Nov 7, 2018)

  • v0.1.4(May 25, 2018)

  • v0.1.3(Jan 2, 2018)

  • v0.1.2(Oct 9, 2017)

  • v0.1.1(Oct 4, 2017)

  • v0.1.0(Sep 29, 2017)

  • v0.0.8(Aug 22, 2017)

  • v0.0.7(Aug 12, 2017)

    • Counting Bloom Filter
      • Fix counting bloom hex export / import
      • Fix for overflow issue in counting bloom export
      • Added ability to remove from counting bloom
    • Count-Min Sketch
      • Fix for not recording large numbers of inserts and deletions correctly
    Source code(tar.gz)
    Source code(zip)
  • v0.0.6(Aug 5, 2017)

  • v0.0.5(Jul 21, 2017)

  • v0.0.4(Jul 15, 2017)

    • Initial probabilistic data-structures
      • Bloom Filter
      • Bloom Filter On Disk
      • Count-Min Sketch
      • Count-Mean Sketch
      • Count-Mean-Min Sketch
      • Heavy Hitters
      • Stream Threshold
    • Import / Export of each type
    Source code(tar.gz)
    Source code(zip)
Owner
Tyler Barrus
Tyler Barrus
CLASSIX is a fast and explainable clustering algorithm based on sorting

CLASSIX Fast and explainable clustering based on sorting CLASSIX is a fast and explainable clustering algorithm based on sorting. Here are a few highl

69 Jan 06, 2023
Solutions for leetcode problems.

Leetcode-solution This is an repository for storring new algorithms that I am learning form the LeetCode for future use. Implemetations Two Sum (pytho

Shrutika Borkute 1 Jan 09, 2022
A Python dictionary implementation designed to act as an in-memory cache for FaaS environments

faas-cache-dict A Python dictionary implementation designed to act as an in-memory cache for FaaS environments. Formally you would describe this a mem

Juan 3 Dec 13, 2022
Basic sort and search algorithms written in python.

Basic sort and search algorithms written in python. These were all developed as part of my Computer Science course to demonstrate understanding so they aren't 100% efficent

Ben Jones 0 Dec 14, 2022
This repository is for adding codes of data structures and algorithms, leetCode, hackerrank etc solutions in different languages

DSA-Code-Snippet This repository is for adding codes of data structures and algorithms, leetCode, hackerrank etc solutions in different languages Cont

DSCSRMNCR 3 Oct 22, 2021
Programming of a spanning tree algorithm with Python : In depth first with a root node.

ST Algorithm Programming of a spanning tree algorithm with Python : In depth first with a root node. Description This programm reads informations abou

Mathieu Lamon 1 Dec 16, 2021
Al-Quran dengan Terjemahan Indonesia

Al-Quran Rofi Al-Quran dengan Terjemahan / Tafsir Jalalayn Instalasi Al-Quran Rofi untuk Archlinux untuk pengguna distro Archlinux dengan paket manage

Nestero 4 Dec 20, 2021
Decided to include my solutions for leetcode problems.

LeetCode_Solutions Decided to include my solutions for leetcode problems. LeetCode # 1 TwoSum First leetcode problem and it was kind of a struggle. Th

DandaIT04 0 Jan 01, 2022
nocasedict - A case-insensitive ordered dictionary for Python

nocasedict - A case-insensitive ordered dictionary for Python Overview Class NocaseDict is a case-insensitive ordered dictionary that preserves the or

PyWBEM Projects 2 Dec 12, 2021
Simple spill-to-disk dictionary

Chest A dictionary that spills to disk. Chest acts likes a dictionary but it can write its contents to disk. This is useful in the following two occas

Blaze 59 Dec 19, 2022
One-Stop Destination for codes of all Data Structures & Algorithms

CodingSimplified_GK This repository is aimed at creating a One stop Destination of codes of all Data structures and Algorithms along with basic explai

Geetika Kaushik 21 Sep 26, 2022
Supporting information (calculation outputs, structures)

Supporting information (calculation outputs, structures)

Eric Berquist 2 Feb 02, 2022
An command-line utility that schedules your exams preparation routines

studyplan A tiny utility that schedules your exams preparation routines. You only need to specify the tasks and the deadline. App will output a iCal f

Ilya Breitburg 3 May 18, 2022
A high-performance immutable mapping type for Python.

immutables An immutable mapping type for Python. The underlying datastructure is a Hash Array Mapped Trie (HAMT) used in Clojure, Scala, Haskell, and

magicstack 996 Jan 02, 2023
A Python implementation of red-black trees

Python red-black trees A Python implementation of red-black trees. This code was originally copied from programiz.com, but I have made a few tweaks to

Emily Dolson 7 Oct 20, 2022
An esoteric data type built entirely of NaNs.

NaNsAreNumbers An esoteric data type built entirely of NaNs. Installation pip install nans_are_numbers Explanation A floating point number is just co

Travis Hoppe 72 Jan 01, 2023
This repo represents all we learned and are learning in Data Structure course.

DataStructure Journey This repo represents all we learned and are learning in Data Structure course which is based on CLRS book and is being taught by

Aprime Afr (Alireza Afroozi) 3 Jan 22, 2022
Final Project for Practical Python Programming and Algorithms for Data Analysis

Final Project for Practical Python Programming and Algorithms for Data Analysis (PHW2781L, Summer 2020) Redlining, Race-Exclusive Deed Restriction Lan

Aislyn Schalck 1 Jan 27, 2022
dict subclass with keylist/keypath support, normalized I/O operations (base64, csv, ini, json, pickle, plist, query-string, toml, xml, yaml) and many utilities.

python-benedict python-benedict is a dict subclass with keylist/keypath support, I/O shortcuts (base64, csv, ini, json, pickle, plist, query-string, t

Fabio Caccamo 799 Jan 09, 2023
My notes on Data structure and Algos in golang implementation and python

My notes on DS and Algo Table of Contents Arrays LinkedList Trees Types of trees: Tree/Graph Traversal Algorithms Heap Priorty Queue Trie Graphs Graph

Chia Yong Kang 0 Feb 13, 2022