Simple Similarities Service

Overview

simsity

Simsity is a Super Simple Similarities Service[tm].
It's all about building a neighborhood. Literally!

This repository contains simple tools to help in similarity retreival scenarios by making a convient wrapper around encoding strategies as well as nearest neighbor approaches. Typical usecases include early stage bulk labelling and duplication discovery.

Warning

Alpha software. Expect things to break. Do not use in production.

Quickstart

This is the basic setup for this package.

import pandas as pd

from simsity.service import Service
from simsity.indexer import PyNNDescentIndexer
from simsity.preprocessing import Identity, ColumnLister


# The Indexer handles the nearest neighbor search
# The Encoder handles the encoding of the datapoints
service = Service(
    indexer=PyNNDescentIndexer(metric="euclidean"),
    encoder=CountVectorizer()
)

# The encoder defines how we encode the data going in.
encoder = make_pipeline(
    ColumnLister(column="text"),
    CountVectorizer()
)

# The indexer handles the nearest neighbor lookup.
indexer = PyNNDescentIndexer(metric="euclidean", n_neighbors=2)

# The service combines the two into a single object.
service_clinc = Service(
    encoder=encoder,
    indexer=indexer,
)

# We can now train the service.
df_clinc = pd.read_csv("tests/data/clinc-data.csv")
service_clinc.train_from_dataf(df_clinc, features=["text"])

# Query the datapoints
service.query("give me directions", n_neighbors=20)

# Save the entire system
service.save("/tmp/simple-model")

# You can also load the model now.
reloaded = Service.load("/tmp/simple-model")

# We can also host it as a web service
reloaded.serve(host='0.0.0.0', port=8080)

# You can now POST to http://0.0.0.0:8080/query with payload:
# {"query": {"text": "hello there"}, "n_neighbors": 20}
Comments
  • Add support for pretrained encoders and transformed data

    Add support for pretrained encoders and transformed data

    First of all this project looks great! I've taken an initial stab at #12 and also tried to add support querying data that has already been transformed. If you have data that you've already transformed (e.g. a UMAP embedding), you probably don't want to rerun encoder.transform again. In this case you want to index the transformed data and query it directly.

    This is just a first crack so happy to incorporate any feedback you might have!

    opened by gclen 10
  • embetter: better embeddings

    embetter: better embeddings

    This is conceptual work in progress. The maintainer is actively researching this, please do not work on it.

    Problem Statement

    When you submit where is my phoone and you get similarities you may get things like:

    • where is my phone
    • where is my credit card

    Depending on your task, either the "where is" part of the sentence is more important or the "phone" part is more important. The encoder, however, may be very brittle when it comes to spelling errors. So to put it more generally;

    image

    The similarity in an embedded space in our case is very much "general". I'm using "general" here, as opposed to "specific" to indicate that these similarities have been constructed without having a task in mind.

    Similar Issue

    Suppose that we are deduplicating and we have a zipcode, city, first-, and last-name. How would our encoding be able to understand that having the same city is not a strong signal while having the first name certainly is? Can we really expect a standard encoding to understand this? Without labels ... I think not.

    opened by koaning 3
  • Add `Identity` as default encoder for Service.

    Add `Identity` as default encoder for Service.

    As mentioned in https://github.com/koaning/simsity/pull/13:

    I think the refit parameter should go in the Service() call. I think there should also be a parameter somewhere to avoid calling .transform() if the data has already been transformed. Do you think it is worth adding an additional parameter to Service() and keeping the indexed_from_transformed_data method?

    It's a fair remark. I think preventing a transfrom() is fair, but the solution would be to have an Identity() transformer that just keeps the data as-is. This would also make a great default value for the encoder.

    Made this issue to track progress and to discuss the approach.

    opened by koaning 2
  • Codecalm tutorial on simsity

    Codecalm tutorial on simsity

    Hi Vincent. Since I discovered you my barrier towards Python has eroded! Thank you. I'm a Data Scientist who wants to check if simsity can help with retrieving similar regions based on environmental variables.

    opened by FrancyJGLisboa 2
  • Update indexer

    Update indexer

    Hi! Are there any plans to add support for updating the indexer, i.e. add new documents without retraining the entire pipeline? Would be a very useful feature .

    from simsity.service import Service
    
    service = Service(
        indexer=indexer,
        encoder=encoder
    )
    
    service.train_from_dataf(df, features=["text"])
    
    ....
    
    service.update(new_docs, features=["text"])  # <- this
    
    
    opened by nthomsencph 1
  • New API

    New API

    I think the original design was flawed and this project should stick to the scikit-learn API more.

    from simsity.preprocessing import Grab
    from simsity.service import Service
    from simsity.indexer import (AnnoyIndexer, PynnDescentIndexed, NMSlibIndexer,
                                 PineconeIndexer, QdrantIndexer, WeviateIndexer)
    
    
    encoder = make_pipeline(
        make_union(
            make_pipeline(Grab("text"), SentenceEncoder()),
            make_pipeline(Grab("title"), SentenceEncoder())
        )
    )
    
    service = Service(encoder, indexer, batch_size=50)
    service.index(X)
    items, dists = service.query(X, n=10)
    
    opened by koaning 0
  • Education Day Goals

    Education Day Goals

    • [x] add typing + type checker
    • [x] add tests for the minhash tools
    • [ ] collect more useful datasets
    • [x] automate the benchmarking
    • [x] write getting started guides
    • [ ] record a quick demo for colleagues
    • [ ] add github actions stash
    opened by koaning 0
  • added-components

    added-components

    Adding the MinHash components. This is also an amazing opportunity to:

    • [ ] add types and a type checker
    • [ ] add some standard tests for indexers
    • [ ] add a script to run some benchmarks on the clinc dataset
    opened by koaning 0
Releases(0.1.1)
Owner
vincent d warmerdam
Solving problems involving data. Mostly NLP these days. AskMeAnything[tm].
vincent d warmerdam
Report-snapchat - Report Snapchat acc with python

report-snapchat Report Snapchat acc Report users on Snapchat about the tool : 4

17 Dec 01, 2022
A solution designed to extract, transform and load Chicago crime data from an RDS instance to other services in AWS.

This project is intended to implement a solution designed to extract, transform and load Chicago crime data from an RDS instance to other services in AWS.

Yesaswi Avula 1 Feb 04, 2022
OAN Music - Highly advanced User Music Bot

ཧᜰ꙰ꦿ➢𝐎𝐀𝐍༒☛ 🎧 Advanced 𝐎𝐀𝐍 Music bot. 🔗 𝐏𝐨𝐰𝐞𝐫𝐞𝐝 𝐛𝐲 : ➢𝐀ttitude

Attitude king 5 Feb 25, 2022
聚合空间测绘搜索(Fofa,Zoomeye,Quake,Shodan,Censys,BinaryEdge)

#Search-Tools Search-Tools集合比较常见的网络空间探测引擎 Fofa,Zoomeye,Quake,Shodan,Censys,BinaryEdge 简单说明 ICO搜索目前只有Fofa,Shodan,Quake支持 代理设置是防止在API请求过于频繁,或者在实战中,好多红队打

311 Dec 16, 2022
Automatically searching for vaccine appointments

Vaccine Appointments Automatically searching for vaccine appointments Usage To copy this package, run: git clone https://github.com/TheIronicCurtain/v

58 Apr 13, 2021
Python binding to the OneTimeSecret API

Thin Python binding for onetimesecret.com API. Unicode-safe. Description of API itself you can find here: https://onetimesecret.com/docs/api Usage:

Vladislav Stepanov 10 Jun 12, 2022
Campsite Reservation Cancellation Finder (Yellowstone National Park)

yellowstone-camping yellowstone-camping is a Campsite Reservation Cancellation Finder for Yellowstone National Park. This simple Python application wi

Justin Flannery 7 Aug 05, 2022
Implement SAST + DAST through Github actions

Implement SAST + DAST through Github actions The repository is supposed to implement SAST+DAST checks using github actions against a vulnerable python

Syed Umar Arfeen 3 Nov 09, 2022
A simple versatile telgeram bot written in Python using pyTelegramBotAPI library.

A simple versatile telgeram bot written in Python using pyTelegramBotAPI library.

Benyamin Zojaji 15 Jun 17, 2022
Discord Auto bumper made in python, just a simple auto bumper that I made.

Discord Auto bumper made in python, just a simple auto bumper that I made.

XPTGR 0 Dec 04, 2021
A Telegram Bot Plays With Words!!!

TheWordzBot ➠ I Can Turn Text Into Audio ➠ I Can Get Results From Dictionary ➠ I Can Make Google Search For You ➠ I Can Suggest Strong Passwords For Y

RAVEEN KUMAR 8 Feb 28, 2022
WikipediaBot from mohirdev.uz

wiki-bot WikipediaBot from mohirdev.uz Requirements wikipedia aiogram Installing wiki/aiogram pip install wikipedia pip install aiogram

Muhammad Ali 5 Sep 28, 2022
Monitor robot of Apple Store's products, using DingTalk notification.

概述 本项目应用主要用来监测Apple Store线下直营店货源情况,主要使用Python实现。 首先感谢iPhone-Pickup-Monitor项目带来的灵感,同时有些实现也直接使用了该项目的一些代码。 本项目在iPhone-Pickup-Monitor原有功能的基础上去掉了声音通知,但添加了多

Lennon Chin 159 Dec 09, 2022
unofficial source of the discord bot, “haunting.” created by: vorqz, vert, & Veltz

hauntingSRC unofficial source of the discord bot, “haunting.” created by: vorqz, vert, & Veltz reasoning: creators skidded the most of this bot and do

Vast 11 Nov 04, 2022
Frida-based ceserver.iOS analysis is possible with Cheat Engine.

frida-ceserver frida-based ceserver. iOS analysis is possible with Cheat Engine. Original by Dark Byte. Usage Install python library. pip install pack

87 Dec 30, 2022
Sadew Jayasekara 23 Oct 21, 2022
Python client for the LightOn Muse API

lightonmuse Python bindings to production-ready intelligence primitives powered by state-of-the-art language models. Create. Process. Understand. Lear

LightOn 12 Apr 10, 2022
A module to get data about anime characters, news, info, lyrics and more.

Animec A module to get data about anime characters, news, info, lyrics and more. The module scrapes myanimelist to parse requested data. If you wish t

DriftAsimov 31 Aug 31, 2022
Discord Bot that can translate your text, count and reply to your messages with a personalised text

Discord Bot that can translate your text, count and reply to your messages with a personalised text

Grizz 2 Jan 26, 2022
Rapid Sms Bomber For Indian Number.

Bombzilla Rapid Sms Bomber For Indian Number. Installation git clone https://github.com/sarv99/Bombzilla cd Bombzilla chmod +x setup.sh ./setup.sh Af

Saurav Jangid 1 Jan 12, 2022