Simple Similarities Service

Overview

simsity

Simsity is a Super Simple Similarities Service[tm].
It's all about building a neighborhood. Literally!

This repository contains simple tools to help in similarity retreival scenarios by making a convient wrapper around encoding strategies as well as nearest neighbor approaches. Typical usecases include early stage bulk labelling and duplication discovery.

Warning

Alpha software. Expect things to break. Do not use in production.

Quickstart

This is the basic setup for this package.

import pandas as pd

from simsity.service import Service
from simsity.indexer import PyNNDescentIndexer
from simsity.preprocessing import Identity, ColumnLister


# The Indexer handles the nearest neighbor search
# The Encoder handles the encoding of the datapoints
service = Service(
    indexer=PyNNDescentIndexer(metric="euclidean"),
    encoder=CountVectorizer()
)

# The encoder defines how we encode the data going in.
encoder = make_pipeline(
    ColumnLister(column="text"),
    CountVectorizer()
)

# The indexer handles the nearest neighbor lookup.
indexer = PyNNDescentIndexer(metric="euclidean", n_neighbors=2)

# The service combines the two into a single object.
service_clinc = Service(
    encoder=encoder,
    indexer=indexer,
)

# We can now train the service.
df_clinc = pd.read_csv("tests/data/clinc-data.csv")
service_clinc.train_from_dataf(df_clinc, features=["text"])

# Query the datapoints
service.query("give me directions", n_neighbors=20)

# Save the entire system
service.save("/tmp/simple-model")

# You can also load the model now.
reloaded = Service.load("/tmp/simple-model")

# We can also host it as a web service
reloaded.serve(host='0.0.0.0', port=8080)

# You can now POST to http://0.0.0.0:8080/query with payload:
# {"query": {"text": "hello there"}, "n_neighbors": 20}
Comments
  • Add support for pretrained encoders and transformed data

    Add support for pretrained encoders and transformed data

    First of all this project looks great! I've taken an initial stab at #12 and also tried to add support querying data that has already been transformed. If you have data that you've already transformed (e.g. a UMAP embedding), you probably don't want to rerun encoder.transform again. In this case you want to index the transformed data and query it directly.

    This is just a first crack so happy to incorporate any feedback you might have!

    opened by gclen 10
  • embetter: better embeddings

    embetter: better embeddings

    This is conceptual work in progress. The maintainer is actively researching this, please do not work on it.

    Problem Statement

    When you submit where is my phoone and you get similarities you may get things like:

    • where is my phone
    • where is my credit card

    Depending on your task, either the "where is" part of the sentence is more important or the "phone" part is more important. The encoder, however, may be very brittle when it comes to spelling errors. So to put it more generally;

    image

    The similarity in an embedded space in our case is very much "general". I'm using "general" here, as opposed to "specific" to indicate that these similarities have been constructed without having a task in mind.

    Similar Issue

    Suppose that we are deduplicating and we have a zipcode, city, first-, and last-name. How would our encoding be able to understand that having the same city is not a strong signal while having the first name certainly is? Can we really expect a standard encoding to understand this? Without labels ... I think not.

    opened by koaning 3
  • Add `Identity` as default encoder for Service.

    Add `Identity` as default encoder for Service.

    As mentioned in https://github.com/koaning/simsity/pull/13:

    I think the refit parameter should go in the Service() call. I think there should also be a parameter somewhere to avoid calling .transform() if the data has already been transformed. Do you think it is worth adding an additional parameter to Service() and keeping the indexed_from_transformed_data method?

    It's a fair remark. I think preventing a transfrom() is fair, but the solution would be to have an Identity() transformer that just keeps the data as-is. This would also make a great default value for the encoder.

    Made this issue to track progress and to discuss the approach.

    opened by koaning 2
  • Codecalm tutorial on simsity

    Codecalm tutorial on simsity

    Hi Vincent. Since I discovered you my barrier towards Python has eroded! Thank you. I'm a Data Scientist who wants to check if simsity can help with retrieving similar regions based on environmental variables.

    opened by FrancyJGLisboa 2
  • Update indexer

    Update indexer

    Hi! Are there any plans to add support for updating the indexer, i.e. add new documents without retraining the entire pipeline? Would be a very useful feature .

    from simsity.service import Service
    
    service = Service(
        indexer=indexer,
        encoder=encoder
    )
    
    service.train_from_dataf(df, features=["text"])
    
    ....
    
    service.update(new_docs, features=["text"])  # <- this
    
    
    opened by nthomsencph 1
  • New API

    New API

    I think the original design was flawed and this project should stick to the scikit-learn API more.

    from simsity.preprocessing import Grab
    from simsity.service import Service
    from simsity.indexer import (AnnoyIndexer, PynnDescentIndexed, NMSlibIndexer,
                                 PineconeIndexer, QdrantIndexer, WeviateIndexer)
    
    
    encoder = make_pipeline(
        make_union(
            make_pipeline(Grab("text"), SentenceEncoder()),
            make_pipeline(Grab("title"), SentenceEncoder())
        )
    )
    
    service = Service(encoder, indexer, batch_size=50)
    service.index(X)
    items, dists = service.query(X, n=10)
    
    opened by koaning 0
  • Education Day Goals

    Education Day Goals

    • [x] add typing + type checker
    • [x] add tests for the minhash tools
    • [ ] collect more useful datasets
    • [x] automate the benchmarking
    • [x] write getting started guides
    • [ ] record a quick demo for colleagues
    • [ ] add github actions stash
    opened by koaning 0
  • added-components

    added-components

    Adding the MinHash components. This is also an amazing opportunity to:

    • [ ] add types and a type checker
    • [ ] add some standard tests for indexers
    • [ ] add a script to run some benchmarks on the clinc dataset
    opened by koaning 0
Releases(0.1.1)
Owner
vincent d warmerdam
Solving problems involving data. Mostly NLP these days. AskMeAnything[tm].
vincent d warmerdam
GUI Pancakeswap V2 and Uniswap V3 trading client (and bot)MOST ADVANCE TRADING BOT SUPPORT WINDOWS LINUX MAC

GUI Pancakeswap 2 and Uniswap 3 trading client (and bot) (MOST ADVANCE TRADING BOT SUPPORT WINDOWS LINUX MAC) UPDATE: MUTI TRADE TOKEN ENABLE ,TRADE 1

2 Dec 27, 2021
A Python library for miHoYo bbs and HoYoLAB Community

A Python library for miHoYo bbs and HoYoLAB Community. genshin 原神签到小助手

384 Jan 05, 2023
Collect links to profiles by username through search engines

Marple Summary Collect links to profiles by username through search engines (currently Google and DuckDuckGo). Quick Start ./marple.py soxoj Results:

125 Dec 19, 2022
Fun telegram bot =)

Recolor Bot About Fun telegram bot, that can change your hair color. Preparations Update package lists sudo apt-get update; Make sure Git and docker-c

Just Koala 4 Jul 09, 2022
[Multithreading] [Proxy - auto & infile]

Discord-Token-Generator-AutoCheck [Multithreading] [Proxy - auto & infile] How to install? pip install -r requirements.txt run generator.py with pytho

Chakeaw__ 3 Oct 17, 2021
Backend for Indipe client

Betsushi Betsu (別), the japanese word meaning "another" and Shiharai (支払い) meaning "payment". Hence the name Betsushi was derived. Introduction This i

Sudodevs 3 Feb 09, 2022
A Python Discord bot project generator

Heater Heat up a Discord bot in a blink What is Heater? Heater is a Command Line Interface tool which allows you to generate a barebones Python Discor

DevGuyAhnaf 5 Jan 14, 2022
An API which returns random AOT quote everytime it's invoked

An API which returns random AOT quote everytime it's invoked

Nishant Sapkota 1 Feb 07, 2022
Automate coin farming for dankmemer. Unlimited accounts at once. Uses a proxy

dankmemer-farm Simple script to farm Dankmemer coins with multiple accounts at once. Requires: Proxies, Discord Tokens Disclaimer I don't take respons

Scobra 12 Dec 20, 2022
Discord raid tool!

GANG Multi Tool Menu: -- YOUTUBE TUTORIAL! Features: Most Advanced Multi Tool! Spammer DM Spammer Friend Spammer Reaction Spam WebhookSpammer Typing

1 Feb 13, 2022
Dados Públicos de CNPJ disponibilizados pela Receita Federal do Brasil

Dados Públicos CNPJ Fonte oficial da Receita Federal do Brasil, aqui. Layout dos arquivos, aqui. A Receita Federal do Brasil disponibiliza bases com o

Aphonso Henrique do Amaral Rafael 102 Dec 28, 2022
As Slack no longer provides an API to invite people, this is a Selenium Python script to do so

As Slack no longer provides an API to invite people, this is a Selenium Python script to do so

Trading bot rienforcement with python

Trading_bot_rienforcement System: Ubuntu 16.04 GPU (GeForce GTX 1080 Ti) Instructions: In order to run the code: Make sure to clone the stable baselin

1 Oct 22, 2021
Change Discord HypeSquad in few seconds!

a simple python script that change your hypesquad to what house you choose

Ho3ein 5 Nov 16, 2022
Simple contact bot for telegram, written in python.

🔗 | Install Install the requirements with pip install -r requirements.txt 📋 | Setup Get your token from BotFather Get your UserId with Nicegram or w

Stehack 3 Dec 10, 2022
Instagram GiftShop Scam Killer

Instagram GiftShop Scam Killer A basic tool for Windows which kills acess to any giftshop scam from Instagram. Protect your Instagram account from the

1 Mar 31, 2022
A Pythonic wrapper for the Wikipedia API

Wikipedia Wikipedia is a Python library that makes it easy to access and parse data from Wikipedia. Search Wikipedia, get article summaries, get data

Jonathan Goldsmith 2.5k Dec 28, 2022
Hello i am TELEGRAM GROUP MANAGEMENT BOT MY NAME IS Evil-Inside ⚡ i have both amazing modules

Evil-Inside DEMO BOT - Evil-Inside Hello i am TELEGRAM GROUP MANAGEMENT BOT MY NAME IS Evil-Inside ⚡ i have both amazing modules ℂ𝕆ℕ𝕋𝔸ℂ𝕋 𝕄𝔼 𝕆ℕ

PANDITHAN 52 Nov 20, 2022
Project for QVault Hackathon which plays sounds based on the letters of a user's name

virtual_instrument Project for QVault Hackathon which plays sounds based on the letters of a user's name I created a virtual instrument using Python a

Paolo Sidera 2 Feb 11, 2022
FTX auto lending bot with python

FTX auto lending bot Get the API key Check my article for step by step + screenshots Setup & Run Install python 3 Install dependency pip install -r re

Patompong Manprasatkul 1 Dec 24, 2021