Simple Similarities Service

Last update: Dec 25, 2022

Related tags

Overview

simsity

Simsity is a Super Simple Similarities Service[tm].
It's all about building a neighborhood. Literally!

This repository contains simple tools to help in similarity retreival scenarios by making a convient wrapper around encoding strategies as well as nearest neighbor approaches. Typical usecases include early stage bulk labelling and duplication discovery.

Warning

Alpha software. Expect things to break. Do not use in production.

Quickstart

This is the basic setup for this package.

import pandas as pd

from simsity.service import Service
from simsity.indexer import PyNNDescentIndexer
from simsity.preprocessing import Identity, ColumnLister


# The Indexer handles the nearest neighbor search
# The Encoder handles the encoding of the datapoints
service = Service(
    indexer=PyNNDescentIndexer(metric="euclidean"),
    encoder=CountVectorizer()
)

# The encoder defines how we encode the data going in.
encoder = make_pipeline(
    ColumnLister(column="text"),
    CountVectorizer()
)

# The indexer handles the nearest neighbor lookup.
indexer = PyNNDescentIndexer(metric="euclidean", n_neighbors=2)

# The service combines the two into a single object.
service_clinc = Service(
    encoder=encoder,
    indexer=indexer,
)

# We can now train the service.
df_clinc = pd.read_csv("tests/data/clinc-data.csv")
service_clinc.train_from_dataf(df_clinc, features=["text"])

# Query the datapoints
service.query("give me directions", n_neighbors=20)

# Save the entire system
service.save("/tmp/simple-model")

# You can also load the model now.
reloaded = Service.load("/tmp/simple-model")

# We can also host it as a web service
reloaded.serve(host='0.0.0.0', port=8080)

# You can now POST to http://0.0.0.0:8080/query with payload:
# {"query": {"text": "hello there"}, "n_neighbors": 20}

Comments

Add support for pretrained encoders and transformed data

First of all this project looks great! I've taken an initial stab at #12 and also tried to add support querying data that has already been transformed. If you have data that you've already transformed (e.g. a UMAP embedding), you probably don't want to rerun encoder.transform again. In this case you want to index the transformed data and query it directly.

This is just a first crack so happy to incorporate any feedback you might have!

opened by gclen 10
embetter: better embeddings
This is conceptual work in progress. The maintainer is actively researching this, please do not work on it.

Problem Statement

When you submit where is my phoone and you get similarities you may get things like:

where is my phone

where is my credit card

Depending on your task, either the "where is" part of the sentence is more important or the "phone" part is more important. The encoder, however, may be very brittle when it comes to spelling errors. So to put it more generally;

The similarity in an embedded space in our case is very much "general". I'm using "general" here, as opposed to "specific" to indicate that these similarities have been constructed without having a task in mind.

Similar Issue

Suppose that we are deduplicating and we have a zipcode, city, first-, and last-name. How would our encoding be able to understand that having the same city is not a strong signal while having the first name certainly is? Can we really expect a standard encoding to understand this? Without labels ... I think not.
opened by koaning 3
Add `Identity` as default encoder for Service.

As mentioned in https://github.com/koaning/simsity/pull/13:

I think the refit parameter should go in the Service() call. I think there should also be a parameter somewhere to avoid calling .transform() if the data has already been transformed. Do you think it is worth adding an additional parameter to Service() and keeping the indexed_from_transformed_data method?

It's a fair remark. I think preventing a transfrom() is fair, but the solution would be to have an Identity() transformer that just keeps the data as-is. This would also make a great default value for the encoder.

Made this issue to track progress and to discuss the approach.

opened by koaning 2
Codecalm tutorial on simsity

Hi Vincent. Since I discovered you my barrier towards Python has eroded! Thank you. I'm a Data Scientist who wants to check if simsity can help with retrieving similar regions based on environmental variables.

opened by FrancyJGLisboa 2

Update indexer

Hi! Are there any plans to add support for updating the indexer, i.e. add new documents without retraining the entire pipeline? Would be a very useful feature .

from simsity.service import Service

service = Service(
    indexer=indexer,
    encoder=encoder
)

service.train_from_dataf(df, features=["text"])

....

service.update(new_docs, features=["text"])  # <- this

opened by nthomsencph 1

New API

I think the original design was flawed and this project should stick to the scikit-learn API more.

from simsity.preprocessing import Grab
from simsity.service import Service
from simsity.indexer import (AnnoyIndexer, PynnDescentIndexed, NMSlibIndexer,
                             PineconeIndexer, QdrantIndexer, WeviateIndexer)


encoder = make_pipeline(
    make_union(
        make_pipeline(Grab("text"), SentenceEncoder()),
        make_pipeline(Grab("title"), SentenceEncoder())
    )
)

service = Service(encoder, indexer, batch_size=50)
service.index(X)
items, dists = service.query(X, n=10)

opened by koaning 0

Education Day Goals
[x] add typing + type checker

[x] add tests for the minhash tools

[ ] collect more useful datasets

[x] automate the benchmarking

[x] write getting started guides

[ ] record a quick demo for colleagues

[ ] add github actions stash
opened by koaning 0
added-components
Adding the MinHash components. This is also an amazing opportunity to:

[ ] add types and a type checker

[ ] add some standard tests for indexers

[ ] add a script to run some benchmarks on the clinc dataset
opened by koaning 0

Releases(0.1.1)

0.1.1(Nov 4, 2021)

Thanks to @gclen you can now re-use scikit-learn pipelines without refitting them internally.
Source code(tar.gz)
Source code(zip)

Owner

vincent d warmerdam

Solving problems involving data. Mostly NLP these days. AskMeAnything[tm].

GitHub Repository

Subtitle Translater

2 Nov 29, 2021

OGE-2022-na-Python - Solving problems in python for the OGE 2022

OGE-2022-na-Python Решение задачек на питоне для ОГЭ 2022 Тут разобраны разные в

0 Oct 14, 2022

AWS-serverless-starter - AWS Lambda serverless stack via Serverless framework

Serverless app via AWS Lambda, ApiGateway and Serverless framework Configuration

3 Feb 02, 2022

A Discord bot for viewing any currency you want comfortably.

Dost Dost is a Discord bot for viewing currencies. Getting Started These instructions will get you a copy of the project up and running on your local

2 Jan 18, 2022

Easy-apply-bot - A LinkedIn Easy Apply bot to help with my job search.

easy-apply-bot A LinkedIn Easy Apply bot to help with my job search. Getting Started First, clone the repository somewhere onto your computer, or down

5 Dec 09, 2022

A simple Facebook Account generator, written in python (needs different Email so Accounts do not get banned)

FacebookAccountGenerator FAB is a Facebook-Account generating script, written in python Installation Use the package manager pip to install selenium p

7 Jan 05, 2023

Python Client for Yandex Cloud Logging

Python Client for Yandex Cloud Logging Installation pip3 install python-yandex-cloud-logging Creating a Yandex Cloud Logging Group yc logging group c

0 Dec 08, 2021

🐲 Powerfull Discord Token Stealer made in python

🐲 Follow me here 🐲 Discord | YouTube | Github ☕ Usage 💻 Downloading git clone https://github.com/KanekiWeb/Powerfull-Token-Stealer

61 Dec 19, 2022

Repositorio que contiene el material mostrado en la primera PyCON de Chile

Buenas prácticas de desarrollo en Python Repositorio que contiene el material mostrado en la primera PyCON de Chile, realizada del 5 al 7 de Noviembre

5 Feb 01, 2022

Telegram bot to host python bots

Host-Bot Setup the api Upload the flask api on your host #its not important to do #i used it just for simple captcha system + save ids on your host!

15 Feb 11, 2022

Bot to notify when vaccine appointments are available

Vaccine Watch Bot to notify when vaccine appointments are available. Supports checking Hy-Vee, Walgreens, CVS, Walmart, Cosentino's stores (KC), and B

37 Aug 13, 2022

Library for working with QIWI API.

2 Apr 26, 2022

A Discord webhook spammer made in Python.

A Python made Discord webhook spammer usually used for token loggers to spam them/delete them original by cattyn I only made it so u can change the avatar to whatever u want instead of it being hardc

15 Dec 15, 2021

Connect your Nintendo Switch playing status to Discord!

Disclaimer: Unfortunately, it appears that Nintendo has removed returning self-Presence in their API as of recently, making this project near obsolete

145 Dec 30, 2022

An API wrapper library for opensea api.

Opensea API An API wrapper library for opensea api. Installation pip3 install opensea Usage Retrieving assets: from opensea import get_assets # This

38 Jul 17, 2022

Telegram bot using python

1 Oct 11, 2021

Shows VRML team stats of all players in your pubs

VRML Team Stat Searcher Displays Team Name, Team Rank (Worldwide), and tier of all the players in your pubs. GUI WIP: Username search works & pub name

2 Dec 22, 2022

A simple discord tool that translates english to either spanish, german or french and sends it. Free to rework but please give me credit.

discord-translator A simple discord tool that translates english to either spanish, german or french and sends it. Free to rework but please give me c

2 Oct 04, 2021

A Simple Advance Auto Filter Bot Complete Rewritten Version Of Adv-Filter-Bot

Adv Auto Filter Bot This Is Just An Simple Advance Auto Filter Bot Complete Rewritten Version Of Adv-Filter-Bot.. Just Sent Any Text As Query It Will

4 Dec 10, 2021

File-sharing-Bot: Telegram Bot to store Posts and Documents and it can Access by Special Links.

Bromélia HSS bromelia-hss is the second official implementation of a Diameter-based protocol application by using the Python micro framework Bromélia.

1 Dec 17, 2021