Simple Similarities Service

Overview

simsity

Simsity is a Super Simple Similarities Service[tm].
It's all about building a neighborhood. Literally!

This repository contains simple tools to help in similarity retreival scenarios by making a convient wrapper around encoding strategies as well as nearest neighbor approaches. Typical usecases include early stage bulk labelling and duplication discovery.

Warning

Alpha software. Expect things to break. Do not use in production.

Quickstart

This is the basic setup for this package.

import pandas as pd

from simsity.service import Service
from simsity.indexer import PyNNDescentIndexer
from simsity.preprocessing import Identity, ColumnLister


# The Indexer handles the nearest neighbor search
# The Encoder handles the encoding of the datapoints
service = Service(
    indexer=PyNNDescentIndexer(metric="euclidean"),
    encoder=CountVectorizer()
)

# The encoder defines how we encode the data going in.
encoder = make_pipeline(
    ColumnLister(column="text"),
    CountVectorizer()
)

# The indexer handles the nearest neighbor lookup.
indexer = PyNNDescentIndexer(metric="euclidean", n_neighbors=2)

# The service combines the two into a single object.
service_clinc = Service(
    encoder=encoder,
    indexer=indexer,
)

# We can now train the service.
df_clinc = pd.read_csv("tests/data/clinc-data.csv")
service_clinc.train_from_dataf(df_clinc, features=["text"])

# Query the datapoints
service.query("give me directions", n_neighbors=20)

# Save the entire system
service.save("/tmp/simple-model")

# You can also load the model now.
reloaded = Service.load("/tmp/simple-model")

# We can also host it as a web service
reloaded.serve(host='0.0.0.0', port=8080)

# You can now POST to http://0.0.0.0:8080/query with payload:
# {"query": {"text": "hello there"}, "n_neighbors": 20}
Comments
  • Add support for pretrained encoders and transformed data

    Add support for pretrained encoders and transformed data

    First of all this project looks great! I've taken an initial stab at #12 and also tried to add support querying data that has already been transformed. If you have data that you've already transformed (e.g. a UMAP embedding), you probably don't want to rerun encoder.transform again. In this case you want to index the transformed data and query it directly.

    This is just a first crack so happy to incorporate any feedback you might have!

    opened by gclen 10
  • embetter: better embeddings

    embetter: better embeddings

    This is conceptual work in progress. The maintainer is actively researching this, please do not work on it.

    Problem Statement

    When you submit where is my phoone and you get similarities you may get things like:

    • where is my phone
    • where is my credit card

    Depending on your task, either the "where is" part of the sentence is more important or the "phone" part is more important. The encoder, however, may be very brittle when it comes to spelling errors. So to put it more generally;

    image

    The similarity in an embedded space in our case is very much "general". I'm using "general" here, as opposed to "specific" to indicate that these similarities have been constructed without having a task in mind.

    Similar Issue

    Suppose that we are deduplicating and we have a zipcode, city, first-, and last-name. How would our encoding be able to understand that having the same city is not a strong signal while having the first name certainly is? Can we really expect a standard encoding to understand this? Without labels ... I think not.

    opened by koaning 3
  • Add `Identity` as default encoder for Service.

    Add `Identity` as default encoder for Service.

    As mentioned in https://github.com/koaning/simsity/pull/13:

    I think the refit parameter should go in the Service() call. I think there should also be a parameter somewhere to avoid calling .transform() if the data has already been transformed. Do you think it is worth adding an additional parameter to Service() and keeping the indexed_from_transformed_data method?

    It's a fair remark. I think preventing a transfrom() is fair, but the solution would be to have an Identity() transformer that just keeps the data as-is. This would also make a great default value for the encoder.

    Made this issue to track progress and to discuss the approach.

    opened by koaning 2
  • Codecalm tutorial on simsity

    Codecalm tutorial on simsity

    Hi Vincent. Since I discovered you my barrier towards Python has eroded! Thank you. I'm a Data Scientist who wants to check if simsity can help with retrieving similar regions based on environmental variables.

    opened by FrancyJGLisboa 2
  • Update indexer

    Update indexer

    Hi! Are there any plans to add support for updating the indexer, i.e. add new documents without retraining the entire pipeline? Would be a very useful feature .

    from simsity.service import Service
    
    service = Service(
        indexer=indexer,
        encoder=encoder
    )
    
    service.train_from_dataf(df, features=["text"])
    
    ....
    
    service.update(new_docs, features=["text"])  # <- this
    
    
    opened by nthomsencph 1
  • New API

    New API

    I think the original design was flawed and this project should stick to the scikit-learn API more.

    from simsity.preprocessing import Grab
    from simsity.service import Service
    from simsity.indexer import (AnnoyIndexer, PynnDescentIndexed, NMSlibIndexer,
                                 PineconeIndexer, QdrantIndexer, WeviateIndexer)
    
    
    encoder = make_pipeline(
        make_union(
            make_pipeline(Grab("text"), SentenceEncoder()),
            make_pipeline(Grab("title"), SentenceEncoder())
        )
    )
    
    service = Service(encoder, indexer, batch_size=50)
    service.index(X)
    items, dists = service.query(X, n=10)
    
    opened by koaning 0
  • Education Day Goals

    Education Day Goals

    • [x] add typing + type checker
    • [x] add tests for the minhash tools
    • [ ] collect more useful datasets
    • [x] automate the benchmarking
    • [x] write getting started guides
    • [ ] record a quick demo for colleagues
    • [ ] add github actions stash
    opened by koaning 0
  • added-components

    added-components

    Adding the MinHash components. This is also an amazing opportunity to:

    • [ ] add types and a type checker
    • [ ] add some standard tests for indexers
    • [ ] add a script to run some benchmarks on the clinc dataset
    opened by koaning 0
Releases(0.1.1)
Owner
vincent d warmerdam
Solving problems involving data. Mostly NLP these days. AskMeAnything[tm].
vincent d warmerdam
Robot to convert files to direct links, hosting files on Telegram servers, unlimited and without restrictions

stream-cloud demo : downloader_star_bot Run : Docker : install docker , docker-compose set Environment or edit Config/init.py docker-compose up Heroku

53 Dec 21, 2022
Discord Selfbot, 90+ commands

Setting the bot up. STEP 1: copy the directory yook.club selfbot was downloaded and extracted into, open cmd and type "cd " then paste. STEP 2: python

yook 1 Dec 12, 2021
Easy to use phishing tool with 63 website templates. Author is not responsible for any misuse.

PyPhisher [+] Created By KasRoudra [+] Description : Ultimate phishing tool in python. Includes popular websites like facebook, twitter, instagram, gi

KasRoudra 1.1k Jan 01, 2023
VideocompBot - This is TG Video Compress BoT. Prouduct By BINARY Tech 💫

VideocompBot - This is TG Video Compress BoT. Prouduct By BINARY Tech 💫

1 Jan 04, 2022
This is a walkthrough about understanding the #BoF machine present in the #OSCP exam.

Buffer Overflow methodology Introduction These are 7 simple python scripts and a methodology to ease (not automate !) the exploitation. Each script ta

3isenHeiM 53 Dec 08, 2022
A liblary whre you can find helpful functions for your discord bot

DBotUtils A liblary whre you can find helpful functions for your discord bot Easy setup Setup is easily and flexible. Change anytime. After setup just

Kondek286 1 Nov 02, 2021
A Telegram AntiChannel bot to ban users who using channel to send message in group

Anti-Channel-bot A Telegram AntiChannel bot to ban users who using channel to send message in group. BOT LINK: Features: Automatic ban Whitelist Unban

Jigar varma 36 Oct 21, 2022
An all-in-one financial analytics and smart portfolio creator as a Discord bot!

An all-in-one financial analytics bot to help you gain quantitative financial insights. Finn is a Discord Bot that lets you explore the stock market like you've never before!

6 Jan 12, 2022
Python: Asynchronous client for the Tailscale API

Python: Asynchronous client for the Tailscale API Asynchronous client for the Tailscale API. About This package allows you to control and monitor Tail

Franck Nijhof 9 Nov 22, 2022
A Telegram Video Watermark Adder Bot in Pyrogram by @AbirHasan2005

Watermark-Bot A Telegram Video Watermark Adder Bot by @AbirHasan2005 Features: Save Custom Watermark Image. Auto Resize Watermark According to Video q

Abir Hasan 95 Nov 20, 2022
A Telelgram Bot to Extract Text from an Image

Text-Scanner-OCR A Telelgram Bot to Extract Text from an Image Configs Vars API_KEY: Your API_KEY from OCR Space GROUP: Your Group Username without '@

ALBY 8 Feb 20, 2022
Fetch tracking numbers of Amazon orders, for the ease of the logistics.

Amazon-Tracking-Number Fetch tracking numbers of Amazon orders, for the ease of the logistics. Read Me First (How to use this code): Get Amazon "Items

Tony Yao 1 Nov 02, 2021
Automatically pick a winner who Retweeted, Commented, and Followed your Twitter account!

AutomaticTwitterGiveaways automates selecting winners for "Retweet, Comment, Follow" type Twitter giveaways.

1 Jan 13, 2022
Repository to access information of stocks in Bombay Stock Exchange.

BSE Repository to access information of stocks in Bombay Stock Exchange. The code in this repository uses BSE API and conclusions made using the code

1 Nov 13, 2021
discord token grabber using python

Discord Token Grabber A Discord token grabber written in Python 3. This version of the grabber only supports Windows. Features No local caching Transf

1 Oct 28, 2021
Fetch information about a public Google document.

xeuledoc Fetch information about any public Google document. It's working on : Google Docs Google Spreadsheets Google Slides Google Drawning Google My

Malfrats Industries 655 Jan 03, 2023
Python library for the eWarehousing Solutions API.

eWarehousing Solutions Python Library This library provides convenient access to the eWarehousing Solutions API from applications written in the Pytho

eWarehousing Solutions 2 Nov 09, 2022
HTTP Calls to Amazon Web Services Rest API for IoT Core Shadow Actions 💻🌐💡

aws-iot-shadow-rest-api HTTP Calls to Amazon Web Services Rest API for IoT Core Shadow Actions 💻 🌐 💡 This simple script implements the following aw

AIIIXIII 3 Jun 06, 2022
Leveraged grid-trading bot using CCXT/CCXT Pro library in FTX exchange.

Leveraged-grid-trading-bot The code is designed to perform infinity grid trading strategy in FTX exchange. The basic trader named Gridtrader.py contro

Hao-Liang Wen 25 Oct 07, 2021
Inline Телеграм бот для отправки GIF-изображений из ВКонтакте

VK GIFS Bot VKGIFSBot - удобный бот для отправки GIF-изображений из ВКонтакте в Телеграмe. Работает это очень просто: бот получает токен ВКонтакте API

Sergievsky Nikita 5 Dec 10, 2022