Simple Similarities Service

Overview

simsity

Simsity is a Super Simple Similarities Service[tm].
It's all about building a neighborhood. Literally!

This repository contains simple tools to help in similarity retreival scenarios by making a convient wrapper around encoding strategies as well as nearest neighbor approaches. Typical usecases include early stage bulk labelling and duplication discovery.

Warning

Alpha software. Expect things to break. Do not use in production.

Quickstart

This is the basic setup for this package.

import pandas as pd

from simsity.service import Service
from simsity.indexer import PyNNDescentIndexer
from simsity.preprocessing import Identity, ColumnLister


# The Indexer handles the nearest neighbor search
# The Encoder handles the encoding of the datapoints
service = Service(
    indexer=PyNNDescentIndexer(metric="euclidean"),
    encoder=CountVectorizer()
)

# The encoder defines how we encode the data going in.
encoder = make_pipeline(
    ColumnLister(column="text"),
    CountVectorizer()
)

# The indexer handles the nearest neighbor lookup.
indexer = PyNNDescentIndexer(metric="euclidean", n_neighbors=2)

# The service combines the two into a single object.
service_clinc = Service(
    encoder=encoder,
    indexer=indexer,
)

# We can now train the service.
df_clinc = pd.read_csv("tests/data/clinc-data.csv")
service_clinc.train_from_dataf(df_clinc, features=["text"])

# Query the datapoints
service.query("give me directions", n_neighbors=20)

# Save the entire system
service.save("/tmp/simple-model")

# You can also load the model now.
reloaded = Service.load("/tmp/simple-model")

# We can also host it as a web service
reloaded.serve(host='0.0.0.0', port=8080)

# You can now POST to http://0.0.0.0:8080/query with payload:
# {"query": {"text": "hello there"}, "n_neighbors": 20}
Comments
  • Add support for pretrained encoders and transformed data

    Add support for pretrained encoders and transformed data

    First of all this project looks great! I've taken an initial stab at #12 and also tried to add support querying data that has already been transformed. If you have data that you've already transformed (e.g. a UMAP embedding), you probably don't want to rerun encoder.transform again. In this case you want to index the transformed data and query it directly.

    This is just a first crack so happy to incorporate any feedback you might have!

    opened by gclen 10
  • embetter: better embeddings

    embetter: better embeddings

    This is conceptual work in progress. The maintainer is actively researching this, please do not work on it.

    Problem Statement

    When you submit where is my phoone and you get similarities you may get things like:

    • where is my phone
    • where is my credit card

    Depending on your task, either the "where is" part of the sentence is more important or the "phone" part is more important. The encoder, however, may be very brittle when it comes to spelling errors. So to put it more generally;

    image

    The similarity in an embedded space in our case is very much "general". I'm using "general" here, as opposed to "specific" to indicate that these similarities have been constructed without having a task in mind.

    Similar Issue

    Suppose that we are deduplicating and we have a zipcode, city, first-, and last-name. How would our encoding be able to understand that having the same city is not a strong signal while having the first name certainly is? Can we really expect a standard encoding to understand this? Without labels ... I think not.

    opened by koaning 3
  • Add `Identity` as default encoder for Service.

    Add `Identity` as default encoder for Service.

    As mentioned in https://github.com/koaning/simsity/pull/13:

    I think the refit parameter should go in the Service() call. I think there should also be a parameter somewhere to avoid calling .transform() if the data has already been transformed. Do you think it is worth adding an additional parameter to Service() and keeping the indexed_from_transformed_data method?

    It's a fair remark. I think preventing a transfrom() is fair, but the solution would be to have an Identity() transformer that just keeps the data as-is. This would also make a great default value for the encoder.

    Made this issue to track progress and to discuss the approach.

    opened by koaning 2
  • Codecalm tutorial on simsity

    Codecalm tutorial on simsity

    Hi Vincent. Since I discovered you my barrier towards Python has eroded! Thank you. I'm a Data Scientist who wants to check if simsity can help with retrieving similar regions based on environmental variables.

    opened by FrancyJGLisboa 2
  • Update indexer

    Update indexer

    Hi! Are there any plans to add support for updating the indexer, i.e. add new documents without retraining the entire pipeline? Would be a very useful feature .

    from simsity.service import Service
    
    service = Service(
        indexer=indexer,
        encoder=encoder
    )
    
    service.train_from_dataf(df, features=["text"])
    
    ....
    
    service.update(new_docs, features=["text"])  # <- this
    
    
    opened by nthomsencph 1
  • New API

    New API

    I think the original design was flawed and this project should stick to the scikit-learn API more.

    from simsity.preprocessing import Grab
    from simsity.service import Service
    from simsity.indexer import (AnnoyIndexer, PynnDescentIndexed, NMSlibIndexer,
                                 PineconeIndexer, QdrantIndexer, WeviateIndexer)
    
    
    encoder = make_pipeline(
        make_union(
            make_pipeline(Grab("text"), SentenceEncoder()),
            make_pipeline(Grab("title"), SentenceEncoder())
        )
    )
    
    service = Service(encoder, indexer, batch_size=50)
    service.index(X)
    items, dists = service.query(X, n=10)
    
    opened by koaning 0
  • Education Day Goals

    Education Day Goals

    • [x] add typing + type checker
    • [x] add tests for the minhash tools
    • [ ] collect more useful datasets
    • [x] automate the benchmarking
    • [x] write getting started guides
    • [ ] record a quick demo for colleagues
    • [ ] add github actions stash
    opened by koaning 0
  • added-components

    added-components

    Adding the MinHash components. This is also an amazing opportunity to:

    • [ ] add types and a type checker
    • [ ] add some standard tests for indexers
    • [ ] add a script to run some benchmarks on the clinc dataset
    opened by koaning 0
Releases(0.1.1)
Owner
vincent d warmerdam
Solving problems involving data. Mostly NLP these days. AskMeAnything[tm].
vincent d warmerdam
Indian Space Research Organisation API With Python

ISRO Indian Space Research Organisation API Installation pip install ISRO Usage import isro isro.spacecrafts() # returns spacecrafts data isro.lau

Fayas Noushad 5 Aug 11, 2022
Python wrapper library for World Weather Online API

pywwo Python wrapper library for World Weather Online API using lxml.objectify How to use from pywwo import * setKey('your_key', 'free') w=LocalWeat

World Weather Online 20 Dec 19, 2022
domhttpx is a google search engine dorker with HTTP toolkit built with python, can make it easier for you to find many URLs/IPs at once with fast time.

domhttpx is a google search engine dorker with HTTP toolkit built with python, can make it easier for you to find many URLs/IPs at once with fast time

Naufal Ardhani 59 Dec 04, 2022
Wrapper for shh/rsync for use with OpenFOAM and blue bear

bbsync wrapper for shh/rsync for use with OpenFOAM and blue bear About The Project bbsync is a wrapper for shh/rsync for use with OpenFOAM and blue be

1 Dec 10, 2021
Bitcoin tracker hecho con python.

Bitcoin Tracker Precio del Bitcoin en tiempo real. Script simple hecho con python. Rollercoin RollerCoin es un juego en el que puedes ganar bitcoin (y

biyivi 3 Jan 04, 2022
A bot to share Facebook posts.

bot_share_facebook a bot to share Facebook posts. install & clone untuk menjalankan anda bisa melalui terminal contohnya termux, cmd, dan terminal lai

Muhammad Latif Harkat 7 Dec 07, 2022
A telegram bot script for generating session string using pyrogram and telethon on Telegram bot

String-session-Bot Telegram Bot to generate Pyrogram and Telethon String Session. A star ⭐ from you means a lot to us! Usage Deploy to Heroku Tap on a

Wahyusaputra 8 Oct 28, 2022
A PowerPacked Version Of Telegram Leech Bot With Modern Easy-To-Use Interface & UI !

FuZionX Leech Bot A Powerful Telegram Leech Bot Modded by MysterySD to directly Leech to Telegram, with Multi Direct Links Support for Enhanced Leechi

MysterySD 28 Oct 09, 2022
A tool for extracting plain text from Wikipedia dumps

WikiExtractor WikiExtractor.py is a Python script that extracts and cleans text from a Wikipedia database dump. The tool is written in Python and requ

Giuseppe Attardi 3.2k Dec 31, 2022
A plugin for modmail-bot for stealing,making ,etc emojis

EmojiPlugin for the Modmail-bot My first plugin .. its very Basic I will make more and better too Only 3 commands for now emojiadd-supports .jpg, .png

1 Dec 28, 2021
Working TikTok Username Auto-Claimer/Sniper/Swapper which will autoclaim username if it´s available

TikTok-AutoClaimer Working TikTok Username Auto-Claimer/Sniper/Swapper which will autoclaim username if it´s available Usage Python 3.6 or above is re

Kevin 18 Dec 08, 2022
Python script to delete old / embarrassing tweets.

Delete Tweets Do you have hundreds of embarrassing tweets on your Twitter profile, that you tweeted over a decade ago as an innocent high schooler, th

Linda Zheng 9 Nov 26, 2022
Powerful Telegram Maintained UserBot in Telethon

Fire-X UserBot The Awaited Bot Fire-X userbot The Most Powerful Telegram Userbot. This Userbot is Safe to use in Your Telegram Account. It is not like

22 Oct 21, 2022
A simple API Wrapper for Guilded.

Guildr A simple API Wrapper for Guilded. Frequently updated! I am not a user of Guilded, meaning I do not keep track of new Guilded updates or patches

2 Mar 07, 2022
A Discord Bot that tracks and displays cryptocurrencies using the CoinMarketCap API

PyBo - A Crypto Inspired Discord Bot Pybo (paɪ boʊ) is a Discord bot that utilizes the discord.py API wrapper to run the bot. Pybo also integrates the

0 Nov 17, 2022
SEMID - OSINT module with lots of discord functions

SEMID Framework About Semid is a framework with different Discord functions and

Hima 20 Sep 23, 2022
Protect Discord server invite link

DiscordOauth2Join Protect discord server invite links! Setup I will not help setting up the discord application, but just python. First, install the r

ZEEE 4 Aug 12, 2021
If you are in allot of groups or channel and you would like to leave them at once use this

Telegram-auto-leave-groups If you are in allot of groups or channel and you would like to leave them at once use this USER GUIDE 👣 Insert your telegr

Julius Njoroge 4 Jan 03, 2023
A Rich renderable for viewing Multiple Sequence Alignments in the terminal.

rich-msa A simple module to render colorful Multiple Sequence Alignment with rich in the terminal. 🔧 Installing Install the rich-msa package directly

Martin Larralde 64 Dec 04, 2022
Um script simples para consultar dados, com API's simples.

Info sobre o Script Esta é uma das mais simples ferramentas para consultar dados. Daqui um tempo eu farei um UPGRADE no painel, irei adicionar um banc

Crowley 6 Apr 11, 2022