Improving Representations via Similarities

Related tags

Miscellaneousembetter
Overview

embetter

warning

I like to build in public, but please don't expect anything yet. This is alpha stuff!

notes

Improving Representations via Similarities

The object to implement:

Embetter(multi_output=True, epochs=50, sampling_kwargs)
  .fit(X, y)
  .fit_sim(X1, X2, y_sim, weights)
  .partial_fit(X, y, classes, weights)
  .partial_fit_sim(X1, X2, y_sim, weights)
  .predict(X)
  .predict_proba(X)
  .predict_sim(X1, X2)
  .transform(X)
  .translate_X_y(X, y, classes=none)

Observation: especially when multi_output=True there's an opportunity with regards to NaN y-values. We can simply choose with values to translate and which to ignore.

Comments
  • [WIP] Feature/progress bar

    [WIP] Feature/progress bar

    Fixes issue #20

    • [x] Adds progress bar to all text and image embedders.
    • [x] Tests for SentenceEncoder.
    • [ ] Use perfplot for progress bar?
    • [ ] Can we ensure fast NumPy vectorization while using a progress bar?
    opened by CarloLepelaars 5
  • [BUG] `device` should be attribute on `SentenceEncoder`

    [BUG] `device` should be attribute on `SentenceEncoder`

    The device argument in SentenceEncoder is not defined as an attribute. This leads to bugs when using it with sklearn. I encountered attribute errors when trying to print out a Pipeline representation that has SentenceEncoder as a component.

    Should be easy to fix by just adding self.device in SentenceEncoder.__init__. We can consider adding tests for text encoders so we can catch these errors beforehand.

    The scikit-learn development docs make it clear every argument should be defined as an attribute:

    every keyword argument accepted by init should correspond to an attribute on the instance. Scikit-learn relies on this to find the relevant attributes to set on an estimator when doing model selection.

    Error message: AttributeError: 'SentenceEncoder' object has no attribute 'device'.

    Reproduction: Python 3.8 with embetter = "^0.2.2"

    se = SentenceEncoder()
    repr(se)
    

    Fix:

    Add self.device on SentenceEncoder

    class SentenceEncoder(EmbetterBase):
        .
        .
        def __init__(self, name="all-MiniLM-L6-v2", device=None):
            if not device:
                device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
            self.device = device
            self.name = name
            self.tfm = SBERT(name, device=self.device)
    
    opened by CarloLepelaars 4
  • Color Histograms - Additional Tricks

    Color Histograms - Additional Tricks

    This approach could work pretty well as an implementation: https://danielmuellerkomorowska.com/2020/06/17/analyzing-image-histograms-with-scikit-image/

    To do something similar to what is explained here: https://www.pinecone.io/learn/color-histograms/

    opened by koaning 4
  • Support for word embeddings

    Support for word embeddings

    Hi,

    Do you think it would be a good idea to add support for static word embeddings (word2vec, glove, etc.)? The embedder would need:

    • A filename to a local embedding file (e.g., glove.6b.100d.txt)
    • Either a callable tokenizer or regex string (i.e., the way sci-kit learn's TfIdfVectorizer splits words).
    • A (name of a) pooling function (e.g., "mean", "max", "sum").

    The second and third parameters could easily have sensible defaults, of course. If you think it's a good idea, I can do the PR somewhere next week.

    Stéphan

    opened by stephantul 3
  • [FEATURE] SpaCyEmbedder

    [FEATURE] SpaCyEmbedder

    I think it would be a nice addition to add an embedder that can easily vectorize text through SpaCy. I already have an implementation class for this and would be happy to contribute it here.

    SpaCy Docs on vector: https://spacy.io/api/doc#vector

    Example code for single string:

    import spacy
    nlp = spacy.load("en_core_web_sm")
    doc = nlp("This here text")
    doc.vector
    
    opened by CarloLepelaars 2
  • `get_feature_names_out` for encoders

    `get_feature_names_out` for encoders

    I would be happy to implement get_feature_names_out for all the Embetter objects. I will implement them by just adding a new method (without a Mixin).

    opened by CarloLepelaars 1
  • Remove the classification layer in timm models

    Remove the classification layer in timm models

    I was playing a bit with the library and found out that the TimmEncoder returns 1000-dimensional vectors for all the models I selected. That is caused by returning the state of the last FC classification layer and the fact all of the models were trained on ImageNet with 1000 classes. In practice, it's typically replaced with identity.

    Are there any reasons for returning the state of that last layer as an embedding? I'd be happy to submit a PR fixing that.

    opened by kacperlukawski 1
  • xception mobilenet

    xception mobilenet

    https://keras.io/api/applications/

    https://www.tensorflow.org/api_docs/python/tf/keras/applications/mobilenet_v2/MobileNetV2 https://www.tensorflow.org/api_docs/python/tf/keras/applications/xception/Xception

    opened by koaning 0
  • 'SentenceEncoder' object has no attribute 'device'

    'SentenceEncoder' object has no attribute 'device'

    text_emb_pipeline = make_pipeline(
      ColumnGrabber("text"),
      SentenceEncoder('all-MiniLM-L6-v2')
    )
    
    # This pipeline can also be trained to make predictions, using
    # the embedded features. 
    text_clf_pipeline = make_pipeline(
      text_emb_pipeline,
      LogisticRegression()
    )
    
    dataf = pd.DataFrame({
      "text": ["positive sentiment", "super negative"],
      "label_col": ["pos", "neg"]
    })
    
    X = text_emb_pipeline.fit_transform(dataf, dataf['label_col'])
    text_clf_pipeline.fit(dataf, dataf['label_col'])
    

    This code gives this error: 'SentenceEncoder' object has no attribute 'device'

    opened by nicholas-dinicola 6
Releases(0.2.2)
Owner
vincent d warmerdam
Solving problems involving data. Mostly NLP these days. AskMeAnything[tm].
vincent d warmerdam
Research using python - Guide for development of research code (using Anaconda Python)

Guide for development of research code (using Anaconda Python) TL;DR: One time s

Ziv Yaniv 1 Feb 01, 2022
take home quiz

guess the correlation data inspection a pretty normal distribution train/val/test split splitting amount .dataset: 150000 instances ├─8

HR Wu 1 Nov 04, 2021
Encode stuff with ducks!

Duckify Encoder Usage Download main.py and run it. main.py has an encoded version in encoded_main.py.txt. As A Module Download the duckify folder (or

Jeremiah 2 Nov 15, 2021
Project5 Data processing system

Project5-Data-processing-system User just needed to copy both these file to a folder and open Project5.py using cmd or using any python ide. It is to

1 Nov 23, 2021
A place where one-off ideas/partial projects can live comfortably

A place to post ideas, partial projects, or anything else that doesn't necessarily warrant its own repo, from my mind to the web.

Carson Scott 2 Feb 25, 2022
Python library to decode the EU Covid-19 vaccine certificate

DCC Utils Python library to decode the EU Covid-19 vaccine certificate, as specified by the EU. Setup pip install dcc-utils Make sure zbar is installe

Developers Italia 13 Mar 11, 2022
Module 2's katas from Launch X's python introduction course.

Module2Katas Module 2's katas from Launch X's python introduction course. Virtual environment creation process (on Windows): Create a folder in any de

Javier Méndez 1 Feb 10, 2022
Proyecto - Análisis de texto de eventos históricos

Acceder al código desde Google Colab para poder ver de manera adecuada todas las visualizaciones y poder interactuar con ellas. Link de acceso: https:

1 Jan 31, 2022
Meaningful and minimalist release notes for developers

Managing manual release notes is hard. Therefore, everyone tends to generate release notes from commit messages. But, you won't get a meaningful release note at the end.

codezri 31 Dec 30, 2022
Blender 2.80+ Timelapse Capture Tool Addon

SimpleTimelapser Blender 2.80+ Timelapse Capture Tool Addon Developed for Blender 3.0.0, tested working on 2.80.0 It's no ZBrush undo history but it's

4 Jan 19, 2022
An animal facts python module

An animal facts python module

Fayas Noushad 3 Dec 19, 2021
Allows you to purge all reply comments left by a user on a YouTube channel or video.

YouTube Spammer Purge Allows you to purge all reply comments left by a user on a YouTube channel or video. Purpose Recently, there has been a massive

4.3k Jan 09, 2023
MoBioTools A simple yet versatile toolkit to automatically setup quantum mechanics/molecular mechanics

A simple yet versatile toolkit to setup quantum mechanical/molecular mechanical (QM/MM) calculations from molecular dynamics trajectories.

MoBioChem 17 Nov 27, 2022
the classic version Of torrentleechx #Unmaintained #Archived

TorrentleechX-Classic Old Modified Version Repo #Unmaintained #Archived for support join here working example group Leech Here For Any Issues/Imroveme

XcodersHub 18 Jan 30, 2022
Master Duel Card Translator Project

Master Duel Card Translator Project A tool for translating card effects in Yu-Gi-Oh! Master Duel. Quick Start (for Chinese version only) Download the

67 Dec 23, 2022
A python script that automatically joins a zoom meeting based on your timetable.

Zoom Automation A python script that automatically joins a zoom meeting based on your timetable. What does it do? It performs the following processes:

Shourya Gupta 3 Jan 01, 2022
This repo presents you the official code of "VISTA: Boosting 3D Object Detection via Dual Cross-VIew SpaTial Attention"

VISTA VISTA: Boosting 3D Object Detection via Dual Cross-VIew SpaTial Attention Shengheng Deng, Zhihao Liang, Lin Sun and Kui Jia* (*) Corresponding a

104 Dec 29, 2022
A numbers extract from string python package

Made with Python3 (C) @FayasNoushad Copyright permission under MIT License License - https://github.com/FayasNoushad/Numbers-Extract/blob/main/LICENS

Fayas Noushad 4 Nov 28, 2021
Tools Elit Adalah Sebuah Script Crack Yang Wajib Tap Yes...

Tools Elit Adalah Sebuah Script Crack Yang Wajib Tap Yes...

Risky [ Zero Tow ] 10 Apr 07, 2022
A reminder for stand-up roster

roster-reminder A reminder for stand-up roster Run the project Setup database The project use SQLite as database. You can create tables refer to roste

Jason Zhang 5 Oct 28, 2022