scikit-learn wrappers for Python fastText.

Related tags

Text Data & NLPskift
Overview

skift skift_icon

PyPI-Status PePy stats PyPI-Versions Build-Status Codecov Codefactor code quality LICENCE

scikit-learn wrappers for Python fastText.

>>> from skift import FirstColFtClassifier
>>> df = pandas.DataFrame([['woof', 0], ['meow', 1]], columns=['txt', 'lbl'])
>>> sk_clf = FirstColFtClassifier(lr=0.3, epoch=10)
>>> sk_clf.fit(df[['txt']], df['lbl'])
>>> sk_clf.predict([['woof']])
[0]

1   Installation

Dependencies:

  • numpy
  • scipy
  • scikit-learn
  • The fasttext Python package
pip install skift

2   Configuration

Because fasttext reads input data from files, skift has to dump the input data into temporary files for fasttext to use. A dedicated folder is created for those files on the filesystem. By default, this storage is allocated in the system temporary storage location (i.e. /tmp on *nix systems). To override this default location, use the SKIFT_TEMP_DIR environment variable:

export SKIFT_TEMP_DIR=/path/to/desired/temp/folder

NOTE: The directory will be created if it does not already exist.

3   Features

4   Wrappers

fastText works only on text data, which means that it will only use a single column from a dataset which might contain many feature columns of different types. As such, a common use case is to have the fastText classifier use a single column as input, ignoring other columns. This is especially true when fastText is to be used as one of several classifiers in a stacking classifier, with other classifiers using non-textual features.

skift includes several scikit-learn-compatible wrappers (for the official fastText Python package) which cater to these use cases.

NOTICE: Any additional keyword arguments provided to the classifier constructor, besides those required, will be forwarded to the fastText.train_supervised method on every call to fit.

4.1   Standard wrappers

These wrappers do not make additional assumptions on input besides those commonly made by scikit-learn classifies; i.e. that input is a 2d ndarray object and such.

  • FirstColFtClassifier - An sklearn classifier adapter for fasttext that takes the first column of input ndarray objects as input.
>>> from skift import FirstColFtClassifier
>>> df = pandas.DataFrame([['woof', 0], ['meow', 1]], columns=['txt', 'lbl'])
>>> sk_clf = FirstColFtClassifier(lr=0.3, epoch=10)
>>> sk_clf.fit(df[['txt']], df['lbl'])
>>> sk_clf.predict([['woof']])
[0]
  • IdxBasedFtClassifier - An sklearn classifier adapter for fasttext that takes input by column index. This is set on object construction by providing the input_ix parameter to the constructor.
>>> from skift import IdxBasedFtClassifier
>>> df = pandas.DataFrame([[5, 'woof', 0], [83, 'meow', 1]], columns=['count', 'txt', 'lbl'])
>>> sk_clf = IdxBasedFtClassifier(input_ix=1, lr=0.4, epoch=6)
>>> sk_clf.fit(df[['count', 'txt']], df['lbl'])
>>> sk_clf.predict([['woof']])
[0]

4.2   pandas-dependent wrappers

These wrappers assume the X parameter given to fit, predict, and predict_proba methods is a pandas.DataFrame object:

  • FirstObjFtClassifier - An sklearn adapter for fasttext using the first column of dtype == object as input.
>>> from skift import FirstObjFtClassifier
>>> df = pandas.DataFrame([['woof', 0], ['meow', 1]], columns=['txt', 'lbl'])
>>> sk_clf = FirstObjFtClassifier(lr=0.2)
>>> sk_clf.fit(df[['txt']], df['lbl'])
>>> sk_clf.predict([['woof']])
[0]
  • ColLblBasedFtClassifier - An sklearn adapter for fasttext taking input by column label. This is set on object construction by providing the input_col_lbl parameter to the constructor.
>>> from skift import ColLblBasedFtClassifier
>>> df = pandas.DataFrame([['woof', 0], ['meow', 1]], columns=['txt', 'lbl'])
>>> sk_clf = ColLblBasedFtClassifier(input_col_lbl='txt', epoch=8)
>>> sk_clf.fit(df[['txt']], df['lbl'])
>>> sk_clf.predict([['woof']])
[0]

5   Contributing

Package author and current maintainer is Shay Palachy ([email protected]); You are more than welcome to approach him for help. Contributions are very welcomed.

5.1   Installing for development

Clone:

git clone [email protected]:shaypal5/skift.git

Install in development mode, including test dependencies:

cd skift
pip install -e '.[test]'

To also install fasttext, see instructions in the Installation section.

5.2   Running the tests

To run the tests use:

cd skift
pytest

5.3   Adding documentation

The project is documented using the numpy docstring conventions, which were chosen as they are perhaps the most widely-spread conventions that are both supported by common tools such as Sphinx and result in human-readable docstrings. When documenting code you add to this project, follow these conventions.

Additionally, if you update this README.rst file, use python setup.py checkdocs to validate it compiles.

6   Credits

Created by Shay Palachy ([email protected]).

Fixes: uniaz, crouffer, amirzamli and sgt.

Comments
  • Fix temp dir permission docker error

    Fix temp dir permission docker error

    • Remove dependance on user home directory for temporary storage. User directories ("~/") are not always created for Unix service accounts.
    • Create the temporary directory using tempfile.mkdtemp()
    • Store the directory path in a singleton-like structure accessed via a function call

    This fixes issue https://github.com/shaypal5/skift/issues/6 by creating the tempdir in an OS/environment agnostic way, and does not rely on the users' home directory being writeable.

    opened by crouffer 12
  • Installing fasttext with skift doesn't work

    Installing fasttext with skift doesn't work

    Tried running this from the README:

    pip install skift[fasttext] --process-dependency-links
    

    Got this error:

    Collecting fasttext==0.1.0+git.3b5fd29; extra == "fasttext" (from skift[fasttext])
      Could not find a version that satisfies the requirement fasttext==0.1.0+git.3b5fd29; extra == "fasttext" (from skift[fasttext]) (from versions: 0.2.0, 0.2.1, 0.3.0, 0.3.1, 0.4.0, 0.5.0, 0.5.1, 0.5.12, 0.5.13, 0.5.14, 0.5.15, 0.5.16, 0.5.17, 0.5.18, 0.5.19, 0.6.0, 0.6.1, 0.6.2, 0.6.4, 0.7.0, 0.7.1, 0.7.2, 0.7.3, 0.7.4, 0.7.5, 0.7.6, 0.8.0, 0.8.1, 0.8.2, 0.8.3)
     No matching distribution found for fasttext==0.1.0+git.3b5fd29; extra == "fasttext" (from skift[fasttext])
    

    Tried with Python 3.6.4 in and out of a virtualenv. Seems skift expects to find a version of fasttext that's not available in pypi?

    bug 
    opened by polm 10
  • error returned during training due to wrong default encoder on Windows 10

    error returned during training due to wrong default encoder on Windows 10

    Hello!

    I am trying to train a supervised text classification model on some text that contains also non-alphanumeric characters

    from skift import FirstColFtClassifier
    sk_clf = FirstColFtClassifier(lr=0.25, dim=100, epoch=100, minCount=5, 
                                  minn=3, maxn=6, wordNgrams=3, loss='softmax')
    sk_clf.fit(X_train, y_train)
    

    As soon as the first non alphanumeric character occurs during training I get the following error

    UnicodeEncodeError                        Traceback (most recent call last)
    <ipython-input-8-05c208efc7be> in <module>()
          4                               minn=3, maxn=6, wordNgrams=3, loss='softmax')
          5 # Train fastText classifier
    ----> 6 sk_clf.fit(X_train, y_train)
    
    ~\AppData\Local\Continuum\anaconda3\lib\site-packages\skift\core.py in fit(self, X, y)
        117         temp_trainset_fpath = temp_dataset_fpath()
        118         input_col = self._input_col(X)
    --> 119         dump_xy_to_fasttext_format(input_col, y, temp_trainset_fpath)
        120         # train
        121         self.model = train_supervised(
    
    ~\AppData\Local\Continuum\anaconda3\lib\site-packages\skift\util.py in dump_xy_to_fasttext_format(X, y, filepath)
         68     with open(filepath, 'w+') as wfile:
         69         for text, label in zip(X, y):
    ---> 70             wfile.write('__label__{} {}\n'.format(label, text))
         71 
         72 
    
    ~\AppData\Local\Continuum\anaconda3\lib\encodings\cp1252.py in encode(self, input, final)
         17 class IncrementalEncoder(codecs.IncrementalEncoder):
         18     def encode(self, input, final=False):
    ---> 19         return codecs.charmap_encode(input,self.errors,encoding_table)[0]
         20 
         21 class IncrementalDecoder(codecs.IncrementalDecoder):
    
    UnicodeEncodeError: 'charmap' codec can't encode character '\u010d' in position 493: character maps to <undefined>
    

    As the error clearly shows, this is due to the fact that cp1252.py is the default encoder used by skift. Even though I am on a Windows OS, I am using Python 3.7 installed with Anaconda 5.3.0, and the standard encoding as far as I know should be UTF-8. (I have already verified that, by simply renaming the utf_8.py encoder as cp1252.py, the model training completes without any error. This is a dirty hack I would like to avoid though, because I plan to operationalize the model in production on Azure ML Studio).

    Is there a way to enforce skift to use as default the utf_8.py encoder?

    Any help appreciated!

    Kind regards

    bug good first issue 
    opened by 86mm86 9
  • Adding model tuning.

    Adding model tuning.

    The cli interface to fasttext to do parameter tuning and model quantization:

    fasttext supervised -input model_train.train -output model_tune -autotune-validation model_train.valid -autotune-modelsize 100M -autotune-duration 1200 -loss one-vs-all
    

    Do you plan to implement it in your package at some point ? If I can make a pr with a piece of code that does the job

    enhancement help wanted good first issue 
    opened by robinicole 7
  • WIP: core: support autotune

    WIP: core: support autotune

    Hi, added support for auto-tuning. Please LMK if you support this direction, and I'll add documentation and more tests to make it a mergeable PR.

    Signed-off-by: Dimid Duchovny [email protected]

    opened by dimidd 4
  • Return ndarrays instead of lists while predicting

    Return ndarrays instead of lists while predicting

    The functions predict, predict_proba return lists instead of numpy arrays which makes them unusable with classifiers like sklearn.multiclass.OneVsRestClassifier. GridSearch and other similar functionality also don't work.

    This is a quick fix.

    bug good first issue 
    opened by uniaz 4
  • Support for string labels

    Support for string labels

    skift seems to expect integer labels and will fail when using string labels.

    For instance, when running

    from skift import FirstColFtClassifier
    import pandas as pd
    df = pd.DataFrame(
        data=[
            ['woof', 'a'],
            ['meow', 'b'],
            ['squick', 'c'],
        ],
        columns=['txt', 'lbl'],
    )
    sk_clf = FirstColFtClassifier(lr=0.3, epoch=10)
    sk_clf.fit(df[['txt']], df['lbl'])
    sk_clf.predict([['squick']])
    

    I get

    ---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    <ipython-input-32-52a73258e761> in <module>
    ----> 1 sk_clf.predict([['squick']])
    
    /usr/local/Caskroom/miniconda/base/envs/base/lib/python3.7/site-packages/skift/core.py in predict(self, X)
        165         return np.array([
        166             self._clean_label(res[0][0])
    --> 167             for res in self._predict(X)
        168         ], dtype=np.float_)
        169 
    
    /usr/local/Caskroom/miniconda/base/envs/base/lib/python3.7/site-packages/skift/core.py in <listcomp>(.0)
        165         return np.array([
        166             self._clean_label(res[0][0])
    --> 167             for res in self._predict(X)
        168         ], dtype=np.float_)
        169 
    
    /usr/local/Caskroom/miniconda/base/envs/base/lib/python3.7/site-packages/skift/core.py in _clean_label(ft_label)
        135     @staticmethod
        136     def _clean_label(ft_label):
    --> 137         return int(ft_label[9:])
        138 
        139     def _predict_on_str_arr(self, str_arr, k=1):
    
    ValueError: invalid literal for int() with base 10: 'c'
    

    This is a bit unexpected since neither sklearn nor fasttext require integer labels.

    I guess skift could handle that either by:

    • passing the string labels directly to fasttext (caveat: might require some cleaning)
    • automatically calling LabelEncoder (e.g. as in sklearn's code for LR)
    enhancement help wanted good first issue 
    opened by michelole 3
  • utf-8 encoding for xy input file

    utf-8 encoding for xy input file

    fastText assumes UTF-8 encoded text (see fastText Python README).

    Without the encoding flag, the xy input file is written using the system's locale, which is problematic, especially on Windows. Attempting to train a model with text which uses utf-8 symbols results in an exception.

    Passing the flag to open when writing the input file solves this issue.

    opened by sgt 3
  • 1D array input for training

    1D array input for training

    Hi,

    I'm very sorry for asking such a basic question but can't work this one out! Usually, I see other text classifiers taking one of three forms;

    1. (1D) List of strings, if it performs tokenisation and vectorisation itself
    2. (2D) List of tokens if it performs vectorisation itself
    3. (2D) List of vectors if it is just a classifier

    I'm a little confused as the readme does not have a case where multiple tokens are inputted into the model. However, in the tests it appears is that it is trained on a pd.DataFrame for X and a pd.Series for y. I believe fasttext does the tokenisation and vectorisation itself, so why do we need a two dimensional input instead of a 1D list of strings? Is there benefit to doing it that way over something like this;

    FtClassifier().fit(
        ['Input 1', 'Input 2'],
        [1, 0]
    )
    

    or the equivalent but with 1D numpy arrays?

    Many thanks! Dom

    question 
    opened by DomHudson 3
  • os.makedirs(TEMP_DIR, exist_ok=True) causes PermissionError in docker container

    os.makedirs(TEMP_DIR, exist_ok=True) causes PermissionError in docker container

    Running skift in a docker container results in permission errors when trying to load previously generated models.

    File "/usr/local/lib/python3.5/dist-packages/skift/util.py", line 10, in PermissionError: [Errno 13] Permission denied: '/root/.temp'

    The problem is the docker container is running as user 'root', but the /root/ folder is not writable.

    I have a fix, and will open a pull request shortly

    bug 
    opened by crouffer 2
  • hyperparameter tuning

    hyperparameter tuning

    how can we tune parameters? in https://fasttext.cc/docs/en/autotune.html uses autotuneValidationFile to feed validation see to model. how can we set this parameter?

    question 
    opened by Alihjt 1
  • Add multi-label support

    Add multi-label support

    Add support to providing multi-label labels in a scikit-learn-compliant format, utilizing (under the hood) fasttext's support for multi-label scenarios.

    enhancement help wanted 
    opened by shaypal5 4
Releases(v0.0.23)
Owner
Shay Palachy
Interested in doing data science and developing open source tools in Python.
Shay Palachy
A collection of Korean Text Datasets ready to use using Tensorflow-Datasets.

tfds-korean A collection of Korean Text Datasets ready to use using Tensorflow-Datasets. TensorFlow-Datasets를 이용한 한국어/한글 데이터셋 모음입니다. Dataset Catalog |

Jeong Ukjae 20 Jul 11, 2022
This repository contains examples of Task-Informed Meta-Learning

Task-Informed Meta-Learning This repository contains examples of Task-Informed Meta-Learning (paper). We consider two tasks: Crop Type Classification

10 Dec 19, 2022
Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

Spanish Language Models 💃🏻 A repository part of the MarIA project. Corpora 📃 Corpora Number of documents Number of tokens Size (GB) BNE 201,080,084

Plan de Tecnologías del Lenguaje - Gobierno de España 203 Dec 20, 2022
Voilà turns Jupyter notebooks into standalone web applications

Rendering of live Jupyter notebooks with interactive widgets. Introduction Voilà turns Jupyter notebooks into standalone web applications. Unlike the

Voilà Dashboards 4.5k Jan 03, 2023
Signature remover is a NLP based solution which removes email signatures from the rest of the text.

Signature Remover Signature remover is a NLP based solution which removes email signatures from the rest of the text. It helps to enchance data conten

Forges Alterway 8 Jan 06, 2023
GSoC'2021 | TensorFlow implementation of Wav2Vec2

GSoC'2021 | TensorFlow implementation of Wav2Vec2

Vasudev Gupta 73 Nov 28, 2022
2021海华AI挑战赛·中文阅读理解·技术组·第三名

文字是人类用以记录和表达的最基本工具,也是信息传播的重要媒介。透过文字与符号,我们可以追寻人类文明的起源,可以传播知识与经验,读懂文字是认识与了解的第一步。对于人工智能而言,它的核心问题之一就是认知,而认知的核心则是语义理解。

21 Dec 26, 2022
基于pytorch+bert的中文事件抽取

pytorch_bert_event_extraction 基于pytorch+bert的中文事件抽取,主要思想是QA(问答)。 要预先下载好chinese-roberta-wwm-ext模型,并在运行时指定模型的位置。

西西嘛呦 31 Nov 30, 2022
[Preprint] Escaping the Big Data Paradigm with Compact Transformers, 2021

Compact Transformers Preprint Link: Escaping the Big Data Paradigm with Compact Transformers By Ali Hassani[1]*, Steven Walton[1]*, Nikhil Shah[1], Ab

SHI Lab 367 Dec 31, 2022
SHAS: Approaching optimal Segmentation for End-to-End Speech Translation

SHAS: Approaching optimal Segmentation for End-to-End Speech Translation In this repo you can find the code of the Supervised Hybrid Audio Segmentatio

Machine Translation @ UPC 21 Dec 20, 2022
Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

Spanish Language Models 💃🏻 Corpora 📃 Corpora Number of documents Size (GB) BNE 201,080,084 570GB Models 🤖 RoBERTa-base BNE: https://huggingface.co

PlanTL-SANIDAD 203 Dec 20, 2022
CMeEE 数据集医学实体抽取

医学实体抽取_GlobalPointer_torch 介绍 思想来自于苏神 GlobalPointer,原始版本是基于keras实现的,模型结构实现参考现有 pytorch 复现代码【感谢!】,基于torch百分百复现苏神原始效果。 数据集 中文医学命名实体数据集 点这里申请,很简单,共包含九类医学

85 Dec 28, 2022
Nystromformer: A Nystrom-based Algorithm for Approximating Self-Attention

Nystromformer: A Nystrom-based Algorithm for Approximating Self-Attention April 6, 2021 We extended segment-means to compute landmarks without requiri

Zhanpeng Zeng 322 Jan 01, 2023
sangha, pronounced "suhng-guh", is a social networking, booking platform where students and teachers can share their practice.

Flask React Project This is the backend for the Flask React project. Getting started Clone this repository (only this branch) git clone https://github

Courtney Newcomer 17 Sep 29, 2021
Speech Recognition for Uyghur using Speech transformer

Speech Recognition for Uyghur using Speech transformer Training: this model using CTC loss and Cross Entropy loss for training. Download pretrained mo

Uyghur 11 Nov 17, 2022
An easy-to-use framework for BERT models, with trainers, various NLP tasks and detailed annonations

FantasyBert English | 中文 Introduction An easy-to-use framework for BERT models, with trainers, various NLP tasks and detailed annonations. You can imp

Fan 137 Oct 26, 2022
TalkNet: Audio-visual active speaker detection Model

Is someone talking? TalkNet: Audio-visual active speaker detection Model This repository contains the code for our ACM MM 2021 paper, TalkNet, an acti

142 Dec 14, 2022
Open source code for AlphaFold.

AlphaFold This package provides an implementation of the inference pipeline of AlphaFold v2.0. This is a completely new model that was entered in CASP

DeepMind 9.7k Jan 02, 2023
A BERT-based reverse-dictionary of Korean proverbs

Wisdomify A BERT-based reverse-dictionary of Korean proverbs. 김유빈 : 모델링 / 데이터 수집 / 프로젝트 설계 / back-end 김종윤 : 데이터 수집 / 프로젝트 설계 / front-end Quick Start C

Eu-Bin KIM 94 Dec 08, 2022
Optimal Transport Tools (OTT), A toolbox for all things Wasserstein.

Optimal Transport Tools (OTT), A toolbox for all things Wasserstein. See full documentation for detailed info on the toolbox. The goal of OTT is to pr

OTT-JAX 255 Dec 26, 2022