Library for fast text representation and classification.

Overview

fastText

fastText is a library for efficient learning of word representations and sentence classification.

CircleCI

Table of contents

Resources

Models

Supplementary data

FAQ

You can find answers to frequently asked questions on our website.

Cheatsheet

We also provide a cheatsheet full of useful one-liners.

Requirements

We are continuously building and testing our library, CLI and Python bindings under various docker images using circleci.

Generally, fastText builds on modern Mac OS and Linux distributions. Since it uses some C++11 features, it requires a compiler with good C++11 support. These include :

  • (g++-4.7.2 or newer) or (clang-3.3 or newer)

Compilation is carried out using a Makefile, so you will need to have a working make. If you want to use cmake you need at least version 2.8.9.

One of the oldest distributions we successfully built and tested the CLI under is Debian jessie.

For the word-similarity evaluation script you will need:

  • Python 2.6 or newer
  • NumPy & SciPy

For the python bindings (see the subdirectory python) you will need:

  • Python version 2.7 or >=3.4
  • NumPy & SciPy
  • pybind11

One of the oldest distributions we successfully built and tested the Python bindings under is Debian jessie.

If these requirements make it impossible for you to use fastText, please open an issue and we will try to accommodate you.

Building fastText

We discuss building the latest stable version of fastText.

Getting the source code

You can find our latest stable release in the usual place.

There is also the master branch that contains all of our most recent work, but comes along with all the usual caveats of an unstable branch. You might want to use this if you are a developer or power-user.

Building fastText using make (preferred)

$ wget https://github.com/facebookresearch/fastText/archive/v0.9.2.zip
$ unzip v0.9.2.zip
$ cd fastText-0.9.2
$ make

This will produce object files for all the classes as well as the main binary fasttext. If you do not plan on using the default system-wide compiler, update the two macros defined at the beginning of the Makefile (CC and INCLUDES).

Building fastText using cmake

For now this is not part of a release, so you will need to clone the master branch.

$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ mkdir build && cd build && cmake ..
$ make && make install

This will create the fasttext binary and also all relevant libraries (shared, static, PIC).

Building fastText for Python

For now this is not part of a release, so you will need to clone the master branch.

$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ pip install .

For further information and introduction see python/README.md

Example use cases

This library has two main use cases: word representation learning and text classification. These were described in the two papers 1 and 2.

Word representation learning

In order to learn word vectors, as described in 1, do:

$ ./fasttext skipgram -input data.txt -output model

where data.txt is a training file containing UTF-8 encoded text. By default the word vectors will take into account character n-grams from 3 to 6 characters. At the end of optimization the program will save two files: model.bin and model.vec. model.vec is a text file containing the word vectors, one per line. model.bin is a binary file containing the parameters of the model along with the dictionary and all hyper parameters. The binary file can be used later to compute word vectors or to restart the optimization.

Obtaining word vectors for out-of-vocabulary words

The previously trained model can be used to compute word vectors for out-of-vocabulary words. Provided you have a text file queries.txt containing words for which you want to compute vectors, use the following command:

$ ./fasttext print-word-vectors model.bin < queries.txt

This will output word vectors to the standard output, one vector per line. This can also be used with pipes:

$ cat queries.txt | ./fasttext print-word-vectors model.bin

See the provided scripts for an example. For instance, running:

$ ./word-vector-example.sh

will compile the code, download data, compute word vectors and evaluate them on the rare words similarity dataset RW [Thang et al. 2013].

Text classification

This library can also be used to train supervised text classifiers, for instance for sentiment analysis. In order to train a text classifier using the method described in 2, use:

$ ./fasttext supervised -input train.txt -output model

where train.txt is a text file containing a training sentence per line along with the labels. By default, we assume that labels are words that are prefixed by the string __label__. This will output two files: model.bin and model.vec. Once the model was trained, you can evaluate it by computing the precision and recall at k ([email protected] and [email protected]) on a test set using:

$ ./fasttext test model.bin test.txt k

The argument k is optional, and is equal to 1 by default.

In order to obtain the k most likely labels for a piece of text, use:

$ ./fasttext predict model.bin test.txt k

or use predict-prob to also get the probability for each label

$ ./fasttext predict-prob model.bin test.txt k

where test.txt contains a piece of text to classify per line. Doing so will print to the standard output the k most likely labels for each line. The argument k is optional, and equal to 1 by default. See classification-example.sh for an example use case. In order to reproduce results from the paper 2, run classification-results.sh, this will download all the datasets and reproduce the results from Table 1.

If you want to compute vector representations of sentences or paragraphs, please use:

$ ./fasttext print-sentence-vectors model.bin < text.txt

This assumes that the text.txt file contains the paragraphs that you want to get vectors for. The program will output one vector representation per line in the file.

You can also quantize a supervised model to reduce its memory usage with the following command:

$ ./fasttext quantize -output model

This will create a .ftz file with a smaller memory footprint. All the standard functionality, like test or predict work the same way on the quantized models:

$ ./fasttext test model.ftz test.txt

The quantization procedure follows the steps described in 3. You can run the script quantization-example.sh for an example.

Full documentation

Invoke a command without arguments to list available arguments and their default values:

$ ./fasttext supervised
Empty input or output path.

The following arguments are mandatory:
  -input              training file path
  -output             output file path

The following arguments are optional:
  -verbose            verbosity level [2]

The following arguments for the dictionary are optional:
  -minCount           minimal number of word occurrences [1]
  -minCountLabel      minimal number of label occurrences [0]
  -wordNgrams         max length of word ngram [1]
  -bucket             number of buckets [2000000]
  -minn               min length of char ngram [0]
  -maxn               max length of char ngram [0]
  -t                  sampling threshold [0.0001]
  -label              labels prefix [__label__]

The following arguments for training are optional:
  -lr                 learning rate [0.1]
  -lrUpdateRate       change the rate of updates for the learning rate [100]
  -dim                size of word vectors [100]
  -ws                 size of the context window [5]
  -epoch              number of epochs [5]
  -neg                number of negatives sampled [5]
  -loss               loss function {ns, hs, softmax} [softmax]
  -thread             number of threads [12]
  -pretrainedVectors  pretrained word vectors for supervised learning []
  -saveOutput         whether output params should be saved [0]

The following arguments for quantization are optional:
  -cutoff             number of words and ngrams to retain [0]
  -retrain            finetune embeddings if a cutoff is applied [0]
  -qnorm              quantizing the norm separately [0]
  -qout               quantizing the classifier [0]
  -dsub               size of each sub-vector [2]

Defaults may vary by mode. (Word-representation modes skipgram and cbow use a default -minCount of 5.)

References

Please cite 1 if using this code for learning word representations or 2 if using for text classification.

Enriching Word Vectors with Subword Information

[1] P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information

@article{bojanowski2017enriching,
  title={Enriching Word Vectors with Subword Information},
  author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
  journal={Transactions of the Association for Computational Linguistics},
  volume={5},
  year={2017},
  issn={2307-387X},
  pages={135--146}
}

Bag of Tricks for Efficient Text Classification

[2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification

@InProceedings{joulin2017bag,
  title={Bag of Tricks for Efficient Text Classification},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
  booktitle={Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers},
  month={April},
  year={2017},
  publisher={Association for Computational Linguistics},
  pages={427--431},
}

FastText.zip: Compressing text classification models

[3] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, FastText.zip: Compressing text classification models

@article{joulin2016fasttext,
  title={FastText.zip: Compressing text classification models},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{\'e}gou, H{\'e}rve and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1612.03651},
  year={2016}
}

(* These authors contributed equally.)

Join the fastText community

See the CONTRIBUTING file for information about how to help out.

License

fastText is MIT-licensed.

Comments
  • fasttext installed but import fails

    fasttext installed but import fails

    Hi have successfully installed fasttext on python3.5. However, when I try to import it I get the following error:

    Using /usr/local/lib/python3.5/dist-packages
    Finished processing dependencies for fasttext==0.8.22
    [email protected]:~/GitHub/fastText$ python3.5
    Python 3.5.2 (default, Nov 23 2017, 16:37:01) 
    [GCC 5.4.0 20160609] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import fasttext
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ImportError: No module named 'fasttext'
    >>> 
    

    I have tried installing both with pip install . and python setup.y install with no luck.

    opened by ahmedahmedov 25
  • Assertion failed on ./fasttext predict

    Assertion failed on ./fasttext predict

    predict command failed!

    ./fasttext predict model.bin test.txt

    Assertion failed: (counts.size() == osz_), function setTargetCounts, file src/model.cc, line 188.
    Abort trap: 6
    

    model train command was:

    ./fasttext supervised -input train.txt -output model -wordNgrams 4 -bucket 1000000 -thread 16

    Read 4223M words
    Number of words:  16577869
    Number of labels: 25
    Progress: 100.0%  words/sec/thread: 375706  lr: 0.000000  loss: 0.169518  eta: 0h0m 
    
    opened by spate141 25
  • How can we get the vector of a paragraph?

    How can we get the vector of a paragraph?

    I have ever tried doc2vec (from gensim, based on word2vec), with which I can extract fixed length vector for variant length paragraphs. Can I do the same with fastText?

    Thank you!

    opened by xchangcheng 22
  • OS X install problem

    OS X install problem

    When I install fasttext using "pip install .", I get some errors like following

    Failed to build fasttext
    Installing collected packages: fasttext
      Running setup.py install for fasttext ... error
        Complete output from command /miniconda3/bin/python -u -c "import setuptools, tokenize;__file__='/private/var/folders/tz/msp_r50s03s59q40s_gmhx600000gn/T/pip-req-build-i2z3pyel/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /private/var/folders/tz/msp_r50s03s59q40s_gmhx600000gn/T/pip-record-yg0h6noh/install-record.txt --single-version-externally-managed --compile:
        running install
        running build
        running build_py
        creating build
        creating build/lib.macosx-10.7-x86_64-3.6
        creating build/lib.macosx-10.7-x86_64-3.6/fastText
        copying python/fastText/__init__.py -> build/lib.macosx-10.7-x86_64-3.6/fastText
        copying python/fastText/FastText.py -> build/lib.macosx-10.7-x86_64-3.6/fastText
        creating build/lib.macosx-10.7-x86_64-3.6/fastText/util
        copying python/fastText/util/util.py -> build/lib.macosx-10.7-x86_64-3.6/fastText/util
        copying python/fastText/util/__init__.py -> build/lib.macosx-10.7-x86_64-3.6/fastText/util
        creating build/lib.macosx-10.7-x86_64-3.6/fastText/tests
        copying python/fastText/tests/test_script.py -> build/lib.macosx-10.7-x86_64-3.6/fastText/tests
        copying python/fastText/tests/__init__.py -> build/lib.macosx-10.7-x86_64-3.6/fastText/tests
        copying python/fastText/tests/test_configurations.py -> build/lib.macosx-10.7-x86_64-3.6/fastText/tests
        running build_ext
        gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/miniconda3/include -arch x86_64 -I/miniconda3/include -arch x86_64 -I/miniconda3/include/python3.6m -c /var/folders/tz/msp_r50s03s59q40s_gmhx600000gn/T/tmp1upvarhx.cpp -o var/folders/tz/msp_r50s03s59q40s_gmhx600000gn/T/tmp1upvarhx.o -stdlib=libc++
        gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/miniconda3/include -arch x86_64 -I/miniconda3/include -arch x86_64 -I/miniconda3/include/python3.6m -c /var/folders/tz/msp_r50s03s59q40s_gmhx600000gn/T/tmp9dzh7j94.cpp -o var/folders/tz/msp_r50s03s59q40s_gmhx600000gn/T/tmp9dzh7j94.o -std=c++14
        warning: include path for stdlibc++ headers not found; pass '-std=libc++' on the command line to use the libc++ standard library instead [-Wstdlibcxx-not-found]
        1 warning generated.
        gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/miniconda3/include -arch x86_64 -I/miniconda3/include -arch x86_64 -I/miniconda3/include/python3.6m -c /var/folders/tz/msp_r50s03s59q40s_gmhx600000gn/T/tmpw5pz6xr0.cpp -o var/folders/tz/msp_r50s03s59q40s_gmhx600000gn/T/tmpw5pz6xr0.o -fvisibility=hidden
        warning: include path for stdlibc++ headers not found; pass '-std=libc++' on the command line to use the libc++ standard library instead [-Wstdlibcxx-not-found]
        1 warning generated.
        building 'fasttext_pybind' extension
        creating build/temp.macosx-10.7-x86_64-3.6
        creating build/temp.macosx-10.7-x86_64-3.6/python
        creating build/temp.macosx-10.7-x86_64-3.6/python/fastText
        creating build/temp.macosx-10.7-x86_64-3.6/python/fastText/pybind
        creating build/temp.macosx-10.7-x86_64-3.6/src
        gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/miniconda3/include -arch x86_64 -I/miniconda3/include -arch x86_64 -I/miniconda3/include/python3.6m -I/Users/ruanxiaoyi/.local/include/python3.6m -Isrc -I/miniconda3/include/python3.6m -c python/fastText/pybind/fasttext_pybind.cc -o build/temp.macosx-10.7-x86_64-3.6/python/fastText/pybind/fasttext_pybind.o -stdlib=libc++ -DVERSION_INFO="0.8.22" -std=c++14 -fvisibility=hidden
        python/fastText/pybind/fasttext_pybind.cc:219:35: warning: comparison of integers of different signs: 'int32_t' (aka 'int') and 'std::__1::vector<long long, std::__1::allocator<long long> >::size_type' (aka 'unsigned long') [-Wsign-compare]
                    for (int32_t i = 0; i < vocab_freq.size(); i++) {
                                        ~ ^ ~~~~~~~~~~~~~~~~~
        python/fastText/pybind/fasttext_pybind.cc:233:35: warning: comparison of integers of different signs: 'int32_t' (aka 'int') and 'std::__1::vector<long long, std::__1::allocator<long long> >::size_type' (aka 'unsigned long') [-Wsign-compare]
                    for (int32_t i = 0; i < labels_freq.size(); i++) {
                                        ~ ^ ~~~~~~~~~~~~~~~~~~
        2 warnings generated.
        gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/miniconda3/include -arch x86_64 -I/miniconda3/include -arch x86_64 -I/miniconda3/include/python3.6m -I/Users/ruanxiaoyi/.local/include/python3.6m -Isrc -I/miniconda3/include/python3.6m -c src/dictionary.cc -o build/temp.macosx-10.7-x86_64-3.6/src/dictionary.o -stdlib=libc++ -DVERSION_INFO="0.8.22" -std=c++14 -fvisibility=hidden
        src/dictionary.cc:181:52: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int' [-Wsign-compare]
            for (size_t j = i, n = 1; j < word.size() && n <= args_->maxn; n++) {
                                                         ~ ^  ~~~~~~~~~~~
        src/dictionary.cc:186:13: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int' [-Wsign-compare]
              if (n >= args_->minn && !(n == 1 && (i == 0 || j == word.size()))) {
                  ~ ^  ~~~~~~~~~~~
        src/dictionary.cc:198:24: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int32_t' (aka 'int') [-Wsign-compare]
          for (size_t i = 0; i < size_; i++) {
                             ~ ^ ~~~~~
        src/dictionary.cc:296:24: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int32_t' (aka 'int') [-Wsign-compare]
          for (size_t i = 0; i < size_; i++) {
                             ~ ^ ~~~~~
        src/dictionary.cc:316:25: warning: comparison of integers of different signs: 'int32_t' (aka 'int') and 'std::__1::vector<int, std::__1::allocator<int> >::size_type' (aka 'unsigned long') [-Wsign-compare]
          for (int32_t i = 0; i < hashes.size(); i++) {
                              ~ ^ ~~~~~~~~~~~~~
        src/dictionary.cc:318:31: warning: comparison of integers of different signs: 'int32_t' (aka 'int') and 'std::__1::vector<int, std::__1::allocator<int> >::size_type' (aka 'unsigned long') [-Wsign-compare]
            for (int32_t j = i + 1; j < hashes.size() && j < i + n; j++) {
                                    ~ ^ ~~~~~~~~~~~~~
        src/dictionary.cc:515:25: warning: comparison of integers of different signs: 'int32_t' (aka 'int') and 'std::__1::vector<fasttext::entry, std::__1::allocator<fasttext::entry> >::size_type' (aka 'unsigned long') [-Wsign-compare]
          for (int32_t i = 0; i < words_.size(); i++) {
                              ~ ^ ~~~~~~~~~~~~~
        src/dictionary.cc:517:12: warning: comparison of integers of different signs: 'int32_t' (aka 'int') and 'std::__1::vector<int, std::__1::allocator<int> >::size_type' (aka 'unsigned long') [-Wsign-compare]
                (j < words.size() && words[j] == i)) {
                 ~ ^ ~~~~~~~~~~~~
        8 warnings generated.
        gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/miniconda3/include -arch x86_64 -I/miniconda3/include -arch x86_64 -I/miniconda3/include/python3.6m -I/Users/ruanxiaoyi/.local/include/python3.6m -Isrc -I/miniconda3/include/python3.6m -c src/main.cc -o build/temp.macosx-10.7-x86_64-3.6/src/main.o -stdlib=libc++ -DVERSION_INFO="0.8.22" -std=c++14 -fvisibility=hidden
        src/main.cc:348:3: warning: code will never be executed [-Wunreachable-code]
          exit(0);
          ^~~~
        1 warning generated.
        gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/miniconda3/include -arch x86_64 -I/miniconda3/include -arch x86_64 -I/miniconda3/include/python3.6m -I/Users/ruanxiaoyi/.local/include/python3.6m -Isrc -I/miniconda3/include/python3.6m -c src/fasttext.cc -o build/temp.macosx-10.7-x86_64-3.6/src/fasttext.o -stdlib=libc++ -DVERSION_INFO="0.8.22" -std=c++14 -fvisibility=hidden
        src/fasttext.cc:92:21: warning: comparison of integers of different signs: 'int' and 'std::__1::vector<int, std::__1::allocator<int> >::size_type' (aka 'unsigned long') [-Wsign-compare]
          for (int i = 0; i < ngrams.size(); i++) {
                          ~ ^ ~~~~~~~~~~~~~
        src/fasttext.cc:302:18: warning: comparison of integers of different signs: 'const int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
            return eosid == i1 || (eosid != i2 && norms[i1] > norms[i2]);
                   ~~~~~ ^  ~~
        src/fasttext.cc:302:34: warning: comparison of integers of different signs: 'const int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
            return eosid == i1 || (eosid != i2 && norms[i1] > norms[i2]);
                                   ~~~~~ ^  ~~
        src/fasttext.cc:323:16: warning: 'selectEmbeddings' is deprecated: selectEmbeddings is being deprecated. [-Wdeprecated-declarations]
            auto idx = selectEmbeddings(qargs.cutoff);
                       ^
        src/fasttext.h:165:3: note: 'selectEmbeddings' has been explicitly marked deprecated here
          FASTTEXT_DEPRECATED("selectEmbeddings is being deprecated.")
          ^
        src/utils.h:18:49: note: expanded from macro 'FASTTEXT_DEPRECATED'
        #define FASTTEXT_DEPRECATED(msg) __attribute__((__deprecated__(msg)))
                                                        ^
        src/fasttext.cc:322:40: warning: comparison of integers of different signs: 'const size_t' (aka 'const unsigned long') and 'int64_t' (aka 'long long') [-Wsign-compare]
          if (qargs.cutoff > 0 && qargs.cutoff < input->size(0)) {
                                  ~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~
        src/fasttext.cc:327:24: warning: comparison of integers of different signs: 'int' and 'std::__1::vector<int, std::__1::allocator<int> >::size_type' (aka 'unsigned long') [-Wsign-compare]
            for (auto i = 0; i < idx.size(); i++) {
                             ~ ^ ~~~~~~~~~~
        src/fasttext.cc:380:25: warning: comparison of integers of different signs: 'int32_t' (aka 'int') and 'std::__1::vector<int, std::__1::allocator<int> >::size_type' (aka 'unsigned long') [-Wsign-compare]
          for (int32_t w = 0; w < line.size(); w++) {
                              ~ ^ ~~~~~~~~~~~
        src/fasttext.cc:384:41: warning: comparison of integers of different signs: 'int' and 'std::__1::vector<int, std::__1::allocator<int> >::size_type' (aka 'unsigned long') [-Wsign-compare]
              if (c != 0 && w + c >= 0 && w + c < line.size()) {
                                          ~~~~~ ^ ~~~~~~~~~~~
        src/fasttext.cc:398:25: warning: comparison of integers of different signs: 'int32_t' (aka 'int') and 'std::__1::vector<int, std::__1::allocator<int> >::size_type' (aka 'unsigned long') [-Wsign-compare]
          for (int32_t w = 0; w < line.size(); w++) {
                              ~ ^ ~~~~~~~~~~~
        src/fasttext.cc:402:41: warning: comparison of integers of different signs: 'int' and 'std::__1::vector<int, std::__1::allocator<int> >::size_type' (aka 'unsigned long') [-Wsign-compare]
              if (c != 0 && w + c >= 0 && w + c < line.size()) {
                                          ~~~~~ ^ ~~~~~~~~~~~
        src/fasttext.cc:479:27: warning: comparison of integers of different signs: 'int32_t' (aka 'int') and 'std::__1::vector<int, std::__1::allocator<int> >::size_type' (aka 'unsigned long') [-Wsign-compare]
            for (int32_t i = 0; i < line.size(); i++) {
                                ~ ^ ~~~~~~~~~~~
        src/fasttext.cc:514:25: warning: comparison of integers of different signs: 'int32_t' (aka 'int') and 'std::__1::vector<int, std::__1::allocator<int> >::size_type' (aka 'unsigned long') [-Wsign-compare]
          for (int32_t i = 0; i < ngrams.size(); i++) {
                              ~ ^ ~~~~~~~~~~~~~
        src/fasttext.cc:551:5: warning: 'precomputeWordVectors' is deprecated: precomputeWordVectors is being deprecated. [-Wdeprecated-declarations]
            precomputeWordVectors(*wordVectors_);
            ^
        src/fasttext.h:180:3: note: 'precomputeWordVectors' has been explicitly marked deprecated here
          FASTTEXT_DEPRECATED("precomputeWordVectors is being deprecated.")
          ^
        src/utils.h:18:49: note: expanded from macro 'FASTTEXT_DEPRECATED'
        #define FASTTEXT_DEPRECATED(msg) __attribute__((__deprecated__(msg)))
                                                        ^
        src/fasttext.cc:585:23: warning: comparison of integers of different signs: 'std::__1::vector<std::__1::pair<float, std::__1::basic_string<char> >, std::__1::allocator<std::__1::pair<float, std::__1::basic_string<char> > > >::size_type' (aka 'unsigned long') and 'int32_t' (aka 'int') [-Wsign-compare]
              if (heap.size() == k && similarity < heap.front().first) {
                  ~~~~~~~~~~~ ^  ~
        src/fasttext.cc:590:23: warning: comparison of integers of different signs: 'std::__1::vector<std::__1::pair<float, std::__1::basic_string<char> >, std::__1::allocator<std::__1::pair<float, std::__1::basic_string<char> > > >::size_type' (aka 'unsigned long') and 'int32_t' (aka 'int') [-Wsign-compare]
              if (heap.size() > k) {
                  ~~~~~~~~~~~ ^ ~
        src/fasttext.cc:701:24: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int64_t' (aka 'long long') [-Wsign-compare]
          for (size_t i = 0; i < n; i++) {
                             ~ ^ ~
        src/fasttext.cc:706:26: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int64_t' (aka 'long long') [-Wsign-compare]
            for (size_t j = 0; j < dim; j++) {
                               ~ ^ ~~~
        src/fasttext.cc:718:24: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int64_t' (aka 'long long') [-Wsign-compare]
          for (size_t i = 0; i < n; i++) {
                             ~ ^ ~
        src/fasttext.cc:723:26: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int64_t' (aka 'long long') [-Wsign-compare]
            for (size_t j = 0; j < dim; j++) {
                               ~ ^ ~~~
        19 warnings generated.
        gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/miniconda3/include -arch x86_64 -I/miniconda3/include -arch x86_64 -I/miniconda3/include/python3.6m -I/Users/ruanxiaoyi/.local/include/python3.6m -Isrc -I/miniconda3/include/python3.6m -c src/utils.cc -o build/temp.macosx-10.7-x86_64-3.6/src/utils.o -stdlib=libc++ -DVERSION_INFO="0.8.22" -std=c++14 -fvisibility=hidden
        gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/miniconda3/include -arch x86_64 -I/miniconda3/include -arch x86_64 -I/miniconda3/include/python3.6m -I/Users/ruanxiaoyi/.local/include/python3.6m -Isrc -I/miniconda3/include/python3.6m -c src/model.cc -o build/temp.macosx-10.7-x86_64-3.6/src/model.o -stdlib=libc++ -DVERSION_INFO="0.8.22" -std=c++14 -fvisibility=hidden
        gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/miniconda3/include -arch x86_64 -I/miniconda3/include -arch x86_64 -I/miniconda3/include/python3.6m -I/Users/ruanxiaoyi/.local/include/python3.6m -Isrc -I/miniconda3/include/python3.6m -c src/loss.cc -o build/temp.macosx-10.7-x86_64-3.6/src/loss.o -stdlib=libc++ -DVERSION_INFO="0.8.22" -std=c++14 -fvisibility=hidden
        src/loss.cc:83:21: warning: comparison of integers of different signs: 'std::__1::vector<std::__1::pair<float, int>, std::__1::allocator<std::__1::pair<float, int> > >::size_type' (aka 'unsigned long') and 'int32_t' (aka 'int') [-Wsign-compare]
            if (heap.size() == k && std_log(output[i]) < heap.front().first) {
                ~~~~~~~~~~~ ^  ~
        src/loss.cc:88:21: warning: comparison of integers of different signs: 'std::__1::vector<std::__1::pair<float, int>, std::__1::allocator<std::__1::pair<float, int> > >::size_type' (aka 'unsigned long') and 'int32_t' (aka 'int') [-Wsign-compare]
            if (heap.size() > k) {
                ~~~~~~~~~~~ ^ ~
        src/loss.cc:257:25: warning: comparison of integers of different signs: 'int32_t' (aka 'int') and 'std::__1::vector<int, std::__1::allocator<int> >::size_type' (aka 'unsigned long') [-Wsign-compare]
          for (int32_t i = 0; i < pathToRoot.size(); i++) {
                              ~ ^ ~~~~~~~~~~~~~~~~~
        src/loss.cc:282:19: warning: comparison of integers of different signs: 'std::__1::vector<std::__1::pair<float, int>, std::__1::allocator<std::__1::pair<float, int> > >::size_type' (aka 'unsigned long') and 'int32_t' (aka 'int') [-Wsign-compare]
          if (heap.size() == k && score < heap.front().first) {
              ~~~~~~~~~~~ ^  ~
        src/loss.cc:289:21: warning: comparison of integers of different signs: 'std::__1::vector<std::__1::pair<float, int>, std::__1::allocator<std::__1::pair<float, int> > >::size_type' (aka 'unsigned long') and 'int32_t' (aka 'int') [-Wsign-compare]
            if (heap.size() > k) {
                ~~~~~~~~~~~ ^ ~
        5 warnings generated.
        gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/miniconda3/include -arch x86_64 -I/miniconda3/include -arch x86_64 -I/miniconda3/include/python3.6m -I/Users/ruanxiaoyi/.local/include/python3.6m -Isrc -I/miniconda3/include/python3.6m -c src/productquantizer.cc -o build/temp.macosx-10.7-x86_64-3.6/src/productquantizer.o -stdlib=libc++ -DVERSION_INFO="0.8.22" -std=c++14 -fvisibility=hidden
        src/productquantizer.cc:246:22: warning: comparison of integers of different signs: 'int' and 'std::__1::vector<float, std::__1::allocator<float> >::size_type' (aka 'unsigned long') [-Wsign-compare]
          for (auto i = 0; i < centroids_.size(); i++) {
                           ~ ^ ~~~~~~~~~~~~~~~~~
        1 warning generated.
        gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/miniconda3/include -arch x86_64 -I/miniconda3/include -arch x86_64 -I/miniconda3/include/python3.6m -I/Users/ruanxiaoyi/.local/include/python3.6m -Isrc -I/miniconda3/include/python3.6m -c src/args.cc -o build/temp.macosx-10.7-x86_64-3.6/src/args.o -stdlib=libc++ -DVERSION_INFO="0.8.22" -std=c++14 -fvisibility=hidden
        src/args.cc:93:23: warning: comparison of integers of different signs: 'int' and 'std::__1::vector<std::__1::basic_string<char>, std::__1::allocator<std::__1::basic_string<char> > >::size_type' (aka 'unsigned long') [-Wsign-compare]
          for (int ai = 2; ai < args.size(); ai += 2) {
                           ~~ ^ ~~~~~~~~~~~
        1 warning generated.
        gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/miniconda3/include -arch x86_64 -I/miniconda3/include -arch x86_64 -I/miniconda3/include/python3.6m -I/Users/ruanxiaoyi/.local/include/python3.6m -Isrc -I/miniconda3/include/python3.6m -c src/quantmatrix.cc -o build/temp.macosx-10.7-x86_64-3.6/src/quantmatrix.o -stdlib=libc++ -DVERSION_INFO="0.8.22" -std=c++14 -fvisibility=hidden
        gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/miniconda3/include -arch x86_64 -I/miniconda3/include -arch x86_64 -I/miniconda3/include/python3.6m -I/Users/ruanxiaoyi/.local/include/python3.6m -Isrc -I/miniconda3/include/python3.6m -c src/matrix.cc -o build/temp.macosx-10.7-x86_64-3.6/src/matrix.o -stdlib=libc++ -DVERSION_INFO="0.8.22" -std=c++14 -fvisibility=hidden
        gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/miniconda3/include -arch x86_64 -I/miniconda3/include -arch x86_64 -I/miniconda3/include/python3.6m -I/Users/ruanxiaoyi/.local/include/python3.6m -Isrc -I/miniconda3/include/python3.6m -c src/meter.cc -o build/temp.macosx-10.7-x86_64-3.6/src/meter.o -stdlib=libc++ -DVERSION_INFO="0.8.22" -std=c++14 -fvisibility=hidden
        gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/miniconda3/include -arch x86_64 -I/miniconda3/include -arch x86_64 -I/miniconda3/include/python3.6m -I/Users/ruanxiaoyi/.local/include/python3.6m -Isrc -I/miniconda3/include/python3.6m -c src/vector.cc -o build/temp.macosx-10.7-x86_64-3.6/src/vector.o -stdlib=libc++ -DVERSION_INFO="0.8.22" -std=c++14 -fvisibility=hidden
        gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/miniconda3/include -arch x86_64 -I/miniconda3/include -arch x86_64 -I/miniconda3/include/python3.6m -I/Users/ruanxiaoyi/.local/include/python3.6m -Isrc -I/miniconda3/include/python3.6m -c src/densematrix.cc -o build/temp.macosx-10.7-x86_64-3.6/src/densematrix.o -stdlib=libc++ -DVERSION_INFO="0.8.22" -std=c++14 -fvisibility=hidden
        g++ -bundle -undefined dynamic_lookup -L/miniconda3/lib -arch x86_64 -L/miniconda3/lib -arch x86_64 -arch x86_64 build/temp.macosx-10.7-x86_64-3.6/python/fastText/pybind/fasttext_pybind.o build/temp.macosx-10.7-x86_64-3.6/src/dictionary.o build/temp.macosx-10.7-x86_64-3.6/src/main.o build/temp.macosx-10.7-x86_64-3.6/src/fasttext.o build/temp.macosx-10.7-x86_64-3.6/src/utils.o build/temp.macosx-10.7-x86_64-3.6/src/model.o build/temp.macosx-10.7-x86_64-3.6/src/loss.o build/temp.macosx-10.7-x86_64-3.6/src/productquantizer.o build/temp.macosx-10.7-x86_64-3.6/src/args.o build/temp.macosx-10.7-x86_64-3.6/src/quantmatrix.o build/temp.macosx-10.7-x86_64-3.6/src/matrix.o build/temp.macosx-10.7-x86_64-3.6/src/meter.o build/temp.macosx-10.7-x86_64-3.6/src/vector.o build/temp.macosx-10.7-x86_64-3.6/src/densematrix.o -o build/lib.macosx-10.7-x86_64-3.6/fasttext_pybind.cpython-36m-darwin.so
        clang: warning: libstdc++ is deprecated; move to libc++ with a minimum deployment target of OS X 10.9 [-Wdeprecated]
        ld: library not found for -lstdc++
        clang: error: linker command failed with exit code 1 (use -v to see invocation)
        error: command 'g++' failed with exit status 1
    
        ----------------------------------------
    Command "/miniconda3/bin/python -u -c "import setuptools, tokenize;__file__='/private/var/folders/tz/msp_r50s03s59q40s_gmhx600000gn/T/pip-req-build-i2z3pyel/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /private/var/folders/tz/msp_r50s03s59q40s_gmhx600000gn/T/pip-record-yg0h6noh/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /private/var/folders/tz/msp_r50s03s59q40s_gmhx600000gn/T/pip-req-build-i2z3pyel/
    

    And my environment is

    Apple LLVM version 10.0.0 (clang-1000.10.44.4)
    Target: x86_64-apple-darwin18.2.0
    Thread model: posix
    InstalledDir: /Library/Developer/CommandLineTools/usr/bin
     "/Library/Developer/CommandLineTools/usr/bin/clang" -cc1 -triple x86_64-apple-macosx10.14.0 -Wdeprecated-objc-isa-usage -Werror=deprecated-objc-isa-usage -E -disable-free -disable-llvm-verifier -discard-value-names -main-file-name - -mrelocation-model pic -pic-level 2 -mthread-model posix -mdisable-fp-elim -fno-strict-return -masm-verbose -munwind-tables -target-cpu penryn -dwarf-column-info -debugger-tuning=lldb -target-linker-version 409.12 -v -resource-dir /Library/Developer/CommandLineTools/usr/lib/clang/10.0.0 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk -I/usr/local/include -stdlib=libc++ -fdeprecated-macro -fdebug-compilation-dir /Users/ruanxiaoyi/Downloads/fastText-master -ferror-limit 19 -fmessage-length 204 -stack-protector 1 -fblocks -fencode-extended-block-signature -fobjc-runtime=macosx-10.14.0 -fcxx-exceptions -fexceptions -fmax-type-align=16 -fdiagnostics-show-option -fcolor-diagnostics -o - -x c++ -
    clang -cc1 version 10.0.0 (clang-1000.10.44.4) default target x86_64-apple-darwin18.2.0
    ignoring nonexistent directory "/Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk/usr/include/c++/v1"
    ignoring nonexistent directory "/Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk/usr/local/include"
    ignoring nonexistent directory "/Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk/Library/Frameworks"
    #include "..." search starts here:
    #include <...> search starts here:
     /usr/local/include
     /Library/Developer/CommandLineTools/usr/include/c++/v1
     /Library/Developer/CommandLineTools/usr/lib/clang/10.0.0/include
     /Library/Developer/CommandLineTools/usr/include
     /Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk/usr/include
     /Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk/System/Library/Frameworks (framework directory)
    

    Any suggestion for this problem?

    Python Build 
    opened by rxy1212 21
  • Any plan to support different weight for each class in loss function?

    Any plan to support different weight for each class in loss function?

    Looking at the current code, it seems to me that loss function are evaluated with the same weight for each class, which is OK for balanced data. For highly imbalanced data, are there any plan to support different weight for each class in loss function? I am thinking in command line, do:

    fasttext -input XXX -output XXX -weight_class1 10 -weight_class2 1 -weight_class3 3 
    

    or simply

    fasttext -weight_balanced 
    

    if the weight is inversely proportional to number of instances in that class?

    opened by kuangchen 18
  • Interpreting Multilabel output

    Interpreting Multilabel output

    So I loaded multilabel values for my targets. But when I use the predict_prob function; it seems like conditional probablity more than multilabel output.

    I was assuming that all the labels would have a value between 1 and 0, but I am seeing that all the labels add up to 1 instead for each class to have a value between 1 and 0.

    Can someone help me understand this output.

    opened by iymitchell 17
  • The memory error when loading the pre-trained model

    The memory error when loading the pre-trained model

    There is a memory error when I trying to load the pre-trained model, e.g., model = fasttext.load_model('D:/download/wiki.en/wiki.en.bin').

    Since the size of this bin file is almost 9G, and my memory size is only 4G. I am trying to find a memory friendly method to load the model. Can anyone give me a clue?
    Thanks a lot!

    opened by zhouchichun 16
  • Quantize error

    Quantize error

    I already have trained model_1.bin with supervised option, and when I am trying to quantize that model, I am getting following error!

    /opt/fastText/fasttext quantize -input data.txt -output models/model_1 -verbose 3 -wordNgrams 3 -bucket 1000000 -minn 3 -maxn 6 -lr 0.010 -dim 100 -loss ns -thread 8 -epoch 10 -qnorm -retrain -cutoff 100000
    
    fasttext: src/vector.cc:71: void fasttext::Vector::addRow(const fasttext::Matrix&, int64_t): Assertion `i < A.m_' failed.
    Aborted (core dumped)
    

    Edit: If I dont use -cutoff then I can run this without any error!

    opened by spate141 16
  • Loss - OVA model - Not predicting sigmoid output in Ubuntu 16.04

    Loss - OVA model - Not predicting sigmoid output in Ubuntu 16.04

    Install Log:

    c++ -pthread -std=c++0x -march=native -O3 -funroll-loops -c src/args.cc c++ -pthread -std=c++0x -march=native -O3 -funroll-loops -c src/matrix.cc c++ -pthread -std=c++0x -march=native -O3 -funroll-loops -c src/dictionary.cc c++ -pthread -std=c++0x -march=native -O3 -funroll-loops -c src/loss.cc c++ -pthread -std=c++0x -march=native -O3 -funroll-loops -c src/productquantizer.cc c++ -pthread -std=c++0x -march=native -O3 -funroll-loops -c src/densematrix.cc c++ -pthread -std=c++0x -march=native -O3 -funroll-loops -c src/quantmatrix.cc c++ -pthread -std=c++0x -march=native -O3 -funroll-loops -c src/vector.cc c++ -pthread -std=c++0x -march=native -O3 -funroll-loops -c src/model.cc c++ -pthread -std=c++0x -march=native -O3 -funroll-loops -c src/utils.cc c++ -pthread -std=c++0x -march=native -O3 -funroll-loops -c src/meter.cc c++ -pthread -std=c++0x -march=native -O3 -funroll-loops -c src/fasttext.cc src/fasttext.cc: In member function ‘void fasttext::FastText::quantize(const fasttext::Args&)’: src/fasttext.cc:323:16: warning: ‘std::vector fasttext::FastText::selectEmbeddings(int32_t) const’ is deprecated: selectEmbeddings is being deprecated. [-Wdeprecated-declarations] auto idx = selectEmbeddings(qargs.cutoff); ^ src/fasttext.cc:293:22: note: declared here std::vector<int32_t> FastText::selectEmbeddings(int32_t cutoff) const { ^ src/fasttext.cc:323:45: warning: ‘std::vector fasttext::FastText::selectEmbeddings(int32_t) const’ is deprecated: selectEmbeddings is being deprecated. [-Wdeprecated-declarations] auto idx = selectEmbeddings(qargs.cutoff); ^ src/fasttext.cc:293:22: note: declared here std::vector<int32_t> FastText::selectEmbeddings(int32_t cutoff) const { ^ src/fasttext.cc: In member function ‘void fasttext::FastText::lazyComputeWordVectors()’: src/fasttext.cc:551:5: warning: ‘void fasttext::FastText::precomputeWordVectors(fasttext::DenseMatrix&)’ is deprecated: precomputeWordVectors is being deprecated. [-Wdeprecated-declarations] precomputeWordVectors(*wordVectors_); ^ src/fasttext.cc:534:6: note: declared here void FastText::precomputeWordVectors(DenseMatrix& wordVectors) { ^ src/fasttext.cc:551:40: warning: ‘void fasttext::FastText::precomputeWordVectors(fasttext::DenseMatrix&)’ is deprecated: precomputeWordVectors is being deprecated. [-Wdeprecated-declarations] precomputeWordVectors(*wordVectors_); ^ src/fasttext.cc:534:6: note: declared here void FastText::precomputeWordVectors(DenseMatrix& wordVectors) { ^ c++ -pthread -std=c++0x -march=native -O3 -funroll-loops args.o matrix.o dictionary.o loss.o productquantizer.o densematrix.o quantmatrix.o vector.o model.o utils.o meter.o fasttext.o src/main.cc -o fasttext

    The output is not sigmoid. Its still same as the Softmax. Args: dim 100 ws 5 epoch 1 minCount 1 neg 5 wordNgrams 3 loss one-vs-all model sup bucket 1000000 minn 3 maxn 3 lrUpdateRate 100 t 0.0001

    bug 
    opened by giriannamalai 15
  • Binary model that was trained on Common crawl

    Binary model that was trained on Common crawl

    Hello! I enjoy using your library and pretrained vectors. I see that for vectors that were trained on wiki you provide both binary model and pretrained vectors. However, for vectors that were trained on Common crawl, you only provide pretrained vectors. Is it possible for you to publish binary model for them?

    Thanks, Alexander.

    opened by MrBoor 15
  • Running on PowerPC64LE (ppc64le)

    Running on PowerPC64LE (ppc64le)

    I am able to compile the stable (0.1.0) version of the code on a powerpc64le (IBM Minsky) without any errors/warnings. However when I run on any dataset (eg stackexchange cooking) using just the defaults ./fasttext supervised -input ... -output ... the program just hangs after displaying Reading ... words. I tried make debug as well. Same problem. (details: make 4.1, Ubuntu 16.04.3 LTS. Any ideas?

    opened by ironv 15
  • What's the status of this project?

    What's the status of this project?

    Last release in 2020-04, I see a lot unsolved installing issues and I miss pre-build wheels on https://pypi.org/project/fasttext/#files

    What is the future of this project or is it just dead?

    opened by return42 1
  • denpendency errors

    denpendency errors

    Hi,

    We recently conducted a study to detect build dependency errors, focusing on missing dependencies and redundant dependencies. A missing dependency (MS) is a dependency that is not declared in the build script and a redundant dependency(RD) is a dependency that is declared in the build script that is not actually used. We have detected the following dependency errors in your public projects. Could you please help us to check these dependency errors? The data format is dependency --- target. MS 0['/home/lv/WorkSpace/vmake_experiment/fastText-master/src/densematrix.h---fasttext'] 1['/home/lv/WorkSpace/vmake_experiment/fastText-master/src/vector.h---fasttext'] 2['/home/lv/WorkSpace/vmake_experiment/fastText-master/src/model.h---fasttext'] 3['/home/lv/WorkSpace/vmake_experiment/fastText-master/src/args.h---fasttext'] 4['/home/lv/WorkSpace/vmake_experiment/fastText-master/src/meter.h---fasttext'] 5['/home/lv/WorkSpace/vmake_experiment/fastText-master/src/fasttext.h---fasttext'] 6['/home/lv/WorkSpace/vmake_experiment/fastText-master/src/real.h---fasttext'] 7['/home/lv/WorkSpace/vmake_experiment/fastText-master/src/main.cc---fasttext'] 8['/home/lv/WorkSpace/vmake_experiment/fastText-master/src/matrix.h---fasttext'] 9['/home/lv/WorkSpace/vmake_experiment/fastText-master/src/utils.h---fasttext'] 10['/home/lv/WorkSpace/vmake_experiment/fastText-master/src/dictionary.h---fasttext']

    RD 0['src/utils.h---productquantizer.o'] 1['src/utils.h---quantmatrix.o'] 2['src/fasttext.cc---fasttext'] 3['src/utils.h---vector.o'] 4['src/args.h---model.o']

    opened by Meiye-lj 0
  • Program running results are abnormal

    Program running results are abnormal

    anaconda3/bin/python3.8

    import fasttext.util ft = fasttext.load_model('cc.zh.300.bin') sentence_w1=ft.get_sentence_vector('色诫');print(sentence_w1)

    [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

    opened by Chu-J 0
  • Language names of Languages supported by Fasttext

    Language names of Languages supported by Fasttext

    I am trying to find out the names of languages supported by Fasttext's LID tool, given these language codes listed here:

    af als am an ar arz as ast av az azb ba bar bcl be bg bh bn bo bpy br bs bxr ca cbk ce ceb ckb co cs cv cy da de diq dsb dty dv el eml en eo es et eu fa fi fr frr fy ga gd gl gn gom gu gv he hi hif hr hsb ht hu hy ia id ie ilo io is it ja jbo jv ka kk km kn ko krc ku kv kw ky la lb lez li lmo lo lrc lt lv mai mg mhr min mk ml mn mr mrj ms mt mwl my myv mzn nah nap nds ne new nl nn no oc or os pa pam pfl pl pms pnb ps pt qu rm ro ru rue sa sah sc scn sco sd sh si sk sl so sq sr su sv sw ta te tg th tk tl tr tt tyv ug uk ur uz vec vep vi vls vo wa war wuu xal xmf yi yo yue zh
    

    I tried to map the ISO codes to each language, but it seems non-standard, either using ISO-639-1 or ISO-639-3. Does anyone have a list of language names for these codes, or know how to find them?
    Wikipedia's list does not cover all of them either, so manual mapping too did not help.
    Thanks!

    opened by AetherPrior 1
  • Lookup tables for language labels - Python module

    Lookup tables for language labels - Python module

    I'm using the prebuilt lid.176.ftz model to do simple language ID on short texts (160 chars or fewer) using the Python module.

    Is there a lookup table (dictionary) for the labels?

    eg

    {
        "en": "English", 
        "fr": "French",
         ...
    }
    

    Some of the labels fastText returns are quite obscure languages & I've had to trawl a lot of ISO-639 docs to establish what they refer to in order to build my own lookup table.

    Or have I simply missed something in the docs /API that tells me how to get these?

    opened by RedactedCode 0
Releases(v0.9.2)
  • v0.9.2(Apr 28, 2020)

    We are happy to announce the release of version 0.9.2.

    WebAssembly

    We are excited to release fastText bindings for WebAssembly. Classification tasks are widely used in web applications and we believe giving access to the complete fastText API from the browser will notably help our community to build nice tools. See our documentation to learn more.

    Autotune: automatic hyperparameter optimization

    Finding the best hyperparameters is crucial for building efficient models. However, searching the best hyperparameters manually is difficult. This release includes the autotune feature that allows you to find automatically the best hyperparameters for your dataset. You can find more information on how to use it here.

    Python

    fastText loves Python. In this release, we have:

    • several bug fixes for prediction functions
    • nearest neighbors and analogies for Python
    • a memory leak fix
    • website tutorials with Python examples

    The autotune feature is fully integrated with our Python API. This allows us to have a more stable autotune optimization loop from Python and to synchronize the best hyper-parameters with the _FastText model object.

    Pre-trained models tool

    We release two helper scripts:

    They can also be used directly from our Python API.

    More metrics

    When you test a trained model, you can now have more detailed results for the precision/recall metrics of a specific label or all labels.

    Paper source code

    This release contains the source code of the unsupervised multilingual alignment paper.

    Community feedback and contributions

    We want to thank our community for giving us feedback on Facebook and on GitHub.

    Source code(tar.gz)
    Source code(zip)
  • v0.9.1(Jul 4, 2019)

    We are happy to announce the release of version 0.9.1.

    New release of python module

    The main goal of this release is to merge two existing python modules: the official fastText module which was available on our github repository and the unofficial fasttext module which was available on pypi.org.

    You can find an overview of the new API here, and more insight in our blog post.

    Refactoring

    This version includes a massive rewrite of internal classes. The training and test are now split into three different classes : Model that takes care of the computational aspect, Loss that handles loss and applies gradients to the output matrix, and State that is responsible of holding the model's state inside each thread.

    That makes the code more straighforward to read but also gives a smaller memory footprint, because the data needed for loss computation is now hold only once unlike before where there was one for each thread.

    Misc

    • Compilation issues fix for recent versions of Mac OS X.
    • Better unicode handling :
      • on_unicode_error argument that helps to handle unicode issues one can face with some datasets
      • bug fix related to different behaviour of pybind11's py::str class between python2 and python3
    • script for unsupervised alignment
    • public file hosting changed from aws to fbaipublicfiles
    • we added a Code of Conduct file.

    Thank you !

    As always, we want to thank you for your help and your precious feedback which helps making this project better.

    Source code(tar.gz)
    Source code(zip)
  • v0.2.0(Dec 19, 2018)

    We are happy to announce the change of the license from BSD+patents to MIT and the release of fastText 0.2.0.

    The main purpose of this release is to set a beta C++ API of the FastText class. The class now behaves as a computational library: we moved the display and some usage error handlings outside of it (mainly to main.cc and fasttext_pybind.cc). It is still compatible with older versions of the class, but some methods are now marked as deprecated and will probably be removed in the next release.

    In this respect, we also introduce the official support for python. The python binding of fastText is a client of the FastText class.

    Here is a short summary of the 104 commits since 0.1.0 :

    New :

    • Introduction of the “OneVsAll” loss function for multi-label classification, which corresponds to the sum of binary cross-entropy computed independently for each label. This new loss can be used with the -loss ova or -loss one-vs-all command line option ( 8850c51b972ed68642a15c17fbcd4dd58766291d ).
    • Computation of the precision and recall metrics for each label ( be1e597cb67c069ba9940ff241d9aad38ccd37da ).
    • Removed printing functions from FastText class ( 256032b87522cdebc4850c99b204b81b3255cb2a ).
    • Better default for number of threads ( 501b9b1e4543fd2de55e4a621a9924ce7d2b5b17 ).
    • Python support ( f10ec1faea1605d40fdb79fe472cc2204f3d584c ).
    • More tests for circleci/python ( eb9703a4a7ed0f7559d6f341cc8e5d166d5e4d88, 97fcde80ea107ca52d3d778a083564619175039c, 1de0624bfaff02d91fd265f331c07a4a0a7bb857 ).

    Bug fixes :

    • Normalize buffer vector in analogy queries.
    • Typo fixes and clarifications on website.
    • Improvements on python install issues : setup.py OS X compiler flags, pybind11 include.
    • Fix: getSubwords for EOS.
    • Fix: ETA time.
    • Fix: division by 0 in word analogy evaluation.
    • Fix for the infinite loop on ARM cpu.

    Operations :

    • We released more pre-trained vectors (92bc7d230959e2a94125fbe7d3b05257effb1111, 5bf8b4c615b6308d76ad39a5a50fa6c4174113ea ).

    Worth noting :

    • We added circleci build badges to the README.md
    • We modified the style to be in compliance with Facebook C++ style.
    • We added coverage option for Makefile and setup.py in order to build for measuring the coverage.

    Thank you fastText community!

    We want to thank you all for being a part of this community and sharing your passion with us. Some of these improvements would not have been possible without your help.

    Source code(tar.gz)
    Source code(zip)
Owner
Facebook Research
Facebook Research
PyTorch code for EMNLP 2019 paper "LXMERT: Learning Cross-Modality Encoder Representations from Transformers".

LXMERT: Learning Cross-Modality Encoder Representations from Transformers Our servers break again :(. I have updated the links so that they should wor

Hao Tan 838 Dec 19, 2022
Artificial Conversational Entity for queries in Eulogio "Amang" Rodriguez Institute of Science and Technology (EARIST)

🤖 Coeus - EARIST A.C.E 💬 Coeus is an Artificial Conversational Entity for queries in Eulogio "Amang" Rodriguez Institute of Science and Technology,

Dids Irwyn Reyes 3 Oct 14, 2022
PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

StyleSpeech - PyTorch Implementation PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation. Status (2021.06.09

Keon Lee 142 Jan 06, 2023
Text editor on python to convert english text to malayalam(Romanization/Transiteration).

Manglish Text Editor This is a simple transiteration (romanization ) program which is used to convert manglish to malayalam (converts njaan to ഞാൻ ).

Merin Rose Tom 1 May 11, 2022
One Stop Anomaly Shop: Anomaly detection using two-phase approach: (a) pre-labeling using statistics, Natural Language Processing and static rules; (b) anomaly scoring using supervised and unsupervised machine learning.

One Stop Anomaly Shop (OSAS) Quick start guide Step 1: Get/build the docker image Option 1: Use precompiled image (might not reflect latest changes):

Adobe, Inc. 148 Dec 26, 2022
Fake Shakespearean Text Generator

Fake Shakespearean Text Generator This project contains an impelementation of stateful Char-RNN model to generate fake shakespearean texts. Files and

Recep YILDIRIM 1 Feb 15, 2022
Research code for ECCV 2020 paper "UNITER: UNiversal Image-TExt Representation Learning"

UNITER: UNiversal Image-TExt Representation Learning This is the official repository of UNITER (ECCV 2020). This repository currently supports finetun

Yen-Chun Chen 680 Dec 24, 2022
Syntax-aware Multi-spans Generation for Reading Comprehension (TASLP 2022)

SyntaxGen Syntax-aware Multi-spans Generation for Reading Comprehension (TASLP 2022) In this repo, we upload all the scripts for this work. Due to siz

Zhuosheng Zhang 3 Jun 13, 2022
Extract city and country mentions from Text like GeoText without regex, but FlashText, a Aho-Corasick implementation.

flashgeotext ⚡ 🌍 Extract and count countries and cities (+their synonyms) from text, like GeoText on steroids using FlashText, a Aho-Corasick impleme

Ben 57 Dec 16, 2022
KoBART model on huggingface transformers

KoBART-Transformers SKT에서 공개한 KoBART를 편리하게 사용할 수 있게 transformers로 포팅하였습니다. Install (Optional) BartModel과 PreTrainedTokenizerFast를 이용하면 설치하실 필요 없습니다. p

Hyunwoong Ko 58 Dec 07, 2022
Repository for the paper "Optimal Subarchitecture Extraction for BERT"

Bort Companion code for the paper "Optimal Subarchitecture Extraction for BERT." Bort is an optimal subset of architectural parameters for the BERT ar

Alexa 461 Nov 21, 2022
ThinkTwice: A Two-Stage Method for Long-Text Machine Reading Comprehension

ThinkTwice ThinkTwice is a retriever-reader architecture for solving long-text machine reading comprehension. It is based on the paper: ThinkTwice: A

Walle 4 Aug 06, 2021
PORORO: Platform Of neuRal mOdels for natuRal language prOcessing

PORORO: Platform Of neuRal mOdels for natuRal language prOcessing pororo performs Natural Language Processing and Speech-related tasks. It is easy to

Kakao Brain 1.2k Dec 21, 2022
A python package for deep multilingual punctuation prediction.

This python library predicts the punctuation of English, Italian, French and German texts. We developed it to restore the punctuation of transcribed spoken language.

Oliver Guhr 27 Dec 22, 2022
A library for end-to-end learning of embedding index and retrieval model

Poeem Poeem is a library for efficient approximate nearest neighbor (ANN) search, which has been widely adopted in industrial recommendation, advertis

54 Dec 21, 2022
GooAQ 🥑 : Google Answers to Google Questions!

This repository contains the code/data accompanying our recent work on long-form question answering.

AI2 112 Nov 06, 2022
Transformers Wav2Vec2 + Parlance's CTCDecodeTransformers Wav2Vec2 + Parlance's CTCDecode

🤗 Transformers Wav2Vec2 + Parlance's CTCDecode Introduction This repo shows how 🤗 Transformers can be used in combination with Parlance's ctcdecode

Patrick von Platen 9 Jul 21, 2022
Dense Passage Retriever - is a set of tools and models for open domain Q&A task.

Dense Passage Retrieval Dense Passage Retrieval (DPR) - is a set of tools and models for state-of-the-art open-domain Q&A research. It is based on the

Meta Research 1.1k Jan 07, 2023
Data and evaluation code for the paper WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER (EMNLP 2021).

Data and evaluation code for the paper WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER. @inproceedings{tedes

Babelscape 40 Dec 11, 2022
OpenChat: Opensource chatting framework for generative models

OpenChat is opensource chatting framework for generative models.

Hyunwoong Ko 427 Jan 06, 2023