Library for fast text representation and classification.

Last update: Jan 05, 2023

Related tags

Text Data & NLP fastText

Overview

fastText

fastText is a library for efficient learning of word representations and sentence classification.

Resources
Requirements
Building fastText
Example use cases
Full documentation
References
Join the fastText community
License

Resources

Models

Recent state-of-the-art English word vectors.
Word vectors for 157 languages trained on Wikipedia and Crawl.
Models for language identification and various supervised tasks.

Supplementary data

The preprocessed YFCC100M data used in [2].

FAQ

You can find answers to frequently asked questions on our website.

Cheatsheet

We also provide a cheatsheet full of useful one-liners.

Requirements

We are continuously building and testing our library, CLI and Python bindings under various docker images using circleci.

Generally, fastText builds on modern Mac OS and Linux distributions. Since it uses some C++11 features, it requires a compiler with good C++11 support. These include :

(g++-4.7.2 or newer) or (clang-3.3 or newer)

Compilation is carried out using a Makefile, so you will need to have a working make. If you want to use cmake you need at least version 2.8.9.

One of the oldest distributions we successfully built and tested the CLI under is Debian jessie.

For the word-similarity evaluation script you will need:

Python 2.6 or newer
NumPy & SciPy

For the python bindings (see the subdirectory python) you will need:

Python version 2.7 or >=3.4
NumPy & SciPy
pybind11

One of the oldest distributions we successfully built and tested the Python bindings under is Debian jessie.

If these requirements make it impossible for you to use fastText, please open an issue and we will try to accommodate you.

Building fastText

We discuss building the latest stable version of fastText.

Getting the source code

You can find our latest stable release in the usual place.

There is also the master branch that contains all of our most recent work, but comes along with all the usual caveats of an unstable branch. You might want to use this if you are a developer or power-user.

Building fastText using make (preferred)

$ wget https://github.com/facebookresearch/fastText/archive/v0.9.2.zip
$ unzip v0.9.2.zip
$ cd fastText-0.9.2
$ make

This will produce object files for all the classes as well as the main binary fasttext. If you do not plan on using the default system-wide compiler, update the two macros defined at the beginning of the Makefile (CC and INCLUDES).

Building fastText using cmake

For now this is not part of a release, so you will need to clone the master branch.

$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ mkdir build && cd build && cmake ..
$ make && make install

This will create the fasttext binary and also all relevant libraries (shared, static, PIC).

Building fastText for Python

For now this is not part of a release, so you will need to clone the master branch.

$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ pip install .

For further information and introduction see python/README.md

Example use cases

This library has two main use cases: word representation learning and text classification. These were described in the two papers 1 and 2.

Word representation learning

In order to learn word vectors, as described in 1, do:

$ ./fasttext skipgram -input data.txt -output model

where data.txt is a training file containing UTF-8 encoded text. By default the word vectors will take into account character n-grams from 3 to 6 characters. At the end of optimization the program will save two files: model.bin and model.vec. model.vec is a text file containing the word vectors, one per line. model.bin is a binary file containing the parameters of the model along with the dictionary and all hyper parameters. The binary file can be used later to compute word vectors or to restart the optimization.

Obtaining word vectors for out-of-vocabulary words

The previously trained model can be used to compute word vectors for out-of-vocabulary words. Provided you have a text file queries.txt containing words for which you want to compute vectors, use the following command:

$ ./fasttext print-word-vectors model.bin < queries.txt

This will output word vectors to the standard output, one vector per line. This can also be used with pipes:

$ cat queries.txt | ./fasttext print-word-vectors model.bin

See the provided scripts for an example. For instance, running:

$ ./word-vector-example.sh

will compile the code, download data, compute word vectors and evaluate them on the rare words similarity dataset RW [Thang et al. 2013].

Text classification

This library can also be used to train supervised text classifiers, for instance for sentiment analysis. In order to train a text classifier using the method described in 2, use:

$ ./fasttext supervised -input train.txt -output model

where train.txt is a text file containing a training sentence per line along with the labels. By default, we assume that labels are words that are prefixed by the string __label__. This will output two files: model.bin and model.vec. Once the model was trained, you can evaluate it by computing the precision and recall at k ([email protected] and [email protected]) on a test set using:

$ ./fasttext test model.bin test.txt k

The argument k is optional, and is equal to 1 by default.

In order to obtain the k most likely labels for a piece of text, use:

$ ./fasttext predict model.bin test.txt k

or use predict-prob to also get the probability for each label

$ ./fasttext predict-prob model.bin test.txt k

where test.txt contains a piece of text to classify per line. Doing so will print to the standard output the k most likely labels for each line. The argument k is optional, and equal to 1 by default. See classification-example.sh for an example use case. In order to reproduce results from the paper 2, run classification-results.sh, this will download all the datasets and reproduce the results from Table 1.

If you want to compute vector representations of sentences or paragraphs, please use:

$ ./fasttext print-sentence-vectors model.bin < text.txt

This assumes that the text.txt file contains the paragraphs that you want to get vectors for. The program will output one vector representation per line in the file.

You can also quantize a supervised model to reduce its memory usage with the following command:

$ ./fasttext quantize -output model

This will create a .ftz file with a smaller memory footprint. All the standard functionality, like test or predict work the same way on the quantized models:

$ ./fasttext test model.ftz test.txt

The quantization procedure follows the steps described in 3. You can run the script quantization-example.sh for an example.

Full documentation

Invoke a command without arguments to list available arguments and their default values:

$ ./fasttext supervised
Empty input or output path.

The following arguments are mandatory:
  -input              training file path
  -output             output file path

The following arguments are optional:
  -verbose            verbosity level [2]

The following arguments for the dictionary are optional:
  -minCount           minimal number of word occurrences [1]
  -minCountLabel      minimal number of label occurrences [0]
  -wordNgrams         max length of word ngram [1]
  -bucket             number of buckets [2000000]
  -minn               min length of char ngram [0]
  -maxn               max length of char ngram [0]
  -t                  sampling threshold [0.0001]
  -label              labels prefix [__label__]

The following arguments for training are optional:
  -lr                 learning rate [0.1]
  -lrUpdateRate       change the rate of updates for the learning rate [100]
  -dim                size of word vectors [100]
  -ws                 size of the context window [5]
  -epoch              number of epochs [5]
  -neg                number of negatives sampled [5]
  -loss               loss function {ns, hs, softmax} [softmax]
  -thread             number of threads [12]
  -pretrainedVectors  pretrained word vectors for supervised learning []
  -saveOutput         whether output params should be saved [0]

The following arguments for quantization are optional:
  -cutoff             number of words and ngrams to retain [0]
  -retrain            finetune embeddings if a cutoff is applied [0]
  -qnorm              quantizing the norm separately [0]
  -qout               quantizing the classifier [0]
  -dsub               size of each sub-vector [2]

Defaults may vary by mode. (Word-representation modes skipgram and cbow use a default -minCount of 5.)

References

Please cite 1 if using this code for learning word representations or 2 if using for text classification.

Enriching Word Vectors with Subword Information

[1] P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information

@article{bojanowski2017enriching,
  title={Enriching Word Vectors with Subword Information},
  author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
  journal={Transactions of the Association for Computational Linguistics},
  volume={5},
  year={2017},
  issn={2307-387X},
  pages={135--146}
}

Bag of Tricks for Efficient Text Classification

[2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification

@InProceedings{joulin2017bag,
  title={Bag of Tricks for Efficient Text Classification},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
  booktitle={Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers},
  month={April},
  year={2017},
  publisher={Association for Computational Linguistics},
  pages={427--431},
}

FastText.zip: Compressing text classification models

[3] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, FastText.zip: Compressing text classification models

@article{joulin2016fasttext,
  title={FastText.zip: Compressing text classification models},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{\'e}gou, H{\'e}rve and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1612.03651},
  year={2016}
}

(* These authors contributed equally.)

Join the fastText community

Facebook page: https://www.facebook.com/groups/1174547215919768
Google group: https://groups.google.com/forum/#!forum/fasttext-library
Contact: [email protected], [email protected], [email protected], [email protected]

See the CONTRIBUTING file for information about how to help out.

License

fastText is MIT-licensed.

Comments

fasttext installed but import fails

Hi have successfully installed fasttext on python3.5. However, when I try to import it I get the following error:

Using /usr/local/lib/python3.5/dist-packages
Finished processing dependencies for fasttext==0.8.22
[email protected]:~/GitHub/fastText$ python3.5
Python 3.5.2 (default, Nov 23 2017, 16:37:01) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import fasttext
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named 'fasttext'
>>>

I have tried installing both with pip install . and python setup.y install with no luck.

opened by ahmedahmedov 25

Assertion failed on ./fasttext predict

predict command failed!

./fasttext predict model.bin test.txt

Assertion failed: (counts.size() == osz_), function setTargetCounts, file src/model.cc, line 188.
Abort trap: 6

model train command was:

./fasttext supervised -input train.txt -output model -wordNgrams 4 -bucket 1000000 -thread 16
Read 4223M words
Number of words:  16577869
Number of labels: 25
Progress: 100.0%  words/sec/thread: 375706  lr: 0.000000  loss: 0.169518  eta: 0h0m 

opened by spate141 25

How can we get the vector of a paragraph?

I have ever tried doc2vec (from gensim, based on word2vec), with which I can extract fixed length vector for variant length paragraphs. Can I do the same with fastText?

Thank you!

opened by xchangcheng 22

OS X install problem

When I install fasttext using "pip install .", I get some errors like following

Failed to build fasttext
Installing collected packages: fasttext
  Running setup.py install for fasttext ... error
    Complete output from command /miniconda3/bin/python -u -c "import setuptools, tokenize;__file__='/private/var/folders/tz/msp_r50s03s59q40s_gmhx600000gn/T/pip-req-build-i2z3pyel/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /private/var/folders/tz/msp_r50s03s59q40s_gmhx600000gn/T/pip-record-yg0h6noh/install-record.txt --single-version-externally-managed --compile:
    running install
    running build
    running build_py
    creating build
    creating build/lib.macosx-10.7-x86_64-3.6
    creating build/lib.macosx-10.7-x86_64-3.6/fastText
    copying python/fastText/__init__.py -> build/lib.macosx-10.7-x86_64-3.6/fastText
    copying python/fastText/FastText.py -> build/lib.macosx-10.7-x86_64-3.6/fastText
    creating build/lib.macosx-10.7-x86_64-3.6/fastText/util
    copying python/fastText/util/util.py -> build/lib.macosx-10.7-x86_64-3.6/fastText/util
    copying python/fastText/util/__init__.py -> build/lib.macosx-10.7-x86_64-3.6/fastText/util
    creating build/lib.macosx-10.7-x86_64-3.6/fastText/tests
    copying python/fastText/tests/test_script.py -> build/lib.macosx-10.7-x86_64-3.6/fastText/tests
    copying python/fastText/tests/__init__.py -> build/lib.macosx-10.7-x86_64-3.6/fastText/tests
    copying python/fastText/tests/test_configurations.py -> build/lib.macosx-10.7-x86_64-3.6/fastText/tests
    running build_ext
    gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/miniconda3/include -arch x86_64 -I/miniconda3/include -arch x86_64 -I/miniconda3/include/python3.6m -c /var/folders/tz/msp_r50s03s59q40s_gmhx600000gn/T/tmp1upvarhx.cpp -o var/folders/tz/msp_r50s03s59q40s_gmhx600000gn/T/tmp1upvarhx.o -stdlib=libc++
    gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/miniconda3/include -arch x86_64 -I/miniconda3/include -arch x86_64 -I/miniconda3/include/python3.6m -c /var/folders/tz/msp_r50s03s59q40s_gmhx600000gn/T/tmp9dzh7j94.cpp -o var/folders/tz/msp_r50s03s59q40s_gmhx600000gn/T/tmp9dzh7j94.o -std=c++14
    warning: include path for stdlibc++ headers not found; pass '-std=libc++' on the command line to use the libc++ standard library instead [-Wstdlibcxx-not-found]
    1 warning generated.
    gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/miniconda3/include -arch x86_64 -I/miniconda3/include -arch x86_64 -I/miniconda3/include/python3.6m -c /var/folders/tz/msp_r50s03s59q40s_gmhx600000gn/T/tmpw5pz6xr0.cpp -o var/folders/tz/msp_r50s03s59q40s_gmhx600000gn/T/tmpw5pz6xr0.o -fvisibility=hidden
    warning: include path for stdlibc++ headers not found; pass '-std=libc++' on the command line to use the libc++ standard library instead [-Wstdlibcxx-not-found]
    1 warning generated.
    building 'fasttext_pybind' extension
    creating build/temp.macosx-10.7-x86_64-3.6
    creating build/temp.macosx-10.7-x86_64-3.6/python
    creating build/temp.macosx-10.7-x86_64-3.6/python/fastText
    creating build/temp.macosx-10.7-x86_64-3.6/python/fastText/pybind
    creating build/temp.macosx-10.7-x86_64-3.6/src
    gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/miniconda3/include -arch x86_64 -I/miniconda3/include -arch x86_64 -I/miniconda3/include/python3.6m -I/Users/ruanxiaoyi/.local/include/python3.6m -Isrc -I/miniconda3/include/python3.6m -c python/fastText/pybind/fasttext_pybind.cc -o build/temp.macosx-10.7-x86_64-3.6/python/fastText/pybind/fasttext_pybind.o -stdlib=libc++ -DVERSION_INFO="0.8.22" -std=c++14 -fvisibility=hidden
    python/fastText/pybind/fasttext_pybind.cc:219:35: warning: comparison of integers of different signs: 'int32_t' (aka 'int') and 'std::__1::vector<long long, std::__1::allocator<long long> >::size_type' (aka 'unsigned long') [-Wsign-compare]
                for (int32_t i = 0; i < vocab_freq.size(); i++) {
                                    ~ ^ ~~~~~~~~~~~~~~~~~
    python/fastText/pybind/fasttext_pybind.cc:233:35: warning: comparison of integers of different signs: 'int32_t' (aka 'int') and 'std::__1::vector<long long, std::__1::allocator<long long> >::size_type' (aka 'unsigned long') [-Wsign-compare]
                for (int32_t i = 0; i < labels_freq.size(); i++) {
                                    ~ ^ ~~~~~~~~~~~~~~~~~~
    2 warnings generated.
    gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/miniconda3/include -arch x86_64 -I/miniconda3/include -arch x86_64 -I/miniconda3/include/python3.6m -I/Users/ruanxiaoyi/.local/include/python3.6m -Isrc -I/miniconda3/include/python3.6m -c src/dictionary.cc -o build/temp.macosx-10.7-x86_64-3.6/src/dictionary.o -stdlib=libc++ -DVERSION_INFO="0.8.22" -std=c++14 -fvisibility=hidden
    src/dictionary.cc:181:52: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int' [-Wsign-compare]
        for (size_t j = i, n = 1; j < word.size() && n <= args_->maxn; n++) {
                                                     ~ ^  ~~~~~~~~~~~
    src/dictionary.cc:186:13: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int' [-Wsign-compare]
          if (n >= args_->minn && !(n == 1 && (i == 0 || j == word.size()))) {
              ~ ^  ~~~~~~~~~~~
    src/dictionary.cc:198:24: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int32_t' (aka 'int') [-Wsign-compare]
      for (size_t i = 0; i < size_; i++) {
                         ~ ^ ~~~~~
    src/dictionary.cc:296:24: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int32_t' (aka 'int') [-Wsign-compare]
      for (size_t i = 0; i < size_; i++) {
                         ~ ^ ~~~~~
    src/dictionary.cc:316:25: warning: comparison of integers of different signs: 'int32_t' (aka 'int') and 'std::__1::vector<int, std::__1::allocator<int> >::size_type' (aka 'unsigned long') [-Wsign-compare]
      for (int32_t i = 0; i < hashes.size(); i++) {
                          ~ ^ ~~~~~~~~~~~~~
    src/dictionary.cc:318:31: warning: comparison of integers of different signs: 'int32_t' (aka 'int') and 'std::__1::vector<int, std::__1::allocator<int> >::size_type' (aka 'unsigned long') [-Wsign-compare]
        for (int32_t j = i + 1; j < hashes.size() && j < i + n; j++) {
                                ~ ^ ~~~~~~~~~~~~~
    src/dictionary.cc:515:25: warning: comparison of integers of different signs: 'int32_t' (aka 'int') and 'std::__1::vector<fasttext::entry, std::__1::allocator<fasttext::entry> >::size_type' (aka 'unsigned long') [-Wsign-compare]
      for (int32_t i = 0; i < words_.size(); i++) {
                          ~ ^ ~~~~~~~~~~~~~
    src/dictionary.cc:517:12: warning: comparison of integers of different signs: 'int32_t' (aka 'int') and 'std::__1::vector<int, std::__1::allocator<int> >::size_type' (aka 'unsigned long') [-Wsign-compare]
            (j < words.size() && words[j] == i)) {
             ~ ^ ~~~~~~~~~~~~
    8 warnings generated.
    gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/miniconda3/include -arch x86_64 -I/miniconda3/include -arch x86_64 -I/miniconda3/include/python3.6m -I/Users/ruanxiaoyi/.local/include/python3.6m -Isrc -I/miniconda3/include/python3.6m -c src/main.cc -o build/temp.macosx-10.7-x86_64-3.6/src/main.o -stdlib=libc++ -DVERSION_INFO="0.8.22" -std=c++14 -fvisibility=hidden
    src/main.cc:348:3: warning: code will never be executed [-Wunreachable-code]
      exit(0);
      ^~~~
    1 warning generated.
    gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/miniconda3/include -arch x86_64 -I/miniconda3/include -arch x86_64 -I/miniconda3/include/python3.6m -I/Users/ruanxiaoyi/.local/include/python3.6m -Isrc -I/miniconda3/include/python3.6m -c src/fasttext.cc -o build/temp.macosx-10.7-x86_64-3.6/src/fasttext.o -stdlib=libc++ -DVERSION_INFO="0.8.22" -std=c++14 -fvisibility=hidden
    src/fasttext.cc:92:21: warning: comparison of integers of different signs: 'int' and 'std::__1::vector<int, std::__1::allocator<int> >::size_type' (aka 'unsigned long') [-Wsign-compare]
      for (int i = 0; i < ngrams.size(); i++) {
                      ~ ^ ~~~~~~~~~~~~~
    src/fasttext.cc:302:18: warning: comparison of integers of different signs: 'const int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
        return eosid == i1 || (eosid != i2 && norms[i1] > norms[i2]);
               ~~~~~ ^  ~~
    src/fasttext.cc:302:34: warning: comparison of integers of different signs: 'const int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
        return eosid == i1 || (eosid != i2 && norms[i1] > norms[i2]);
                               ~~~~~ ^  ~~
    src/fasttext.cc:323:16: warning: 'selectEmbeddings' is deprecated: selectEmbeddings is being deprecated. [-Wdeprecated-declarations]
        auto idx = selectEmbeddings(qargs.cutoff);
                   ^
    src/fasttext.h:165:3: note: 'selectEmbeddings' has been explicitly marked deprecated here
      FASTTEXT_DEPRECATED("selectEmbeddings is being deprecated.")
      ^
    src/utils.h:18:49: note: expanded from macro 'FASTTEXT_DEPRECATED'
    #define FASTTEXT_DEPRECATED(msg) __attribute__((__deprecated__(msg)))
                                                    ^
    src/fasttext.cc:322:40: warning: comparison of integers of different signs: 'const size_t' (aka 'const unsigned long') and 'int64_t' (aka 'long long') [-Wsign-compare]
      if (qargs.cutoff > 0 && qargs.cutoff < input->size(0)) {
                              ~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~
    src/fasttext.cc:327:24: warning: comparison of integers of different signs: 'int' and 'std::__1::vector<int, std::__1::allocator<int> >::size_type' (aka 'unsigned long') [-Wsign-compare]
        for (auto i = 0; i < idx.size(); i++) {
                         ~ ^ ~~~~~~~~~~
    src/fasttext.cc:380:25: warning: comparison of integers of different signs: 'int32_t' (aka 'int') and 'std::__1::vector<int, std::__1::allocator<int> >::size_type' (aka 'unsigned long') [-Wsign-compare]
      for (int32_t w = 0; w < line.size(); w++) {
                          ~ ^ ~~~~~~~~~~~
    src/fasttext.cc:384:41: warning: comparison of integers of different signs: 'int' and 'std::__1::vector<int, std::__1::allocator<int> >::size_type' (aka 'unsigned long') [-Wsign-compare]
          if (c != 0 && w + c >= 0 && w + c < line.size()) {
                                      ~~~~~ ^ ~~~~~~~~~~~
    src/fasttext.cc:398:25: warning: comparison of integers of different signs: 'int32_t' (aka 'int') and 'std::__1::vector<int, std::__1::allocator<int> >::size_type' (aka 'unsigned long') [-Wsign-compare]
      for (int32_t w = 0; w < line.size(); w++) {
                          ~ ^ ~~~~~~~~~~~
    src/fasttext.cc:402:41: warning: comparison of integers of different signs: 'int' and 'std::__1::vector<int, std::__1::allocator<int> >::size_type' (aka 'unsigned long') [-Wsign-compare]
          if (c != 0 && w + c >= 0 && w + c < line.size()) {
                                      ~~~~~ ^ ~~~~~~~~~~~
    src/fasttext.cc:479:27: warning: comparison of integers of different signs: 'int32_t' (aka 'int') and 'std::__1::vector<int, std::__1::allocator<int> >::size_type' (aka 'unsigned long') [-Wsign-compare]
        for (int32_t i = 0; i < line.size(); i++) {
                            ~ ^ ~~~~~~~~~~~
    src/fasttext.cc:514:25: warning: comparison of integers of different signs: 'int32_t' (aka 'int') and 'std::__1::vector<int, std::__1::allocator<int> >::size_type' (aka 'unsigned long') [-Wsign-compare]
      for (int32_t i = 0; i < ngrams.size(); i++) {
                          ~ ^ ~~~~~~~~~~~~~
    src/fasttext.cc:551:5: warning: 'precomputeWordVectors' is deprecated: precomputeWordVectors is being deprecated. [-Wdeprecated-declarations]
        precomputeWordVectors(*wordVectors_);
        ^
    src/fasttext.h:180:3: note: 'precomputeWordVectors' has been explicitly marked deprecated here
      FASTTEXT_DEPRECATED("precomputeWordVectors is being deprecated.")
      ^
    src/utils.h:18:49: note: expanded from macro 'FASTTEXT_DEPRECATED'
    #define FASTTEXT_DEPRECATED(msg) __attribute__((__deprecated__(msg)))
                                                    ^
    src/fasttext.cc:585:23: warning: comparison of integers of different signs: 'std::__1::vector<std::__1::pair<float, std::__1::basic_string<char> >, std::__1::allocator<std::__1::pair<float, std::__1::basic_string<char> > > >::size_type' (aka 'unsigned long') and 'int32_t' (aka 'int') [-Wsign-compare]
          if (heap.size() == k && similarity < heap.front().first) {
              ~~~~~~~~~~~ ^  ~
    src/fasttext.cc:590:23: warning: comparison of integers of different signs: 'std::__1::vector<std::__1::pair<float, std::__1::basic_string<char> >, std::__1::allocator<std::__1::pair<float, std::__1::basic_string<char> > > >::size_type' (aka 'unsigned long') and 'int32_t' (aka 'int') [-Wsign-compare]
          if (heap.size() > k) {
              ~~~~~~~~~~~ ^ ~
    src/fasttext.cc:701:24: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int64_t' (aka 'long long') [-Wsign-compare]
      for (size_t i = 0; i < n; i++) {
                         ~ ^ ~
    src/fasttext.cc:706:26: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int64_t' (aka 'long long') [-Wsign-compare]
        for (size_t j = 0; j < dim; j++) {
                           ~ ^ ~~~
    src/fasttext.cc:718:24: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int64_t' (aka 'long long') [-Wsign-compare]
      for (size_t i = 0; i < n; i++) {
                         ~ ^ ~
    src/fasttext.cc:723:26: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int64_t' (aka 'long long') [-Wsign-compare]
        for (size_t j = 0; j < dim; j++) {
                           ~ ^ ~~~
    19 warnings generated.
    gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/miniconda3/include -arch x86_64 -I/miniconda3/include -arch x86_64 -I/miniconda3/include/python3.6m -I/Users/ruanxiaoyi/.local/include/python3.6m -Isrc -I/miniconda3/include/python3.6m -c src/utils.cc -o build/temp.macosx-10.7-x86_64-3.6/src/utils.o -stdlib=libc++ -DVERSION_INFO="0.8.22" -std=c++14 -fvisibility=hidden
    gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/miniconda3/include -arch x86_64 -I/miniconda3/include -arch x86_64 -I/miniconda3/include/python3.6m -I/Users/ruanxiaoyi/.local/include/python3.6m -Isrc -I/miniconda3/include/python3.6m -c src/model.cc -o build/temp.macosx-10.7-x86_64-3.6/src/model.o -stdlib=libc++ -DVERSION_INFO="0.8.22" -std=c++14 -fvisibility=hidden
    gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/miniconda3/include -arch x86_64 -I/miniconda3/include -arch x86_64 -I/miniconda3/include/python3.6m -I/Users/ruanxiaoyi/.local/include/python3.6m -Isrc -I/miniconda3/include/python3.6m -c src/loss.cc -o build/temp.macosx-10.7-x86_64-3.6/src/loss.o -stdlib=libc++ -DVERSION_INFO="0.8.22" -std=c++14 -fvisibility=hidden
    src/loss.cc:83:21: warning: comparison of integers of different signs: 'std::__1::vector<std::__1::pair<float, int>, std::__1::allocator<std::__1::pair<float, int> > >::size_type' (aka 'unsigned long') and 'int32_t' (aka 'int') [-Wsign-compare]
        if (heap.size() == k && std_log(output[i]) < heap.front().first) {
            ~~~~~~~~~~~ ^  ~
    src/loss.cc:88:21: warning: comparison of integers of different signs: 'std::__1::vector<std::__1::pair<float, int>, std::__1::allocator<std::__1::pair<float, int> > >::size_type' (aka 'unsigned long') and 'int32_t' (aka 'int') [-Wsign-compare]
        if (heap.size() > k) {
            ~~~~~~~~~~~ ^ ~
    src/loss.cc:257:25: warning: comparison of integers of different signs: 'int32_t' (aka 'int') and 'std::__1::vector<int, std::__1::allocator<int> >::size_type' (aka 'unsigned long') [-Wsign-compare]
      for (int32_t i = 0; i < pathToRoot.size(); i++) {
                          ~ ^ ~~~~~~~~~~~~~~~~~
    src/loss.cc:282:19: warning: comparison of integers of different signs: 'std::__1::vector<std::__1::pair<float, int>, std::__1::allocator<std::__1::pair<float, int> > >::size_type' (aka 'unsigned long') and 'int32_t' (aka 'int') [-Wsign-compare]
      if (heap.size() == k && score < heap.front().first) {
          ~~~~~~~~~~~ ^  ~
    src/loss.cc:289:21: warning: comparison of integers of different signs: 'std::__1::vector<std::__1::pair<float, int>, std::__1::allocator<std::__1::pair<float, int> > >::size_type' (aka 'unsigned long') and 'int32_t' (aka 'int') [-Wsign-compare]
        if (heap.size() > k) {
            ~~~~~~~~~~~ ^ ~
    5 warnings generated.
    gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/miniconda3/include -arch x86_64 -I/miniconda3/include -arch x86_64 -I/miniconda3/include/python3.6m -I/Users/ruanxiaoyi/.local/include/python3.6m -Isrc -I/miniconda3/include/python3.6m -c src/productquantizer.cc -o build/temp.macosx-10.7-x86_64-3.6/src/productquantizer.o -stdlib=libc++ -DVERSION_INFO="0.8.22" -std=c++14 -fvisibility=hidden
    src/productquantizer.cc:246:22: warning: comparison of integers of different signs: 'int' and 'std::__1::vector<float, std::__1::allocator<float> >::size_type' (aka 'unsigned long') [-Wsign-compare]
      for (auto i = 0; i < centroids_.size(); i++) {
                       ~ ^ ~~~~~~~~~~~~~~~~~
    1 warning generated.
    gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/miniconda3/include -arch x86_64 -I/miniconda3/include -arch x86_64 -I/miniconda3/include/python3.6m -I/Users/ruanxiaoyi/.local/include/python3.6m -Isrc -I/miniconda3/include/python3.6m -c src/args.cc -o build/temp.macosx-10.7-x86_64-3.6/src/args.o -stdlib=libc++ -DVERSION_INFO="0.8.22" -std=c++14 -fvisibility=hidden
    src/args.cc:93:23: warning: comparison of integers of different signs: 'int' and 'std::__1::vector<std::__1::basic_string<char>, std::__1::allocator<std::__1::basic_string<char> > >::size_type' (aka 'unsigned long') [-Wsign-compare]
      for (int ai = 2; ai < args.size(); ai += 2) {
                       ~~ ^ ~~~~~~~~~~~
    1 warning generated.
    gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/miniconda3/include -arch x86_64 -I/miniconda3/include -arch x86_64 -I/miniconda3/include/python3.6m -I/Users/ruanxiaoyi/.local/include/python3.6m -Isrc -I/miniconda3/include/python3.6m -c src/quantmatrix.cc -o build/temp.macosx-10.7-x86_64-3.6/src/quantmatrix.o -stdlib=libc++ -DVERSION_INFO="0.8.22" -std=c++14 -fvisibility=hidden
    gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/miniconda3/include -arch x86_64 -I/miniconda3/include -arch x86_64 -I/miniconda3/include/python3.6m -I/Users/ruanxiaoyi/.local/include/python3.6m -Isrc -I/miniconda3/include/python3.6m -c src/matrix.cc -o build/temp.macosx-10.7-x86_64-3.6/src/matrix.o -stdlib=libc++ -DVERSION_INFO="0.8.22" -std=c++14 -fvisibility=hidden
    gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/miniconda3/include -arch x86_64 -I/miniconda3/include -arch x86_64 -I/miniconda3/include/python3.6m -I/Users/ruanxiaoyi/.local/include/python3.6m -Isrc -I/miniconda3/include/python3.6m -c src/meter.cc -o build/temp.macosx-10.7-x86_64-3.6/src/meter.o -stdlib=libc++ -DVERSION_INFO="0.8.22" -std=c++14 -fvisibility=hidden
    gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/miniconda3/include -arch x86_64 -I/miniconda3/include -arch x86_64 -I/miniconda3/include/python3.6m -I/Users/ruanxiaoyi/.local/include/python3.6m -Isrc -I/miniconda3/include/python3.6m -c src/vector.cc -o build/temp.macosx-10.7-x86_64-3.6/src/vector.o -stdlib=libc++ -DVERSION_INFO="0.8.22" -std=c++14 -fvisibility=hidden
    gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/miniconda3/include -arch x86_64 -I/miniconda3/include -arch x86_64 -I/miniconda3/include/python3.6m -I/Users/ruanxiaoyi/.local/include/python3.6m -Isrc -I/miniconda3/include/python3.6m -c src/densematrix.cc -o build/temp.macosx-10.7-x86_64-3.6/src/densematrix.o -stdlib=libc++ -DVERSION_INFO="0.8.22" -std=c++14 -fvisibility=hidden
    g++ -bundle -undefined dynamic_lookup -L/miniconda3/lib -arch x86_64 -L/miniconda3/lib -arch x86_64 -arch x86_64 build/temp.macosx-10.7-x86_64-3.6/python/fastText/pybind/fasttext_pybind.o build/temp.macosx-10.7-x86_64-3.6/src/dictionary.o build/temp.macosx-10.7-x86_64-3.6/src/main.o build/temp.macosx-10.7-x86_64-3.6/src/fasttext.o build/temp.macosx-10.7-x86_64-3.6/src/utils.o build/temp.macosx-10.7-x86_64-3.6/src/model.o build/temp.macosx-10.7-x86_64-3.6/src/loss.o build/temp.macosx-10.7-x86_64-3.6/src/productquantizer.o build/temp.macosx-10.7-x86_64-3.6/src/args.o build/temp.macosx-10.7-x86_64-3.6/src/quantmatrix.o build/temp.macosx-10.7-x86_64-3.6/src/matrix.o build/temp.macosx-10.7-x86_64-3.6/src/meter.o build/temp.macosx-10.7-x86_64-3.6/src/vector.o build/temp.macosx-10.7-x86_64-3.6/src/densematrix.o -o build/lib.macosx-10.7-x86_64-3.6/fasttext_pybind.cpython-36m-darwin.so
    clang: warning: libstdc++ is deprecated; move to libc++ with a minimum deployment target of OS X 10.9 [-Wdeprecated]
    ld: library not found for -lstdc++
    clang: error: linker command failed with exit code 1 (use -v to see invocation)
    error: command 'g++' failed with exit status 1

    ----------------------------------------
Command "/miniconda3/bin/python -u -c "import setuptools, tokenize;__file__='/private/var/folders/tz/msp_r50s03s59q40s_gmhx600000gn/T/pip-req-build-i2z3pyel/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /private/var/folders/tz/msp_r50s03s59q40s_gmhx600000gn/T/pip-record-yg0h6noh/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /private/var/folders/tz/msp_r50s03s59q40s_gmhx600000gn/T/pip-req-build-i2z3pyel/

And my environment is

Apple LLVM version 10.0.0 (clang-1000.10.44.4)
Target: x86_64-apple-darwin18.2.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
 "/Library/Developer/CommandLineTools/usr/bin/clang" -cc1 -triple x86_64-apple-macosx10.14.0 -Wdeprecated-objc-isa-usage -Werror=deprecated-objc-isa-usage -E -disable-free -disable-llvm-verifier -discard-value-names -main-file-name - -mrelocation-model pic -pic-level 2 -mthread-model posix -mdisable-fp-elim -fno-strict-return -masm-verbose -munwind-tables -target-cpu penryn -dwarf-column-info -debugger-tuning=lldb -target-linker-version 409.12 -v -resource-dir /Library/Developer/CommandLineTools/usr/lib/clang/10.0.0 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk -I/usr/local/include -stdlib=libc++ -fdeprecated-macro -fdebug-compilation-dir /Users/ruanxiaoyi/Downloads/fastText-master -ferror-limit 19 -fmessage-length 204 -stack-protector 1 -fblocks -fencode-extended-block-signature -fobjc-runtime=macosx-10.14.0 -fcxx-exceptions -fexceptions -fmax-type-align=16 -fdiagnostics-show-option -fcolor-diagnostics -o - -x c++ -
clang -cc1 version 10.0.0 (clang-1000.10.44.4) default target x86_64-apple-darwin18.2.0
ignoring nonexistent directory "/Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk/usr/include/c++/v1"
ignoring nonexistent directory "/Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk/usr/local/include"
ignoring nonexistent directory "/Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk/Library/Frameworks"
#include "..." search starts here:
#include <...> search starts here:
 /usr/local/include
 /Library/Developer/CommandLineTools/usr/include/c++/v1
 /Library/Developer/CommandLineTools/usr/lib/clang/10.0.0/include
 /Library/Developer/CommandLineTools/usr/include
 /Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk/usr/include
 /Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk/System/Library/Frameworks (framework directory)

Any suggestion for this problem?

Python Build

opened by rxy1212 21

Any plan to support different weight for each class in loss function?
Looking at the current code, it seems to me that loss function are evaluated with the same weight for each class, which is OK for balanced data. For highly imbalanced data, are there any plan to support different weight for each class in loss function? I am thinking in command line, do:

fasttext -input XXX -output XXX -weight_class1 10 -weight_class2 1 -weight_class3 3

or simply

fasttext -weight_balanced

if the weight is inversely proportional to number of instances in that class?
opened by kuangchen 18
Interpreting Multilabel output

So I loaded multilabel values for my targets. But when I use the predict_prob function; it seems like conditional probablity more than multilabel output.

I was assuming that all the labels would have a value between 1 and 0, but I am seeing that all the labels add up to 1 instead for each class to have a value between 1 and 0.

Can someone help me understand this output.

opened by iymitchell 17
The memory error when loading the pre-trained model

There is a memory error when I trying to load the pre-trained model, e.g., model = fasttext.load_model('D:/download/wiki.en/wiki.en.bin').

Since the size of this bin file is almost 9G, and my memory size is only 4G. I am trying to find a memory friendly method to load the model. Can anyone give me a clue?
Thanks a lot!

opened by zhouchichun 16

Quantize error

I already have trained model_1.bin with supervised option, and when I am trying to quantize that model, I am getting following error!

/opt/fastText/fasttext quantize -input data.txt -output models/model_1 -verbose 3 -wordNgrams 3 -bucket 1000000 -minn 3 -maxn 6 -lr 0.010 -dim 100 -loss ns -thread 8 -epoch 10 -qnorm -retrain -cutoff 100000

fasttext: src/vector.cc:71: void fasttext::Vector::addRow(const fasttext::Matrix&, int64_t): Assertion `i < A.m_' failed.
Aborted (core dumped)

Edit: If I dont use -cutoff then I can run this without any error!

opened by spate141 16

Loss - OVA model - Not predicting sigmoid output in Ubuntu 16.04

Install Log:

c++ -pthread -std=c++0x -march=native -O3 -funroll-loops -c src/args.cc c++ -pthread -std=c++0x -march=native -O3 -funroll-loops -c src/matrix.cc c++ -pthread -std=c++0x -march=native -O3 -funroll-loops -c src/dictionary.cc c++ -pthread -std=c++0x -march=native -O3 -funroll-loops -c src/loss.cc c++ -pthread -std=c++0x -march=native -O3 -funroll-loops -c src/productquantizer.cc c++ -pthread -std=c++0x -march=native -O3 -funroll-loops -c src/densematrix.cc c++ -pthread -std=c++0x -march=native -O3 -funroll-loops -c src/quantmatrix.cc c++ -pthread -std=c++0x -march=native -O3 -funroll-loops -c src/vector.cc c++ -pthread -std=c++0x -march=native -O3 -funroll-loops -c src/model.cc c++ -pthread -std=c++0x -march=native -O3 -funroll-loops -c src/utils.cc c++ -pthread -std=c++0x -march=native -O3 -funroll-loops -c src/meter.cc c++ -pthread -std=c++0x -march=native -O3 -funroll-loops -c src/fasttext.cc src/fasttext.cc: In member function ‘void fasttext::FastText::quantize(const fasttext::Args&)’: src/fasttext.cc:323:16: warning: ‘std::vector fasttext::FastText::selectEmbeddings(int32_t) const’ is deprecated: selectEmbeddings is being deprecated. [-Wdeprecated-declarations] auto idx = selectEmbeddings(qargs.cutoff); ^ src/fasttext.cc:293:22: note: declared here std::vector<int32_t> FastText::selectEmbeddings(int32_t cutoff) const { ^ src/fasttext.cc:323:45: warning: ‘std::vector fasttext::FastText::selectEmbeddings(int32_t) const’ is deprecated: selectEmbeddings is being deprecated. [-Wdeprecated-declarations] auto idx = selectEmbeddings(qargs.cutoff); ^ src/fasttext.cc:293:22: note: declared here std::vector<int32_t> FastText::selectEmbeddings(int32_t cutoff) const { ^ src/fasttext.cc: In member function ‘void fasttext::FastText::lazyComputeWordVectors()’: src/fasttext.cc:551:5: warning: ‘void fasttext::FastText::precomputeWordVectors(fasttext::DenseMatrix&)’ is deprecated: precomputeWordVectors is being deprecated. [-Wdeprecated-declarations] precomputeWordVectors(*wordVectors_); ^ src/fasttext.cc:534:6: note: declared here void FastText::precomputeWordVectors(DenseMatrix& wordVectors) { ^ src/fasttext.cc:551:40: warning: ‘void fasttext::FastText::precomputeWordVectors(fasttext::DenseMatrix&)’ is deprecated: precomputeWordVectors is being deprecated. [-Wdeprecated-declarations] precomputeWordVectors(*wordVectors_); ^ src/fasttext.cc:534:6: note: declared here void FastText::precomputeWordVectors(DenseMatrix& wordVectors) { ^ c++ -pthread -std=c++0x -march=native -O3 -funroll-loops args.o matrix.o dictionary.o loss.o productquantizer.o densematrix.o quantmatrix.o vector.o model.o utils.o meter.o fasttext.o src/main.cc -o fasttext

The output is not sigmoid. Its still same as the Softmax. Args: dim 100 ws 5 epoch 1 minCount 1 neg 5 wordNgrams 3 loss one-vs-all model sup bucket 1000000 minn 3 maxn 3 lrUpdateRate 100 t 0.0001
bug

opened by giriannamalai 15
Binary model that was trained on Common crawl

Hello! I enjoy using your library and pretrained vectors. I see that for vectors that were trained on wiki you provide both binary model and pretrained vectors. However, for vectors that were trained on Common crawl, you only provide pretrained vectors. Is it possible for you to publish binary model for them?

Thanks, Alexander.

opened by MrBoor 15
Running on PowerPC64LE (ppc64le)

I am able to compile the stable (0.1.0) version of the code on a powerpc64le (IBM Minsky) without any errors/warnings. However when I run on any dataset (eg stackexchange cooking) using just the defaults ./fasttext supervised -input ... -output ... the program just hangs after displaying Reading ... words. I tried make debug as well. Same problem. (details: make 4.1, Ubuntu 16.04.3 LTS. Any ideas?

opened by ironv 15
What's the status of this project?

Last release in 2020-04, I see a lot unsolved installing issues and I miss pre-build wheels on https://pypi.org/project/fasttext/#files

What is the future of this project or is it just dead?

opened by return42 1
denpendency errors

Hi，

We recently conducted a study to detect build dependency errors, focusing on missing dependencies and redundant dependencies. A missing dependency （MS） is a dependency that is not declared in the build script and a redundant dependency（RD） is a dependency that is declared in the build script that is not actually used. We have detected the following dependency errors in your public projects. Could you please help us to check these dependency errors? The data format is dependency --- target. MS 0['/home/lv/WorkSpace/vmake_experiment/fastText-master/src/densematrix.h---fasttext'] 1['/home/lv/WorkSpace/vmake_experiment/fastText-master/src/vector.h---fasttext'] 2['/home/lv/WorkSpace/vmake_experiment/fastText-master/src/model.h---fasttext'] 3['/home/lv/WorkSpace/vmake_experiment/fastText-master/src/args.h---fasttext'] 4['/home/lv/WorkSpace/vmake_experiment/fastText-master/src/meter.h---fasttext'] 5['/home/lv/WorkSpace/vmake_experiment/fastText-master/src/fasttext.h---fasttext'] 6['/home/lv/WorkSpace/vmake_experiment/fastText-master/src/real.h---fasttext'] 7['/home/lv/WorkSpace/vmake_experiment/fastText-master/src/main.cc---fasttext'] 8['/home/lv/WorkSpace/vmake_experiment/fastText-master/src/matrix.h---fasttext'] 9['/home/lv/WorkSpace/vmake_experiment/fastText-master/src/utils.h---fasttext'] 10['/home/lv/WorkSpace/vmake_experiment/fastText-master/src/dictionary.h---fasttext']

RD 0['src/utils.h---productquantizer.o'] 1['src/utils.h---quantmatrix.o'] 2['src/fasttext.cc---fasttext'] 3['src/utils.h---vector.o'] 4['src/args.h---model.o']

opened by Meiye-lj 0
Program running results are abnormal

anaconda3/bin/python3.8

import fasttext.util ft = fasttext.load_model('cc.zh.300.bin') sentence_w1=ft.get_sentence_vector('色诫');print(sentence_w1)

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

opened by Chu-J 0

Language names of Languages supported by Fasttext

I am trying to find out the names of languages supported by Fasttext's LID tool, given these language codes listed here:

af als am an ar arz as ast av az azb ba bar bcl be bg bh bn bo bpy br bs bxr ca cbk ce ceb ckb co cs cv cy da de diq dsb dty dv el eml en eo es et eu fa fi fr frr fy ga gd gl gn gom gu gv he hi hif hr hsb ht hu hy ia id ie ilo io is it ja jbo jv ka kk km kn ko krc ku kv kw ky la lb lez li lmo lo lrc lt lv mai mg mhr min mk ml mn mr mrj ms mt mwl my myv mzn nah nap nds ne new nl nn no oc or os pa pam pfl pl pms pnb ps pt qu rm ro ru rue sa sah sc scn sco sd sh si sk sl so sq sr su sv sw ta te tg th tk tl tr tt tyv ug uk ur uz vec vep vi vls vo wa war wuu xal xmf yi yo yue zh

I tried to map the ISO codes to each language, but it seems non-standard, either using ISO-639-1 or ISO-639-3. Does anyone have a list of language names for these codes, or know how to find them?
Wikipedia's list does not cover all of them either, so manual mapping too did not help.
Thanks!

opened by AetherPrior 1

Lookup tables for language labels - Python module
I'm using the prebuilt lid.176.ftz model to do simple language ID on short texts (160 chars or fewer) using the Python module.

Is there a lookup table (dictionary) for the labels?

eg

{ "en": "English", "fr": "French", ... }

Some of the labels fastText returns are quite obscure languages & I've had to trawl a lot of ISO-639 docs to establish what they refer to in order to build my own lookup table.

Or have I simply missed something in the docs /API that tells me how to get these?
opened by RedactedCode 0

Releases(v0.9.2)

v0.9.2(Apr 28, 2020)
We are happy to announce the release of version 0.9.2.

WebAssembly

We are excited to release fastText bindings for WebAssembly. Classification tasks are widely used in web applications and we believe giving access to the complete fastText API from the browser will notably help our community to build nice tools. See our documentation to learn more.

Autotune: automatic hyperparameter optimization

Finding the best hyperparameters is crucial for building efficient models. However, searching the best hyperparameters manually is difficult. This release includes the autotune feature that allows you to find automatically the best hyperparameters for your dataset. You can find more information on how to use it here.

Python

fastText loves Python. In this release, we have:

several bug fixes for prediction functions

nearest neighbors and analogies for Python

a memory leak fix

website tutorials with Python examples

The autotune feature is fully integrated with our Python API. This allows us to have a more stable autotune optimization loop from Python and to synchronize the best hyper-parameters with the _FastText model object.

Pre-trained models tool

We release two helper scripts:

download_model.py to automatically download pre-trained vectors from our website

reduce_model.py to reduce the word-vectors' size using PCA.

They can also be used directly from our Python API.

More metrics

When you test a trained model, you can now have more detailed results for the precision/recall metrics of a specific label or all labels.

Paper source code

This release contains the source code of the unsupervised multilingual alignment paper.

Community feedback and contributions

We want to thank our community for giving us feedback on Facebook and on GitHub.
Source code(tar.gz)
Source code(zip)
v0.9.1(Jul 4, 2019)
We are happy to announce the release of version 0.9.1.

New release of python module

The main goal of this release is to merge two existing python modules: the official fastText module which was available on our github repository and the unofficial fasttext module which was available on pypi.org.

You can find an overview of the new API here, and more insight in our blog post.

Refactoring

This version includes a massive rewrite of internal classes. The training and test are now split into three different classes : Model that takes care of the computational aspect, Loss that handles loss and applies gradients to the output matrix, and State that is responsible of holding the model's state inside each thread.

That makes the code more straighforward to read but also gives a smaller memory footprint, because the data needed for loss computation is now hold only once unlike before where there was one for each thread.

Misc

Compilation issues fix for recent versions of Mac OS X.

Better unicode handling :

on_unicode_error argument that helps to handle unicode issues one can face with some datasets

bug fix related to different behaviour of pybind11's py::str class between python2 and python3

script for unsupervised alignment

public file hosting changed from aws to fbaipublicfiles

we added a Code of Conduct file.

Thank you !

As always, we want to thank you for your help and your precious feedback which helps making this project better.
Source code(tar.gz)
Source code(zip)
v0.2.0(Dec 19, 2018)
We are happy to announce the change of the license from BSD+patents to MIT and the release of fastText 0.2.0.

The main purpose of this release is to set a beta C++ API of the FastText class. The class now behaves as a computational library: we moved the display and some usage error handlings outside of it (mainly to main.cc and fasttext_pybind.cc). It is still compatible with older versions of the class, but some methods are now marked as deprecated and will probably be removed in the next release.

In this respect, we also introduce the official support for python. The python binding of fastText is a client of the FastText class.

Here is a short summary of the 104 commits since 0.1.0 :

New :

Introduction of the “OneVsAll” loss function for multi-label classification, which corresponds to the sum of binary cross-entropy computed independently for each label. This new loss can be used with the -loss ova or -loss one-vs-all command line option ( 8850c51b972ed68642a15c17fbcd4dd58766291d ).

Computation of the precision and recall metrics for each label ( be1e597cb67c069ba9940ff241d9aad38ccd37da ).

Removed printing functions from FastText class ( 256032b87522cdebc4850c99b204b81b3255cb2a ).

Better default for number of threads ( 501b9b1e4543fd2de55e4a621a9924ce7d2b5b17 ).

Python support ( f10ec1faea1605d40fdb79fe472cc2204f3d584c ).

More tests for circleci/python ( eb9703a4a7ed0f7559d6f341cc8e5d166d5e4d88, 97fcde80ea107ca52d3d778a083564619175039c, 1de0624bfaff02d91fd265f331c07a4a0a7bb857 ).

Bug fixes :

Normalize buffer vector in analogy queries.

Typo fixes and clarifications on website.

Improvements on python install issues : setup.py OS X compiler flags, pybind11 include.

Fix: getSubwords for EOS.

Fix: ETA time.

Fix: division by 0 in word analogy evaluation.

Fix for the infinite loop on ARM cpu.

Operations :

We released more pre-trained vectors (92bc7d230959e2a94125fbe7d3b05257effb1111, 5bf8b4c615b6308d76ad39a5a50fa6c4174113ea ).

Worth noting :

We added circleci build badges to the README.md

We modified the style to be in compliance with Facebook C++ style.

We added coverage option for Makefile and setup.py in order to build for measuring the coverage.

Thank you fastText community!

We want to thank you all for being a part of this community and sharing your passion with us. Some of these improvements would not have been possible without your help.
Source code(tar.gz)
Source code(zip)
v0.1.0(Dec 2, 2017)

Source code(tar.gz)
Source code(zip)

Library for fast text representation and classification.

Related tags

Overview

fastText

Table of contents

Resources

Models

Supplementary data

FAQ

Cheatsheet

Requirements

Building fastText

Getting the source code

Building fastText using make (preferred)

Building fastText using cmake

Building fastText for Python

Example use cases

Word representation learning

Obtaining word vectors for out-of-vocabulary words

Text classification

Full documentation

References

Enriching Word Vectors with Subword Information

Bag of Tricks for Efficient Text Classification

FastText.zip: Compressing text classification models

Join the fastText community

License

Comments

Releases(v0.9.2)

v0.9.2(Apr 28, 2020)

WebAssembly

Autotune: automatic hyperparameter optimization

Python

Pre-trained models tool

More metrics

Paper source code

Community feedback and contributions

v0.9.1(Jul 4, 2019)

New release of python module

Refactoring

Misc

Thank you !

v0.2.0(Dec 19, 2018)

New :

Bug fixes :

Operations :

Worth noting :

Thank you fastText community!

v0.1.0(Dec 2, 2017)

Owner

Facebook Research

Translators - is a library which aims to bring free, multiple, enjoyable translation to individuals and students in Python

This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

Autoregressive Entity Retrieval

Write Alphabet, Words and Sentences with your eyes.

Text-Summarization-using-NLP - Text Summarization using NLP to fetch BBC News Article and summarize its text and also it includes custom article Summarization

Converts python code into c++ by using OpenAI CODEX.

GVT is a generic translation tool for parts of text on the PC screen with Text to Speak functionality.

Tools, wrappers, etc... for data science with a concentration on text processing

A complete NLP guideline for enthusiasts

This is a MD5 password/passphrase brute force tool

A Facebook Messenger Chatbot using NLP

Speech Recognition Database Management with python

Deduplication is the task to combine different representations of the same real world entity.

Unifying Cross-Lingual Semantic Role Labeling with Heterogeneous Linguistic Resources (NAACL-2021).

Crowd sourced training data for Rasa NLU models

An IVR Chatbot which can exponentially reduce the burden of companies as well as can improve the consumer/end user experience.

Neural text generators like the GPT models promise a general-purpose means of manipulating texts.

Continuously update some NLP practice based on different tasks.

A 30000+ Chinese MRC dataset - Delta Reading Comprehension Dataset

The ibet-Prime security token management system for ibet network.