Unsupervised text tokenizer for Neural Network-based text generation.

Overview

SentencePiece

Build Status Build status Coverage Status GitHub Issues Codacy Badge PyPI version PyPi downloads Contributions welcome License

SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements subword units (e.g., byte-pair-encoding (BPE) [Sennrich et al.]) and unigram language model [Kudo.]) with the extension of direct training from raw sentences. SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre/postprocessing.

This is not an official Google product.

Technical highlights

  • Purely data driven: SentencePiece trains tokenization and detokenization models from sentences. Pre-tokenization (Moses tokenizer/MeCab/KyTea) is not always required.
  • Language independent: SentencePiece treats the sentences just as sequences of Unicode characters. There is no language-dependent logic.
  • Multiple subword algorithms: BPE [Sennrich et al.] and unigram language model [Kudo.] are supported.
  • Subword regularization: SentencePiece implements subword sampling for subword regularization and BPE-dropout which help to improve the robustness and accuracy of NMT models.
  • Fast and lightweight: Segmentation speed is around 50k sentences/sec, and memory footprint is around 6MB.
  • Self-contained: The same tokenization/detokenization is obtained as long as the same model file is used.
  • Direct vocabulary id generation: SentencePiece manages vocabulary to id mapping and can directly generate vocabulary id sequences from raw sentences.
  • NFKC-based normalization: SentencePiece performs NFKC-based text normalization.

For those unfamiliar with SentencePiece as a software/algorithm, one can read a gentle introduction here.

Comparisons with other implementations

Feature SentencePiece subword-nmt WordPiece
Supported algorithm BPE, unigram, char, word BPE BPE*
OSS? Yes Yes Google internal
Subword regularization Yes No No
Python Library (pip) Yes No N/A
C++ Library Yes No N/A
Pre-segmentation required? No Yes Yes
Customizable normalization (e.g., NFKC) Yes No N/A
Direct id generation Yes No N/A

Note that BPE algorithm used in WordPiece is slightly different from the original BPE.

Overview

What is SentencePiece?

SentencePiece is a re-implementation of sub-word units, an effective way to alleviate the open vocabulary problems in neural machine translation. SentencePiece supports two segmentation algorithms, byte-pair-encoding (BPE) [Sennrich et al.] and unigram language model [Kudo.]. Here are the high level differences from other implementations.

The number of unique tokens is predetermined

Neural Machine Translation models typically operate with a fixed vocabulary. Unlike most unsupervised word segmentation algorithms, which assume an infinite vocabulary, SentencePiece trains the segmentation model such that the final vocabulary size is fixed, e.g., 8k, 16k, or 32k.

Note that SentencePiece specifies the final vocabulary size for training, which is different from subword-nmt that uses the number of merge operations. The number of merge operations is a BPE-specific parameter and not applicable to other segmentation algorithms, including unigram, word and character.

Trains from raw sentences

Previous sub-word implementations assume that the input sentences are pre-tokenized. This constraint was required for efficient training, but makes the preprocessing complicated as we have to run language dependent tokenizers in advance. The implementation of SentencePiece is fast enough to train the model from raw sentences. This is useful for training the tokenizer and detokenizer for Chinese and Japanese where no explicit spaces exist between words.

Whitespace is treated as a basic symbol

The first step of Natural Language processing is text tokenization. For example, a standard English tokenizer would segment the text "Hello world." into the following three tokens.

[Hello] [World] [.]

One observation is that the original input and tokenized sequence are NOT reversibly convertible. For instance, the information that is no space between “World” and “.” is dropped from the tokenized sequence, since e.g., Tokenize(“World.”) == Tokenize(“World .”)

SentencePiece treats the input text just as a sequence of Unicode characters. Whitespace is also handled as a normal symbol. To handle the whitespace as a basic token explicitly, SentencePiece first escapes the whitespace with a meta symbol "▁" (U+2581) as follows.

Hello▁World.

Then, this text is segmented into small pieces, for example:

[Hello] [▁Wor] [ld] [.]

Since the whitespace is preserved in the segmented text, we can detokenize the text without any ambiguities.

  detokenized = ''.join(pieces).replace('▁', ' ')

This feature makes it possible to perform detokenization without relying on language-specific resources.

Note that we cannot apply the same lossless conversions when splitting the sentence with standard word segmenters, since they treat the whitespace as a special symbol. Tokenized sequences do not preserve the necessary information to restore the original sentence.

  • (en) Hello world. → [Hello] [World] [.] (A space between Hello and World)
  • (ja) こんにちは世界。 → [こんにちは] [世界] [。] (No space between こんにちは and 世界)

Subword regularization and BPE-dropout

Subword regularization [Kudo.] and BPE-droptout Provilkov et al are simple regularization methods that virtually augment training data with on-the-fly subword sampling, which helps to improve the accuracy as well as robustness of NMT models.

To enable subword regularization, you would like to integrate SentencePiece library (C++/Python) into the NMT system to sample one segmentation for each parameter update, which is different from the standard off-line data preparations. Here's the example of Python library. You can find that 'New York' is segmented differently on each SampleEncode (C++) or encode with enable_sampling=True (Python) calls. The details of sampling parameters are found in sentencepiece_processor.h.

>>> import sentencepiece as spm
>>> s = spm.SentencePieceProcessor(model_file='spm.model')
>>> for n in range(5):
...     s.encode('New York', out_type=str, enable_sampling=True, alpha=0.1, nbest=-1)
...
['▁', 'N', 'e', 'w', '▁York']
['▁', 'New', '▁York']
['▁', 'New', '▁Y', 'o', 'r', 'k']
['▁', 'New', '▁York']
['▁', 'New', '▁York']

Installation

Python module

SentencePiece provides Python wrapper that supports both SentencePiece training and segmentation. You can install Python binary package of SentencePiece with.

% pip install sentencepiece

For more detail, see Python module

Build and install SentencePiece command line tools from C++ source

The following tools and libraries are required to build SentencePiece:

  • cmake
  • C++11 compiler
  • gperftools library (optional, 10-40% performance improvement can be obtained.)

On Ubuntu, the build tools can be installed with apt-get:

% sudo apt-get install cmake build-essential pkg-config libgoogle-perftools-dev

Then, you can build and install command line tools as follows.

% git clone https://github.com/google/sentencepiece.git 
% cd sentencepiece
% mkdir build
% cd build
% cmake ..
% make -j $(nproc)
% sudo make install
% sudo ldconfig -v

On OSX/macOS, replace the last command with sudo update_dyld_shared_cache

Build and install using vcpkg

You can download and install sentencepiece using the vcpkg dependency manager:

git clone https://github.com/Microsoft/vcpkg.git
cd vcpkg
./bootstrap-vcpkg.sh
./vcpkg integrate install
./vcpkg install sentencepiece

The sentencepiece port in vcpkg is kept up to date by Microsoft team members and community contributors. If the version is out of date, please create an issue or pull request on the vcpkg repository.

Usage instructions

Train SentencePiece Model

% spm_train --input=<input> --model_prefix=<model_name> --vocab_size=8000 --character_coverage=1.0 --model_type=<type>
  • --input: one-sentence-per-line raw corpus file. No need to run tokenizer, normalizer or preprocessor. By default, SentencePiece normalizes the input with Unicode NFKC. You can pass a comma-separated list of files.
  • --model_prefix: output model name prefix. <model_name>.model and <model_name>.vocab are generated.
  • --vocab_size: vocabulary size, e.g., 8000, 16000, or 32000
  • --character_coverage: amount of characters covered by the model, good defaults are: 0.9995 for languages with rich character set like Japanse or Chinese and 1.0 for other languages with small character set.
  • --model_type: model type. Choose from unigram (default), bpe, char, or word. The input sentence must be pretokenized when using word type.

Use --help flag to display all parameters for training, or see here for an overview.

Encode raw text into sentence pieces/ids

% spm_encode --model=<model_file> --output_format=piece < input > output
% spm_encode --model=<model_file> --output_format=id < input > output

Use --extra_options flag to insert the BOS/EOS markers or reverse the input sequence.

% spm_encode --extra_options=eos (add </s> only)
% spm_encode --extra_options=bos:eos (add <s> and </s>)
% spm_encode --extra_options=reverse:bos:eos (reverse input and add <s> and </s>)

SentencePiece supports nbest segmentation and segmentation sampling with --output_format=(nbest|sample)_(piece|id) flags.

% spm_encode --model=<model_file> --output_format=sample_piece --nbest_size=-1 --alpha=0.5 < input > output
% spm_encode --model=<model_file> --output_format=nbest_id --nbest_size=10 < input > output

Decode sentence pieces/ids into raw text

% spm_decode --model=<model_file> --input_format=piece < input > output
% spm_decode --model=<model_file> --input_format=id < input > output

Use --extra_options flag to decode the text in reverse order.

% spm_decode --extra_options=reverse < input > output

End-to-End Example

% spm_train --input=data/botchan.txt --model_prefix=m --vocab_size=1000
unigram_model_trainer.cc(494) LOG(INFO) Starts training with :
input: "../data/botchan.txt"
... <snip>
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1100 obj=10.4973 num_tokens=37630 num_tokens/piece=34.2091
trainer_interface.cc(272) LOG(INFO) Saving model: m.model
trainer_interface.cc(281) LOG(INFO) Saving vocabs: m.vocab

% echo "I saw a girl with a telescope." | spm_encode --model=m.model
▁I ▁saw ▁a ▁girl ▁with ▁a ▁ te le s c o pe .

% echo "I saw a girl with a telescope." | spm_encode --model=m.model --output_format=id
9 459 11 939 44 11 4 142 82 8 28 21 132 6

% echo "9 459 11 939 44 11 4 142 82 8 28 21 132 6" | spm_decode --model=m.model --input_format=id
I saw a girl with a telescope.

You can find that the original input sentence is restored from the vocabulary id sequence.

Export vocabulary list

% spm_export_vocab --model=<model_file> --output=<output file>

<output file> stores a list of vocabulary and emission log probabilities. The vocabulary id corresponds to the line number in this file.

Redefine special meta tokens

By default, SentencePiece uses Unknown (<unk>), BOS (<s>) and EOS (</s>) tokens which have the ids of 0, 1, and 2 respectively. We can redefine this mapping in the training phase as follows.

% spm_train --bos_id=0 --eos_id=1 --unk_id=5 --input=... --model_prefix=... --character_coverage=...

When setting -1 id e.g., bos_id=-1, this special token is disabled. Note that the unknow id cannot be disabled. We can define an id for padding (<pad>) as --pad_id=3.  

If you want to assign another special tokens, please see Use custom symbols.

Vocabulary restriction

spm_encode accepts a --vocabulary and a --vocabulary_threshold option so that spm_encode will only produce symbols which also appear in the vocabulary (with at least some frequency). The background of this feature is described in subword-nmt page.

The usage is basically the same as that of subword-nmt. Assuming that L1 and L2 are the two languages (source/target languages), train the shared spm model, and get resulting vocabulary for each:

% cat {train_file}.L1 {train_file}.L2 | shuffle > train
% spm_train --input=train --model_prefix=spm --vocab_size=8000 --character_coverage=0.9995
% spm_encode --model=spm.model --generate_vocabulary < {train_file}.L1 > {vocab_file}.L1
% spm_encode --model=spm.model --generate_vocabulary < {train_file}.L2 > {vocab_file}.L2

shuffle command is used just in case because spm_train loads the first 10M lines of corpus by default.

Then segment train/test corpus with --vocabulary option

% spm_encode --model=spm.model --vocabulary={vocab_file}.L1 --vocabulary_threshold=50 < {test_file}.L1 > {test_file}.seg.L1
% spm_encode --model=spm.model --vocabulary={vocab_file}.L2 --vocabulary_threshold=50 < {test_file}.L2 > {test_file}.seg.L2

Advanced topics

Comments
  • Pip install sentencepiece failure

    Pip install sentencepiece failure

    Hi, pip install sentencepiece fails, This is the log I get:

    pip install sentencepiece 7.4.0 Collecting sentencepiece Using cached https://files.pythonhosted.org/packages/fd/45/6d0eb609d5cd81df094aab71a867b2ab6b315ffd592e78fb94a625c4d6aa/sentencepiece-0.1.3.tar.gz ERROR: Complete output from command python setup.py egg_info: ERROR: /bin/sh: 1: pkg-config: not found Failed to find sentencepiece pkgconfig ---------------------------------------- ERROR: Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-463tj_x8/sentencepiece/

    opened by saareliad 32
  • Compatibility with Tensorflow Serving

    Compatibility with Tensorflow Serving

    Any idea how to best integrate the tensorflow op with tensorflow serving?

    Currently if this is used to train, when the tensorflow Graph is exported to a servable and ran with tensorflow serving a run time error will obviously occur.

    For example a model trained with this op trying to be loaded into tensorflow serving will result in:

    Loading servable: {name: xling } failed: Not Found: Op tyope not registered `SentencepieceEncodeSparse' in binary...
    
    opened by r-wheeler 31
  • pip install failed on linux cluster

    pip install failed on linux cluster

    System Info: Linux version 4.14.0-115.7.1.el7a.ppc64le ([email protected]) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-36) (GCC))

    I tried both installing from PyPI and installing from source file, but neither of them worked.

    When installing from PyPI:

    $ pip install sentencepiece
    Collecting sentencepiece
      Using cached https://files.pythonhosted.org/packages/1b/87/c3c2fa8cbec61fffe031ca9f0da512747520bec9be7f886f748457daac31/sentencepiece-0.1.83.tar.gz
        Complete output from command python setup.py egg_info:
        Traceback (most recent call last):
          File "<string>", line 1, in <module>
          File "/tmp/pip-install-t33o0yz4/sentencepiece/setup.py", line 29, in <module>
            with codecs.open(os.path.join('..', 'VERSION'), 'r', 'utf-8') as f:
          File "/opt/anaconda3/lib/python3.6/codecs.py", line 897, in open
            file = builtins.open(filename, mode, buffering)
        FileNotFoundError: [Errno 2] No such file or directory: '../VERSION'
    
        ----------------------------------------
    Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-t33o0yz4/sentencepiece/
    

    I then manually downloaded the tar.gz source file, uncompressed it, changed the directory to "./python", and tried to install directly from the setup.py:

    $ python setup.py install
    Package sentencepiece was not found in the pkg-config search path.
    Perhaps you should add the directory containing `sentencepiece.pc'
    to the PKG_CONFIG_PATH environment variable
    No package 'sentencepiece' found
    Failed to find sentencepiece pkgconfig
    

    However pip install . gives a different error message:

    $ pip install .
    Processing <...>/sentencepiece-0.1.83/python
        Complete output from command python setup.py egg_info:
        Traceback (most recent call last):
          File "<string>", line 1, in <module>
          File "/tmp/pip-req-build-209jgy5x/setup.py", line 29, in <module>
            with codecs.open(os.path.join('..', 'VERSION'), 'r', 'utf-8') as f:
          File "/opt/anaconda3/lib/python3.6/codecs.py", line 897, in open
            file = builtins.open(filename, mode, buffering)
        FileNotFoundError: [Errno 2] No such file or directory: '../VERSION'
    
        ----------------------------------------
    Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-req-build-209jgy5x/
    

    Does anyone know what might be wrong and how to fix it? Thank you!

    execution environment 
    opened by wendywangwwt 24
  • undefined symbol: _ZN10tensorflow12OpDefBuilder4AttrESs

    undefined symbol: _ZN10tensorflow12OpDefBuilder4AttrESs

    Hi , When I am trying to import "tf_sentencepiece" . I am getting following error:

    NotFoundError Traceback (most recent call last) in import tf_sentencepiece as tfs

    ~/.conda/envs/tf_gpu/lib/python3.6/site-packages/tf_sentencepiece/init.py in from future import print_function from tf_sentencepiece.sentencepiece_processor_ops import * ~/.conda/envs/tf_gpu/lib/python3.6/site-packages/tf_sentencepiece/sentencepiece_processor_ops.py in _gen_sentencepiece_processor_op = tf.load_op_library(so_file) ~/.conda/envs/tf_gpu/lib/python3.6/site-packages/tensorflow/python/framework/load_library.py in load_op_library(library_filename) RuntimeError: when unable to load the library or get the python wrappers. """ lib_handle = py_tf.TF_LoadLibrary(library_filename) op_list_str = py_tf.TF_GetOpList(lib_handle) NotFoundError: /home/user/.conda/envs/tf_gpu/lib/python3.6/site-packages/tf_sentencepiece/_sentencepiece_processor_ops.so.1.12.0: undefined symbol: _ZN10tensorflow12OpDefBuilder4AttrESs

    Help me out in resolving this issue. Thanks in advance.

    opened by ramreddyyasa 21
  • Add Mac M1 Compatibility

    Add Mac M1 Compatibility

    Hi,

    Like the most part of Python librairies, SentencePiece won't install on Mac M1 architecture... "A revolution in data science" they said... what a joke, every data science library is a real pain to install! Do you plan to make a compatible version of SentencePiece?

    Thank you!

    opened by pierreia 19
  • Issue in installing.

    Issue in installing.

    Python 3.7.3 OS: Redhat

    I am getting following error message while installing:

    I already tried installing wheel but getting message:

    (tanveer) [[email protected] tanveer]$ pip install sentencepiece-0.1.85-cp38-cp38-manylinux1_i686.whl
    ERROR: sentencepiece-0.1.85-cp38-cp38-manylinux1_i686.whl is not a supported wheel on this platform.
    
    > Using cached sentencepiece-0.1.83.tar.gz (497 kB)
    >   ERROR: Command errored out with exit status 1:
    >    command: /power8nfs/home/ai_u/.conda/envs/tanveer/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-6kz16kgn/sentencepiece/setup.py'"'"'; __file__='"'"'/tmp/pip-install-6kz16kgn/sentencepiece/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-install-6kz16kgn/sentencepiece/pip-egg-info
    >        cwd: /tmp/pip-install-6kz16kgn/sentencepiece/
    >   Complete output (7 lines):
    >   Traceback (most recent call last):
    >     File "<string>", line 1, in <module>
    >     File "/tmp/pip-install-6kz16kgn/sentencepiece/setup.py", line 29, in <module>
    >       with codecs.open(os.path.join('..', 'VERSION'), 'r', 'utf-8') as f:
    >     File "/power8nfs/home/ai_u/.conda/envs/tanveer/lib/python3.7/codecs.py", line 904, in open
    >       file = builtins.open(filename, mode, buffering)
    >   FileNotFoundError: [Errno 2] No such file or directory: '../VERSION'
    >   ----------------------------------------
    > ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
    > 
    
    execution environment 
    opened by tkhan3 19
  • `sentencepiece==0.1.92` seems breaking something

    `sentencepiece==0.1.92` seems breaking something

    with newly released sentencepiece==0.1.92

    Python 3.6.9 (default, Nov  7 2019, 10:44:02)
    [GCC 8.3.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import transformers, torch
    >>> transformers.__version__
    '2.9.1'
    >>> torch.__version__
    '1.4.0'
    >>> torch.rand(3)
    Segmentation fault (core dumped)
    

    However, downgrade to sentencepiece==0.1.91 solves this issue

    opened by boy2000-007man 16
  • terminate called after throwing an instance of 'std::bad_alloc'

    terminate called after throwing an instance of 'std::bad_alloc'

    I'm running a sentencepiece model and getting an std::bad_alloc error when I increase the training size from 5M to 10M sentences. (it works fine for 5M sentences). Here's how I'm calling the function:

    spm_train --input=input.txt --vocab_size=32000 --character_coverage=1.0
        --model_type=unigram --input_sentence_size=10000000 --num_threads=32
    

    here's the specific error:

    trainer_interface.cc(317) LOG(INFO) Sampled 10000000 sentences from 283087079 sentences.
    trainer_interface.cc(321) LOG(INFO) Skipped 209436 too long sentences.
    trainer_interface.cc(330) LOG(INFO) Adding meta_piece: <unk>
    trainer_interface.cc(330) LOG(INFO) Adding meta_piece: <s>
    trainer_interface.cc(330) LOG(INFO) Adding meta_piece: </s>
    trainer_interface.cc(335) LOG(INFO) Normalizing sentences...
    trainer_interface.cc(384) LOG(INFO) all chars count=3460742236
    trainer_interface.cc(392) LOG(INFO) Done: 100% characters are covered.
    trainer_interface.cc(402) LOG(INFO) Alphabet size=25
    trainer_interface.cc(403) LOG(INFO) Final character coverage=1
    trainer_interface.cc(435) LOG(INFO) Done! preprocessed 10000000 sentences.
    terminate called after throwing an instance of 'std::bad_alloc'
      what():  std::bad_alloc
    

    I've tried compiling SentencePiece with and without gperftools, and get the same error message. Compiled with gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16), in case that matters. (Edit: also tried a more recent gcc 8.2.0 with the same results.) I doubt that it's a RAM limitation, I'm running this on a pretty beefy compute node with 768 GB of memory, and watching memory utilization as the program is running (even at 5M input sentences) I never get close to maxing out. Any thoughts why I might be getting this error message?

    opened by pstjohn 15
  • FileNotFoundError: [Errno 2] No such file or directory: '..\\VERSION'

    FileNotFoundError: [Errno 2] No such file or directory: '..\\VERSION'

    Hi,

    I opened an issue relating to the pytorch-transformers library but was redirected here. For the sake of clarity here's all the relevant info:

    OS: Windows10 Python: 3.5.2. Error when trying pip install sentencepiece:

        ERROR: Command errored out with exit status 1:
         command: 'c:\users\pawel.lonca\appdata\local\programs\python\python35\python.exe' -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\PAWEL~1.LON\\AppData\\Local\\Temp\\pip-install-ibsvnyrj\\sentencepiece\\setup.py'"'"'; __file__='"'"'C:\\Users\\PAWEL~1.LON\\AppData\\Local\\Temp\\pip-install-ibsvnyrj\\sentencepiece\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base pip-egg-info
             cwd: C:\Users\PAWEL~1.LON\AppData\Local\Temp\pip-install-ibsvnyrj\sentencepiece\
        Complete output (7 lines):
        Traceback (most recent call last):
          File "<string>", line 1, in <module>
          File "C:\Users\PAWEL~1.LON\AppData\Local\Temp\pip-install-ibsvnyrj\sentencepiece\setup.py", line 29, in <module>
            with codecs.open(os.path.join('..', 'VERSION'), 'r', 'utf-8') as f:
          File "c:\users\pawel.lonca\appdata\local\programs\python\python35\lib\codecs.py", line 895, in open
            file = builtins.open(filename, mode, buffering)
        FileNotFoundError: [Errno 2] No such file or directory: '..\\VERSION'
        ----------------------------------------
    ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
    
    execution environment 
    opened by balkon16 14
  • Subword regularization on BPE models

    Subword regularization on BPE models

    As described by @eric-haibin-lin in https://github.com/google/sentencepiece/issues/335 it is currently not possible to use SampleEncodeAsPieces, SampleEncodeAs{Pieces,Ids} on a BPE model (displays model_interface.h(85) LOG(ERROR) Not implemented. error and returns an empty list).

    Do you plan to support it in the near futur ?

    (and thank you for this great tool BTW!)

    opened by nicolaspanel 13
  • Cannot install sentencepiece with Python 3.9 on Windows

    Cannot install sentencepiece with Python 3.9 on Windows

    Currently adding Python 3.9 support for pytorch/text and ran into an issue installing sentencepiece for Python 3.9 on windows. (CircleCI logs)

      ERROR: Failed building wheel for sentencepiece
        ERROR: Command errored out with exit status 1:
         command: 'C:\Users\circleci\project\env\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\circleci\\AppData\\Local\\Temp\\pip-install-trvw9qva\\sentencepiece_6ae2202249f44bf5b7a3902ec8532c93\\setup.py'"'"'; __file__='"'"'C:\\Users\\circleci\\AppData\\Local\\Temp\\pip-install-trvw9qva\\sentencepiece_6ae2202249f44bf5b7a3902ec8532c93\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record 'C:\Users\circleci\AppData\Local\Temp\pip-record-xi27zjv8\install-record.txt' --single-version-externally-managed --compile --install-headers 'C:\Users\circleci\project\env\Include\sentencepiece'
             cwd: C:\Users\circleci\AppData\Local\Temp\pip-install-trvw9qva\sentencepiece_6ae2202249f44bf5b7a3902ec8532c93\
        Complete output (20 lines):
        running install
        running build
        running build_py
        creating build
        creating build\lib.win-amd64-3.9
        creating build\lib.win-amd64-3.9\sentencepiece
        copying src\sentencepiece/__init__.py -> build\lib.win-amd64-3.9\sentencepiece
        copying src\sentencepiece/sentencepiece_model_pb2.py -> build\lib.win-amd64-3.9\sentencepiece
        copying src\sentencepiece/sentencepiece_pb2.py -> build\lib.win-amd64-3.9\sentencepiece
        running build_ext
        building 'sentencepiece._sentencepiece' extension
        creating build\temp.win-amd64-3.9
        creating build\temp.win-amd64-3.9\Release
        creating build\temp.win-amd64-3.9\Release\src
        creating build\temp.win-amd64-3.9\Release\src\sentencepiece
        C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.27.29110\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -IC:\Users\circleci\project\env\include -IC:\Users\circleci\project\env\include -IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.27.29110\ATLMFC\include -IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.27.29110\include -IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\ucrt -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\shared -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\um -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\winrt -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\cppwinrt /EHsc /Tpsrc/sentencepiece/sentencepiece_wrap.cxx /Fobuild\temp.win-amd64-3.9\Release\src/sentencepiece/sentencepiece_wrap.obj /MT /I..\build\root\include
        cl : Command line warning D9025 : overriding '/MD' with '/MT'
        sentencepiece_wrap.cxx
        src/sentencepiece/sentencepiece_wrap.cxx(2777): fatal error C1083: Cannot open include file: 'sentencepiece_processor.h': No such file or directory
        error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio\\2019\\Community\\VC\\Tools\\MSVC\\14.27.29110\\bin\\HostX86\\x64\\cl.exe' failed with exit code 2
    

    This is a duplicate of #452, but no real solution to building from source seems to have come from that so I have opened a new issue

    Is there a workaround for getting this dependency?

    cc @taku910

    opened by seemethere 12
  • Training a BPE model w/

    Training a BPE model w/ "identity" normalization rule doesn't add "\n" to the vocab

    Training a BPE model w/ the identity normalization rule doesn't add the newline character to the vocab:

    #!/bin/bash
    
    ../sentencepiece_upstream/build/src/spm_train \
      --input ../europarl-v7.de-en.en,../europarl-v7.de-en.de \
      --input_sentence_size 9999 \
      --model_prefix "bpe.joint" \
      --model_type "bpe" \
      --pad_id 3 \
      --pad_piece "<pad>" \
      --normalization_rule_name "identity" \
      --remove_extra_whitespaces 0
    

    This causes unks when encoding strings w/ \n:

    >>> import sentencepiece
    >>> x=sentencepiece.SentencePieceProcessor("bpe.joint.model")
    >>> x.encode_as_ids("asdf\nasdf\n", add_eos=True, add_bos=True)
    [1, 174, 7930, 7936, 0, 41, 7930, 7936, 0, 2]
    

    Without the identity normalization, newlines just get replaced with whitespace, for example:

    ../sentencepiece_upstream/build/src/spm_train \
      --input ../europarl-v7.de-en.en,../europarl-v7.de-en.de \
      --input_sentence_size 9999 \
      --model_prefix "bpe.joint" \
      --model_type "bpe" \
      --pad_id 3 \
      --pad_piece "<pad>" \
      --remove_extra_whitespaces 0
    [...]
    >>> x.encode_as_ids("asdf\nasdf\n", add_eos=True, add_bos=True)
    [1, 174, 7931, 7937, 174, 7931, 7937, 7921, 2]
    
    opened by pks 0
  • Not able to install sentencepiece on s390x machine

    Not able to install sentencepiece on s390x machine

    Hi Team Im not able to install sentencepiece on my s390x machine. below is the error. Please do help me out with this

    pip install sentencepiece Collecting sentencepiece Downloading sentencepiece-0.1.97.tar.gz (524 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 524.7/524.7 kB 2.8 MB/s eta 0:00:00 Preparing metadata (setup.py) ... done Building wheels for collected packages: sentencepiece Building wheel for sentencepiece (setup.py) ... error error: subprocess-exited-with-error

    × python setup.py bdist_wheel did not run successfully. │ exit code: 1 ╰─> [161 lines of output] running bdist_wheel running build running build_py creating build creating build/lib.linux-s390x-3.8 creating build/lib.linux-s390x-3.8/sentencepiece copying src/sentencepiece/init.py -> build/lib.linux-s390x-3.8/sentencepiece copying src/sentencepiece/_version.py -> build/lib.linux-s390x-3.8/sentencepiece copying src/sentencepiece/sentencepiece_model_pb2.py -> build/lib.linux-s390x-3.8/sentencepiece copying src/sentencepiece/sentencepiece_pb2.py -> build/lib.linux-s390x-3.8/sentencepiece running build_ext Package sentencepiece was not found in the pkg-config search path. Perhaps you should add the directory containing `sentencepiece.pc' to the PKG_CONFIG_PATH environment variable Package 'sentencepiece', required by 'virtual:world', not found Cloning into 'sentencepiece'... Note: switching to '58f256cf6f01bb86e6fa634a5cc560de5bd1667d'.

      You are in 'detached HEAD' state. You can look around, make experimental
      changes and commit them, and you can discard any commits you make in this
      state without impacting any branches by switching back to a branch.
      
      If you want to create a new branch to retain commits you create, you may
      do so (now or later) by using -c with the switch command. Example:
      
        git switch -c <new-branch-name>
      
      Or undo this operation with:
      
        git switch -
      
      Turn off this advice by setting config variable advice.detachedHead to false
      
      -- VERSION: 0.1.97
      -- The C compiler identification is GNU 8.5.0
      -- The CXX compiler identification is GNU 8.5.0
      -- Detecting C compiler ABI info
      -- Detecting C compiler ABI info - done
      -- Check for working C compiler: /usr/bin/cc - skipped
      -- Detecting C compile features
      -- Detecting C compile features - done
      -- Detecting CXX compiler ABI info
      -- Detecting CXX compiler ABI info - done
      -- Check for working CXX compiler: /usr/bin/c++ - skipped
      -- Detecting CXX compile features
      -- Detecting CXX compile features - done
      -- Looking for pthread.h
      -- Looking for pthread.h - found
      -- Performing Test CMAKE_HAVE_LIBC_PTHREAD
      -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
      -- Looking for pthread_create in pthreads
      -- Looking for pthread_create in pthreads - not found
      -- Looking for pthread_create in pthread
      -- Looking for pthread_create in pthread - found
      -- Found Threads: TRUE
      -- Not Found TCMalloc: TCMALLOC_LIB-NOTFOUND
      -- Configuring done
      -- Generating done
      -- Build files have been written to: /tmp/pip-install-5emffrl3/sentencepiece_0fc0e52c5f2b4bfea7160dcf5bae3daf/bundled
      [  1%] Building CXX object src/CMakeFiles/sentencepiece_train-static.dir/builder.cc.o
      [  3%] Building CXX object src/CMakeFiles/sentencepiece_train-static.dir/trainer_interface.cc.o
      [  4%] Building CXX object src/CMakeFiles/sentencepiece_train-static.dir/unicode_script.cc.o
      [  8%] Building CXX object src/CMakeFiles/sentencepiece_train-static.dir/unigram_model_trainer.cc.o
      [  8%] Building CXX object src/CMakeFiles/sentencepiece_train-static.dir/word_model_trainer.cc.o
      [  9%] Building CXX object src/CMakeFiles/sentencepiece_train-static.dir/char_model_trainer.cc.o
      [ 11%] Building CXX object src/CMakeFiles/sentencepiece_train-static.dir/trainer_factory.cc.o
      [ 12%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/arena.cc.o
      [ 14%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/arenastring.cc.o
      [ 16%] Building CXX object src/CMakeFiles/sentencepiece_train-static.dir/bpe_model_trainer.cc.o
      [ 17%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/bytestream.cc.o
      [ 19%] Building CXX object src/CMakeFiles/sentencepiece_train-static.dir/sentencepiece_trainer.cc.o
      [ 20%] Building CXX object src/CMakeFiles/sentencepiece_train-static.dir/pretokenizer_for_training.cc.o
      [ 22%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/coded_stream.cc.o
      [ 24%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/common.cc.o
      [ 25%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/extension_set.cc.o
      [ 27%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/generated_enum_util.cc.o
      [ 29%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/generated_message_table_driven_lite.cc.o
      [ 30%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/generated_message_util.cc.o
      [ 32%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/implicit_weak_message.cc.o
      [ 33%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/int128.cc.o
      [ 35%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/io_win32.cc.o
      [ 37%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/message_lite.cc.o
      [ 38%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/parse_context.cc.o
      [ 40%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/repeated_field.cc.o
      [ 41%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/status.cc.o
      [ 43%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/statusor.cc.o
      [ 45%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/stringpiece.cc.o
      [ 46%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/stringprintf.cc.o
      [ 48%] Linking CXX static library libsentencepiece_train.a
      [ 50%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/structurally_valid.cc.o
      [ 50%] Built target sentencepiece_train-static
      [ 51%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/strutil.cc.o
      [ 53%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/time.cc.o
      [ 54%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/wire_format_lite.cc.o
      [ 56%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/zero_copy_stream.cc.o
      [ 58%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/zero_copy_stream_impl.cc.o
      [ 59%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/zero_copy_stream_impl_lite.cc.o
      [ 61%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/builtin_pb/sentencepiece.pb.cc.o
      [ 62%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/builtin_pb/sentencepiece_model.pb.cc.o
      [ 64%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/bpe_model.cc.o
      [ 66%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/char_model.cc.o
      [ 67%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/error.cc.o
      [ 69%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/filesystem.cc.o
      [ 70%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/model_factory.cc.o
      [ 72%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/model_interface.cc.o
      [ 74%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/normalizer.cc.o
      [ 75%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/sentencepiece_processor.cc.o
      [ 77%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/unigram_model.cc.o
      [ 79%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/util.cc.o
      [ 80%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/word_model.cc.o
      /tmp/pip-install-5emffrl3/sentencepiece_0fc0e52c5f2b4bfea7160dcf5bae3daf/sentencepiece/src/normalizer.cc: In member function ‘void sentencepiece::normalizer::Normalizer::Init()’:
      /tmp/pip-install-5emffrl3/sentencepiece_0fc0e52c5f2b4bfea7160dcf5bae3daf/sentencepiece/src/normalizer.cc:54:42: error: ‘precompiled_charsmap_buffer_’ was not declared in this scope
                                               &precompiled_charsmap_buffer_);
                                                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
      [ 82%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/absl/flags/flag.cc.o
      gmake[2]: *** [src/CMakeFiles/sentencepiece-static.dir/build.make:552: src/CMakeFiles/sentencepiece-static.dir/normalizer.cc.o] Error 1
      gmake[2]: *** Waiting for unfinished jobs....
      gmake[1]: *** [CMakeFiles/Makefile2:207: src/CMakeFiles/sentencepiece-static.dir/all] Error 2
      gmake: *** [Makefile:156: all] Error 2
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-5emffrl3/sentencepiece_0fc0e52c5f2b4bfea7160dcf5bae3daf/setup.py", line 136, in <module>
          setup(
        File "/usr/lib/python3.8/site-packages/setuptools/__init__.py", line 145, in setup
          return distutils.core.setup(**attrs)
        File "/usr/lib64/python3.8/distutils/core.py", line 148, in setup
          dist.run_commands()
        File "/usr/lib64/python3.8/distutils/dist.py", line 966, in run_commands
          self.run_command(cmd)
        File "/usr/lib64/python3.8/distutils/dist.py", line 985, in run_command
          cmd_obj.run()
        File "/usr/local/lib/python3.8/site-packages/wheel/bdist_wheel.py", line 290, in run
          self.run_command('build')
        File "/usr/lib64/python3.8/distutils/cmd.py", line 313, in run_command
          self.distribution.run_command(command)
        File "/usr/lib64/python3.8/distutils/dist.py", line 985, in run_command
          cmd_obj.run()
        File "/usr/lib64/python3.8/distutils/command/build.py", line 135, in run
          self.run_command(cmd_name)
        File "/usr/lib64/python3.8/distutils/cmd.py", line 313, in run_command
          self.distribution.run_command(command)
        File "/usr/lib64/python3.8/distutils/dist.py", line 985, in run_command
          cmd_obj.run()
        File "/usr/lib/python3.8/site-packages/setuptools/command/build_ext.py", line 84, in run
          _build_ext.run(self)
        File "/usr/local/lib/python3.8/site-packages/Cython/Distutils/old_build_ext.py", line 186, in run
          _build_ext.build_ext.run(self)
        File "/usr/lib64/python3.8/distutils/command/build_ext.py", line 340, in run
          self.build_extensions()
        File "/usr/local/lib/python3.8/site-packages/Cython/Distutils/old_build_ext.py", line 195, in build_extensions
          _build_ext.build_ext.build_extensions(self)
        File "/usr/lib64/python3.8/distutils/command/build_ext.py", line 449, in build_extensions
          self._build_extensions_serial()
        File "/usr/lib64/python3.8/distutils/command/build_ext.py", line 474, in _build_extensions_serial
          self.build_extension(ext)
        File "/tmp/pip-install-5emffrl3/sentencepiece_0fc0e52c5f2b4bfea7160dcf5bae3daf/setup.py", line 89, in build_extension
          subprocess.check_call(['./build_bundled.sh', __version__])
        File "/usr/lib64/python3.8/subprocess.py", line 364, in check_call
          raise CalledProcessError(retcode, cmd)
      subprocess.CalledProcessError: Command '['./build_bundled.sh', '0.1.97']' returned non-zero exit status 2.
      [end of output]
    

    note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for sentencepiece Running setup.py clean for sentencepiece Failed to build sentencepiece Installing collected packages: sentencepiece Running setup.py install for sentencepiece ... error error: subprocess-exited-with-error

    × Running setup.py install for sentencepiece did not run successfully. │ exit code: 1 ╰─> [77 lines of output] running install running build running build_py creating build creating build/lib.linux-s390x-3.8 creating build/lib.linux-s390x-3.8/sentencepiece copying src/sentencepiece/init.py -> build/lib.linux-s390x-3.8/sentencepiece copying src/sentencepiece/version.py -> build/lib.linux-s390x-3.8/sentencepiece copying src/sentencepiece/sentencepiece_model_pb2.py -> build/lib.linux-s390x-3.8/sentencepiece copying src/sentencepiece/sentencepiece_pb2.py -> build/lib.linux-s390x-3.8/sentencepiece running build_ext Package sentencepiece was not found in the pkg-config search path. Perhaps you should add the directory containing `sentencepiece.pc' to the PKG_CONFIG_PATH environment variable Package 'sentencepiece', required by 'virtual:world', not found fatal: destination path 'sentencepiece' already exists and is not an empty directory. fatal: destination path 'sentencepiece' already exists and is not an empty directory. -- VERSION: 0.1.97 -- Not Found TCMalloc: TCMALLOC_LIB-NOTFOUND -- Configuring done -- Generating done -- Build files have been written to: /tmp/pip-install-5emffrl3/sentencepiece_0fc0e52c5f2b4bfea7160dcf5bae3daf/bundled Consolidate compiler generated dependencies of target sentencepiece_train-static [ 17%] Built target sentencepiece_train-static Consolidate compiler generated dependencies of target sentencepiece-static [ 19%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/normalizer.cc.o /tmp/pip-install-5emffrl3/sentencepiece_0fc0e52c5f2b4bfea7160dcf5bae3daf/sentencepiece/src/normalizer.cc: In member function ‘void sentencepiece::normalizer::Normalizer::Init()’: /tmp/pip-install-5emffrl3/sentencepiece_0fc0e52c5f2b4bfea7160dcf5bae3daf/sentencepiece/src/normalizer.cc:54:42: error: ‘precompiled_charsmap_buffer’ was not declared in this scope &precompiled_charsmap_buffer_); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~ gmake[2]: *** [src/CMakeFiles/sentencepiece-static.dir/build.make:552: src/CMakeFiles/sentencepiece-static.dir/normalizer.cc.o] Error 1 gmake[1]: *** [CMakeFiles/Makefile2:207: src/CMakeFiles/sentencepiece-static.dir/all] Error 2 gmake: *** [Makefile:156: all] Error 2 Traceback (most recent call last): File "", line 2, in File "", line 34, in File "/tmp/pip-install-5emffrl3/sentencepiece_0fc0e52c5f2b4bfea7160dcf5bae3daf/setup.py", line 136, in setup( File "/usr/lib/python3.8/site-packages/setuptools/init.py", line 145, in setup return distutils.core.setup(**attrs) File "/usr/lib64/python3.8/distutils/core.py", line 148, in setup dist.run_commands() File "/usr/lib64/python3.8/distutils/dist.py", line 966, in run_commands self.run_command(cmd) File "/usr/lib64/python3.8/distutils/dist.py", line 985, in run_command cmd_obj.run() File "/usr/lib/python3.8/site-packages/setuptools/command/install.py", line 61, in run return orig.install.run(self) File "/usr/lib64/python3.8/distutils/command/install.py", line 556, in run self.run_command('build') File "/usr/lib64/python3.8/distutils/cmd.py", line 313, in run_command self.distribution.run_command(command) File "/usr/lib64/python3.8/distutils/dist.py", line 985, in run_command cmd_obj.run() File "/usr/lib64/python3.8/distutils/command/build.py", line 135, in run self.run_command(cmd_name) File "/usr/lib64/python3.8/distutils/cmd.py", line 313, in run_command self.distribution.run_command(command) File "/usr/lib64/python3.8/distutils/dist.py", line 985, in run_command cmd_obj.run() File "/usr/lib/python3.8/site-packages/setuptools/command/build_ext.py", line 84, in run _build_ext.run(self) File "/usr/local/lib/python3.8/site-packages/Cython/Distutils/old_build_ext.py", line 186, in run _build_ext.build_ext.run(self) File "/usr/lib64/python3.8/distutils/command/build_ext.py", line 340, in run self.build_extensions() File "/usr/local/lib/python3.8/site-packages/Cython/Distutils/old_build_ext.py", line 195, in build_extensions _build_ext.build_ext.build_extensions(self) File "/usr/lib64/python3.8/distutils/command/build_ext.py", line 449, in build_extensions self._build_extensions_serial() File "/usr/lib64/python3.8/distutils/command/build_ext.py", line 474, in _build_extensions_serial self.build_extension(ext) File "/tmp/pip-install-5emffrl3/sentencepiece_0fc0e52c5f2b4bfea7160dcf5bae3daf/setup.py", line 89, in build_extension subprocess.check_call(['./build_bundled.sh', version]) File "/usr/lib64/python3.8/subprocess.py", line 364, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['./build_bundled.sh', '0.1.97']' returned non-zero exit status 2. [end of output]

    note: This error originates from a subprocess, and is likely not a problem with pip. error: legacy-install-failure

    × Encountered error while trying to install package. ╰─> sentencepiece

    note: This is an issue with the package mentioned above, not pip. hint: See above for output from the failure.

    opened by swagaths1 0
  • Is it allowed to rearrange index/id of each vocabulary?

    Is it allowed to rearrange index/id of each vocabulary?

    Thank you for reading my question. I have a demand of rearranging vocabulary id and assigning scores freely to any token. Here is a background

    Background:

    Firstly, I want to manually add some tokens to a vocabulary that was trained with unigram model type. These tokens should allow other pieces to contain these tokens, so they are not user_defined_symbols. I want to manually assign them a score, so they can be sampled according to probability.

    Secondly, I want to align the trained vocabulary with the other vocabulary. The other vocabulary makes indexes for those tokens I mentioned before. I hope the indexes for the common tokens in both vocabularies are of the same values. The indexes of other vocabularies are assigned with numbers after the last common index.

    Could you please give me some advice about how to achieve this goal? Thank you

    opened by lsy641 0
  • tokens listed in user_defined_symbols tokenized as unknowns when using the

    tokens listed in user_defined_symbols tokenized as unknowns when using the "word" model_type

    When using model_type="word" as argument in spm.SentencePieceTrainer.train, it seems that tokens listed in user_defined_symbols for example user_defined_symbols=["<s>", "</s>", "."], are still encoded to the unk_id. Using BPE, and Char works.

    Is this intended for word models?

    opened by lintangsutawika 0
  • Cannot install sentencepiece with Python 3.11 on Windows

    Cannot install sentencepiece with Python 3.11 on Windows

    Error alive again, Windows 10, Python 3.10.7

     Attempting uninstall: sentencepiece
        Found existing installation: sentencepiece 0.1.97
        Uninstalling sentencepiece-0.1.97:
          Successfully uninstalled sentencepiece-0.1.97
      Running setup.py install for sentencepiece ... error
      error: subprocess-exited-with-error
    
      × Running setup.py install for sentencepiece did not run successfully.
      │ exit code: 1
      ╰─> [24 lines of output]
          C:\Python310\lib\site-packages\setuptools\dist.py:771: UserWarning: Usage of dash-separated 'description-file' will not be supported in future versions. Please use the underscore name 'description_file' instead
            warnings.warn(
          running install
          C:\Python310\lib\site-packages\setuptools\command\install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
            warnings.warn(
          running build
          running build_py
          creating build
          creating build\lib.win-amd64-cpython-310
          creating build\lib.win-amd64-cpython-310\sentencepiece
          copying src\sentencepiece/__init__.py -> build\lib.win-amd64-cpython-310\sentencepiece
          copying src\sentencepiece/sentencepiece_model_pb2.py -> build\lib.win-amd64-cpython-310\sentencepiece
          copying src\sentencepiece/sentencepiece_pb2.py -> build\lib.win-amd64-cpython-310\sentencepiece
          running build_ext
          building 'sentencepiece._sentencepiece' extension
          creating build\temp.win-amd64-cpython-310
          creating build\temp.win-amd64-cpython-310\Release
          creating build\temp.win-amd64-cpython-310\Release\src
          creating build\temp.win-amd64-cpython-310\Release\src\sentencepiece
          "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30037\bin\HostX86\x64\cl.exe" /c /nologo /O2 /W3 /GL /DNDEBUG /MD -IC:\Python310\include -IC:\Python310\Include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30037\ATLMFC\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30037\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\cppwinrt" /EHsc /Tpsrc/sentencepiece/sentencepiece_wrap.cxx /Fobuild\temp.win-amd64-cpython-310\Release\src/sentencepiece/sentencepiece_wrap.obj /MT /I..\build\root\include
          cl : L¡nea de comandos warning D9025 : invalidando '/MD' con '/MT'
          sentencepiece_wrap.cxx
          src/sentencepiece/sentencepiece_wrap.cxx(2809): fatal error C1083: No se puede abrir el archivo incluir: 'sentencepiece_processor.h': No such file or directory
          error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio\\2019\\Community\\VC\\Tools\\MSVC\\14.29.30037\\bin\\HostX86\\x64\\cl.exe' failed with exit code 2
          [end of output]
    
      note: This error originates from a subprocess, and is likely not a problem with pip.
      Rolling back uninstall of sentencepiece
      Moving to c:\python310\lib\site-packages\sentencepiece-0.1.97.dist-info\
       from C:\Python310\Lib\site-packages\~entencepiece-0.1.97.dist-info
      Moving to c:\python310\lib\site-packages\sentencepiece\
       from C:\Python310\Lib\site-packages\~entencepiece
    error: legacy-install-failure
    
    × Encountered error while trying to install package.
    ╰─> sentencepiece
    
    note: This is an issue with the package mentioned above, not pip.
    hint: See above for output from the failure`
    
    Edit:
    This path: "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30037\bin\HostX86\x64\" exists, and cl.exe is there too.
    

    Originally posted by @cibernicola in https://github.com/google/sentencepiece/issues/591#issuecomment-1250851548

    opened by kbatsuren 1
  • Build with protobuf in system

    Build with protobuf in system

    While using protobuf library in system (i.e., SPM_USE_BUILTIN_PROTOBUF=OFF, instead of third_party/protobuf-lite), hard-coded header file inclusion causes an error.

    in init.h:21:

    #include "third_party/protobuf-lite/google/protobuf/message_lite.h"
    

    it should be

    #include "google/protobuf/message_lite.h"
    
    opened by acane77 1
Releases(v0.1.97)
Owner
Google
Google ❤️ Open Source
Google
State of the Art Natural Language Processing

Spark NLP: State of the Art Natural Language Processing Spark NLP is a Natural Language Processing library built on top of Apache Spark ML. It provide

John Snow Labs 3k Jan 05, 2023
Text Analysis & Topic Extraction on Android App user reviews

AndroidApp_TextAnalysis Hi, there! This is code archive for Text Analysis and Topic Extraction from user_reviews of Android App. Dataset Source : http

Fitrie Ratnasari 1 Feb 14, 2022
OpenChat: Opensource chatting framework for generative models

OpenChat is opensource chatting framework for generative models.

Hyunwoong Ko 427 Jan 06, 2023
Korean extractive summarization. 2021 AI 텍스트 요약 온라인 해커톤 화성갈끄니까팀 코드

korean extractive summarization 2021 AI 텍스트 요약 온라인 해커톤 화성갈끄니까팀 코드 Leaderboard Notice Text Summarization with Pretrained Encoders에 나오는 bertsumext모델(ext

3 Aug 10, 2022
Write Alphabet, Words and Sentences with your eyes.

The-Next-Gen-AI-Eye-Writer The Eye tracking Technique has become one of the most popular techniques within the human and computer interaction era, thi

Rohan Kasabe 2 Apr 05, 2022
Simple Text-Generator with OpenAI gpt-2 Pytorch Implementation

GPT2-Pytorch with Text-Generator Better Language Models and Their Implications Our model, called GPT-2 (a successor to GPT), was trained simply to pre

Tae-Hwan Jung 775 Jan 08, 2023
基于GRU网络的句子判断程序/A program based on GRU network for judging sentences

SentencesJudger SentencesJudger 是一个基于GRU神经网络的句子判断程序,基本的功能是判断文章中的某一句话是否为一个优美的句子。 English 如何使用SentencesJudger 确认Python运行环境 安装pyTorch与LTP python3 -m pip

8 Mar 24, 2022
Deep Learning for Natural Language Processing - Lectures 2021

This repository contains slides for the course "20-00-0947: Deep Learning for Natural Language Processing" (Technical University of Darmstadt, Summer term 2021).

0 Feb 21, 2022
Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Texar is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar provides

ASYML 2.3k Jan 07, 2023
PyTorch Implementation of "Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging" (Findings of ACL 2022)

Feature_CRF_AE Feature_CRF_AE provides a implementation of Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging

Jacob Zhou 6 Apr 29, 2022
NLP-based analysis of poor Chinese movie reviews on Douban

douban_embedding 豆瓣中文影评差评分析 1. NLP NLP(Natural Language Processing)是指自然语言处理,他的目的是让计算机可以听懂人话。 下面是我将2万条豆瓣影评训练之后,随意输入一段新影评交给神经网络,最终AI推断出的结果。 "很好,演技不错

3 Apr 15, 2022
This converter will create the exact measure for your cappuccino recipe from the grandiose Rafaella Ballerini!

About CappuccinoJs This converter will create the exact measure for your cappuccino recipe from the grandiose Rafaella Ballerini! Este conversor criar

Arthur Ottoni Ribeiro 48 Nov 15, 2022
This project deals with a simplified version of a more general problem of Aspect Based Sentiment Analysis.

Aspect_Based_Sentiment_Extraction Created on: 5th Jan, 2022. This project deals with an important field of Natural Lnaguage Processing - Aspect Based

Naman Rastogi 4 Jan 01, 2023
This repository is home to the Optimus data transformation plugins for various data processing needs.

Transformers Optimus's transformation plugins are implementations of Task and Hook interfaces that allows execution of arbitrary jobs in optimus. To i

Open Data Platform 37 Dec 14, 2022
StarGAN - Official PyTorch Implementation

StarGAN - Official PyTorch Implementation ***** New: StarGAN v2 is available at https://github.com/clovaai/stargan-v2 ***** This repository provides t

Yunjey Choi 5.1k Dec 30, 2022
Easy, fast, effective, and automatic g-code compression!

Getting to the meat of g-code. Easy, fast, effective, and automatic g-code compression! MeatPack nearly doubles the effective data rate of a standard

Scott Mudge 97 Nov 21, 2022
Official implementation of MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis

MLP Singer Official implementation of MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis. Audio samples are available on our demo page.

Neosapience 103 Dec 23, 2022
Subtitle Workshop (subshop): tools to download and synchronize subtitles

SUBSHOP Tools to download, remove ads, and synchronize subtitles. SUBSHOP Purpose Limitations Required Web Credentials Installation, Configuration, an

Joe D 4 Feb 13, 2022
Train and use generative text models in a few lines of code.

blather Train and use generative text models in a few lines of code. To see blather in action check out the colab notebook! Installation Use the packa

Dan Carroll 16 Nov 07, 2022
GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training Code and model from our AAAI 2021 paper

Amazon Web Services - Labs 83 Jan 09, 2023