Simple text to phones converter for multiple languages

Overview

Build Status Codecov GitHub release (latest SemVer) DOI

Phonemizer -- foʊnmaɪzɚ

  • The phonemizer allows simple phonemization of words and texts in many languages.

  • Provides both the phonemize command-line tool and the Python function phonemizer.phonemize.

  • It is using four backends: espeak, espeak-mbrola, festival and segments.

    • espeak-ng supports a lot of languages and IPA (International Phonetic Alphabet) output.

    • espeak-ng-mbrola uses the SAMPA phonetic alphabet instead of IPA but does not preserve word boundaries.

    • festival currently supports only American English. It uses a custom phoneset, but it allows tokenization at the syllable level.

    • segments is a Unicode tokenizer that build a phonemization from a grapheme to phoneme mapping provided as a file by the user.

Installation

You need python>=3.6. If you really need to use python2, use an older version of the phonemizer.

Dependencies

  • You need to install festival, espeak-ng and mbrola on your system. On Debian/Ubuntu simply run:

      $ sudo apt-get install festival espeak-ng mbrola
    
  • When using the espeak-mbrola backend, additional mbrola voices must be installed (see here). On Debian/Ubuntu, list the possible voices with apt search mbrola.

Phonemizer

  • The simplest way is using pip:

      $ pip install phonemizer
    
  • OR install it from sources with:

      $ git clone https://github.com/bootphon/phonemizer
      $ cd phonemizer
      $ [sudo] python setup.py install
    

    If you experiment an error such as ImportError: No module named setuptools during installation, refeer to issue 11.

Docker image

Alternatively you can run the phonemizer within docker, using the provided `Dockerfile**. To build the docker image, have a:

$ git clone https://github.com/bootphon/phonemizer
$ cd phonemizer
$ sudo docker build -t phonemizer .

Then run an interactive session with:

$ sudo docker run -it phonemizer /bin/bash

Testing

When installed from sources or whithin a Docker image, you can run the tests suite from the root phonemizer folder (once you installed pytest):

$ pip install pytest
$ pytest

Python usage

In Python import the phonemize function with from phonemizer import phonemize. See here for function documentation.

Command-line examples

The above examples can be run from Python using the phonemize function

For a complete list of available options, have a:

$ phonemize --help

See the installed backends with the --version option:

$ phonemize --version
phonemizer-2.2
available backends: espeak-ng-1.49.3, espeak-mbrola, festival-2.5.0, segments-2.0.1

Input/output exemples

  • from stdin to stdout:

      $ echo "hello world" | phonemize
      həloʊ wɜːld
    
  • from file to stdout

      $ echo "hello world" > hello.txt
      $ phonemize hello.txt
      həloʊ wɜːld
    
  • from file to file

      $ phonemize hello.txt -o hello.phon --strip
      $ cat hello.phon
      həloʊ wɜːld
    

Backends

  • The default is to use espeak us-english:

      $ echo "hello world" | phonemize
      həloʊ wɜːld
      $ echo "hello world" | phonemize -l en-us -b espeak
      həloʊ wɜːld
    
  • Use festival US English instead

      $ echo "hello world" | phonemize -l en-us -b festival
      hhaxlow werld
    
  • In French, using espeak and espeak-mbrola, with custom token separators (see below). espeak-mbrola does not support words separation.

      $ echo "bonjour le monde" | phonemize -b espeak -l fr-fr -p ' ' -w '/w '
      b ɔ̃ ʒ u ʁ /w l ə /w m ɔ̃ d /w
      $ echo "bonjour le monde" | phonemize -b espeak-mbrola -l mb-fr1 -p ' ' -w '/w '
      b o~ Z u R l @ m o~ d
    
  • In Japanese, using segments

      $ echo 'konnichiwa' | phonemize -b segments -l japanese
      konnitʃiwa
      $ echo 'konnichiwa' | phonemize -b segments -l ./phonemizer/share/japanese.g2p
      konnitʃiwa
    

Supported languages

The exhaustive list of supported languages is available with the command phonemize --list-languages [--backend <backend>].

  • Languages supported by espeak are available here.

  • Languages supported by espeak-mbrola are available here. Please note that the mbrola voices are not bundled with the phonemizer and must be installed separately.

  • Languages supported by festival are:

      en-us -> english-us
    
  • Languages supported by the segments backend are:

      chintang  -> ./phonemizer/share/segments/chintang.g2p
      cree      -> ./phonemizer/share/segments/cree.g2p
      inuktitut -> ./phonemizer/share/segments/inuktitut.g2p
      japanese  -> ./phonemizer/share/segments/japanese.g2p
      sesotho   -> ./phonemizer/share/segments/sesotho.g2p
      yucatec   -> ./phonemizer/share/segments/yucatec.g2p
    

    Instead of a language you can also provide a file specifying a grapheme to phone mapping (see the files above for examples).

Token separators

You can specify separators for phones, syllables (festival only) and words (excepted espeak-mbrola).

$ echo "hello world" | phonemize -b festival -w ' ' -p ''
hhaxlow werld

$ echo "hello world" | phonemize -b festival -p ' ' -w ''
hh ax l ow w er l d

$ echo "hello world" | phonemize -b festival -p '-' -s '|'
hh-ax-l-|ow-| w-er-l-d-|

$ echo "hello world" | phonemize -b festival -p '-' -s '|' --strip
hh-ax-l|ow w-er-l-d

$ echo "hello world" | phonemize -b festival -p ' ' -s ';esyll ' -w ';eword '
hh ax l ;esyll ow ;esyll ;eword w er l d ;esyll ;eword

You cannot specify the same separator for several tokens (for instance a space for both phones and words):

$ echo "hello world" | phonemize -b festival -p ' ' -w ' '
fatal error: illegal separator with word=" ", syllable="" and phone=" ",
must be all differents if not empty

Punctuation

By default the punctuation is removed in the phonemized output. You can preserve it using the --preserve-punctuation option (not supported by the espeak-mbrola backend):

$ echo "hello, world!" | phonemize --strip
həloʊ wɜːld

$ echo "hello, world!" | phonemize --preserve-punctuation --strip
həloʊ, wɜːld!

Espeak specific options

  • The espeak backend can output the stresses on phones:

      $ echo "hello world" | phonemize -l en-us -b espeak --with-stress
      həlˈoʊ wˈɜːld
    
  • The espeak backend can switch languages during phonemization (below from French to English), use the --language-switch option to deal with it:

      $ echo "j'aime le football" | phonemize -l fr-fr -b espeak --language-switch keep-flags
      [WARNING] fount 1 utterances containing language switches on lines 1
      [WARNING] extra phones may appear in the "fr-fr" phoneset
      [WARNING] language switch flags have been kept (applying "keep-flags" policy)
      ʒɛm lə- (en)fʊtbɔːl(fr)
    
      $ echo "j'aime le football" | phonemize -l fr-fr -b espeak --language-switch remove-flags
      [WARNING] fount 1 utterances containing language switches on lines 1
      [WARNING] extra phones may appear in the "fr-fr" phoneset
      [WARNING] language switch flags have been removed (applying "remove-flags" policy)
      ʒɛm lə- fʊtbɔːl
    
      $ echo "j'aime le football" | phonemize -l fr-fr -b espeak --language-switch remove-utterance
      [WARNING] removed 1 utterances containing language switches (applying "remove-utterance" policy)
    

Licence

Copyright 2015-2021 Mathieu Bernard

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

Comments
  • add direct access to punctuation regex

    add direct access to punctuation regex

    Fixes #99

    First pass at adding this feature. Not extensively tested yet.

    I took the approach @mmmaat suggested, which made the changes pretty minimal. But I'd be happy to rewrite it the way @hadware proposed if it's decided that's better.

    opened by jncasey 14
  • [espeak][korean] end of espeak output discarded by phonemizer

    [espeak][korean] end of espeak output discarded by phonemizer

    echo "하늘은 파랗게 구름은 하얗게 실바람도 불어와 부풀은 내 마음 나뭇잎 푸르게 강물도 푸르게 아름다운 이곳에 내가 있고 네가 있네 우리는 이 땅 위에 우리는 태어나고 아름다운 이곳에 자랑스런 이곳에 살리라 찬란하게 빛나는 붉은 태양이 비추고 하얀 물결 넘치는 저 바다와 함께 있네 그 얼마나 좋은가 우리 사는 이곳에 사랑하는 그대와 노래하리 빰빠밤빠밤 빠바밤 빠바밤 빰빠 빠바바바바밤 오늘도 너를 만나러 가야지 말해야지 먼 훗날에 너와 나 살고 지고 영원한 이곳에 우리의 새 꿈을 만들어 보고파 찬란하게 빛나는
    붉은 태양이 비추고 하얀 물결 넘치는 저 바다와 함께 있네 그 얼마나 좋은가 우리 사는 이곳에 사랑하는 그대와 사랑하며 노래하리 빰빠밤빠밤 빠바밤 빠바밤 빰빠 빠바바바바밤 빰빠밤빠밤 빠바밤 빠바밤 빰빠 빠바바바바밤 빰빠밤빠밤 빠바밤 빠바밤 빰빠 빠바바바바밤 빰빠밤빠밤 빠바밤 빠바밤 빰빠 빠바바바바밤 오오오오 봄여름이 지나면 가을 겨울이 온다네 아름다운 강산 너의 마음은 나의 마음 나의 마음은 너의 마음 너와 나는 한마음 너와 나 우리 영원히 영원히 사랑 영원히 영원히 우리 모두 다 모두 다 끝없이 다정해 end of the sentence" | phonemize
    [WARNING] 1 utterances containing language switches on lines 1
    [WARNING] extra phones may appear in the "en-us" phoneset
    [WARNING] language switch flags have been kept (applying "keep-flags" policy)
    (ko)hɐnɯɾɯn phɐɾɐtkhe ɡuɾɯmɯn hɐjɐtkhe siɫbɐɾɐmdo puɾʌwɐ puphuɾɯnnɛmɐɯmnɐmunnip phuɾɯqe ɡɐŋmuɫdo phuɾɯqe ɐɾɯmdɐun iqosenɛqɐ itkoneqɐ inne uɾinɯn i tɐŋ wie uɾinɯn thɛʌnɐqo ɐɾɯmdɐun iqose tɕɐɾɐŋsɯɾʌn iqose sɐliɾɐ tʃhɐnɾɐnhɐqe pinnɐnɯn pulɡɯn thɛjɐŋi pitʃhuqo hɐjɐnmuɫqjʌɫnʌmtʃhinɯn tɕʌ pɐdɐwɐ hɐmqe inne ɡɯ ʌɫmɐnɐ tɕot(enus) (ko)hɐjɐnmuɫqjʌɫnʌmtʃhinɯn tɕʌ pɐdɐwɐ hɐmqe inne ɡɯ ʌɫmɐnɐ tɕoɯnqɐ uɾi sɐnɯn iqose sɐɾɐŋhɐnɯn ɡɯdɛwɐ sɐɾɐŋhɐmjʌnoɾɛhɐɾi pɐmpɐbɐmpɐbɐm pɐbɐbɐm pɐbɐbɐm pɐmpɐ pɐbɐbɐbɐbɐbɐm pɐmpɐbɐmpɐbɐm pɐbɐbɐm pɐbɐbɐm pɐmpɐ pɐbɐbɐbɐbɐbɐm pɐmpɐbɐmpɐbɐm pɐbɐbɐm pɐbɐbɐm pɐmpɐ pɐbɐbɐbɐbɐbɐm pɐmpɐbɐmpɐbɐm pɐbɐbɐm pɐbɐ(enus)
    

    I'm using WSL to preprocess korean to ipa . for some reason the phonemizer takes only part of the sentence as input and do not preprocess characters after that . I tried using cat,echo,and phonemizer I/O(using option -o) but the result are all same

    bug 
    opened by Ldoun 10
  • Request: more flexibility around punctuation definitions

    Request: more flexibility around punctuation definitions

    Is your feature request related to a problem? Please describe. I'd like more flexibility in defining punctuation, ideally by having access directly to the regex.

    Specifically, instead of defining the characters to be counted as punctuation, I think it'd be more useful to me to define which characters are words to be phonemized, and treat everything else as punctuation.

    Describe the solution you'd like Something as broad as [^\p{L}\p{M}0-9'] could work as a default, which from what I understand would capture everything that's not a number, unicode letter or its diacritics.

    That may be overly broad, though, because I've run into trouble with espeak and characters from Cyrillic and Korean sets already, and I'd imagine characters from other less-supported languages could also be problematic.

    Describe alternatives you've considered In my local copy of phonemizer, I've played with hard-coding the punctuation regex like so:

    marks = "[^a-zA-ZÀ-ÖØ-öø-ÿ0-9',.$@&+\\-=/\\\]"
    self._marks_re = re.compile(fr'(\s*{marks}+\s*|\s*(?<!\d),\s*|\s*(?<!\d)\.(?!\d)\s*)+')
    

    which captures everything that's not a latin character or a set of marks that the backends can pronounce, like "æt" for "@".

    The back half of this is also an attempt to handle the problem raised in #87, though I haven't tested it much, and there may be some cases where it breaks.

    feature request 
    opened by jncasey 9
  • fixes to --preserve-punctuation

    fixes to --preserve-punctuation

    This reworks a few things related to preserving punctuation.

    • Fixes the inconsistency described in #106. Now word separators appear in the same locations whether preserve_punctuation is True or False.
    • Addresses part of the problem described in #104 (but doesn't return the output to what it was in 0.3, which I believe was incorrect – they'd also need to use a non-None word separator, maybe, to get what they're after?)
    • Fixes #108 by refactoring the restore method to be iterative instead of recursive

    A number of the tests had to be updated due to that first bullet point.

    opened by jncasey 8
  • Adding preserve_empty_lines option

    Adding preserve_empty_lines option

    Adding the feature I requested in #95.

    Not stripping out the empty lines was causing problems with the festival backend and preserve_punctuation, so I took the approach of stripping out the empty lines pre-phonemization and then reinserting them afterward.

    I want to flag that I changed the conversion of the input text to a list from a generator to list comprehension here since I needed to run through the list a second time to preserve the empty lines, and I thought this made the code easier to read than making another generator. I'm assuming that this won't make a big difference on performance given how I think phonemizer is generally used.

    opened by jncasey 8
  • Windows issue with NamedTemporaryFile

    Windows issue with NamedTemporaryFile

    when i run phonemize on windows 10 , python 3.6 i still have issue. it looks the temp file doesn't created at all. (the backslash issue fixed i can see)

    >>> ph=phonemize('Hello World',strip=False,njobs=1,backend='espeak')
    Failed to read file 'C:\Users\cinetec\AppData\Local\Temp\tmp5sigu2vf'
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "C:\Users\cinetec\AppData\Local\Programs\Python\Python36\lib\site-packages\phonemizer-1.0.1-py3.6.egg\phonemizer\phonemize.py", line 94, in phonemize
        text, separator=separator, strip=strip, njobs=njobs)
      File "C:\Users\cinetec\AppData\Local\Programs\Python\Python36\lib\site-packages\phonemizer-1.0.1-py3.6.egg\phonemizer\backend.py", line 130, in phonemize
        out = self._phonemize_aux(self._list2str(text), separator, strip)
      File "C:\Users\cinetec\AppData\Local\Programs\Python\Python36\lib\site-packages\phonemizer-1.0.1-py3.6.egg\phonemizer\backend.py", line 235, in _phonemize_aux
        shlex.split(command, posix=False)).decode('utf8')
      File "C:\Users\cinetec\AppData\Local\Programs\Python\Python36\lib\subprocess.py", line 356, in check_output
        **kwargs).stdout
      File "C:\Users\cinetec\AppData\Local\Programs\Python\Python36\lib\subprocess.py", line 438, in run
        output=stdout, stderr=stderr)
    subprocess.CalledProcessError: Command '['espeak', '-ven-us', '--ipa=3', '-q', '-f', 'C:\\Users\\cinetec\\AppData\\Local\\Temp\\tmp5sigu2vf']' returned non-zero exit status 1.
    
    opened by snowzhangy 8
  • State of the field and force alignment literature

    State of the field and force alignment literature

    Part of a review at openjournals/joss-reviews#3958

    • [ ] State of the field: Do the authors describe how this software compares to other commonly-used packages?

    The manuscript identifies a number of related software packages with which this program could interface, but there is a gap in its references to the force alignment literature. Within linguistics, especially phonetic analysis, force alignment is an important part of the research pipeline whereby an acoustic signal is segmented and aligned with a text transcript. This then allows corpus queries and phonetic analysis of segments. The paper briefly touched on this when discussing Kaldi (Povey, et al. 2011), but the state of the field is broader and this program has important implications for that field. I believe the paper would be improved by further review of that literature.

    The most impactful piece of software in that field is the force Alignment and Vowel Extraction (FAVE) toolkit (Rosenfelder, et al. 2014) which converts orthographic transcriptions to phonetic transcriptions through dictionary lookups using the CMU pronunciation dictionary. This has the downside of not being able to handle out-of-dictionary words requiring experimenter transcription or data exclusion.

    Other researchers have been trying to improve coverage of force alignment to underdocumented languages and a major problem is the lack of grapheme-to-phoneme mappings (Barth, et al. 2020) or comprehensive pronunciation dictionaries (Johnson, Di Paolo, and Bell 2018). These can be substantial work and the language, orthographic system, or researcher time can limit the utility of these approaches.

    These programs require a task similar to the one performed by this package, but do it in a seemingly different way. Comparing this package to the methods used in those packages will improve the paper by connecting it to a wider body of literature and identifying new potential areas of impact.

    This is a really interesting project, and I'm excited to look further into the code!

    joss 
    opened by chrisbrickhouse 7
  • instructions for docker users

    instructions for docker users

    How does one access one's files from inside the interactive session? That is, if I sudo docker run -it phonemizer /bin/bash I get transported inside a different space, where my files are not available. Right? And this is the only way to call phonemizer in mac?

    Minor suggestion: In the instructions for docker users, add the explicit instruction to do git clone https://github.com/bootphon/phonemizer.git.

    opened by alecristia 7
  • espeak-ng

    espeak-ng

    Hi, I heard about espeak-ng and am considering installing it: https://github.com/espeak-ng/espeak-ng Do you know if phonemizer will work with espeak-ng? regards, Andrew

    feature request help wanted 
    opened by cainesap 7
  • When phonemizing a text whick has more than 100k utterances, it will always gives a

    When phonemizing a text whick has more than 100k utterances, it will always gives a "RuntimeError"

    Describe the bug

    When phonemizing a text whick has more than 100k utterances, it will always gives a "RuntimeError" include "espeak not installed on your system",“failed to find espeak library” and "invalid voice code 'cmn' " at around 900 utterances.

    Phonemizer version phonemizer-3.0 available backends: espeak-ng-1.49.2, espeak-mbrola, festival-2.5.0, segments-2.2.0

    System cat /proc/version: Linux version 4.15.0-106-generic ([email protected]) (gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04))

    python: Python 3.9.1 (default, Dec 11 2020, 14:32:07) [GCC 7.3.0] :: Anaconda, Inc. on linux

    To reproduce

    txtdict = txt2dict(text_path)
    
    with open(scp_path) as f:
        for line in f.readlines():
            txt = txtdict.get(line[0])
            phone = phonemize(txt, backend='espeak', language='cmn', 
                            separator=Separator(word='/', phone=' ', syllable="-"))
            rows.append([wav, new_wav, txt, phone, new_phone])
    

    Expected behavior

    opened by YoungKang1222 6
  • Problems about Mandarin phoneme

    Problems about Mandarin phoneme

    When run phonemize with -l cmn or zh, the phoneme is International Phonetic Alphabet instead of Mandarin phoneme

    Is this a code problem or a bug? How can i convert Chinese to Mandarin phoneme? Look forward to your reply, thanks!

    Expected behavior I try to convert Chinese to Mandarin phoneme with this code:

    cat chinese.txt | PHONEMIZER_ESPEAK_PATH=$(which espeak) phonemize -o train_out.phn -p ' ' -w '' -l zh -j 70 --language-switch remove-flags

    result: fatal error: language "zh" is not supported by the espeak backend then i check: espeak --voices it include Mandarin phonemize --list-languages it said:cmn -> Chinese (Mandarin)

    Phonemizer version phonemizer-3.0 espeak-ng-1.50, espeak-mbrola, festival-2.5.0, segments-2.2.0

    System ubuntu 20.04

    bug 
    opened by mynah15 6
  • Disparity between backends with punctuation

    Disparity between backends with punctuation

    Describe the bug When using the default preserve_punctuation=False, the Festival backend ignores text that only contains punctuation, whereas the Espeak backend returns the empty string.

    Phonemizer version

    phonemizer-3.2.1
    available backends: espeak-ng-1.50, espeak-mbrola, festival-2.5.0, segments-2.2.1
    

    System Ubuntu 20.04.4 Linux kernel 5.15.0 Python 3.8.10

    To reproduce

    from phonemizer import phonemize
    
    print(phonemize([".", "."], language="en-us", backend="festival"))
    print(phonemize([".", "."], language="en-us", backend="espeak"))
    print(phonemize([".", "."], language="mb-us1", backend="espeak-mbrola"))
    

    Yields output

    []
    ['', '']
    ['', '']
    

    Expected behavior Should output:

    ['', '']
    ['', '']
    ['', '']
    
    bug 
    opened by agkphysics 1
  • EspeakBackend enters a corrupted state upon seeing some characters

    EspeakBackend enters a corrupted state upon seeing some characters

    Describe the bug When calling phonemize on an instance of EspeakBackend with the character "ꪁ", the backend enters a corrupted state where all succeeding phonemization (including in the sentence with "ꪁ") is incorrect.

    Phonemizer version Phonemizer 3.2.1 Espeak NG 1.50

    System Reproduced the bug both on Win10 and Ubuntu

    To reproduce

    from phonemizer.backend import EspeakBackend
    
    texts = [
        "a, b, c, d, e, f, p, w, y, z",
        "ꪁ",
        "a, b, c, d, e, f, p, w, y, z"
    ]
    
    backend = EspeakBackend(
        language="en-us", preserve_punctuation=True, with_stress=True,
        language_switch="remove-flags", words_mismatch="ignore"
    )
    
    for text in texts:
        print(backend.phonemize([text])[0])
    

    Expected behavior Expected output:

    ˈeɪ , bˈiː , sˈiː , dˈiː , ˈiː , ˈɛf , pˈiː , dˈʌbəljˌuː , wˈaɪ , zˈiː 
    
    ˈeɪ , bˈiː , sˈiː , dˈiː , ˈiː , ˈɛf , pˈiː , dˈʌbəljˌuː , wˈaɪ , zˈiː
    

    Actual output:

    ˈeɪ , bˈiː , sˈiː , dˈiː , ˈiː , ˈɛf , pˈiː , dˈʌbəljˌuː , wˈaɪ , zˈiː 
    
    ˈʌ , bˈʌ , sˈʌ , dˈʌ , ˈʌ , ˈʌf , pˈʌ , dˈʌbd-jʌ , wˈʌ , zˈʌ 
    
    bug espeak 
    opened by CorentinJ 1
  • Can't use multiple EspeakBackend objects with njobs=1

    Can't use multiple EspeakBackend objects with njobs=1

    Describe the bug It seems that instantiation of multiple EspeakBackend objects is not correctly handled. All the objects start operating with the language used to instantiate the last object. Please refer to the example below.

    Phonemizer version 3.0.1

    System macOS 11.6.4 python 3.8.9 [Clang 13.0.0 (clang-1300.0.29.30)] on darwin

    To reproduce

    from phonemizer.backend import EspeakBackend
    
    en_backend = EspeakBackend(
        "en-us",
        preserve_punctuation=True,
        with_stress=True,
        language_switch="remove-flags",
        words_mismatch="ignore",
    )
    en_sentence = ["I love to eat pizza everyday"]
    print(en_backend.phonemize(en_sentence, njobs=1, strip=True)) # ['aɪ lˈʌv tʊ ˈiːt pˈiːtsə ˈɛvɹɪdˌeɪ']
    
    de_backend = EspeakBackend(
        "de",
        preserve_punctuation=True,
        with_stress=True,
        language_switch="remove-flags",
        words_mismatch="ignore",
    )
    de_sentence = ["ich esse jeden tag gerne pizza."]
    print(de_backend.phonemize(de_sentence, njobs=1, strip=True)) # ['ɪç ˈɛsə jˈeːdən tˈɑːk ɡˈɛɾnə pˈɪtsɑː.']
    
    incorrect_en = en_backend.phonemize(en_sentence, njobs=1, strip=True)
    en_with_de = de_backend.phonemize(en_sentence, njobs=1, strip=True)
    
    assert en_with_de == incorrect_en
    print(incorrect_en, en_with_de) 
    # ['ˈiː lˈoːvə tˈoː eːˈɑːt pˈɪtsɑː ˈeːveːrˌyːdɛɪ'] 
    # ['ˈiː lˈoːvə tˈoː eːˈɑːt pˈɪtsɑː ˈeːveːrˌyːdɛɪ']
    

    Expected behavior Notice that incorrect_en is equal to en_with_de and not equal to en_sentence.

    Additional context This problem happens only with njobs=1 and doesn't appear with njobs>1

    bug espeak 
    opened by eeishaan 0
  • Use espeak phone X-SAMPA to language-specific SAMPA foldings.

    Use espeak phone X-SAMPA to language-specific SAMPA foldings.

    Currently, to get the language-specific SAMPA form of each phoneme, a working (and espeak-friendly) installation of mbrola is required. This is problematic for several reasons:

    • it requires an additional system-wide package install on linux platform (even though mbrola is available on most distributions)
    • it requires the installation of a corresponding mbrola voice, which is either unpractical and/or quite heavy. Moreover, the entirety of the voice's speech data isn't actually used by espeak for the phonemization.
    • the OSX/windows support for mbrola is very bad.

    The foldings are all here: https://github.com/espeak-ng/espeak-ng/tree/master/phsource/mbrola

    espeak 
    opened by hadware 1
  • fatal error: language

    fatal error: language "mb-fr4" is not supported by the espeak-mbrola backend

    echo "bonjour le monde" | phonemize -b espeak-mbrola -l mb-fr1 -p ' ' -w '/w ' when running this command it is giving error

    fatal error: language "mb-fr4" is not supported by the espeak-mbrola backend

    opened by sravani40 1
Releases(v3.2.1)
  • v3.2.1(Jun 9, 2022)

  • v3.2.0(May 23, 2022)

    • bug fixes

      • Fixed a bug when trying to restore punctuation on very long text. See #108
    • improvements

      • Improved consistency with the handling of word separators when preserving punctuation, and when using a word separator that is not a literal space character. See #106
    • new features

      • Added the option to define punctuation with a regular expression. Previously only strings were accepted. See #120

        • In the python API, the punctuation_marks parameter can now be passed to phonemize (or a backend constructor) as a re.Pattern that defines which characters will be matched as punctuation. Passing punctuation_marks as a str will continue to function as before, treating each character in the string as a punctuation mark.

        • Added the optional parameter --punctuation_marks_is_regex to the CLI interface. When used, the CLI will attempt to compile a re.Pattern from the value passed to --punctuation-marks.

    Source code(tar.gz)
    Source code(zip)
  • v3.1.1(Mar 31, 2022)

    ChangeLog

    • improvements

      • Preserve empty lines in texts when using --preserve-empty-lines. Without this option, empty lines used to be automatically dropped. See PR #103
    • new features

      • Type hinted most of phonemizer's API. This makes the usage of our API a bit clearer, and can be easily leveraged by IDE's and type checkers to prevent typing issues.
    Source code(tar.gz)
    Source code(zip)
  • v3.0.1(Dec 18, 2021)

    ChangeLog

    • improvements in README after JOSS reviews

    • bug fixes

      • The method BaseBackend.phonemize now raises a RuntimeError if the input text is a str instead of a list of of str (was only logging an error message).

      • Preserve punctuation alignement when using --preserve-punctuation, was inserting a space before each punctuation token, see issue #97.

    Source code(tar.gz)
    Source code(zip)
  • v3.0(Oct 25, 2021)

    phonemizer-3.0

    breaking change

    • Do not remove empty lines from output. For example:

      # this is now
      phonemize(["hello", "!??"]) == ['həloʊ ', '']
      # this was
      phonemize(["hello", "!??"]) == ['həloʊ ']
      
    • Default backend in the phonemize function is now espeak (was festival).

    • espeak-mbrola backend now requires espeak>=1.49.

    • --espeak-path option renamed as --espeak-libraryand PHONEMIZER_ESPEAK_PATH environment variable renamed as PHONEMIZER_ESPEAK_LIBRARY.

    • --festival-path option renamed as --festival-executable and PHONEMIZER_FESTIVAL_PATH environment variable renamed as PHONEMIZER_FESTIVAL_EXECUTABLE.

    • The methods backend.phonemize() from the backend classes take only a list of str a input text (was either a str or a list of str).

    • The methods backend.version() from the backend classes returns a tuple of int instead of a str.

    improvements

    • espeak and mbrola backends now rely on the espeak shared library using the ctypes Python module, instead of reliying on the espeak executable through subprocesses. This implies drastic speed improvments, up to 40 times faster.

    new features

    • New option --prepend-text to prepend the input text to phonemized utterances, so as to have both orthographic and phonemized available at output.

    • New option --tie for the espeak backend to display a tie character within multi-letter phonemes. (see issue #74).

    • New option --words-mismatch for the espeak backend. This allows to detect when espeak merge consecutive words or drop a word from the orthographic text. Possible actions are to ignore those misatches, to issue a warning for each line where a mismatch is detectd, or to remove those lines from the output.

    bugfixes

    • phonemizer's logger no more conflicts with other loggers when imported from Python (see PR #61).
    Source code(tar.gz)
    Source code(zip)
  • v2.2.2(Jan 6, 2021)

    phonemizer-2.2.2

    • bugfixes

      • Fixed installation from source (bug introduced in 2.2.1, see issue #52).

      • Fixed a bug when trying to restore punctuation on an empty text (see issue #54).

      • Fixed an edge case bug when using custom punctuation marks (see issue #55).

      • Fixed regex issue that causes digits to be considered punctuation (see issue #60).

    Source code(tar.gz)
    Source code(zip)
  • v2.2.1(Jul 24, 2020)

    Changelog for phonemizer-2.2.1

    • improvements

      From Python import the phonemize function using from phonemizer import phonemize instead of from phonemizer.phonemize import phonemize. The second import is still available for compatibility.

    • bugfixes

      • Fixed a minor bug in utils.chunks.

      • Fixed warnings on language switching for espeak backend when using parallel jobs (see issue #50).

      • Save file in utf-8 explicitly for Windows compat (see issue #43).

      • Fixed build and tests in Dockerfile (see issue #45).

    Source code(tar.gz)
    Source code(zip)
  • v2.2(Feb 27, 2020)

    ChangeLog

    • new features

      • New option --list-languages to list the available languages for a given backend from the command line.

      • The --sampa option of the espeak backend has been replaced by a new backend espeak-mbrola.

        • The former --sampa option (introduced in phonemizer-2.0) outputs phones that are not standard SAMPA but are adapted to the espeak TTS front-end.

        • On the other hand the espeak-mbrola backend allows espeak to output phones in standard SAMPA (adapted to the mbrola TTS front-end). This backend requires mbrola to be installed, as well as additional mbrola voices to support needed languages. This backend does not support word separation nor punctuation preservation.

    • bugfixes

      • Fixed issues with punctuation processing on some corner cases, see issues #39 and #40.

      • Improvments and updates in the documentation (Readme, phonemize --help and Python code).

      • Fixed a test when using espeak>=1.50.

      • Empty lines are correctly ignored when reading text from a file.

    Source code(tar.gz)
    Source code(zip)
  • v2.1(Jan 29, 2020)

    ChangeLog for phonemizer-2.1

    • new features

      • Possibility to preserve the punctuation (ignored and silently removed by default) in the phonemized output with the new option --preserve-punctuation from command line (or the equivalent preserve-punctuation from Python API). With the punctuation-marks option, one can overload the default marls considered as punctuation.

      • It is now possible to specify the path to a custom espeak or festival executable (for instance to use a local installation or to test different versions). Either specify the PHONEMIZER_ESPEAK_PATH environment variable, the --espeak-path option from command line or use the EspeakBackend.set_espeak_path method from the Python API. Similarly for festival use PHONEMIZER_FESTIVAL_PATH, --festival-path or FestivalBackend.set_festival_path.

      • The --sampa option is now available for espeak (was available only for espeak-ng).

      • When using espeak with SAMPA output, some SAMPA phones are corrected to correspond to the normalized SAMPA alphabet (espeak seems not to respect it). The corrections are language specific. A correction file must be placed in phonemizer/share/espeak. This have been implemented only for French by now.

    • bugfixes

      • parses correctly the version of espeak-ng even for dev versions (e.g. 1.51-dev).

      • fixed an issue with espeak backend, where multiple phone separators can be present at the end of a word, see #31.

      • added an additional stress symbol - for espeak.

    Source code(tar.gz)
    Source code(zip)
  • v2.0.1(Nov 7, 2019)

    phonemizer-2.0.1

    • bugfixes

      • keep-flags was not the default argument for language_switch in the class EspeakBackend.

      • fixed an issue with punctuation processing in the espeak backend, see #26

    • improvements

      • log a warning if using python2.
    Source code(tar.gz)
    Source code(zip)
  • v2.0(Oct 10, 2019)

    ChangeLog

    • incompatible change

      Starting with phonemizer-2.0 only python3 is supported. Compatibility with python2 is no more ensured nor tested. https://pythonclock.org.

    • bugfixes

      • new --language-switch option to use with espeak backend to deals with language switching on phonemized output. In previous version there was a bug in detection of the language switching flags (sometimes removed, sometimes not). Now you can choose to keep the flags, to remove them, or to delete the whole utterance.

      • bugfix in a test with espeak>=1.49.3.

      • bugfix using NamedTemporaryFile on windows, see #21.

      • bugfix when calling festival or espeak subprocesses on Windows, see #17.

      • bugfix in detecting recent versions of espeak-ng, see #18.

      • bugfix when using utf8 input on espeak backend (python2), see #19.

    • new features and improvements

      • new --sampa option to output phonemes in SAMPA alphabet instead of IPA, available for espeak-ng only.

      • new --with-stress option to use with espeak backend to not remove the stresses on phonemized output. For instance:

        $ echo "hello world" | phonemize
        həloʊ wɜːld
        $ echo "hello world" | phonemize --with-stress
        həlˈoʊ wˈɜːld
        
      • improved logging: by default only warnings are displayed, use the new --quiet option to inhibate all log messages or --verbose to see all of them. Log messages now display level name (debug/info/warning).

      • improved code organization:

        • backends are now implemented in the backend submodule as separated source files.

        • improved version string (displays uninstalled backends, moved outside of main for use from Python).

        • improved logger implemented in its own module so as a call to phonemizer from CLI or API yields the same log messages.

    Source code(tar.gz)
    Source code(zip)
  • v1.0(Dec 18, 2018)

    • incompabile changes

      The following changes break the compatibility with previous versions of phonemizer (0.X.Y):

      • command-line phonemize program: new --backend <espeak|festival|segments> option, default language is now espeak en-us (was festival en-us),

      • it is now illegal to have the same separator at different levels (for instance a space for both word and phone),

      • from Python, must import the phonemize function as from phonemizer.phonemize import phonemize, was from phonemizer import phonemize.

    • New backend segments for phonemization based on grapheme-to-phoneme mappings.

    • Major refactoring of the backends implementation and separators (as Python classes).

    • Input to phonemizer now supports utf8.

    • Better handling of errors (display of a meaningful message).

    • Fixed a bug in fetching espeak version on macos, see #14.

    Source code(tar.gz)
    Source code(zip)
  • v0.3.3(Aug 29, 2018)

  • v0.3.2(Jul 26, 2018)

    ChangeLog

    • Continuous integration with tracis-ci
    • Support for docker
    • Better support for different versions of espeak/festival
    • Minor bugfixes and improved tests
    Source code(tar.gz)
    Source code(zip)
  • v0.3.1(Nov 13, 2017)

    ChangeLog

    • New espeak or espeak-ng backend with more than 100 languages
    • Support for Python 2.7 and 3.5
    • Integration with zenodo for citation
    • Various bugfixes and minor improvments
    Source code(tar.gz)
    Source code(zip)
Owner
CoML
CoML
Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

NeuroNER NeuroNER is a program that performs named-entity recognition (NER). Website: neuroner.com. This page gives step-by-step instructions to insta

Franck Dernoncourt 1.6k Dec 27, 2022
Python utility library for compositing PDF documents with reportlab.

pdfdoc-py Python utility library for compositing PDF documents with reportlab. Installation The pdfdoc-py package can be installed directly from the s

Michael Gale 1 Jan 06, 2022
RecipeReduce: Simplified Recipe Processing for Lazy Programmers

RecipeReduce This repo will help you figure out the amount of ingredients to buy for a certain number of meals with selected recipes. RecipeReduce Get

Qibin Chen 9 Apr 22, 2022
Repository for Project Insight: NLP as a Service

Project Insight NLP as a Service Contents Introduction Features Installation Setup and Documentation Project Details Demonstration Directory Details H

Abhishek Kumar Mishra 286 Dec 06, 2022
Implementation of ProteinBERT in Pytorch

ProteinBERT - Pytorch (wip) Implementation of ProteinBERT in Pytorch. Original Repository Install $ pip install protein-bert-pytorch Usage import torc

Phil Wang 92 Dec 25, 2022
NLP made easy

GluonNLP: Your Choice of Deep Learning for NLP GluonNLP is a toolkit that helps you solve NLP problems. It provides easy-to-use tools that helps you l

Distributed (Deep) Machine Learning Community 2.5k Jan 04, 2023
A demo for end-to-end English and Chinese text spotting using ABCNet.

ABCNet_Chinese A demo for end-to-end English and Chinese text spotting using ABCNet. This is an old model that was trained a long ago, which serves as

Yuliang Liu 45 Oct 04, 2022
An extensive UI tool built using new data scraped from BBC News

BBC-News-Analyzer An extensive UI tool built using new data scraped from BBC New

Antoreep Jana 1 Dec 31, 2021
中文生成式预训练模型

T5 PEGASUS 中文生成式预训练模型,以mT5为基础架构和初始权重,通过类似PEGASUS的方式进行预训练。 详情可见:https://kexue.fm/archives/8209 Tokenizer 我们将T5 PEGASUS的Tokenizer换成了BERT的Tokenizer,它对中文更

410 Jan 03, 2023
NeoDays-based tileset for the roguelike CDDA (Cataclysm Dark Days Ahead)

NeoDaysPlus Reduced contrast, expanded, and continuously developed version of the CDDA tileset NeoDays that's being completed with new sprites for mis

0 Nov 12, 2022
CoNLL-English NER Task (NER in English)

CoNLL-English NER Task en | ch Motivation Course Project review the pytorch framework and sequence-labeling task practice using the transformers of Hu

Kevin 2 Jan 14, 2022
Jupyter Notebook tutorials on solving real-world problems with Machine Learning & Deep Learning using PyTorch

Jupyter Notebook tutorials on solving real-world problems with Machine Learning & Deep Learning using PyTorch. Topics: Face detection with Detectron 2, Time Series anomaly detection with LSTM Autoenc

Venelin Valkov 1.8k Dec 31, 2022
The NewSHead dataset is a multi-doc headline dataset used in NHNet for training a headline summarization model.

This repository contains the raw dataset used in NHNet [1] for the task of News Story Headline Generation. The code of data processing and training is available under Tensorflow Models - NHNet.

Google Research Datasets 31 Jul 15, 2022
DensePhrases provides answers to your natural language questions from the entire Wikipedia in real-time

DensePhrases provides answers to your natural language questions from the entire Wikipedia in real-time. While it efficiently searches the answers out of 60 billion phrases in Wikipedia, it is also v

Jinhyuk Lee 543 Jan 08, 2023
Optimal Transport Tools (OTT), A toolbox for all things Wasserstein.

Optimal Transport Tools (OTT), A toolbox for all things Wasserstein. See full documentation for detailed info on the toolbox. The goal of OTT is to pr

OTT-JAX 255 Dec 26, 2022
Official implementations for various pre-training models of ERNIE-family, covering topics of Language Understanding & Generation, Multimodal Understanding & Generation, and beyond.

English|简体中文 ERNIE是百度开创性提出的基于知识增强的持续学习语义理解框架,该框架将大数据预训练与多源丰富知识相结合,通过持续学习技术,不断吸收海量文本数据中词汇、结构、语义等方面的知识,实现模型效果不断进化。ERNIE在累积 40 余个典型 NLP 任务取得 SOTA 效果,并在 G

5.4k Jan 03, 2023
Turkish Stop Words Türkçe Dolgu Sözcükleri

trstop Turkish Stop Words Türkçe Dolgu Sözcükleri In this repository I put Turkish stop words that is contained in the first 10 thousand words with th

Ahmet Aksoy 103 Nov 12, 2022
Large-scale Knowledge Graph Construction with Prompting

Large-scale Knowledge Graph Construction with Prompting across tasks (predictive and generative), and modalities (language, image, vision + language, etc.)

ZJUNLP 161 Dec 28, 2022
An attempt to map the areas with active conflict in Ukraine using open source twitter data.

Live Action Map (LAM) An attempt to use open source data on Twitter to map areas with active conflict. Right now it is used for the Ukraine-Russia con

Kinshuk Dua 171 Nov 21, 2022
A fast and easy implementation of Transformer with PyTorch.

FasySeq FasySeq is a shorthand as a Fast and easy sequential modeling toolkit. It aims to provide a seq2seq model to researchers and developers, which

宁羽 7 Jul 18, 2022