WikiPron - a command-line tool and Python API for mining multilingual pronunciation data from Wiktionary

Overview

WikiPron

PyPI version Supported Python versions CircleCI Paper Conference

WikiPron is a command-line tool and Python API for mining multilingual pronunciation data from Wiktionary, as well as a database of pronunciation dictionaries mined using this tool.

If you use WikiPron in your research, please cite the following:

Jackson L. Lee, Lucas F.E. Ashby, M. Elizabeth Garza, Yeonju Lee-Sikka, Sean Miller, Alan Wong, Arya D. McCarthy, and Kyle Gorman (2020). Massively multilingual pronunciation mining with WikiPron. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 4223-4228. [bibtex]

Command-line tool

Installation

WikiPron requires Python 3.6+. It is available from PyPI:

pip install wikipron

Usage

Quick Start

After installation, the terminal command wikipron will be available. As a basic example, the following command scrapes G2P data for French:

wikipron fra

Specifying the Language

The language is indicated by a three-letter ISO 639-2 or ISO 639-3 language code, e.g., fra for French. For which languages can be scraped, here is the complete list of languages on Wiktionary that have pronunciation entries.

Specifying the Dialect

One can optionally specify dialects to target using the --dialect flag. The dialect name can be found together with the transcription on Wiktionary. For example, "(UK, US) IPA: /təˈmɑːtəʊ/". To restrict to the union of dialects use the pipe character '|': e.g., --dialect='General American | US'. Transcriptions which lack a dialect specification are selected regardless of the value of this flag.

Segmentation

By default, the segments library is used to segment the transcription into whitespace. The segmentation tends to place IPA diacritics and modifiers on the "parent" symbol. For instance, [kʰæt] is rendered kʰ æ t. This can be disabled using the --no-segment flag.

Parentheses

Some of transcriptions contain parentheses to indicate alternative pronunciations. The parentheses (but not the content) are discarded in the scrape unless the --no-skip-parens flag is used.

Output

The scraped data is organized with each pair on its own line, where the word and pronunciation are separated by a tab. Note that the pronunciation is in International Phonetic Alphabet (IPA), segmented by spaces that correctly handle the combining and modifier diacritics for modeling purposes, e.g., we have kʰ æ t with the aspirated k instead of k ʰ æ t.

For illustration, here is a snippet of French data scraped by WikiPron:

accrémentitielle    a k ʁ e m ɑ̃ t i t j ɛ l
accrescent  a k ʁ ɛ s ɑ̃
accrétion   a k ʁ e s j ɔ̃
accrétions  a k ʁ e s j ɔ̃

By default, the scraped data appears in the terminal. To save the data in a TSV file, please redirect the standard output to a filename of your choice:

wikipron fra > fra.tsv

Advanced Options

The wikipron terminal command has an array of options to configure your scraping run. For a full list of the options, please run wikipron -h.

Python API

The underlying module can also be used from Python. A standard workflow looks like:

import wikipron

config = wikipron.Config(key="fra")  # French, with default options.
for word, pron in wikipron.scrape(config):
    ...

Data

We also make available a database of over 3 million word/pronunciation pairs mined using WikiPron.

Models

We host grapheme-to-phoneme models and modeling software in a separate repository.

Development

Repository

The source code of WikiPron is hosted on GitHub at https://github.com/CUNY-CL/wikipron, where development also happens.

For the latest changes not yet released through pip or working on the codebase yourself, you may obtain the latest source code through GitHub and git:

  1. Create a fork of the wikipron repo on your GitHub account.

  2. Locally, make sure you are in some sort of a virtual environment (venv, virtualenv, conda, etc).

  3. Download and install the library in the "editable" mode together with the core and dev dependencies within the virtual environment:

    git clone https://github.com/<your-github-username>/wikipron.git
    cd wikipron
    pip install -U pip setuptools
    pip install -r requirements.txt
    pip install --no-deps -e .

We keep track of notable changes in CHANGELOG.md.

Contribution

For questions, bug reports, and feature requests, please file an issue.

If you would like to contribute to the wikipron codebase, please see CONTRIBUTING.md.

License

WikiPron is released under an Apache 2.0 license. Please see LICENSE.txt for details.

Please note that Wiktionary data in the data/ directory has its own licensing terms.

Comments
  • Potential problem in _parse_combining_modifiers()

    Potential problem in _parse_combining_modifiers()

    I started the second big scrape and while scraping for phonetic data from Albanian, Wikipron threw an error, the last line of which I'll reproduce below:

    File ".../wikipron/config.py", line 73, in _parse_combining_modifiers last_char = chars.pop() IndexError: pop from empty list

    The final line in the Albanian phonetic tsv is herë h ɛː ɾ meaning the scrape likely failed on this entry which contains what looks like word initial aspiration.

    I guess for words like the one that caused this error we would want to combine with next char ʰi d r ɔ ɟ ɛ n?

    bug 
    opened by lfashby 22
  • TSV files for all Wiktionary languages with over 1000 entries

    TSV files for all Wiktionary languages with over 1000 entries

    -Adds tsv files in wikipron/langauges/wikipron/tsv_files -Adds a readme in wikpron/languages/wikipron

    The tsv file names are formatted as such: iso693-2(B)code_phonetic/phonemic (If the language only has an iso639-3 code then that code is used. If a language doesn't have any phonetic entries on Wiktionary, then it will not have a phonetic file. Same goes for phonemic.)

    The readme tsv file link links to the file (phonetic or phonemic) with more entries. I tried to determine whether or not to apply case-folding for each language, but may have gotten it wrong for a few languages. If you see any instances in the readme where I incorrectly applied or failed to apply case-folding then let me know and I can rerun those languages if need be.

    I will add Russian once I have it, but it may take quite a long time to get it. I can also add all languages with more than 100 but less than 1000 entries in the same pull request as Russian if you'd like those files as well. I can submit a pull request for the code that generated all these files after that.

    opened by lfashby 21
  • [arm] can't use wikipron because of potential readtimeout? Can we use a wiktionary dump?

    [arm] can't use wikipron because of potential readtimeout? Can we use a wiktionary dump?

    Hello

    I can use the terminal version of WikiPron to scrape a small language like Amharic [amh], and to scrape a big one like French. But when I try to run it on Armenian [hye or arm], the code just stops running after an hour and outputs nothing -- there's not even any errors thrown. I suspect the code is finding a readtimeout error and then skipping it.

    I suspect there's a readtimeout error because in the past, I used other wiktionary extractors to Wiktextract and that took 9-12 hours to scrape the Armenian words (just 17k words). I suspect that the Armenian entries are just oddly dispersed across Wiktionary that it takes a while for some scrapers to find them. Granted Wiktextract was using a wiktionary dump and that's how it managed to eventually work. Can WikiPron work over a Wiktionary dump or does it need to actively use an internet connection?

    enhancement 
    opened by jhdeov 16
  • [arm] issues in the phones list

    [arm] issues in the phones list

    Wonderful resource.

    There are some errors in the phone list for the Armenian dialects. A lot of the errors got cleaned up since you last scraped them.

    The way that the Wiktionary contributors for Armenian work is that they do the following:

    1. They take the orthographic form like գրպան (transliteration is <grban>, pronunciation is [gərbɑn])
    2. They manually the orthographic and rewrite it with Armenian letters in order to apply any phonological rules like schwa insertion: գրպան to գըրպան <gərban>
    3. They then use this script to convert from the rewritten form (2) into the IPA for Eastern and Western Armenian
    4. This means that the Western entries are almost all redundant and automatically derived from the Eastern entries. A lot of words actually include both IPA entries together.

    For Eastern (EA)

    • You're right that the <ա> grapheme is /ɑ/. The script automatically converts the rewritten grapheme to [ɑ]. Some people had manually written the pronunciation entries and used non-IPA symbols like [a]. But they've mostly gotten cleaned up. A lot got cleaned before August. Then I personally scraped Wiktionary with another python package, found some overlooked [a]'s, and I cleaned them up too.

    • Same issue with <ո> actually being [ɔ] but sometimes incorrectly written as [o].

    • Armenian doesn't have phonemic geminates. What you see is that the orthography has a sequence of identical consonants like փթթել <pttel> [pəttel]. The script automatically converts a sequence of identical segments into a geminate with the [ː] diacritic. So if the rewritten form is փըթթել <pəttel>, then the outputted pronunciation is [pət:el]. If the rewritten form is փըթել <pətel>, then the automatic output is [pətel]. IMO, it would make more sense if the automatic transcription was with doubled segments [pəttel] because Armenian lacks phonemic length. So all the consonants have a possible geminated/lengthened/doubled form. It might just be an accidental gap in the Wiktionary data that you don't have geminate t͡s: and others.

    • The back fricatives are in free variation between velar and uvular, with more tendency towards uvular. The automatic transcription uses χ, so any [x] that you see is just an error (which I think also got cleaned up in around summer).

    • What's the issue with missing tie-bars?

      • TODO: for future cleanup, missing tie-bars on segments like: <ց> [tʃʰ] <չ> [tsʰ]

    For Western (WA)

    • Any [r] that you see is an error. The orthography has two rhotic graphemes ր ռ that are both pronounced the same as a flap. The trill is just someone manually writing it. That counts as an error and I think most of them got cleaned up on the WIktionary entries.
    • Yup you're right about the mergers.
      • TODO: supposedly the following mergers have occurred in this dialect: aspirated > aspirated voiced > aspirated voiceless unaspirated > voiced
    • The W dialect also developing a devoicing rule for clusters of orthographic voice+voiceless consonants. The automatic transcription covers that though.
    language support 
    opened by jhdeov 16
  • Negative flags are renamed to positive statements (#141)

    Negative flags are renamed to positive statements (#141)

    This pull request is for #141 Negative flags in cli.py are renamed to positive statements. In order to accommodate this change,Wikipron/config.py and tests/test_wikipron/test_config.py are also edited accordingly.

    • [x] Updated Unreleased in CHANGELOG.md to reflect the changes in code or data.
    opened by yeonju123 15
  • [mdf] Scrape Moksha + slightly more flexible default pron selector.

    [mdf] Scrape Moksha + slightly more flexible default pron selector.

    Strangely enough some (but not all) Moksha pages are not standard. The pronunciations don't reside under the regular list item elements ("li"), but under the paragraph elements ("p"). For an example, see the page for ала

    Please let me know whether you'd prefer a custom extractor for this rather than changing the default template.

    • [x] Updated Unreleased in CHANGELOG.md to reflect the changes in code or data.
    opened by agutkin 14
  • [geo] inconsistencies

    [geo] inconsistencies

    The Wikipedia transcription guidelines say that the high front phoneme is /i/ but many transcriptions have /ɪ/ instead. We should fix this upstream (e.g., on Wiktionary itself) and rescrape.

    See issue for more context.

    language support 
    opened by kylebgorman 14
  • Add config options to big scrape, separate scraping and writing within big scrape.

    Add config options to big scrape, separate scraping and writing within big scrape.

    (Sorry for the wall of text...) These changes are meant to address issues #66, #67, #68 as well as a few suggestions made in the comments of pull request #61.

    Here are the larger changes introduced by this pull request:

    • Separated the scraping and writing part of scrape.py (formerly scrape_and_write.py).

      • write.py now generates the README table by inspecting the contents of the tsv/ directory. In the process it also creates a tsv readme_tsv.tsv with similar information as is in the README table.
    • Added no_stress, no_syllable_boundaries, and cut_off_date options to languages in languages.json.

      • Modified codes.py to specify default values for these options when adding new languages to languages.json and to copy over previously set values for these options.
      • cut_off_date should now be set in codes.py prior to running codes.py, I’ve updated the README in languages/wikipron with those instructions.
    • Added dialect config option (and require_dialect_label option) to English, Spanish and Portuguese.

      • Restructured scrape.py to handle when one or more dialects are specified for a language. (Ran this new code on Portuguese, because it is a smaller language, to generate some sample data.)
      • README table now includes dialect information in Wiktionary language name column

    The only small changes worth noting are:

    • Logging in scrape.py will now also output to scraping.log which I’ve added to .gitignore. This way finding the languages that failed to be scraped is a bit easier (don’t need to scroll through the console). It also outputs the language dict from languages.json in the error message for languages that failed to be scraped within our set amount of retries, so it is a bit easier to build a temporary languages.json with the failed languages.
    • scrape.py will now remove files with less than 100 entries. (TSVs with less than 100 entries have been removed.)

    I have a few questions regarding dialects that I'd like your thoughts on:

    • How dialects are handled in languages.json.

      • I added dialects to languages.json in the following way:
        "por": {
            ...
            "require_dialect_label": true,
            "dialect": {
                "_bz": "Brazil",
                "_po": "Portugal"
            },
            ...
        },
        
      • The keys (_bz, _po) in dialect serve as a sort of extension when naming the dialect tsv files (por_bz_phonetic.tsv, for example) and help with easy access to the dialect strings ("Brazil") in write.py. If you'd like me to change any of the keys because you'd like different extensions for certain dialects let me know. They can be longer than two letters. I'll provide links to the English, Spanish, and Portuguese entries in languages.json as a separate comment so you can review the keys and dialect strings I'm using.
      • Is there a process for finding which dialects are frequently used within a given Wiktionary language category? (Aside from just checking entries and seeing whether dialect information is specified.)
    • How dialects are handled in scrape.py.

      • As written scrape.py will first scrape a language entirely and then scrape for dialects if any are specified. This means it will scrape por (Portuguese) with no dialect, then por with "Brazil" as the dialect and then por with "Portugal" as the dialect. Is there any reason to scrape for por with no dialect when we are specifying a dialect? Do we want to keep the tsvs generated from previously scraping por (or eng/spa) with no dialect?
      • Within scrape.py, I moved a lot of what was in main() to a separate function in order to handle dialects. There may well be a better way of handling dialects than the way I've tried to do it and I'm open to suggestions on how to improve it.
    opened by lfashby 14
  • Create generate_phone_summary.py

    Create generate_phone_summary.py

    • [x] Updated Unreleased in CHANGELOG.md to reflect the changes in code or data. This script automatically generates phone_summary, which has similar structure as language_summary. This script is written based on generate_summary.py .

    The output of generate_phone_summary.py is a TSV file instead of README.md, since README.md is already used. I will change the output path after we discuss what would be the good path for the output.

    opened by yeonju123 13
  • Rename files from

    Rename files from "phonemic"/"phonetic" to "broad"/"narrow"

    • [x] Updated Unreleased in CHANGELOG.md to reflect the changes in code or data.

    Closes #389

    I renamed the files with this script: link

    I only changed the filenames, so I'm sure a lot of stuff is broken at the moment…

    opened by ajmalanoski 12
  • adds unimorph data repo and download routine

    adds unimorph data repo and download routine

    adds the json file with all unimorph files and wikipron lg names. Additionally uses download routine to grab data and logs statements to the console

    • [ ] Updated Unreleased in CHANGELOG.md to reflect the changes in code or data.
    opened by reubenraff 12
  • [pam] Can't parse both types of transcriptions from the same line?

    [pam] Can't parse both types of transcriptions from the same line?

    For Kapampangan(pam) the format of all pronunciation entries looks as follows:

    Hyphenation: ba‧tia‧uan
    IPA(key): /bəˈtjawən/, [bəˈtjäː.wən]
    

    I suspect we can't parse this when both transcriptions are under the same heading. May be a duplicate.

    opened by agutkin 4
  • remove default casefolding

    remove default casefolding

    Removed the statement casefold:true from the languages.json list. I rescraped hye and apw to confirm that the languages were still scrapped, but now with the original case marking from Wiktionary.

    • [x] Updated Unreleased in CHANGELOG.md to reflect the changes in code or data.
    opened by jhdeov 3
  • [arm] finding IPA transcriptions outside of the Pronunciation block

    [arm] finding IPA transcriptions outside of the Pronunciation block

    For the word կարկանդակ, wikipron finds the correct pronunciation of [kɑɾkɑndɑk] but it also finds the IPA transcriptions of other words in the Usage Notes section like [pɛrɑʃˈki]. I'm not sure if this is an unavoidable glitch from Wikipron's side, or if it's a glitch that could be fixed from the Wiktionary side.

    It seems that what's going on is that WikiPron is just finding any IPA transcription that's inside the Armenian entry, even if it's not associated with a dialect. E.g., if you run wikipron arm --dialect='ladygaga' --no-skip-parens --narrow > randos.tsv you get a handful of IPA transcriptions that aren't associated with the pre-defined dialects. These are either a) IPA transcriptions in the Usage notes or etymology, or b) IPA transcriptions for non-standard dialects. This isn't a problem for using Wikipron on a specific language (because the person can just filter those out manually). But I wonder if this glitch causes any other funny business for the other languages.

    Side note: I wonder if there's been enough situations where people had to fix Wiktionary entries in order to optimize Wikipron's scraper (like on the various closed issues). If so, perhaps a tips and tricks page would be helpful down the line?

    opened by jhdeov 5
  • Undoing casefolding?

    Undoing casefolding?

    The commandline lets the user choose to apply casefolding so that entries like English can be changed to either English or english. But for the scraped data on the repo, it seems you apply casefolding by default. Would it be more useful if the online data didn't do casefolding? That way,

    • If the user wanted to get the original data (with the correct cases), then they can just use the scraped data online instead of running WIkipron on the terminal
    • If the user wanted to get the casefolded data, then they can take the un-casefolded data from the repo and then apply casefolding on their on their own machine (a simple fast Excel function).

    Right now, if the user wants to get the original cases, then they have to run the terminal option (which takes a while).

    enhancement good first issue 
    opened by jhdeov 5
  • scraping audio files?

    scraping audio files?

    Do you think there's a reasonable way to make an enhancement that will extract audio file URLs for Wiktionary words? At least for Armenian, the audio files are linked in the Pronunciation section.

    enhancement 
    opened by jhdeov 5
Releases(v1.3.0)
  • v1.3.0(Nov 28, 2022)

    [1.3.0] - 2022-11-28

    Under data/

    Added

    • Big scrape for 2022. (#464)
    • Added the --fresh flag to data/scrape/scrape.py to facilitate running the big scrape in batches. (#464)
    • Added the --exclude flag for excluding one or more languages in data/scrape/scrape.py. (#460)
    • Added data/src/normalize.py. (#356)
    • Updated README.md. (#360)
    • Added data/cg/tsv/geo.tsv. (#367)
    • Added data/morphology. (#369)
    • Added SIGMORPHON 2021 morphology data. (#375)
    • Added data/cg/tsv/jpn_hira.tsv. (#384)
    • Enforced final newlines. (#387)
    • Adds all UniMorph languages to morphology. (#393)
    • Added data/covering_grammar/tsv/fre_latn_phonemic.tsv (#398)
    • Added data/covering_grammar/lib/make_test_file.py (#396, #399)
    • Added Komi-Zyrian (kpv). (#400)
    • Added Makasar (mak). (#415, #419)
    • Added Zou (zom). (#421)
    • Added Wiyot (wiy). (#422)
    • Added Sidamo (sid). (#423)
    • Added Central Atlas Tamazight (tzm). (#429)
    • Added Chibcha (chb). (#430)
    • Added Kashmiri (kas). (#431)
    • Added Malayalam (mal). (#434)
    • Added Dhivehi (div). (#437)
    • Added Akkadian (akk). (#441)
    • Added Central Nahuatl (nhn). (#443)
    • Added Etruscan (ett). (#444)
    • Added Gujarati (guj). (#445)
    • Added Kannada (kan). (#446)
    • Added Karelian (krl). (#447)
    • Added Romagnol (rgn). (#448)
    • Added Southern Yukaghir (yux). (#449)
    • Added Urak Lawoi' (urk). (#451)
    • Added Hausa (ha). (#452)
    • Added Kashubian (csb). (#453)
    • Added Tabaru (tby). (#455)
    • Added West Makian (mqs). (#457)
    • Added Amharic (amh). (#458)
    • Added Livvi (olo). (#459)
    • Added Kalmyk (xal). (#472)
    • Added Ternate (tft). (#473)
    • Added Abkhaz (abk). (#474)
    • Added Farefare (gur). (#475)
    • Added Iban (iba). (#476)
    • Added Laz (lzz). (#477)

    Changed

    • Switched to ISO 639-3 language codes. (#468)
    • Updated scraped data in preparation for the SIGMORPHON 2022 shared task: swe nno ger dut ita rum ukr bel tgl ceb ben asm per pus tha lwl. (#461)
    • Made scripts under data/frequencies/ and data/morphology/ more flexible, especially for the purposes of preparing data for a shared task. (#461)
    • Fixed the --restriction flag for specifying multiple languages in data/scrape/scrape.py. (#460)
    • Added covering grammar coverage error log and specified error_type in error_analysis.py. (#424)
    • Added error log writing in error_analysis.py. (#420)
    • Added new columns in summary tables. (#365)
    • Fixed broken paths in data/src/generate_phones_summary.py and in data/phones/HOWTO.md. (#352)
    • Added Atong (India) (aot). (#353)
    • Added Egyptian Arabic (arz). (#354)
    • Added Lolopo (ycl). (#355)
    • Fixed Unicode normalization in data/phones/slv_phonemic.phones and re-scraped Slovenian data. (#356)
    • Updated data/phones/HOWTO.md to include instructions on applying the NFC Unicode normalization (#357)
    • Updated data/src/normalize.py to be more efficient. (#358)
    • Fixed inaccuracies in data/phones/geo_phonemic.phones. (#367)
    • Fixed typo in data/cg/tsv/geo.tsv and added missing character. (#370)
    • Morphology URLs are now provided as a list. (#376)
    • Configured and scraped Yamphu (ybi). (#380)
    • Configured and scraped Khumi Chin (cnk). (#381)
    • Made summary generation in common_characters.py optional. (#382)
    • Fixed phone counting in data/src/generate_phones_summary.py (#390, #392)
    • Reorganizes scraping scripts under data/scrape (#394)
    • Reorganizes .phones files and related scripts under data/phones (#395)
    • Reorganizes CG files and related scripts under data/covering_grammar (#395)
    • Reorganized data/phones/phones/fre_phonemic.phones (#398)
    • Removed data/src/ (#401)
    • Renamed TSV files and phonelists to use the terms "broad"/"narrow" instead of "phonemic"/"phonetic" (#389, #402, #405)
    • Fixed typo in README.md (#407)
    • Fixed column ordering of the test file read by the script in data/covering_grammar/lib/error_analysis.py (#411)
    • Fixed Common character collection in common_characters.py (#419)
    • Scraping test fixed for blt. (#436)
    • Changed URLs to point at CUNY-CL repo, where applicable. (#438)

    Under wikipron/ and elsewhere

    Added

    • Added ckb in languagecodes.py. (#464)
    • Added support for Python 3.10. (#462)
    • Added test of phones list generation in test_data/test_summary.py (#363)
    • Added Min Nan extraction function. (#397)
    • Added Tai Dam extraction function, configuration and initial scrape. (#435)
    • Added test of casefold value for languages in data/scrape/lib/languages.json (#442)
    • Added support for Python 3.11. (#479)
    • Added checks for the Python source distribution and wheel on CI. (#479)
    • Turned on tests for Windows on CI. (#479)

    Removed

    • Dropped support for Python 3.6. (#462)
    • Dropped support for Python 3.7. (#479)

    Changed

    • Switched to ISO 639-3 language codes. (#468)
    • Converted setup.py to pyproject.toml. (#479)
    Source code(tar.gz)
    Source code(zip)
  • v1.2.0(Jan 30, 2021)

    Under data/

    Added

    • Added Afrikaans whitelists, filtered TSV file, rescraped phonemic and phonetic TSV files. (#311)
    • Added German whitelists and filtered TSV file. (#285)
    • Added whitelisting capabilities to postprocess. (#152)
    • Added whitelists for Dutch, English, Greek, Latin, Korean, and Spanish. (#158, etc.)
    • Logged dialect configuration if specified. (#133)
    • Added typing to big scrape code. (#140)
    • Added argparse to allow limiting 'big scrape' to individual languages with --restriction flag. (#154)
    • Added Manchu (mnc). (#185)
    • Added Polabian (pox). (#186)
    • Added aar, bdq, jje, and lsi. (#202)
    • Added tyv to languagecodes.py (#203, #205)
    • Added bcl, egl, izh, ltg, azg, kir and mga to languagecodes.py. (#205)
    • Added nep to languagecodes.py. (#206)
    • Added Ingrian (izh). (#215)
    • Added French phoneme list and filtered TSV file. (#213, #217)
    • Added Corsican (cos). (#222)
    • Added Middle Korean (okm). (#223)
    • Added Middle Irish (mga). (#224)
    • Added Old Portuguese (opt). (#225)
    • Added Serbo-Croatian phoneme list and filtered TSV files. (#227)
    • Added Tuvan (tyv). (#228)
    • Added Shan (shn) with custom extraction. (#229)
    • Added Northern Kurdish (kmr). (#243)
    • Added a script to facilitate the creation of a .phones file. (#246)
    • Added IPA validity checks for phonemes. (#248)
    • Split multiple pronunciations joined by tilde in eng_us_phonetic.
    • Added Italian phoneme list and filtered TSV file. (#260, #261)
    • Added Adyghe phone list and filtered TSV file. (#262, #263)
    • Added Bulgarian phoneme list and filtered TSV file. (#264, #267)
    • Added Icelandic phoneme list and filtered TSV file. (#269, #270)
    • Added Slovenian phoneme list and filtered TSV file. (#271, #273)
    • Added normalization to list_phones.py. Corrected errors relating to ipapy (#275)
    • Added Welsh .phones lists and filtered TSV files. (#274, #276)
    • Added draft of covering grammar script. (#297)
    • Updated data/phones/README.md with instructions to re-scrape. (#279, #281)
    • Added Vietnamese .phones files and re-scraped and filtered .tsv files. (#278, #283)
    • Added Hindi .phones files and the re-scraped and filtered .tsv files. (#282, #284)
    • Added Old Frisian (ofs). (#294)
    • Added Dungan (dng). (#293)
    • Added Latgalian (ltg). (#296)
    • Added draft of covering grammar script. (#297)
    • Added Portuguese .phones files and re-scraped data. (#290, #304)
    • Added Japanese .phones files and re-scraped data. (#230, #307)
    • Added Moksha (mdf). (#295)
    • Added Azerbaijani .phones files and re-scraped data. (#306, #312)
    • Added Turkish .phones file and re-scraped data. (#313, #314)
    • Added Maltese .phones file and re-scraped data. (#317, #318)
    • Added Latvian .phones file and re-scraped data. (#321, #322)
    • Added Khmer .phones file and re-scraped data. (#324, #327)
    • Added Østnorsk (Bokmål) .phones file and re-scraped data. (#324, #327)
    • Several languages added to languagecodes.py. (#334)

    Changed

    • Edited the arm_e_phonetic.phones and arm_w_phonetic.phones files. (#298)
    • Improved printing in the README table. (#145)
    • Renamed data directory data. (#147)
    • Split may into Latin and Arabic files. (#164)
    • Split pan into Gurmukhi and Shahmukhī. (#169)
    • Split uig into Perso-Arabic and Cyrillic. (#173)
    • Only allowed Latin spellings in Maltese lexicon. (#166).
    • Split mon into Cyrillic and Mongol Bichig (#179).
    • Merged whitelist.py into 'big scrape' script. src scrape.py now checks for existence of whitelist file during scrape to create second filtered TSV. New TSV placed under tsv/\*\_filtered.tsv. (#154).
    • Updated generate_summary.py to reflect presence of 'filtered' tsv. (#154)
    • Imperial Aramaic (arc) split into three scripts properly. (#187)
    • Flattened data directory structure. (#194)
    • Updated Georgian (geo) to take advantage of upstream bot-based consistency fixes. (#138)
    • Split arm into Eastern and Western dialects. (#197)
    • Rescraped files with new whitelists. (#199)
    • Updated logging statements for consistency. (#196)
    • Renamed .whitelist file extension name as .phones. (#207)
    • Split ban into Latin and Balinese scripts. (#214)
    • Split kir into Cyrillic and Arabic. (#216)
    • Split Latin (lat) into its dialects. (#233)
    • Added MyPy coverage for wikipron, tests and data directories. (#247)
    • Modified paths in codes.py, scrape.py, and split.py. (#251, #256)
    • Modified config flags in languages.json and scrape.py. (#258)
    • Edited Serbo-Croatian .phones file to list all vowel/pitch accent combinations. Re-scraped Serbo-Croatian data. (#288)
    • Moved list_phones.py to parent directory. (#265, #266)
    • Moved list_phones.py to src directory. (#297)
    • Frequencies code no longer overwrites TSV files. (#320)
    • Updated data/phones/README.md to specify that .phones files should be in NFC normalization form. (#333)
    • Kurdish (kur) and Opata (opt) removed from languages.json. (#334)
    • Re-scraped Armenian data. Fixed an error in West Armenian phone list. (#338)

    Fixed

    • Fixed path issue with phonetic whitelisted files. (#195)

    Under wikipron/ and Elsewhere

    Added

    • Added positive flags for stress, syllable boundaries, tones, segment to cli.py. (#141)
    • Added positive flags for space skipping to cli.py. (#257)
    • Added two Vietnamese dialects to languages.json. (#139)
    • Handled additional language codes. (#132, #148)
    • Added --no-skip-spaces-word and --no-skip-spaces-pron flag. (#135)
    • Allowed ASCII apostrophes (0x27) in spellings. (#172).
    • Added Vietnamese extraction function. (#181).
    • Modified pron selector in Latin extraction function. (#183).
    • Added --no-tone flag. (#188)
    • Customized extractor and new scraped prons for khb. (#219)
    • Added tests/test_data directory containing two tests. (#226, #251)
    • Added HTTP User-Agent header to API calls to Wiktionary. (#234)
    • Added support for python 3.9 (#240)
    • Added black style formatting to .circleci/config.yml. (#242)
    • Added logging for scraping a language with --dialect specified that requires its custom extraction logic. (#245)
    • Improved CircleCI workflow with orbs. (#249)
    • Added test_split.py to tests/test_data. (#256)
    • Handled Cantonese for scraping. (#277)
    • Added exclusion for reconstructions. (#302)
    • Added Vietnamese contour tone grouping test in tests/test_config.py (#308)
    • Added restart functionality. (#340)

    Changed

    • Renamed arguments to positive statements in wikipron/config.py and edited _get_process_pron function accordingly. (#141, #257)
    • Changed testing values used in tests/test_config.py in order to accomodate the addition of positive flags. (#141)
    • Specified UTF-8 encoding in handling text files. (#221)
    • Moved previous contents of tests into tests/test_wikipron (#226)
    • Updated the packages version numbers in requirements.txt to their latest according to PyPI (#239)
    • Updated the default pron selector to also look for IPA strings under paragraphs in addition to list items. (#295)
    • Updated segments package version to 2.2.0 (#308)

    Removed

    • Moved Wiktionary querying functions from test_languagecodes.py to codes.py (#205)
    Source code(tar.gz)
    Source code(zip)
  • v1.1.0(Mar 3, 2020)

    [1.1.0] - 2020-03-03

    Added

    • Added the extraction function for Mandarin Chinese and its scraped data. (#124)
    • Integrated Wortschatz frequencies. (#122)

    Changed

    • Updated the Japanese extraction function and Japanese data. (#129)
    • Updated all scraped Wiktionary data and frequency data. (#127, #128)
    • Generalized the splitting script in the big scrape. (#123)
    • Moved small file removal to generate_summary.py. (#119)
    • Updated Russian data. (#115)

    Fixed

    • Avoided and logged error in case of pron processing failure. (#130)
    Source code(tar.gz)
    Source code(zip)
Create a semantic search engine with a neural network (i.e. BERT) whose knowledge base can be updated

Create a semantic search engine with a neural network (i.e. BERT) whose knowledge base can be updated. This engine can later be used for downstream tasks in NLP such as Q&A, summarization, generation

Diego 1 Mar 20, 2022
Sploitus - Command line search tool for sploitus.com. Think searchsploit, but with more POCs

Sploitus Command line search tool for sploitus.com. Think searchsploit, but with

watchdog2000 5 Mar 07, 2022
The official implementation of VAENAR-TTS, a VAE based non-autoregressive TTS model.

VAENAR-TTS This repo contains code accompanying the paper "VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis". Sa

THUHCSI 138 Oct 28, 2022
Code for EmBERT, a transformer model for embodied, language-guided visual task completion.

Code for EmBERT, a transformer model for embodied, language-guided visual task completion.

41 Jan 03, 2023
Mednlp - Medical natural language parsing and utility library

Medical natural language parsing and utility library A natural language medical

Paul Landes 3 Aug 24, 2022
PyTorch implementation of NATSpeech: A Non-Autoregressive Text-to-Speech Framework

A Non-Autoregressive Text-to-Speech (NAR-TTS) framework, including official PyTorch implementation of PortaSpeech (NeurIPS 2021) and DiffSpeech (AAAI 2022)

760 Jan 03, 2023
An easy-to-use framework for BERT models, with trainers, various NLP tasks and detailed annonations

FantasyBert English | 中文 Introduction An easy-to-use framework for BERT models, with trainers, various NLP tasks and detailed annonations. You can imp

Fan 137 Oct 26, 2022
A high-level yet extensible library for fast language model tuning via automatic prompt search

ruPrompts ruPrompts is a high-level yet extensible library for fast language model tuning via automatic prompt search, featuring integration with Hugg

Sber AI 37 Dec 07, 2022
Auto_code_complete is a auto word-completetion program which allows you to customize it on your needs

auto_code_complete is a auto word-completetion program which allows you to customize it on your needs. the model for this program is one of the deep-learning NLP(Natural Language Process) model struc

RUO 2 Feb 22, 2022
Example code for "Real-World Natural Language Processing"

Real-World Natural Language Processing This repository contains example code for the book "Real-World Natural Language Processing." AllenNLP (2.5.0 or

Masato Hagiwara 303 Dec 17, 2022
The RWKV Language Model

RWKV-LM We propose the RWKV language model, with alternating time-mix and channel-mix layers: The R, K, V are generated by linear transforms of input,

PENG Bo 877 Jan 05, 2023
A simple word search made in python

Word Search Puzzle A simple word search made in python Usage $ python3 main.py -h usage: main.py [-h] [-c] [-f FILE] Generates a word s

Magoninho 16 Mar 10, 2022
Code for producing Japanese GPT-2 provided by rinna Co., Ltd.

japanese-gpt2 This repository provides the code for training Japanese GPT-2 models. This code has been used for producing japanese-gpt2-medium release

rinna Co.,Ltd. 491 Jan 07, 2023
Twitter Sentiment Analysis using #tag, words and username

Twitter Sentment Analysis Web App using #tag, words and username to fetch data finds Insides of data and Tells Sentiment of the perticular #tag, words or username.

Kumar Saksham 26 Dec 25, 2022
Universal End2End Training Platform, including pre-training, classification tasks, machine translation, and etc.

背景 安装教程 快速上手 (一)预训练模型 (二)机器翻译 (三)文本分类 TenTrans 进阶 1. 多语言机器翻译 2. 跨语言预训练 背景 TrenTrans是一个统一的端到端的多语言多任务预训练平台,支持多种预训练方式,以及序列生成和自然语言理解任务。 安装教程 git clone git

Tencent Minority-Mandarin Translation Team 42 Dec 20, 2022
LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search

LightSpeech UnOfficial PyTorch implementation of LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search.

Rishikesh (ऋषिकेश) 54 Dec 03, 2022
PyKaldi is a Python scripting layer for the Kaldi speech recognition toolkit.

PyKaldi is a Python scripting layer for the Kaldi speech recognition toolkit. It provides easy-to-use, low-overhead, first-class Python wrappers for t

922 Dec 31, 2022
Code for Emergent Translation in Multi-Agent Communication

Emergent Translation in Multi-Agent Communication PyTorch implementation of the models described in the paper Emergent Translation in Multi-Agent Comm

Facebook Research 75 Jul 15, 2022
TEACh is a dataset of human-human interactive dialogues to complete tasks in a simulated household environment.

TEACh is a dataset of human-human interactive dialogues to complete tasks in a simulated household environment.

Alexa 98 Dec 09, 2022
AI-Broad-casting - AI Broad casting with python

Basic Code 1. Use The Code Configuration Environment conda create -n code_base p