Yaspeller Dictionary (Auto)builder

Usage

# this sample command generates `./yaspeller_report.json`
# yaspeller --report json --ignore-digits --ignore-text "'.*" --ignore-latin --only-errors --file-extensions ".md" --lang ru

python -m venv env
source env/bin/activate
pip install 
python src/dictionary.py yaspeller_report.json

Why

Yaspeller is nice, but there are too many anglicisms in a usual documentation. Normally you just want to ignore that, but there's the only possibility to add a regexp-array to ignore words.

This generates a array of dictionary words including all lexems for all cases like

[
    "[бБ]аг(а|ам|ами|ах|е|и|ов|ом|у)?",
    "[дД]ифф(а|ам|ами|ах|е|ов|ом|у|ы)?",
    "[кК]оммит(а|ам|ами|ах|е|ов|ом|у|ы)?",
    "[пП]атчинг(а|ам|ами|ах|е|и|ов|ом|у)?",
    "[рР]убист(а|ам|ами|ах|е|ов|ом|у|ы)?",
    "[сС]амоорганизованн(ого|ом|ому|ую|ые|ый|ым|ыми|ых)",
    "[тТ]икет(а|ам|ами|ах|е|ов|ом|у|ы)?",
    "коммитить"
]

from yaspeller errors (in text format looking like)

Spelling check:
✗ www.ruby-lang.org/ru/community/ruby-core/index.md 130 ms
-----
Typos: 9
1. патчингом (36:27)
2. коммитить (68:32, suggest: комитет)
3. багах (75:15, suggest: богах, баках, бегах)
4. баги (89:24, suggest: багги)
5. баг (96:25)
6. тикет (107:14, suggest: этикет)
7. дифф (115:18)
8. коммиту (147:24, suggest: комету, комнату)
9. коммита (148:58, suggest: комета)
-----

Live example

Initially created for www.ruby-lang.org translations spellchecking

🤕 spelling exceptions builder for lazy people

Related tags

Overview

Yaspeller Dictionary (Auto)builder

Usage

Why

Live example

Owner

Vlad Bokov

Repositório do trabalho de introdução a NLP

A versatile token stream for handwritten parsers.

spaCy plugin for Transformers , Udify, ELmo, etc.

Multilingual Emotion classification using BERT (fine-tuning). Published at the WASSA workshop (ACL2022).

Text-Summarization-using-NLP - Text Summarization using NLP to fetch BBC News Article and summarize its text and also it includes custom article Summarization

Easy Language Model Pretraining leveraging Huggingface's Transformers and Datasets

Scikit-learn style model finetuning for NLP

Code for the paper TestRank: Bringing Order into Unlabeled Test Instances for Deep Learning Tasks

Grading tools for Advanced NLP (11-711)Grading tools for Advanced NLP (11-711)

Chinese version of GPT2 training code, using BERT tokenizer.

AEC_DeepModel - Deep learning based acoustic echo cancellation baseline code

Findings of ACL 2021

To classify the News into Real/Fake using Features from the Text Content of the article

Exploration of BERT-based models on twitter sentiment classifications

Training code for Korean multi-class sentiment analysis

Rich Prosody Diversity Modelling with Phone-level Mixture Density Network

The Sudachi synonym dictionary in Solar format.

Unsupervised intent recognition

A simple implementation of N-gram language model.

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.