πŸ€• spelling exceptions builder for lazy people

Overview

Yaspeller Dictionary (Auto)builder

CI

Usage

# this sample command generates `./yaspeller_report.json`
# yaspeller --report json --ignore-digits --ignore-text "'.*" --ignore-latin --only-errors --file-extensions ".md" --lang ru

python -m venv env
source env/bin/activate
pip install 
python src/dictionary.py yaspeller_report.json

Why

Yaspeller is nice, but there are too many anglicisms in a usual documentation. Normally you just want to ignore that, but there's the only possibility to add a regexp-array to ignore words.

This generates a array of dictionary words including all lexems for all cases like

[
    "[Π±Π‘]Π°Π³(Π°|Π°ΠΌ|Π°ΠΌΠΈ|Π°Ρ…|Π΅|ΠΈ|ΠΎΠ²|ΠΎΠΌ|Ρƒ)?",
    "[Π΄Π”]ΠΈΡ„Ρ„(Π°|Π°ΠΌ|Π°ΠΌΠΈ|Π°Ρ…|Π΅|ΠΎΠ²|ΠΎΠΌ|Ρƒ|Ρ‹)?",
    "[кК]ΠΎΠΌΠΌΠΈΡ‚(Π°|Π°ΠΌ|Π°ΠΌΠΈ|Π°Ρ…|Π΅|ΠΎΠ²|ΠΎΠΌ|Ρƒ|Ρ‹)?",
    "[пП]Π°Ρ‚Ρ‡ΠΈΠ½Π³(Π°|Π°ΠΌ|Π°ΠΌΠΈ|Π°Ρ…|Π΅|ΠΈ|ΠΎΠ²|ΠΎΠΌ|Ρƒ)?",
    "[Ρ€Π ]убист(Π°|Π°ΠΌ|Π°ΠΌΠΈ|Π°Ρ…|Π΅|ΠΎΠ²|ΠΎΠΌ|Ρƒ|Ρ‹)?",
    "[сБ]Π°ΠΌΠΎΠΎΡ€Π³Π°Π½ΠΈΠ·ΠΎΠ²Π°Π½Π½(ΠΎΠ³ΠΎ|ΠΎΠΌ|ΠΎΠΌΡƒ|ΡƒΡŽ|Ρ‹Π΅|Ρ‹ΠΉ|Ρ‹ΠΌ|Ρ‹ΠΌΠΈ|Ρ‹Ρ…)",
    "[Ρ‚Π’]ΠΈΠΊΠ΅Ρ‚(Π°|Π°ΠΌ|Π°ΠΌΠΈ|Π°Ρ…|Π΅|ΠΎΠ²|ΠΎΠΌ|Ρƒ|Ρ‹)?",
    "ΠΊΠΎΠΌΠΌΠΈΡ‚ΠΈΡ‚ΡŒ"
]

from yaspeller errors (in text format looking like)

Spelling check:
βœ— www.ruby-lang.org/ru/community/ruby-core/index.md 130 ms
-----
Typos: 9
1. ΠΏΠ°Ρ‚Ρ‡ΠΈΠ½Π³ΠΎΠΌ (36:27)
2. ΠΊΠΎΠΌΠΌΠΈΡ‚ΠΈΡ‚ΡŒ (68:32, suggest: ΠΊΠΎΠΌΠΈΡ‚Π΅Ρ‚)
3. Π±Π°Π³Π°Ρ… (75:15, suggest: Π±ΠΎΠ³Π°Ρ…, Π±Π°ΠΊΠ°Ρ…, Π±Π΅Π³Π°Ρ…)
4. Π±Π°Π³ΠΈ (89:24, suggest: Π±Π°Π³Π³ΠΈ)
5. Π±Π°Π³ (96:25)
6. Ρ‚ΠΈΠΊΠ΅Ρ‚ (107:14, suggest: этикСт)
7. Π΄ΠΈΡ„Ρ„ (115:18)
8. ΠΊΠΎΠΌΠΌΠΈΡ‚Ρƒ (147:24, suggest: ΠΊΠΎΠΌΠ΅Ρ‚Ρƒ, ΠΊΠΎΠΌΠ½Π°Ρ‚Ρƒ)
9. ΠΊΠΎΠΌΠΌΠΈΡ‚Π° (148:58, suggest: ΠΊΠΎΠΌΠ΅Ρ‚Π°)
-----

Live example

Initially created for www.ruby-lang.org translations spellchecking

Owner
Vlad Bokov
Vlad Bokov
Repositório do trabalho de introdução a NLP

Trabalho da disciplina de BI NLP Repositório do trabalho da disciplina Introdução a Processamento de Linguagem Natural da pós BI-Master da PUC-RIO. Eq

Leonardo Lins 1 Jan 18, 2022
A versatile token stream for handwritten parsers.

Writing recursive-descent parsers by hand can be quite elegant but it's often a bit more verbose than expected, especially when it comes to handling indentation and reporting proper syntax errors. Th

Valentin Berlier 8 Nov 30, 2022
spaCy plugin for Transformers , Udify, ELmo, etc.

Camphr - spaCy plugin for Transformers, Udify, Elmo, etc. Camphr is a Natural Language Processing library that helps in seamless integration for a wid

342 Nov 21, 2022
Multilingual Emotion classification using BERT (fine-tuning). Published at the WASSA workshop (ACL2022).

XLM-EMO: Multilingual Emotion Prediction in Social Media Text Abstract Detecting emotion in text allows social and computational scientists to study h

MilaNLP 35 Sep 17, 2022
Easy Language Model Pretraining leveraging Huggingface's Transformers and Datasets

Easy Language Model Pretraining leveraging Huggingface's Transformers and Datasets What is LASSL β€’ How to Use What is LASSL LASSL은 LAnguage Semi-Super

LASSL: LAnguage Self-Supervised Learning 116 Dec 27, 2022
Scikit-learn style model finetuning for NLP

Scikit-learn style model finetuning for NLP Finetune is a library that allows users to leverage state-of-the-art pretrained NLP models for a wide vari

indico 665 Dec 17, 2022
Code for the paper TestRank: Bringing Order into Unlabeled Test Instances for Deep Learning Tasks

TestRank in Pytorch Code for the paper TestRank: Bringing Order into Unlabeled Test Instances for Deep Learning Tasks by Yu Li, Min Li, Qiuxia Lai, Ya

3 May 19, 2022
Grading tools for Advanced NLP (11-711)Grading tools for Advanced NLP (11-711)

Grading tools for Advanced NLP (11-711) Installation You'll need docker and unzip to use this repo. For docker, visit the official guide to get starte

Hao Zhu 2 Sep 27, 2022
Chinese version of GPT2 training code, using BERT tokenizer.

GPT2-Chinese Description Chinese version of GPT2 training code, using BERT tokenizer or BPE tokenizer. It is based on the extremely awesome repository

Zeyao Du 5.6k Jan 04, 2023
AEC_DeepModel - Deep learning based acoustic echo cancellation baseline code

AEC_DeepModel - Deep learning based acoustic echo cancellation baseline code

ε‡Œι€†ζˆ˜ 75 Dec 05, 2022
Findings of ACL 2021

Assessing Dialogue Systems with Distribution Distances [arXiv][code] We propose to measure the performance of a dialogue system by computing the distr

Yahui Liu 16 Feb 24, 2022
To classify the News into Real/Fake using Features from the Text Content of the article

Hoax-Detector Authenticity of news has now become a major problem. The Idea is to classify the News into Real/Fake using Features from the Text Conten

Aravindhan 1 Feb 09, 2022
Exploration of BERT-based models on twitter sentiment classifications

twitter-sentiment-analysis Explore the relationship between twitter sentiment of Tesla and its stock price/return. Explore the effect of different BER

Sammy Cui 2 Oct 02, 2022
Training code for Korean multi-class sentiment analysis

KoSentimentAnalysis Bert implementation for the Korean multi-class sentiment analysis μ™œ ν•œκ΅­μ–΄ 감정 닀쀑뢄λ₯˜ λͺ¨λΈμ€ 거의 μ—†λŠ” κ²ƒμΌκΉŒ?μ—μ„œ μ‹œμž‘λœ ν”„λ‘œμ νŠΈ Environment: Pytorch, Da

Donghoon Shin 3 Dec 02, 2022
Rich Prosody Diversity Modelling with Phone-level Mixture Density Network

Phone Level Mixture Density Network for TTS This repo contains pytorch implementation of paper Rich Prosody Diversity Modelling with Phone-level Mixtu

Rishikesh (ΰ€‹ΰ€·ΰ€Ώΰ€•ΰ₯‡ΰ€Ά) 42 Dec 13, 2022
The Sudachi synonym dictionary in Solar format.

solr-sudachi-synonyms The Sudachi synonym dictionary in Solar format. Summary Run a script that checks for updates to the Sudachi dictionary every hou

Karibash 3 Aug 19, 2022
Unsupervised intent recognition

INTENT author: steeve LAQUITAINE description: deployment pattern: currently batch only Setup & run git clone https://github.com/slq0/intent.git bash

sl 1 Apr 08, 2022
A simple implementation of N-gram language model.

About A simple implementation of N-gram language model. Requirements numpy Data preparation Corpus Training data for the N-gram model, a text file lik

4 Nov 24, 2021
:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

(Framework for Adapting Representation Models) What is it? FARM makes Transfer Learning with BERT & Co simple, fast and enterprise-ready. It's built u

deepset 1.6k Dec 27, 2022