πŸ€• spelling exceptions builder for lazy people

Overview

Yaspeller Dictionary (Auto)builder

CI

Usage

# this sample command generates `./yaspeller_report.json`
# yaspeller --report json --ignore-digits --ignore-text "'.*" --ignore-latin --only-errors --file-extensions ".md" --lang ru

python -m venv env
source env/bin/activate
pip install 
python src/dictionary.py yaspeller_report.json

Why

Yaspeller is nice, but there are too many anglicisms in a usual documentation. Normally you just want to ignore that, but there's the only possibility to add a regexp-array to ignore words.

This generates a array of dictionary words including all lexems for all cases like

[
    "[Π±Π‘]Π°Π³(Π°|Π°ΠΌ|Π°ΠΌΠΈ|Π°Ρ…|Π΅|ΠΈ|ΠΎΠ²|ΠΎΠΌ|Ρƒ)?",
    "[Π΄Π”]ΠΈΡ„Ρ„(Π°|Π°ΠΌ|Π°ΠΌΠΈ|Π°Ρ…|Π΅|ΠΎΠ²|ΠΎΠΌ|Ρƒ|Ρ‹)?",
    "[кК]ΠΎΠΌΠΌΠΈΡ‚(Π°|Π°ΠΌ|Π°ΠΌΠΈ|Π°Ρ…|Π΅|ΠΎΠ²|ΠΎΠΌ|Ρƒ|Ρ‹)?",
    "[пП]Π°Ρ‚Ρ‡ΠΈΠ½Π³(Π°|Π°ΠΌ|Π°ΠΌΠΈ|Π°Ρ…|Π΅|ΠΈ|ΠΎΠ²|ΠΎΠΌ|Ρƒ)?",
    "[Ρ€Π ]убист(Π°|Π°ΠΌ|Π°ΠΌΠΈ|Π°Ρ…|Π΅|ΠΎΠ²|ΠΎΠΌ|Ρƒ|Ρ‹)?",
    "[сБ]Π°ΠΌΠΎΠΎΡ€Π³Π°Π½ΠΈΠ·ΠΎΠ²Π°Π½Π½(ΠΎΠ³ΠΎ|ΠΎΠΌ|ΠΎΠΌΡƒ|ΡƒΡŽ|Ρ‹Π΅|Ρ‹ΠΉ|Ρ‹ΠΌ|Ρ‹ΠΌΠΈ|Ρ‹Ρ…)",
    "[Ρ‚Π’]ΠΈΠΊΠ΅Ρ‚(Π°|Π°ΠΌ|Π°ΠΌΠΈ|Π°Ρ…|Π΅|ΠΎΠ²|ΠΎΠΌ|Ρƒ|Ρ‹)?",
    "ΠΊΠΎΠΌΠΌΠΈΡ‚ΠΈΡ‚ΡŒ"
]

from yaspeller errors (in text format looking like)

Spelling check:
βœ— www.ruby-lang.org/ru/community/ruby-core/index.md 130 ms
-----
Typos: 9
1. ΠΏΠ°Ρ‚Ρ‡ΠΈΠ½Π³ΠΎΠΌ (36:27)
2. ΠΊΠΎΠΌΠΌΠΈΡ‚ΠΈΡ‚ΡŒ (68:32, suggest: ΠΊΠΎΠΌΠΈΡ‚Π΅Ρ‚)
3. Π±Π°Π³Π°Ρ… (75:15, suggest: Π±ΠΎΠ³Π°Ρ…, Π±Π°ΠΊΠ°Ρ…, Π±Π΅Π³Π°Ρ…)
4. Π±Π°Π³ΠΈ (89:24, suggest: Π±Π°Π³Π³ΠΈ)
5. Π±Π°Π³ (96:25)
6. Ρ‚ΠΈΠΊΠ΅Ρ‚ (107:14, suggest: этикСт)
7. Π΄ΠΈΡ„Ρ„ (115:18)
8. ΠΊΠΎΠΌΠΌΠΈΡ‚Ρƒ (147:24, suggest: ΠΊΠΎΠΌΠ΅Ρ‚Ρƒ, ΠΊΠΎΠΌΠ½Π°Ρ‚Ρƒ)
9. ΠΊΠΎΠΌΠΌΠΈΡ‚Π° (148:58, suggest: ΠΊΠΎΠΌΠ΅Ρ‚Π°)
-----

Live example

Initially created for www.ruby-lang.org translations spellchecking

Owner
Vlad Bokov
Vlad Bokov
A sentence aligner for comparable corpora

About Yalign is a tool for extracting parallel sentences from comparable corpora. Statistical Machine Translation relies on parallel corpora (eg.. eur

Machinalis 128 Aug 24, 2022
This is the code for the EMNLP 2021 paper AEDA: An Easier Data Augmentation Technique for Text Classification

The baseline code is for EDA: Easy Data Augmentation techniques for boosting performance on text classification tasks

Akbar Karimi 81 Dec 09, 2022
Production First and Production Ready End-to-End Keyword Spotting Toolkit

Production First and Production Ready End-to-End Keyword Spotting Toolkit

223 Jan 02, 2023
Unofficial Python library for using the Polish Wordnet (plWordNet / Słowosieć)

Polish Wordnet Python library Simple, easy-to-use and reasonably fast library for using the Słowosieć (also known as PlWordNet) - a lexico-semantic da

Max Adamski 12 Dec 23, 2022
PyTorch code for EMNLP 2019 paper "LXMERT: Learning Cross-Modality Encoder Representations from Transformers".

LXMERT: Learning Cross-Modality Encoder Representations from Transformers Our servers break again :(. I have updated the links so that they should wor

Hao Tan 838 Dec 19, 2022
A method to generate speech across multiple speakers

VoiceLoop PyTorch implementation of the method described in the paper VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop. VoiceLoop is a n

Facebook Archive 873 Dec 15, 2022
In this project, we aim to achieve the task of predicting emojis from tweets. We aim to investigate the relationship between words and emojis.

Making Emojis More Predictable by Karan Abrol, Karanjot Singh and Pritish Wadhwa, Natural Language Processing (CSE546) under the guidance of Dr. Shad

Karanjot Singh 2 Jan 17, 2022
Need: Image Search With Python

Need: Image Search The problem is that a user needs to search for a specific ima

Surya Komandooru 1 Dec 30, 2021
ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset.

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset. Through its Python API, the pretrained model can be fine-tuned on any protein-related task in

241 Jan 04, 2023
This is a simple item2vec implementation using gensim for recbole

recbole-item2vec-model This is a simple item2vec implementation using gensim for recbole( https://recbole.io ) Usage When you want to run experiment f

Yusuke Fukasawa 2 Oct 06, 2022
[ICCV 2021] Instance-level Image Retrieval using Reranking Transformers

Instance-level Image Retrieval using Reranking Transformers Fuwen Tan, Jiangbo Yuan, Vicente Ordonez, ICCV 2021. Abstract Instance-level image retriev

UVA Computer Vision 86 Dec 28, 2022
Mycroft Core, the Mycroft Artificial Intelligence platform.

Mycroft Mycroft is a hackable open source voice assistant. Table of Contents Getting Started Running Mycroft Using Mycroft Home Device and Account Man

Mycroft 6.1k Jan 09, 2023
Train and use generative text models in a few lines of code.

blather Train and use generative text models in a few lines of code. To see blather in action check out the colab notebook! Installation Use the packa

Dan Carroll 16 Nov 07, 2022
This is the Alpha of Nutte language, she is not complete yet / Essa Γ© a Alpha da Nutte language, nΓ£o estΓ‘ completa ainda

nutte-language This is the Alpha of Nutte language, it is not complete yet / Essa Γ© a Alpha da Nutte language, nΓ£o estΓ‘ completa ainda My language was

catdochrome 2 Dec 18, 2021
BERT Attention Analysis

BERT Attention Analysis This repository contains code for What Does BERT Look At? An Analysis of BERT's Attention. It includes code for getting attent

Kevin Clark 401 Dec 11, 2022
PyABSA - Open & Efficient for Framework for Aspect-based Sentiment Analysis

PyABSA - Open & Efficient for Framework for Aspect-based Sentiment Analysis

YangHeng 567 Jan 07, 2023
The simple project to separate mixed voice (2 clean voices) to 2 separate voices.

Speech Separation The simple project to separate mixed voice (2 clean voices) to 2 separate voices. Result Example (Clisk to hear the voices): mix ||

vuthede 31 Oct 30, 2022
Graph Coloring - Weighted Vertex Coloring Problem

Graph Coloring - Weighted Vertex Coloring Problem This project proposes several local searches and an MCTS algorithm for the weighted vertex coloring

Cyril 1 Jul 08, 2022
Which Apple Keeps Which Doctor Away? Colorful Word Representations with Visual Oracles

Which Apple Keeps Which Doctor Away? Colorful Word Representations with Visual Oracles (TASLP 2022)

Zhuosheng Zhang 3 Apr 14, 2022
Transformer Based Korean Sentence Spacing Corrector

TKOrrector Transformer Based Korean Sentence Spacing Corrector License Summary This solution is made available under Apache 2 license. See the LICENSE

Paul Hyung Yuel Kim 3 Apr 18, 2022