KaziText is a tool for modelling common human errors.

Related tags

Deep Learningkazitext
Overview

KaziText

KaziText is a tool for modelling common human errors. It estimates probabilities of individual error types (so called aspects) from grammatical error correction corpora in M2 format.

The tool was introduced in Understanding Model Robustness to User-generated Noisy Texts.

Requirements

A set of requirements is listed in requirements.txt. Moreover, UDPipe model has to be downloaded for used languages (see http://hdl.handle.net/11234/1-3131) and linked in udpipe_tokenizer.py.

Overview

KaziText defines a set of aspects located in aspects. These model following phenomena:

  • Casing Errors
  • Common Other Errors (for most common phrases)
  • Errors in Diacritics
  • Punctuation Errors
  • Spelling Errors
  • Errors in wrongly used suffix/prefix
  • Whitespace Errors
  • Word-Order Errors

Each aspect has a set of internal probabilities (e.g. the probability of a user typing first letter of a starting word in lower-case instead of upper-case) that are estimated from M2 GEC corpora.

A complete set of aspects with their internal probabilities is called profile. We provide precomputed profiles for Czech, English, Russian and German in profiles as json files. The profiles are additionally split into dev and test. Also there are 4 profiles for Czech and 2 profiles for English differing in the underlying user domain (e.g. natives vs second learners).

To noise a text using a profile, use:

python introduce_errors.py $infile $outfile $profile $lang 

introduce_errors.py script offers a variety of switches (run python introduce_errors.py --help to display them). One noteworthy is --alpha that serves for regulating final text error rate (set it to value lower than 1 to reduce number of errors; set to to value bigger than 1 to have more noisy texts). Apart for profiles themselves, we also precomputed set of alphas that are stored as .csv files in respective profiles folders and store values for alphas to reach 5-30 final text word error rates as well as so called reference-alpha word error rate that corresponds to the same error rate as the original M2 files the profile was estimated from had. To have for example noisy text at circa 5% word error rate noised by Romani profile, use --profile dev/cs_romi.json --alpha 0.2.

Moreover, we provide several scripts (noise*.py) for noising specific data formats.

To estimate a profile for given M2 file, run:

python estimate_all_ratios.py $m2_pattern outfile

To estimate normalization alphas file, see estimate_alpha.sh that describes iterative process of noising clean texts with an alpha, measuring text's noisiness and changing alpha respectively.

Other notes

  • Russian RULEC-GEC was normalized using normalize_russian_m2.py
Owner
ÚFAL
Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University
ÚFAL
Official implementation of Rich Semantics Improve Few-Shot Learning (BMVC, 2021)

Rich Semantics Improve Few-Shot Learning Paper Link Abstract : Human learning benefits from multi-modal inputs that often appear as rich semantics (e.

Mohamed Afham 11 Jul 26, 2022
True Few-Shot Learning with Language Models

This codebase supports using language models (LMs) for true few-shot learning: learning to perform a task using a limited number of examples from a single task distribution.

Ethan Perez 124 Jan 04, 2023
Qimera: Data-free Quantization with Synthetic Boundary Supporting Samples

Qimera: Data-free Quantization with Synthetic Boundary Supporting Samples This repository is the official implementation of paper [Qimera: Data-free Q

Kanghyun Choi 21 Nov 03, 2022
Official implementation for paper Render In-between: Motion Guided Video Synthesis for Action Interpolation

Render In-between: Motion Guided Video Synthesis for Action Interpolation [Paper] [Supp] [arXiv] [4min Video] This is the official Pytorch implementat

8 Oct 27, 2022
DiffQ performs differentiable quantization using pseudo quantization noise. It can automatically tune the number of bits used per weight or group of weights, in order to achieve a given trade-off between model size and accuracy.

Differentiable Model Compression via Pseudo Quantization Noise DiffQ performs differentiable quantization using pseudo quantization noise. It can auto

Facebook Research 145 Dec 30, 2022
Character Grounding and Re-Identification in Story of Videos and Text Descriptions

Character in Story Identification Network (CiSIN) This project hosts the code for our paper. Youngjae Yu, Jongseok Kim, Heeseung Yun, Jiwan Chung and

8 Dec 09, 2022
Multi-Stage Episodic Control for Strategic Exploration in Text Games

XTX: eXploit - Then - eXplore Requirements First clone this repo using git clone https://github.com/princeton-nlp/XTX.git Please create two conda envi

Princeton Natural Language Processing 9 May 24, 2022
UnsupervisedR&R: Unsupervised Pointcloud Registration via Differentiable Rendering

UnsupervisedR&R: Unsupervised Pointcloud Registration via Differentiable Rendering This repository holds all the code and data for our recent work on

Mohamed El Banani 118 Dec 06, 2022
AITUS - An atomatic notr maker for CYTUS

AITUS an automatic note maker for CYTUS. 利用AI根据指定乐曲生成CYTUS游戏谱面。 效果展示:https://www

GradiusTwinbee 6 Feb 24, 2022
Histocartography is a framework bringing together AI and Digital Pathology

Documentation | Paper Welcome to the histocartography repository! histocartography is a python-based library designed to facilitate the development of

155 Nov 23, 2022
Gray Zone Assessment

Gray Zone Assessment Get started Clone github repository git clone https://github.com/andreanne-lemay/gray_zone_assessment.git Build docker image dock

1 Jan 08, 2022
Wider or Deeper: Revisiting the ResNet Model for Visual Recognition

ademxapp Visual applications by the University of Adelaide In designing our Model A, we did not over-optimize its structure for efficiency unless it w

Zifeng Wu 338 Dec 12, 2022
A set of tools for converting a darknet dataset to COCO format working with YOLOX

darknet格式数据→COCO darknet训练数据目录结构(详情参见dataset/darknet): darknet ├── class.names ├── gen_config.data ├── gen_train.txt ├── gen_valid.txt └── images

RapidAI-NG 148 Jan 03, 2023
CLIP (Contrastive Language–Image Pre-training) trained on Indonesian data

CLIP-Indonesian CLIP (Radford et al., 2021) is a multimodal model that can connect images and text by training a vision encoder and a text encoder joi

Galuh 17 Mar 10, 2022
Dynamical movement primitives (DMPs), probabilistic movement primitives (ProMPs), spatially coupled bimanual DMPs.

Movement Primitives Movement primitives are a common group of policy representations in robotics. There are many different types and variations. This

DFKI Robotics Innovation Center 63 Jan 06, 2023
Yas CRNN model training - Yet Another Genshin Impact Scanner

Yas-Train Yet Another Genshin Impact Scanner 又一个原神圣遗物导出器 介绍 该仓库为 Yas 的模型训练程序 相关资料 MobileNetV3 CRNN 使用 假设你会设置基本的pytorch环境。 生成数据集 python main.py gen 训练

wormtql 18 Jan 08, 2023
Stroke-predictions-ml-model - Machine learning model to predict individuals chances of having a stroke

stroke-predictions-ml-model machine learning model to predict individuals chance

Alex Volchek 1 Jan 03, 2022
A commany has recently introduced a new type of bidding, the average bidding, as an alternative to the bid given to the current maximum bidding

Business Problem A commany has recently introduced a new type of bidding, the average bidding, as an alternative to the bid given to the current maxim

Kübra Bilinmiş 1 Jan 15, 2022
Official code for: A Probabilistic Hard Attention Model For Sequentially Observed Scenes

"A Probabilistic Hard Attention Model For Sequentially Observed Scenes" Authors: Samrudhdhi Rangrej, James Clark Accepted to: BMVC'21 A recurrent atte

5 Nov 19, 2022
Pytorch implementation of set transformer

set_transformer Official PyTorch implementation of the paper Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks .

Juho Lee 410 Jan 06, 2023