KaziText is a tool for modelling common human errors.

Related tags

Deep Learningkazitext
Overview

KaziText

KaziText is a tool for modelling common human errors. It estimates probabilities of individual error types (so called aspects) from grammatical error correction corpora in M2 format.

The tool was introduced in Understanding Model Robustness to User-generated Noisy Texts.

Requirements

A set of requirements is listed in requirements.txt. Moreover, UDPipe model has to be downloaded for used languages (see http://hdl.handle.net/11234/1-3131) and linked in udpipe_tokenizer.py.

Overview

KaziText defines a set of aspects located in aspects. These model following phenomena:

  • Casing Errors
  • Common Other Errors (for most common phrases)
  • Errors in Diacritics
  • Punctuation Errors
  • Spelling Errors
  • Errors in wrongly used suffix/prefix
  • Whitespace Errors
  • Word-Order Errors

Each aspect has a set of internal probabilities (e.g. the probability of a user typing first letter of a starting word in lower-case instead of upper-case) that are estimated from M2 GEC corpora.

A complete set of aspects with their internal probabilities is called profile. We provide precomputed profiles for Czech, English, Russian and German in profiles as json files. The profiles are additionally split into dev and test. Also there are 4 profiles for Czech and 2 profiles for English differing in the underlying user domain (e.g. natives vs second learners).

To noise a text using a profile, use:

python introduce_errors.py $infile $outfile $profile $lang 

introduce_errors.py script offers a variety of switches (run python introduce_errors.py --help to display them). One noteworthy is --alpha that serves for regulating final text error rate (set it to value lower than 1 to reduce number of errors; set to to value bigger than 1 to have more noisy texts). Apart for profiles themselves, we also precomputed set of alphas that are stored as .csv files in respective profiles folders and store values for alphas to reach 5-30 final text word error rates as well as so called reference-alpha word error rate that corresponds to the same error rate as the original M2 files the profile was estimated from had. To have for example noisy text at circa 5% word error rate noised by Romani profile, use --profile dev/cs_romi.json --alpha 0.2.

Moreover, we provide several scripts (noise*.py) for noising specific data formats.

To estimate a profile for given M2 file, run:

python estimate_all_ratios.py $m2_pattern outfile

To estimate normalization alphas file, see estimate_alpha.sh that describes iterative process of noising clean texts with an alpha, measuring text's noisiness and changing alpha respectively.

Other notes

  • Russian RULEC-GEC was normalized using normalize_russian_m2.py
Owner
ÚFAL
Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University
ÚFAL
Datasets, Transforms and Models specific to Computer Vision

vision Datasets, Transforms and Models specific to Computer Vision Installation First install the nightly version of OneFlow python3 -m pip install on

OneFlow 68 Dec 07, 2022
This is the latest version of the PULP SDK

PULP-SDK This is the latest version of the PULP SDK, which is under active development. The previous (now legacy) version, which is no longer supporte

78 Dec 07, 2022
ContourletNet: A Generalized Rain Removal Architecture Using Multi-Direction Hierarchical Representation

ContourletNet: A Generalized Rain Removal Architecture Using Multi-Direction Hierarchical Representation (Accepted by BMVC'21) Abstract: Images acquir

10 Dec 08, 2022
The Official Implementation of the ICCV-2021 Paper: Semantically Coherent Out-of-Distribution Detection.

SCOOD-UDG (ICCV 2021) This repository is the official implementation of the paper: Semantically Coherent Out-of-Distribution Detection Jingkang Yang,

Jake YANG 62 Nov 21, 2022
3D Human Pose Machines with Self-supervised Learning

3D Human Pose Machines with Self-supervised Learning Keze Wang, Liang Lin, Chenhan Jiang, Chen Qian, and Pengxu Wei, “3D Human Pose Machines with Self

Chenhan Jiang 398 Dec 20, 2022
MNE: Magnetoencephalography (MEG) and Electroencephalography (EEG) in Python

MNE-Python MNE-Python software is an open-source Python package for exploring, visualizing, and analyzing human neurophysiological data such as MEG, E

MNE tools for MEG and EEG data analysis 2.1k Dec 28, 2022
A general 3D Object Detection codebase in PyTorch.

Det3D is the first 3D Object Detection toolbox which provides off the box implementations of many 3D object detection algorithms such as PointPillars, SECOND, PIXOR, etc, as well as state-of-the-art

Benjin Zhu 1.4k Jan 05, 2023
A solution to the 2D Ising model of ferromagnetism, implemented using the Metropolis algorithm

Solving the Ising model on a 2D lattice using the Metropolis Algorithm Introduction The Ising model is a simplified model of ferromagnetism, the pheno

Rohit Prabhu 5 Nov 13, 2022
Source code of our work: "Benchmarking Deep Models for Salient Object Detection"

SALOD Source code of our work: "Benchmarking Deep Models for Salient Object Detection". In this works, we propose a new benchmark for SALient Object D

22 Dec 30, 2022
使用深度学习框架提取视频硬字幕;docker容器免安装深度学习库,使用本地api接口使得界面和后端识别分离;

extract-video-subtittle 使用深度学习框架提取视频硬字幕; 本地识别无需联网; CPU识别速度可观; 容器提供API接口; 运行环境 本项目运行环境非常好搭建,我做好了docker容器免安装各种深度学习包; 提供windows界面操作; 容器为CPU版本; 视频演示 https

歌者 16 Aug 06, 2022
Stochastic Scene-Aware Motion Prediction

Stochastic Scene-Aware Motion Prediction [Project Page] [Paper] Description This repository contains the training code for MotionNet and GoalNet of SA

Mohamed Hassan 31 Dec 09, 2022
A framework for joint super-resolution and image synthesis, without requiring real training data

SynthSR This repository contains code to train a Convolutional Neural Network (CNN) for Super-resolution (SR), or joint SR and data synthesis. The met

83 Jan 01, 2023
李云龙二次元风格化!打滚卖萌,使用了animeGANv2进行了视频的风格迁移

李云龙二次元风格化!一键star、fork,你也可以生成这样的团长! 打滚卖萌求star求fork! 0.效果展示 视频效果前往B站观看效果最佳:李云龙二次元风格化: github开源repo:李云龙二次元风格化 百度AIstudio开源地址,一键fork即可运行: 李云龙二次元风格化!一键fork

oukohou 44 Dec 04, 2022
Deep learning with dynamic computation graphs in TensorFlow

TensorFlow Fold TensorFlow Fold is a library for creating TensorFlow models that consume structured data, where the structure of the computation graph

1.8k Dec 28, 2022
Semantic Scholar's Author Disambiguation Algorithm & Evaluation Suite

S2AND This repository provides access to the S2AND dataset and S2AND reference model described in the paper S2AND: A Benchmark and Evaluation System f

AI2 54 Nov 28, 2022
Fashion Landmark Estimation with HRNet

HRNet for Fashion Landmark Estimation (Modified from deep-high-resolution-net.pytorch) Introduction This code applies the HRNet (Deep High-Resolution

SVIP Lab 91 Dec 26, 2022
Contrastive Learning for Metagenomic Binning

CLMB A simple framework for CLMB - a novel deep Contrastive Learningfor Metagenomic Binning Created by Pengfei Zhang, senior of Department of Computer

1 Sep 14, 2022
Sandbox for training deep learning networks

Deep learning networks This repo is used to research convolutional networks primarily for computer vision tasks. For this purpose, the repo contains (

Oleg Sémery 2.7k Jan 01, 2023
Evolving neural network parameters in JAX.

Evolving Neural Networks in JAX This repository holds code displaying techniques for applying evolutionary network training strategies in JAX. Each sc

Trevor Thackston 6 Feb 12, 2022
DetCo: Unsupervised Contrastive Learning for Object Detection

DetCo: Unsupervised Contrastive Learning for Object Detection arxiv link News Sparse RCNN+DetCo improves from 45.0 AP to 46.5 AP(+1.5) with 3x+ms trai

Enze Xie 234 Dec 18, 2022