A collection of pre-commit hooks for handling text files.

Overview

texthooks

A collection of pre-commit hooks for handling text files.

In particular, hooks for handling unicode characters which may be undesirable in a repository.

Usage with pre-commit

To use with pre-commit, include this repo and the desired hooks in .pre-commit-config.yaml:

- repo: https://github.com/sirosen/texthooks
  rev: 0.1.0
  hooks:
    - id: fix-smartquotes
    - id: fix-ligatures

Standalone Usage

Each hook is usable as a CLI script. Simply

pip install texthooks

and then invoke, e.g.

fix-smartquotes FILENAME

Supported Hooks

fix-smartquotes

This fixes copy-paste from some applications which replace double-quotes with curly quotes. It does not convert corner brackets, braile quotation marks, or angle quotation marks. Those characters are not typically the result of copy-paste errors, so they are allowed.

Low quotation marks vary in usage and meaning by language, and some languages use quotation marks which are facing "outwards" (opposite facing from english). For the most part, these and exotic characters (double-prime quotes) are ignored.

In files with the offending marks, they are replaced and the run is marked as failed.

Overriding Quotation Characters

Two options are available for specifying exactly which characters will be replaced. For ease of use, they are specified as hex-encoded unicode codepoints.

Suppose you wanted to avoid replacing the "Heavy single comma quotation mark ornament" (275C) and the "Heavy single turned comma quotation mark ornament" (275B) characters. You could override the single quote codepoints as follows:

- repo: https://github.com/sirosen/texthooks
  rev: 0.1.0
  hooks:
    - id: fix-smartquotes
      # replace default single quote chars with this set:
      # apostrophe, fullwidth apostrophe, left single quote, single high
      # reversed-9 quote, right single quote
      args: ["--single-quote-codepoints", "0027,FF07,2018,201B,2019"]

fix-ligatures

Automatically find and replace ligature characters with their ascii equivalents.

This replaces liguatures which may be created by programs like LaTeX for presentation with their strictly-equivalent ASCII counterparts. For example, fi and ff may be ligature-ized.

This hook converts these back into ASCII so that tools like grep will behave as expected.

forbid-bidi-controls

This is checker which forbids the use of unicode bidirectional text control characters.

These are directional formatting characters which can be used to construct text with unexpected or unclear semantics. For example, in programming languages which allow bidirectional text in statements, "X" = ייִדיש can be written with right-to-left reversal to mean that the variable ייִדיש is assigned a value of "X".

CHANGELOG

0.2.2

  • Fix a bug in CLI argument handling for all hooks

0.2.1

  • Fix a typo in forbid-bidi-controls entrypoint

0.2.0

  • Add the forbid-bidi-controls hook
  • Adjust the handling of file encodings. Files will be read with UTF-8 encoding by default in most cases.

0.1.0

  • Initial release with fix-ligatures and fix-smartquotes hooks
Owner
Stephen Rosen
Stephen Rosen
Fuzzy String Matching in Python

FuzzyWuzzy Fuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.

SeatGeek 8.8k Jan 08, 2023
A Python3 script that simulates the user typing a text on their keyboard.

A Python3 script that simulates the user typing a text on their keyboard. (control the speed, randomness, rate of typos and more!)

Jose Gracia Berenguer 3 Feb 22, 2022
Utility for Text Normalisation or Inverse Normalisation

Text Processor Text Normalisation or Inverse Normalisation for Indonesian, e.g. measurements "123 kg" - "seratus dua puluh tiga kilogram" Currency/Mo

Cahya Wirawan 2 Aug 11, 2022
An anthology of a variety of tools for the Persian language in Python

An anthology of a variety of tools for the Persian language in Python

Persian Tools 106 Nov 08, 2022
Vector space based Information Retrieval System for Text Processing - Information retrieval

Information Retrieval: Text Processing Group 13 Sequence of operations Install Requirements Add given wikipedia files to the corpus directory. Downloa

1 Jan 01, 2022
Find a Doc is a free online resource aimed at helping connect the foreign community in Japan with health services in their native language.

Find a Doc - Localization Find a Doc is a free online resource aimed at helping connect the foreign community in Japan with health services in their n

Our Japan Life 18 Dec 19, 2022
Maiden & Spell community player ranking based on tournament data.

MnSRank Maiden & Spell community player ranking based on tournament data. Why? 2021 just ended and this seemed like a cool idea. Elo doesn't work well

Jonathan Lee 1 Apr 20, 2022
A Python app which can convert normal text to Handwritten text.

Text to HandWritten Text ✍️ Converter Watch Tutorial for this project Usage:- Clone my repository. Open CMD in working directory. Run following comman

Kushal Bhavsar 5 Dec 11, 2022
PyNews 📰 Simple newsletter made with python 🐍🗞️

PyNews 📰 Simple newsletter made with python Install dependencies This project has some dependencies (see requirements.txt) that are not included in t

Luciano Felix 4 Aug 21, 2022
The bot creates hashtags for user's texts in Russian and English.

telegram_bot_hashtags The bot creates hashtags for user's texts in Russian and English. It is a simple bot for creating hashtags. NOTE file config.py

Yana Davydovich 2 Feb 12, 2022
Production First and Production Ready End-to-End Keyword Spotting Toolkit

WeKws Production First and Production Ready End-to-End Keyword Spotting Toolkit. The goal of this toolkit it to... Small footprint keyword spotting (K

222 Dec 30, 2022
utoken is a multilingual tokenizer that divides text into words, punctuation and special tokens such as numbers, URLs, XML tags, email-addresses and hashtags.

utoken utoken is a multilingual tokenizer that divides text into words, punctuation and special tokens such as numbers, URLs, XML tags, email-addresse

Ulf Hermjakob 11 Jan 05, 2023
A query extract python package

A query extract python package

Fayas Noushad 4 Nov 28, 2021
Split large XML files into smaller ones for easy upload

Split large XML files into smaller ones for easy upload. Works for WordPress Posts Import and other XML files.

Joseph Adediji 1 Jan 30, 2022
An online markdown resume template project, based on pywebio

An online markdown resume template project, based on pywebio

极简XksA 5 Nov 10, 2022
Repositori untuk belajar pemrograman Python dalam bahasa Indonesia

Python Repositori ini berisi kumpulan dari berbagai macam contoh struktur data, algoritma dan komputasi matematika yang diimplementasikan dengan mengg

Bellshade 111 Dec 19, 2022
Shows twitch pay for any streamer from Twitch leaked CSV files.

twitch_leak_csv_reader Shows twitch pay for any streamer from Twitch leaked CSV files. Requirements: You need python3 (you can install python 3 from o

5 Nov 11, 2022
Text to ASCII and ASCII to text

Text2ASCII Description This python script (converter.py) contains two functions: encode() is used to return a list of Integer, one item per character

4 Jan 22, 2022
TextStatistics - Get a text file wich contains English text

TextStatistics This program get a text file wich contains English text. The program analyses the text, and print some information. For this program I

2 Nov 15, 2021
A Python library that provides an easy way to identify devices like mobile phones, tablets and their capabilities by parsing (browser) user agent strings.

Python User Agents user_agents is a Python library that provides an easy way to identify/detect devices like mobile phones, tablets and their capabili

Selwin Ong 1.3k Dec 22, 2022