These scripts look for non-printable unicode characters in all text files in a source tree

Overview

find-unicode-control

These scripts look for non-printable unicode characters in all text files in a source tree. find_unicode_control.py should work with python2 as well as python3. It uses python-magic if available to determine file type, or simply spawns the file --mime-type command. They should be functionally the same and find_unicode_control.py could eventually get disposed.

usage: find_unicode_control.py [-h] [-p {all,bidi}] [-v] [-c CONFIG] path [path ...]

Look for Unicode control characters

positional arguments:
  path                  Sources to analyze

optional arguments:
  -h, --help            show this help message and exit
  -p {all,bidi}, --nonprint {all,bidi}
                        Look for either all non-printable unicode characters or bidirectional control characters.
  -v, --verbose         Verbose mode.
  -d, --detailed        Print line numbers where characters occur.
  -t, --notests         Exclude tests (basically test.* as a component of path).
  -c CONFIG, --config CONFIG
                        Configuration file to read settings from.

If unicode BIDI control characters or non-printable characters are found in a file, it will print output as follows:

$ python3 find_unicode_control.py -p bidi *.c
commenting-out.c: bidirectional control characters: {'\u202e', '\u2066', '\u2069'}
early-return.c: bidirectional control characters: {'\u2067'}
stretched-string.c: bidirectional control characters: {'\u202e', '\u2066', '\u2069'}

Using the -d flag, the output is more detailed, showing line numbers in files, but this mode is also slower:

find_unicode_control.py -p bidi -d .
./commenting-out.c:4 bidirectional control characters: ['\u202e', '\u2066', '\u2069', '\u2066']
./commenting-out.c:6 bidirectional control characters: ['\u202e', '\u2066']
./early-return.c:4 bidirectional control characters: ['\u2067']
./stretched-string.c:6 bidirectional control characters: ['\u202e', '\u2066', '\u2069', '\u2066']

The optimal workflow would be to do a quick scan through a source tree and if any issues are found, do a detailed scan on only those files.

Configuration file

If files need to be excluded from the scan, make a configuration file and define a scan_exclude variable to a list of regular expressions that match the files or paths to exclude. Alternatively, add a scan_exclude_mime list with the list of mime types to ignore; this can again be a regular expression. Here is an example configuration that glibc uses:

scan_exclude = [
        # Iconv test data
        r'/iconvdata/testdata/',
        # Test case data
        r'libio/tst-widetext.input$',
        # Test script.  This is to silence the warning:
        # 'utf-8' codec can't decode byte 0xe9 in position 2118: invalid continuation byte
        # since the script tests mixed encoding characters.
        r'localedata/tst-langinfo.sh$']

Notes

This script was quickly hacked together to scan repositories with mostly LTR, unicode content. If you have RTL content (either in comments, literals or even identifiers in code), it will give false warnings that you need to weed out. For now you need to exclude such RTL code using scan_exclude but a long term wish list (if this remains relevant, hopefully more sophisticated RTL diagnostics will make it obsolete!) is to handle RTL a bit more intelligently.

Owner
Siddhesh Poyarekar
Toolchain hacker and all round nice guy. My openhub profile will probably tell you more about my work: https://www.openhub.net/accounts/siddhesh
Siddhesh Poyarekar
Make some improvements in the Pizza class and pizzashop file by refactoring.

Make some improvements in the Pizza class and pizzashop file by refactoring.

James Brucker 1 Oct 18, 2021
A toolkit for writing and executing automation scripts for Final Fantasy XIV

XIV Scripter This is a tool for scripting out series of actions in FFXIV. It allows for custom actions to be defined in config.yaml as well as custom

Jacob Beel 1 Dec 09, 2021
Customized python validations.

A customized python validations.

Wilfred V. Pine 2 Apr 20, 2022
Shut is an opinionated tool to simplify publishing pure Python packages.

Welcome to Shut Shut is an opinionated tool to simplify publishing pure Python packages. What can Shut do for you? Generate setup files (setup.py, MAN

Niklas Rosenstein 6 Nov 18, 2022
ZX Spectrum Utilities: (zx-spectrum-utils)

Here are a few utility programs that can be used with the zx spectrum. The ZX Spectrum is one of the first home computers from the early 1980s.

Graham Oakes 4 Mar 07, 2022
Daiho Tool is a Script Gathering for Windows/Linux systems written in Python.

Daiho is a Script Developed with Python3. It gathers a total of 22 Discord tools (including a RAT, a Raid Tool, a Nuker Tool, a Token Grabberr, etc). It has a pleasant and intuitive interface to faci

AstraaDev 32 Jan 05, 2023
SysInfo is an app developed in python which gives Basic System Info , and some detailed graphs of system performance .

SysInfo SysInfo is an app developed in python which gives Basic System Info , and some detailed graphs of system performance . Installation Download t

5 Nov 08, 2021
Factoral Methods using two different method

Factoral-Methods-using-two-different-method Here, I am finding the factorial of a number by using two different method. The first method is by using f

Sachin Vinayak Dabhade 4 Sep 24, 2021
extract gene TSS/TES site form gencode/ensembl/gencode database GTF file and export bed format file.

GetTsite python Package extract gene TSS/TES site form gencode/ensembl/gencode database GTF file and export bed format file. Install $ pip install Get

laojunjun 7 Nov 21, 2022
Install, run, and update apps without root and only in your home directory

Qube Apps Install, run, and update apps in the private storage of a Qube. Build and install in Qubes Get the code: git clone https://github.com/micahf

Micah Lee 26 Dec 27, 2022
Python program for Linux users to change any url to any domain name they want.

URLMask Python program for Linux users to change a URL to ANY domain. A program than can take any url and mask it to any domain name you like. E.g. ne

2 Jun 20, 2022
Find version automatically based on git tags and commit messages.

GIT-CONVENTIONAL-VERSION Find version automatically based on git tags and commit messages. The tool is very specific in its function, so it is very fl

0 Nov 07, 2021
Data Utilities e.g. for importing files to onetask

Use this repository to easily convert your source files (csv, txt, excel, json, html) into record-oriented JSON files that can be uploaded into onetask.

onetask.ai 1 Jul 18, 2022
python-codicefiscale: a tiny library for encode/decode Italian fiscal code - codifica/decodifica del Codice Fiscale.

python-codicefiscale python-codicefiscale is a tiny library for encode/decode Italian fiscal code - codifica/decodifica del Codice Fiscale. Features T

Fabio Caccamo 53 Dec 14, 2022
A simple package for handling variables in string.

A simple package for handling string variables. Welcome! This is a simple package for handling variables in string, You can add or remove variables wi

1 Dec 31, 2021
Script for generating Hearthstone card spoilers & checklists

This is a script for generating text spoilers and set checklists for Hearthstone. Installation & Running Python 3.6 or higher is required. Copy/clone

John T. Wodder II 1 Oct 11, 2022
Regression Metrics Calculation Made easy

Regression Metrics Mean Absolute Error Mean Square Error Root Mean Square Error Root Mean Square Logarithmic Error Root Mean Square Logarithmic Error

Ashish Patel 12 Jan 02, 2023
Simple code to generate a password for your account!

Password-Generator Simple code to generate a password for your account! Password Generator for passwords for your accounts or anything else! This code

DEEM 1 Jun 05, 2022
Aurin - A quick AUR installer for Arch Linux. Install packages from AUR website in a click.

Aurin - A quick AUR installer for Arch Linux. Install packages from AUR website in a click.

Suleman 51 Nov 04, 2022
It is a tool that looks for a specific username in social networks

It is a tool that looks for a specific username in social networks

MasterBurnt 6 Oct 07, 2022