These scripts look for non-printable unicode characters in all text files in a source tree

Overview

find-unicode-control

These scripts look for non-printable unicode characters in all text files in a source tree. find_unicode_control.py should work with python2 as well as python3. It uses python-magic if available to determine file type, or simply spawns the file --mime-type command. They should be functionally the same and find_unicode_control.py could eventually get disposed.

usage: find_unicode_control.py [-h] [-p {all,bidi}] [-v] [-c CONFIG] path [path ...]

Look for Unicode control characters

positional arguments:
  path                  Sources to analyze

optional arguments:
  -h, --help            show this help message and exit
  -p {all,bidi}, --nonprint {all,bidi}
                        Look for either all non-printable unicode characters or bidirectional control characters.
  -v, --verbose         Verbose mode.
  -d, --detailed        Print line numbers where characters occur.
  -t, --notests         Exclude tests (basically test.* as a component of path).
  -c CONFIG, --config CONFIG
                        Configuration file to read settings from.

If unicode BIDI control characters or non-printable characters are found in a file, it will print output as follows:

$ python3 find_unicode_control.py -p bidi *.c
commenting-out.c: bidirectional control characters: {'\u202e', '\u2066', '\u2069'}
early-return.c: bidirectional control characters: {'\u2067'}
stretched-string.c: bidirectional control characters: {'\u202e', '\u2066', '\u2069'}

Using the -d flag, the output is more detailed, showing line numbers in files, but this mode is also slower:

find_unicode_control.py -p bidi -d .
./commenting-out.c:4 bidirectional control characters: ['\u202e', '\u2066', '\u2069', '\u2066']
./commenting-out.c:6 bidirectional control characters: ['\u202e', '\u2066']
./early-return.c:4 bidirectional control characters: ['\u2067']
./stretched-string.c:6 bidirectional control characters: ['\u202e', '\u2066', '\u2069', '\u2066']

The optimal workflow would be to do a quick scan through a source tree and if any issues are found, do a detailed scan on only those files.

Configuration file

If files need to be excluded from the scan, make a configuration file and define a scan_exclude variable to a list of regular expressions that match the files or paths to exclude. Alternatively, add a scan_exclude_mime list with the list of mime types to ignore; this can again be a regular expression. Here is an example configuration that glibc uses:

scan_exclude = [
        # Iconv test data
        r'/iconvdata/testdata/',
        # Test case data
        r'libio/tst-widetext.input$',
        # Test script.  This is to silence the warning:
        # 'utf-8' codec can't decode byte 0xe9 in position 2118: invalid continuation byte
        # since the script tests mixed encoding characters.
        r'localedata/tst-langinfo.sh$']

Notes

This script was quickly hacked together to scan repositories with mostly LTR, unicode content. If you have RTL content (either in comments, literals or even identifiers in code), it will give false warnings that you need to weed out. For now you need to exclude such RTL code using scan_exclude but a long term wish list (if this remains relevant, hopefully more sophisticated RTL diagnostics will make it obsolete!) is to handle RTL a bit more intelligently.

Owner
Siddhesh Poyarekar
Toolchain hacker and all round nice guy. My openhub profile will probably tell you more about my work: https://www.openhub.net/accounts/siddhesh
Siddhesh Poyarekar
ColorController is a Pythonic interface for managing colors by english-language name and various color values.

ColorController.py Table of Contents Encode color data in various formats. 1.1: Create a ColorController object using a familiar, english-language col

Tal Zaken 2 Feb 12, 2022
Allows you to canibalize methods from classes effectively implementing trait-oriented programming

About This package enables code reuse in non-inheritance way from existing classes, effectively implementing traits-oriented programming pattern. Stor

1 Dec 13, 2021
cssOrganizer - organize a css file by grouping them into categories

This python project was created to scan through a CSS file and produce a more organized CSS file by grouping related CSS Properties within selectors. Created in my spare time for fun and my own utili

Andrew Espindola 0 Aug 31, 2022
A python mathematics module

A python mathematics module

Fayas Noushad 4 Nov 28, 2021
Python type-checker written in Rust

pravda Python type-checker written in Rust Features Fully typed with annotations and checked with mypy, PEP561 compatible Add yours! Installation pip

wemake.services 31 Oct 21, 2022
Color getter (including method to get random color or complementary color) made out of Python

python-color-getter Color getter (including method to get random color or complementary color) made out of Python Setup pip3 install git+https://githu

Jung Gyu Yoon 2 Sep 17, 2022
A fast Python implementation of Ac Auto Mechine

A fast Python implementation of Ac Auto Mechine

Jin Zitian 1 Dec 07, 2021
Easy compression and extraction for any compression or archival format.

Tzar: Tar, Zip, Anything Really Easy compression and extraction for any compression or archival format. Usage/Examples tzar compress large-dir compres

DanielVZ 37 Nov 02, 2022
extract gene TSS/TES site form gencode/ensembl/gencode database GTF file and export bed format file.

GetTsite python Package extract gene TSS/TES site form gencode/ensembl/gencode database GTF file and export bed format file. Install $ pip install Get

laojunjun 7 Nov 21, 2022
Manage your exceptions in Python like a PRO

A linter to manage all your python exceptions and try/except blocks (limited only for those who like dinosaurs).

Guilherme Latrova 353 Dec 31, 2022
A small utility that sorts your files.

FileSorter A small utility that sorts your files. TODO: Scan directory to find files(thanks @corruptmemry for this!) Split extensions to determine fil

2 Jun 16, 2022
Personal Toolbox Package

Jammy (Jam) A personal toolbox by Qsh.zh. Usage setup For core package, run pip install jammy To access functions in bin git clone https://gitlab.com/

5 Sep 16, 2022
A hashtag from string extract python module

A hashtag from string extract python module

Fayas Noushad 3 Aug 10, 2022
πŸ’‰ μ½”λ‘œλ‚˜ μž”μ—¬λ°±μ‹  μ˜ˆμ•½ 맀크둜 μ»€μŠ€ν…€ λΉŒλ“œ (속도 ν–₯상 버전)

Korea-Covid-19-Vaccine-Reservation μ½”λ‘œλ‚˜ μž”μ—¬ λ°±μ‹  μ˜ˆμ•½ 맀크둜λ₯Ό 기반으둜 ν•œ μ»€μŠ€ν…€ λΉŒλ“œμž…λ‹ˆλ‹€. 더 λΉ λ₯Έ λ°±μ‹  μ˜ˆμ•½μ„ λͺ©ν‘œλ‘œ ν•˜λ©°, 속도λ₯Ό μš°μ„ ν•˜κΈ° λ•Œλ¬Έμ— μ‚¬μš©μžλŠ” 이에 λŒ€μ²˜κ°€ κ°€λŠ₯ν•΄μ•Ό ν•©λ‹ˆλ‹€. μ§€μ •ν•œ μ’Œν‘œ λ‚΄ λŒ€κΈ°μ€‘μΈ λ³‘μ›μ—μ„œ μž”μ—¬ λ°±μ‹ 

Queue.ri 21 Aug 15, 2022
Python script to launch burp scans automatically

SimpleAutoBurp Python script that takes a config.json file as config and uses Burp Suite Pro to scan a list of websites.

Adan Álvarez 26 Jul 18, 2022
Customized python validations.

A customized python validations.

Wilfred V. Pine 2 Apr 20, 2022
A thing to simplify listening for PG notifications with asyncpg

asyncpg-listen This library simplifies usage of listen/notify with asyncpg: Handles loss of a connection Simplifies notifications processing from mult

ANNA 18 Dec 23, 2022
Plone Interface contracts, plus basic features and utilities

plone.base This package is the base package of the CMS Plone https://plone.org. It contains only interface contracts and basic features and utilitie

Plone Foundation 1 Oct 03, 2022
This utility synchronises spelling dictionaries from various tools with each other.

This utility synchronises spelling dictionaries from various tools with each other. This way the words that have been trained on MS Office are also correctly checked in vim or Firefox. And vice versa

Patrice Neff 2 Feb 11, 2022
A simple package for handling variables in string.

A simple package for handling string variables. Welcome! This is a simple package for handling variables in string, You can add or remove variables wi

1 Dec 31, 2021