These scripts look for non-printable unicode characters in all text files in a source tree

Overview

find-unicode-control

These scripts look for non-printable unicode characters in all text files in a source tree. find_unicode_control.py should work with python2 as well as python3. It uses python-magic if available to determine file type, or simply spawns the file --mime-type command. They should be functionally the same and find_unicode_control.py could eventually get disposed.

usage: find_unicode_control.py [-h] [-p {all,bidi}] [-v] [-c CONFIG] path [path ...]

Look for Unicode control characters

positional arguments:
  path                  Sources to analyze

optional arguments:
  -h, --help            show this help message and exit
  -p {all,bidi}, --nonprint {all,bidi}
                        Look for either all non-printable unicode characters or bidirectional control characters.
  -v, --verbose         Verbose mode.
  -d, --detailed        Print line numbers where characters occur.
  -t, --notests         Exclude tests (basically test.* as a component of path).
  -c CONFIG, --config CONFIG
                        Configuration file to read settings from.

If unicode BIDI control characters or non-printable characters are found in a file, it will print output as follows:

$ python3 find_unicode_control.py -p bidi *.c
commenting-out.c: bidirectional control characters: {'\u202e', '\u2066', '\u2069'}
early-return.c: bidirectional control characters: {'\u2067'}
stretched-string.c: bidirectional control characters: {'\u202e', '\u2066', '\u2069'}

Using the -d flag, the output is more detailed, showing line numbers in files, but this mode is also slower:

find_unicode_control.py -p bidi -d .
./commenting-out.c:4 bidirectional control characters: ['\u202e', '\u2066', '\u2069', '\u2066']
./commenting-out.c:6 bidirectional control characters: ['\u202e', '\u2066']
./early-return.c:4 bidirectional control characters: ['\u2067']
./stretched-string.c:6 bidirectional control characters: ['\u202e', '\u2066', '\u2069', '\u2066']

The optimal workflow would be to do a quick scan through a source tree and if any issues are found, do a detailed scan on only those files.

Configuration file

If files need to be excluded from the scan, make a configuration file and define a scan_exclude variable to a list of regular expressions that match the files or paths to exclude. Alternatively, add a scan_exclude_mime list with the list of mime types to ignore; this can again be a regular expression. Here is an example configuration that glibc uses:

scan_exclude = [
        # Iconv test data
        r'/iconvdata/testdata/',
        # Test case data
        r'libio/tst-widetext.input$',
        # Test script.  This is to silence the warning:
        # 'utf-8' codec can't decode byte 0xe9 in position 2118: invalid continuation byte
        # since the script tests mixed encoding characters.
        r'localedata/tst-langinfo.sh$']

Notes

This script was quickly hacked together to scan repositories with mostly LTR, unicode content. If you have RTL content (either in comments, literals or even identifiers in code), it will give false warnings that you need to weed out. For now you need to exclude such RTL code using scan_exclude but a long term wish list (if this remains relevant, hopefully more sophisticated RTL diagnostics will make it obsolete!) is to handle RTL a bit more intelligently.

Owner
Siddhesh Poyarekar
Toolchain hacker and all round nice guy. My openhub profile will probably tell you more about my work: https://www.openhub.net/accounts/siddhesh
Siddhesh Poyarekar
ecowater-softner is a Python library for collecting information from Ecowater water softeners.

Ecowater Softner ecowater-softner is a Python library for collecting information from Ecowater water softeners. Installation Use the package manager p

6 Dec 08, 2022
HeadHunter parser

HHparser Description Program for finding work at HeadHunter service Features Find job Parse vacancies Dependencies python pip geckodriver firefox Inst

memphisboy 1 Oct 30, 2021
Gradually automate your procedures, one step at a time

Gradualist Gradually automate your procedures, one step at a time Inspired by https://blog.danslimmon.com/2019/07/15/ Features Main Features Converts

Ross Jacobs 8 Jul 24, 2022
Hot reloading for Python

Hot reloading for Python

Olivier Breuleux 769 Jan 03, 2023
Fcpy: A Python package for high performance, fast convergence and high precision numerical fractional calculus computing.

Fcpy: A Python package for high performance, fast convergence and high precision numerical fractional calculus computing.

SciFracX 1 Mar 23, 2022
Obsidian tools - a Python package for analysing an Obsidian.md vault

obsidiantools is a Python package for getting structured metadata about your Obsidian.md notes and analysing your vault.

Mark Farragher 153 Jan 04, 2023
A dictionary that can be flattened and re-inflated

deflatable-dict A dictionary that can be flattened and re-inflated. Particularly useful if you're interacting with yaml, for example. Installation wit

Lucas Sargent 2 Oct 18, 2021
JeNot - A tool to notify you when Jenkins builds are done.

JeNot - Jenkins Notifications NOTE: under construction, buggy, and not production-ready What A tool to notify you when Jenkins builds are done. Why Je

1 Jun 24, 2022
Dependency Injector is a dependency injection framework for Python.

What is Dependency Injector? Dependency Injector is a dependency injection framework for Python. It helps implementing the dependency injection princi

ETS Labs 2.6k Jan 04, 2023
This is discord nitro code generator and checker made with python. This will generate nitro codes and checks if the code is valid or not. If code is valid then it will print the code leaving 2 lines and if not then it will print '*'.

Discord Nitro Generator And Checker ⚙️ Rᴜɴ Oɴ Rᴇᴘʟɪᴛ 🛠️ Lᴀɴɢᴜᴀɢᴇs Aɴᴅ Tᴏᴏʟs If you are taking code from this repository without a fork, then atleast

Vɪɴᴀʏᴀᴋ Pᴀɴᴅᴇʏ 37 Jan 07, 2023
Create C bindings for python automatically with the help of libclang

Python C Import Dynamic library + header + ctypes = Module like object! Create C bindings for python automatically with the help of libclang. Examples

1 Jul 25, 2022
A python package for your Kali Linux distro that find the fastest mirror and configure your apt to use that mirror

Kali Mirror Finder Using Single Python File A python package for your Kali Linux distro that find the fastest mirror and configure your apt to use tha

MrSingh 6 Dec 12, 2022
Python program for analyzing the output files of phonopy.

PhononTools Description Python program to analyze the results generated by phonopy. Using the .yaml and .dat files that phonopy generates one can plot

Harry LaBollita 8 Nov 27, 2022
ticktock is a minimalist library to view Python time performance of Python code.

ticktock is a minimalist library to view Python time performance of Python code.

Victor Benichoux 30 Sep 28, 2022
A tiny Python library for generating public IDs from integers

pids Create short public identifiers based on integer IDs. Installation pip install pids Usage from pids import pid public_id = pid.from_int(1234) #

Simon Willison 7 Nov 11, 2021
Extract XML from the OS X dictionaries.

Extract XML from the OS X dictionaries.

Joshua Olson 13 Dec 11, 2022
New time-based UUID formats which are suited for use as a database key

uuid6 New time-based UUID formats which are suited for use as a database key. This module extends immutable UUID objects (the UUID class) with the fun

26 Dec 30, 2022
A simple python script to generate an iCalendar file for the university classes.

iCal Generator This is a simple python script to generate an iCalendar file for the university classes. Installation Clone the repository git clone ht

Foad Rashidi 2 Sep 01, 2022
python script to generate color coded resistor images

Resistor image generator I got nerdsniped into making this. It's not finished at all, and the code is messy. The end goal it generate a whole E-series

MichD 1 Nov 12, 2021
PyHook is an offensive API hooking tool written in python designed to catch various credentials within the API call.

PyHook is the python implementation of my SharpHook project, It uses various API hooks in order to give us the desired credentials. PyHook Uses

Ilan Kalendarov 158 Dec 22, 2022