ftfy: fixes text for you
>>> print(fix_encoding("(ง'⌣')ง"))
(ง'⌣')ง
Full documentation: https://ftfy.readthedocs.org
Testimonials
- “My life is livable again!” — @planarrowspace
- “A handy piece of magic” — @simonw
- “Saved me a large amount of frustrating dev work” — @iancal
- “ftfy did the right thing right away, with no faffing about. Excellent work, solving a very tricky real-world (whole-world!) problem.” — Brennan Young
- “Hat mir die Tage geholfen. Im Übrigen bin ich der Meinung, dass wir keine komplexen Maschinen mit Computern bauen sollten solange wir nicht einmal Umlaute sicher verarbeiten können. :D” — Bruno Ranieri
- “I have no idea when I’m gonna need this, but I’m definitely bookmarking it.” — /u/ocrow
- “9.2/10” — pylint
Developed at Luminoso
Luminoso makes groundbreaking software for text analytics that really understands what words mean, in many languages. Our software is used by enterprise customers such as Sony, Intel, Mars, and Scotts, and it's built on Python and open-source technologies.
We use ftfy every day at Luminoso, because the first step in understanding text is making sure it has the correct characters in it!
Luminoso is growing fast and hiring. If you're interested in joining us, take a look at our careers page.
What it does
ftfy
fixes Unicode that's broken in various ways.
The goal of ftfy
is to take in bad Unicode and output good Unicode, for use in your Unicode-aware code. This is different from taking in non-Unicode and outputting Unicode, which is not a goal of ftfy. It also isn't designed to protect you from having to write Unicode-aware code. ftfy helps those who help themselves.
Of course you're better off if your input is decoded properly and has no glitches. But you often don't have any control over your input; it's someone else's mistake, but it's your problem now.
ftfy
will do everything it can to fix the problem.
Mojibake
The most interesting kind of brokenness that ftfy will fix is when someone has encoded Unicode with one standard and decoded it with a different one. This often shows up as characters that turn into nonsense sequences (called "mojibake"):
- The word
schön
might appear asschön
. - An em dash (
—
) might appear as—
. - Text that was meant to be enclosed in quotation marks might end up instead enclosed in
“
andâ€<9d>
, where<9d>
represents an unprintable character.
ftfy uses heuristics to detect and undo this kind of mojibake, with a very low rate of false positives.
This part of ftfy now has an unofficial Web implementation by simonw: https://ftfy.now.sh/
Examples
fix_text
is the main function of ftfy. This section is meant to give you a taste of the things it can do. fix_encoding
is the more specific function that only fixes mojibake.
Please read the documentation for more information on what ftfy does, and how to configure it for your needs.
>>> print(fix_text('This text should be in “quotesâ€\x9d.'))
This text should be in "quotes".
>>> print(fix_text('ünicode'))
ünicode
>>> print(fix_text('Broken text… it’s flubberific!',
... normalization='NFKC'))
Broken text... it's flubberific!
>>> print(fix_text('HTML entities <3'))
HTML entities <3
>>> print(fix_text('<em>HTML entities in HTML <3</em>'))
<em>HTML entities in HTML <3</em>
>>> print(fix_text('\001\033[36;44mI’m blue, da ba dee da ba '
... 'doo…\033[0m', normalization='NFKC'))
I'm blue, da ba dee da ba doo...
>>> print(fix_text('LOUD NOISES'))
LOUD NOISES
>>> print(fix_text('LOUD NOISES', fix_character_width=False))
LOUD NOISES
Installing
ftfy is a Python 3 package that can be installed using pip
:
pip install ftfy
(Or use pip3 install ftfy
on systems where Python 2 and 3 are both globally installed and pip
refers to Python 2.)
If you're on Python 2.7, you can install an older version:
pip install 'ftfy<5'
You can also clone this Git repository and install it with python setup.py install
.
Who maintains ftfy?
I'm Robyn Speer ([email protected]). I develop this tool as part of my text-understanding company, Luminoso, where it has proven essential.
Luminoso provides ftfy as free, open source software under the extremely permissive MIT license.
You can report bugs regarding ftfy on GitHub and we'll handle them.
Citing ftfy
ftfy has been used as a crucial data processing step in major NLP research.
It's important to give credit appropriately to everyone whose work you build on in research. This includes software, not just high-status contributions such as mathematical models. All I ask when you use ftfy for research is that you cite it.
ftfy has a citable record on Zenodo. A citation of ftfy may look like this:
Robyn Speer. (2019). ftfy (Version 5.5). Zenodo.
http://doi.org/10.5281/zenodo.2591652
In BibTeX format, the citation is::
@misc{speer-2019-ftfy,
author = {Robyn Speer},
title = {ftfy},
note = {Version 5.5},
year = 2019,
howpublished = {Zenodo},
doi = {10.5281/zenodo.2591652},
url = {https://doi.org/10.5281/zenodo.2591652}
}