Fuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.

Last update: Jan 01, 2023

Related tags

Text Processing thefuzz

Overview

TheFuzz

Fuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.

Requirements

Python 2.7 or higher
difflib
python-Levenshtein (optional, provides a 4-10x speedup in String Matching, though may result in differing results for certain cases)

For testing

pycodestyle
hypothesis
pytest

Installation

Using PIP via PyPI

pip install thefuzz

or the following to install python-Levenshtein too

pip install thefuzz[speedup]

Using PIP via Github

pip install git+git://github.com/seatgeek/[email protected]#egg=thefuzz

Adding to your requirements.txt file (run pip install -r requirements.txt afterwards)

git+ssh://[email protected]/seatgeek/[email protected]#egg=thefuzz

Manually via GIT

git clone git://github.com/seatgeek/thefuzz.git thefuzz
cd thefuzz
python setup.py install

Usage

>>> from thefuzz import fuzz
>>> from thefuzz import process

Simple Ratio

>>> fuzz.ratio("this is a test", "this is a test!")
    97

Partial Ratio

>>> fuzz.partial_ratio("this is a test", "this is a test!")
    100

Token Sort Ratio

>> fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear") 100 ">

>>> fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
    91
>>> fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
    100

Token Set Ratio

>> fuzz.token_set_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear") 100 ">

>>> fuzz.token_sort_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")
    84
>>> fuzz.token_set_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")
    100

Process

>> process.extract("new york jets", choices, limit=2) [('New York Jets', 100), ('New York Giants', 78)] >>> process.extractOne("cowboys", choices) ("Dallas Cowboys", 90) ">

>>> choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
>>> process.extract("new york jets", choices, limit=2)
    [('New York Jets', 100), ('New York Giants', 78)]
>>> process.extractOne("cowboys", choices)
    ("Dallas Cowboys", 90)

You can also pass additional parameters to extractOne method to make it use a specific scorer. A typical use case is to match file paths:

>> process.extractOne("System of a down - Hypnotize - Heroin", songs, scorer=fuzz.token_sort_ratio) ("/music/library/good/System of a Down/2005 - Hypnotize/10 - She's Like Heroin.mp3", 61) ">

>>> process.extractOne("System of a down - Hypnotize - Heroin", songs)
    ('/music/library/good/System of a Down/2005 - Hypnotize/01 - Attack.mp3', 86)
>>> process.extractOne("System of a down - Hypnotize - Heroin", songs, scorer=fuzz.token_sort_ratio)
    ("/music/library/good/System of a Down/2005 - Hypnotize/10 - She's Like Heroin.mp3", 61)

Fuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.

Related tags

Overview

TheFuzz

Requirements

For testing

Installation

Usage

Simple Ratio

Partial Ratio

Token Sort Ratio

Token Set Ratio

Process

Owner

SeatGeek

Python character encoding detector

Wikipedia Reader for the GNOME Desktop

RSS Reader application for the Emacs Application Framework.

Microsoft's Cascadia Code font customized to my liking.

Vastasanuli - Vastasanuli pelaa Sanuli-peliä.

WorldCloud Orçamento de Estado 2022

Map Reduce Wordcount in Python using gRPC

Word-Generator - Generates meaningful words from dictionary with given no. of letters and words.

Free & simple way to encipher text

Getting git-style versioning working on RDFlib

Auto translate Localizable.strings for multiple languages in Xcode

Open-source linguistic ethnography tool for framing public opinion in mediatized groups.

Wikipedia Extractive Text Summarizer + Keywords Identification (entropy-based)

This is REST-API for Indonesian Text Summarization using Non-Negative Matrix Factorization for the algorithm to summarize documents and FastAPI for the framework.

Redlines produces a Markdown text showing the differences between two strings/text

pydantic-i18n is an extension to support an i18n for the pydantic error messages.

A simple Python module for parsing human names into their individual components

🐸 Identify anything. pyWhat easily lets you identify emails, IP addresses, and more. Feed it a .pcap file or some text and it'll tell you what it is! 🧙‍♀️

基于Pytex的数学建模工具,实现将md文件转换成pdf/tex文档的前后端

Etranslate is a free and unlimited python library for transiting your texts