Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.

Overview

TextDistance

TextDistance logo

Build Status PyPI version Status License

TextDistance -- python library for comparing distance between two or more sequences by many algorithms.

Features:

  • 30+ algorithms
  • Pure python implementation
  • Simple usage
  • More than two sequences comparing
  • Some algorithms have more than one implementation in one class.
  • Optional numpy usage for maximum speed.

Algorithms

Edit based

Algorithm Class Functions
Hamming Hamming hamming
MLIPNS Mlipns mlipns
Levenshtein Levenshtein levenshtein
Damerau-Levenshtein DamerauLevenshtein damerau_levenshtein
Jaro-Winkler JaroWinkler jaro_winkler, jaro
Strcmp95 StrCmp95 strcmp95
Needleman-Wunsch NeedlemanWunsch needleman_wunsch
Gotoh Gotoh gotoh
Smith-Waterman SmithWaterman smith_waterman

Token based

Algorithm Class Functions
Jaccard index Jaccard jaccard
Sørensen–Dice coefficient Sorensen sorensen, sorensen_dice, dice
Tversky index Tversky tversky
Overlap coefficient Overlap overlap
Tanimoto distance Tanimoto tanimoto
Cosine similarity Cosine cosine
Monge-Elkan MongeElkan monge_elkan
Bag distance Bag bag

Sequence based

Algorithm Class Functions
longest common subsequence similarity LCSSeq lcsseq
longest common substring similarity LCSStr lcsstr
Ratcliff-Obershelp similarity RatcliffObershelp ratcliff_obershelp

Compression based

Normalized compression distance with different compression algorithms.

Classic compression algorithms:

Algorithm Class Function
Arithmetic coding ArithNCD arith_ncd
RLE RLENCD rle_ncd
BWT RLE BWTRLENCD bwtrle_ncd

Normal compression algorithms:

Algorithm Class Function
Square Root SqrtNCD sqrt_ncd
Entropy EntropyNCD entropy_ncd

Work in progress algorithms that compare two strings as array of bits:

Algorithm Class Function
BZ2 BZ2NCD bz2_ncd
LZMA LZMANCD lzma_ncd
ZLib ZLIBNCD zlib_ncd

See blog post for more details about NCD.

Phonetic

Algorithm Class Functions
MRA MRA mra
Editex Editex editex

Simple

Algorithm Class Functions
Prefix similarity Prefix prefix
Postfix similarity Postfix postfix
Length distance Length length
Identity similarity Identity identity
Matrix similarity Matrix matrix

Installation

Stable

Only pure python implementation:

pip install textdistance

With extra libraries for maximum speed:

pip install "textdistance[extras]"

With all libraries (required for benchmarking and testing):

pip install "textdistance[benchmark]"

With algorithm specific extras:

pip install "textdistance[Hamming]"

Algorithms with available extras: DamerauLevenshtein, Hamming, Jaro, JaroWinkler, Levenshtein.

Dev

Via pip:

pip install -e git+https://github.com/life4/textdistance.git#egg=textdistance

Or clone repo and install with some extras:

git clone https://github.com/life4/textdistance.git
pip install -e ".[benchmark]"

Usage

All algorithms have 2 interfaces:

  1. Class with algorithm-specific params for customizing.
  2. Class instance with default params for quick and simple usage.

All algorithms have some common methods:

  1. .distance(*sequences) -- calculate distance between sequences.
  2. .similarity(*sequences) -- calculate similarity for sequences.
  3. .maximum(*sequences) -- maximum possible value for distance and similarity. For any sequence: distance + similarity == maximum.
  4. .normalized_distance(*sequences) -- normalized distance between sequences. The return value is a float between 0 and 1, where 0 means equal, and 1 totally different.
  5. .normalized_similarity(*sequences) -- normalized similarity for sequences. The return value is a float between 0 and 1, where 0 means totally different, and 1 equal.

Most common init arguments:

  1. qval -- q-value for split sequences into q-grams. Possible values:
    • 1 (default) -- compare sequences by chars.
    • 2 or more -- transform sequences to q-grams.
    • None -- split sequences by words.
  2. as_set -- for token-based algorithms:
    • True -- t and ttt is equal.
    • False (default) -- t and ttt is different.

Examples

For example, Hamming distance:

import textdistance

textdistance.hamming('test', 'text')
# 1

textdistance.hamming.distance('test', 'text')
# 1

textdistance.hamming.similarity('test', 'text')
# 3

textdistance.hamming.normalized_distance('test', 'text')
# 0.25

textdistance.hamming.normalized_similarity('test', 'text')
# 0.75

textdistance.Hamming(qval=2).distance('test', 'text')
# 2

Any other algorithms have same interface.

Articles

A few articles with examples how to use textdistance in the real world:

Extra libraries

For main algorithms textdistance try to call known external libraries (fastest first) if available (installed in your system) and possible (this implementation can compare this type of sequences). Install textdistance with extras for this feature.

You can disable this by passing external=False argument on init:

import textdistance
hamming = textdistance.Hamming(external=False)
hamming('text', 'testit')
# 3

Supported libraries:

  1. abydos
  2. Distance
  3. jellyfish
  4. py_stringmatching
  5. pylev
  6. python-Levenshtein
  7. pyxDamerauLevenshtein

Algorithms:

  1. DamerauLevenshtein
  2. Hamming
  3. Jaro
  4. JaroWinkler
  5. Levenshtein

Benchmarks

Without extras installation:

algorithm library function time
DamerauLevenshtein jellyfish damerau_levenshtein_distance 0.00965294
DamerauLevenshtein pyxdameraulevenshtein damerau_levenshtein_distance 0.151378
DamerauLevenshtein pylev damerau_levenshtein 0.766461
DamerauLevenshtein textdistance DamerauLevenshtein 4.13463
DamerauLevenshtein abydos damerau_levenshtein 4.3831
Hamming Levenshtein hamming 0.0014428
Hamming jellyfish hamming_distance 0.00240262
Hamming distance hamming 0.036253
Hamming abydos hamming 0.0383933
Hamming textdistance Hamming 0.176781
Jaro Levenshtein jaro 0.00313561
Jaro jellyfish jaro_distance 0.0051885
Jaro py_stringmatching jaro 0.180628
Jaro textdistance Jaro 0.278917
JaroWinkler Levenshtein jaro_winkler 0.00319735
JaroWinkler jellyfish jaro_winkler 0.00540443
JaroWinkler textdistance JaroWinkler 0.289626
Levenshtein Levenshtein distance 0.00414404
Levenshtein jellyfish levenshtein_distance 0.00601647
Levenshtein py_stringmatching levenshtein 0.252901
Levenshtein pylev levenshtein 0.569182
Levenshtein distance levenshtein 1.15726
Levenshtein abydos levenshtein 3.68451
Levenshtein textdistance Levenshtein 8.63674

Total: 24 libs.

Yeah, so slow. Use TextDistance on production only with extras.

Textdistance use benchmark's results for algorithm's optimization and try to call fastest external lib first (if possible).

You can run benchmark manually on your system:

pip install textdistance[benchmark]
python3 -m textdistance.benchmark

TextDistance show benchmarks results table for your system and save libraries priorities into libraries.json file in TextDistance's folder. This file will be used by textdistance for calling fastest algorithm implementation. Default libraries.json already included in package.

Running tests

You can run tests via dephell:

curl -L dephell.org/install | python3
dephell venv create --env=pytest-external
dephell deps install --env=pytest-external
dephell venv run --env=pytest-external

Contributing

PRs are welcome!

  • Found a bug? Fix it!
  • Want to add more algorithms? Sure! Just make it with the same interface as other algorithms in the lib and add some tests.
  • Can make something faster? Great! Just avoid external dependencies and remember that everything should work not only with strings.
  • Something else that do you think is good? Do it! Just make sure that CI passes and everything from the README is still applicable (interface, features, and so on).
  • Have no time to code? Tell your friends and subscribers about textdistance. More users, more contributions, more amazing features.

Thank you ❤️

Comments
  • add support for rapidfuzz

    add support for rapidfuzz

    The implementation used by rapidfuzz has the following algorithms

    • Jaro/JaroWinkler (fastest by a large margin)
    • Hamming (slightly slower than python-Levenshtein)
    • Levenshtein (similar fast to python-Levenshtein for very short strings and fastest for longer strings)

    Additionally it supports any sequence of hashable types (e.g. lists of strings) and not only text

    Here is the benchmark result:

    # Faster than textdistance:
    
    | algorithm          | library                 | function                     |        time |
    |--------------------+-------------------------+------------------------------+-------------|
    | DamerauLevenshtein | jellyfish               | damerau_levenshtein_distance | 0.0181046   |
    | DamerauLevenshtein | pyxdameraulevenshtein   | damerau_levenshtein_distance | 0.030925    |
    | Hamming            | Levenshtein             | hamming                      | 0.000351586 |
    | Hamming            | rapidfuzz.string_metric | hamming                      | 0.00040442  |
    | Hamming            | jellyfish               | hamming_distance             | 0.0143502   |
    | Jaro               | rapidfuzz.string_metric | jaro_similarity              | 0.000749048 |
    | Jaro               | jellyfish               | jaro_similarity              | 0.0152322   |
    | JaroWinkler        | rapidfuzz.string_metric | jaro_winkler_similarity      | 0.000776006 |
    | JaroWinkler        | jellyfish               | jaro_winkler_similarity      | 0.0157833   |
    | Levenshtein        | rapidfuzz.string_metric | levenshtein                  | 0.0010058   |
    | Levenshtein        | Levenshtein             | distance                     | 0.00103176  |
    | Levenshtein        | jellyfish               | levenshtein_distance         | 0.0147382   |
    | Levenshtein        | pylev                   | levenshtein                  | 0.14116     |
    Total: 13 libs.
    

    and the benchmark results when adding slightly longer strings:

    STMT = """
    func('text', 'test')
    func('qwer', 'asdf')
    func('a' * 15, 'b' * 15)
    func('a' * 30, 'b' * 30)
    """
    
    # Faster than textdistance:
    
    | algorithm          | library                 | function                     |        time |
    |--------------------+-------------------------+------------------------------+-------------|
    | DamerauLevenshtein | jellyfish               | damerau_levenshtein_distance | 0.0323887   |
    | DamerauLevenshtein | pyxdameraulevenshtein   | damerau_levenshtein_distance | 0.143235    |
    | Hamming            | Levenshtein             | hamming                      | 0.000489837 |
    | Hamming            | rapidfuzz.string_metric | hamming                      | 0.000517879 |
    | Hamming            | jellyfish               | hamming_distance             | 0.0182341   |
    | Jaro               | rapidfuzz.string_metric | jaro_similarity              | 0.00111363  |
    | Jaro               | jellyfish               | jaro_similarity              | 0.0201971   |
    | JaroWinkler        | rapidfuzz.string_metric | jaro_winkler_similarity      | 0.00105238  |
    | JaroWinkler        | jellyfish               | jaro_winkler_similarity      | 0.0206678   |
    | Levenshtein        | rapidfuzz.string_metric | levenshtein                  | 0.00138601  |
    | Levenshtein        | Levenshtein             | distance                     | 0.0034889   |
    | Levenshtein        | jellyfish               | levenshtein_distance         | 0.0232467   |
    | Levenshtein        | pylev                   | levenshtein                  | 0.599603    |
    Total: 13 libs.
    
    opened by maxbachmann 13
  • Add new DamerauLevenshtein... classes

    Add new DamerauLevenshtein... classes

    There are two versions of the Damerau-Levenshtein distance, as described in this Debian bug report: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1018933 Some of the external libraries implement one of them, others the other.

    This PR splits introduces two different classes: DamerauLevenshteinRestricted and DamerauLevenshteinUnrestricted, with DamerauLevenshtein being the unrestricted version, so that it is clear what is intended.

    opened by juliangilbey 7
  • Ignore inconsistent timings on some comparison tests

    Ignore inconsistent timings on some comparison tests

    Two particular tests have timings that differ wildly between successive runs on arm64 architectures. This might be because some libraries take a long time to load or something like that - I don't know. But this patch turns off hypothesis's timing checks for these two tests. I'm going to apply it to Debian's package; you might or might not want to apply it upstream.

    opened by juliangilbey 5
  • Modify JaroWinkler boosting to match behaviour of jellyfish algorithm

    Modify JaroWinkler boosting to match behaviour of jellyfish algorithm

    Jellyfish has recently modified its JaroWinkler algorithm to allow for boosting even when one of the strings is shorter than 4 characters: https://github.com/jamesturk/jellyfish/commit/87f9679910eba0dad6a1f6019f03cbdffba28392. It is very unclear whether this is a good idea or not. But as it is, the tests now fail, as the internal and external algorithms give different results on a pair of strings such as ":" and ":0".

    This patch replicates the change that jellyfish has made, which will then allow the external tests to pass once again. It also modifies the expected value of the comparison "fog" and "frog" to match this new algorithm behaviour.

    If you do not wish to apply this patch, then the external tests will need modifying to exclude the case where either of the strings has length < 4.

    hacktoberfest-accepted 
    opened by juliangilbey 5
  • Possible correction to Monge-Elkan calculation

    Possible correction to Monge-Elkan calculation

    Might be wrong about this, but think the code for the Monge-Elkan algorithm needs to be corrected.

    If you look at the implementation in the py_stringmatching library on line 81 of https://github.com/anhaidgroup/py_stringmatching/blob/master/py_stringmatching/similarity_measure/monge_elkan.py sim = float(sum_of_maxes) / float(len(bag1)) which is essentially the mean max.

    But in the implementation for textdistance, the score is given on line 222 of https://github.com/life4/textdistance/blob/master/textdistance/algorithms/token_based.py as
    sum(maxes) / len(seq) / len(maxes)

    I think the further division by len(maxes) isn't needed, and the line should just be sum(maxes) / len(seq)

    The change in the code could mess up tests elsewhere, so I'm not changing anything else. But thought I should bring this to your attention.

    Below is some code and differing scores I got in textdistance and py_stringmatching.

    # score in textdistance
    from textdistance import MongeElkan, levenshtein
    ALG = MongeElkan
    score = ALG(algorithm=levenshtein,qval=None,symmetric=False).similarity('Good Times!', "The Good Times and The Bad Ones")
    score
    # Got 2.25
    
    #score in py_stringmatching
    from py_stringmatching import MongeElkan
    from py_stringmatching import Levenshtein as Levenshtein_2
    ALG_2 = MongeElkan(sim_func=Levenshtein_2().get_raw_score)
    source = 'Good Times!'
    source_split = source.split()
    target = "The Good Times and The Bad Ones"
    target_split = target.split()
    score2 = ALG_2.get_raw_score(source_split, target_split)
    score2
    # got 5.5
    
    opened by shijithpk 3
  • Handle newer versions of abydos and jellyfish

    Handle newer versions of abydos and jellyfish

    abydos has changed its interface for distance metrics quite significantly, and jellyfish has changed the names of the functions. This patch addresses both of these issues.

    opened by juliangilbey 3
  • Ensure that maximum normalised distance is <= 1 and ...

    Ensure that maximum normalised distance is <= 1 and ...

    textdistance is currently failing its test-suite on arm64 machines with Python 3.10, which is causing me problems on Debian. I have managed to track down the first of these bugs (and there are at least two more to come): there are some algorithms that use upper() before comparing the strings. As noted in the code already, though these algorithms were designed for English (ASCII only), this can cause upper() to change the length of the string if using non-English characters. And hypothesis does this when testing. This can result in the normalised distance being greater than 1. This patch addresses this by ensuring that the distance returned from the relevant algorithms is no greater than self.maximum().

    A second issue which arose when doing this was calculating the maximum distance for Editex(); the current function for calculating the maximum does not give the correct answer if match_cost > mismatch_cost, for example. But this would be a silly situation: why would we penalise matching characters more than mismatching ones? There are two ways of resolving this: the first is to calculate the maximum distance using max(match_cost, group_cost, mismatch_cost), the second is to force the inequalities match_cost <= group_cost <= mismatch_cost. I have gone for the latter option in this patch.

    All being well, there will be more patches to come in the next few weeks as I get to the bottom of them!

    opened by juliangilbey 2
  • update rapidfuzz

    update rapidfuzz

    update rapidfuzz to the latest version which provides a damerau levenshtein implementation. It is the fastest of the supported libraries:

    | algorithm          | library                               | function                     |        time |
    |--------------------+---------------------------------------+------------------------------+-------------|
    | DamerauLevenshtein | rapidfuzz.distance.DamerauLevenshtein | distance                     | 0.00267046  |
    | DamerauLevenshtein | jellyfish                             | damerau_levenshtein_distance | 0.022479    |
    | DamerauLevenshtein | pyxdameraulevenshtein                 | damerau_levenshtein_distance | 0.0393475   |
    | DamerauLevenshtein | **textdistance**                      | DamerauLevenshtein           | 0.589098    |
    

    In addition it is the only implementation which only requires linear memory.

    opened by maxbachmann 1
  • Fix numpy types warnings

    Fix numpy types warnings

    Basic types have been deprecated in numpy 1.20. Here are the full warnings:

    DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
      Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
    
    DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
      Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
    

    I don’t know the code enough to assess if the specific numpy types are required though.

    opened by ArchangeGabriel 1
  • Fix a setuptools warning

    Fix a setuptools warning

    UserWarning: Usage of dash-separated 'description-file' will not be supported in future versions. Please use the underscore name 'description_file' instead

    opened by ArchangeGabriel 1
  • Fix README links

    Fix README links

    Hi,

    Noticed that the Travis CI link was wrong. Then found a few more links that appear to reference an old repository.

    This PR tries to correct the links by replacing orsinium by life4 in some URL's.

    And thanks for the great project, Bruno

    opened by kinow 1
Releases(4.5.0)
Owner
Life4
Original cool Open Source projects
Life4
Use AutoModelForSeq2SeqLM in Huggingface Transformers to train COMET

Training COMET using seq2seq setting Use AutoModelForSeq2SeqLM in Huggingface Transformers to train COMET. The codes are modified from run_summarizati

tqfang 9 Dec 17, 2022
LOT: A Benchmark for Evaluating Chinese Long Text Understanding and Generation

LOT: A Benchmark for Evaluating Chinese Long Text Understanding and Generation Tasks | Datasets | LongLM | Baselines | Paper Introduction LOT is a ben

46 Dec 28, 2022
CYGNUS, the Cynical AI, combines snarky responses with uncanny aggression.

New & (hopefully) Improved CYGNUS with several API updates, user updates, and online/offline operations added!!!

Simran Farrukh 0 Mar 28, 2022
Dust model dichotomous performance analysis

Dust-model-dichotomous-performance-analysis Using a collated dataset of 90,000 dust point source observations from 9 drylands studies from around the

1 Dec 17, 2021
Train 🤗-transformers model with Poutyne.

poutyne-transformers Train 🤗 -transformers models with Poutyne. Installation pip install poutyne-transformers Example import torch from transformers

Lennart Keller 2 Dec 18, 2022
Toward a Visual Concept Vocabulary for GAN Latent Space, ICCV 2021

Toward a Visual Concept Vocabulary for GAN Latent Space Code and data from the ICCV 2021 paper Sarah Schwettmann, Evan Hernandez, David Bau, Samuel Kl

Sarah Schwettmann 13 Dec 23, 2022
Python implementation of TextRank for phrase extraction and summarization of text documents

PyTextRank PyTextRank is a Python implementation of TextRank as a spaCy pipeline extension, used to: extract the top-ranked phrases from text document

derwen.ai 1.9k Jan 06, 2023
Poetry PEP 517 Build Backend & Core Utilities

Poetry Core A PEP 517 build backend implementation developed for Poetry. This project is intended to be a light weight, fully compliant, self-containe

Poetry 293 Jan 02, 2023
华为商城抢购手机的Python脚本 Python script of Huawei Store snapping up mobile phones

HUAWEI STORE GO 2021 说明 基于Python3+Selenium的华为商城抢购爬虫脚本,修改自近两年没更新的项目BUY-HW,为女神抢Nova 8(什么时候华为开始学小米玩饥饿营销了?) 原项目的登陆以及抢购部分已经不可用,本项目对原项目进行了改正以适应新华为商城,并增加一些功能

ZhangLiang 111 Dec 22, 2022
ConvBERT-Prod

ConvBERT 目录 0. 仓库结构 1. 简介 2. 数据集和复现精度 3. 准备数据与环境 3.1 准备环境 3.2 准备数据 3.3 准备模型 4. 开始使用 4.1 模型训练 4.2 模型评估 4.3 模型预测 5. 模型推理部署 5.1 基于Inference的推理 5.2 基于Serv

yujun 7 Apr 08, 2022
The source code of HeCo

HeCo This repo is for source code of KDD 2021 paper "Self-supervised Heterogeneous Graph Neural Network with Co-contrastive Learning". Paper Link: htt

Nian Liu 106 Dec 27, 2022
基于GRU网络的句子判断程序/A program based on GRU network for judging sentences

SentencesJudger SentencesJudger 是一个基于GRU神经网络的句子判断程序,基本的功能是判断文章中的某一句话是否为一个优美的句子。 English 如何使用SentencesJudger 确认Python运行环境 安装pyTorch与LTP python3 -m pip

8 Mar 24, 2022
Converts python code into c++ by using OpenAI CODEX.

🦾 codex_py2cpp 🤖 OpenAI Codex Python to C++ Code Generator Your Python Code is too slow? 🐌 You want to speed it up but forgot how to code in C++? ⌨

Alexander 423 Jan 01, 2023
test

Lidar-data-decode In this project, you can decode your lidar data frame(pcap file) and make your own datasets(test dataset) in Windows without any hug

46 Dec 05, 2022
Code for "Generative adversarial networks for reconstructing natural images from brain activity".

Reconstruct handwritten characters from brains using GANs Example code for the paper "Generative adversarial networks for reconstructing natural image

K. Seeliger 2 May 17, 2022
☀️ Measuring the accuracy of BBC weather forecasts in Honolulu, USA

Accuracy of BBC Weather forecasts for Honolulu This repository records the forecasts made by BBC Weather for the city of Honolulu, USA. Essentially, t

Max Halford 12 Oct 15, 2022
This repository structures data in title, summary, tags, sentiment given a fragment of a conversation

Understand-conversation-AI This repository structures data in title, summary, tags, sentiment given a fragment of a conversation How to install: pip i

Juan Camilo López Montes 1 Jan 11, 2022
Codes to pre-train Japanese T5 models

t5-japanese Codes to pre-train a T5 (Text-to-Text Transfer Transformer) model pre-trained on Japanese web texts. The model is available at https://hug

Megagon Labs 37 Dec 25, 2022
Implementation of "Adversarial purification with Score-based generative models", ICML 2021

Adversarial Purification with Score-based Generative Models by Jongmin Yoon, Sung Ju Hwang, Juho Lee This repository includes the official PyTorch imp

15 Dec 15, 2022
This code is the implementation of Text Emotion Recognition (TER) with linguistic features

APSIPA-TER This code is the implementation of Text Emotion Recognition (TER) with linguistic features. The network model is BERT with a pretrained mod

kenro515 1 Feb 08, 2022