Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Expressions.

Overview

patterns-finder

Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Expressions.

This library offers the capabilities:

  • A set of predefined patterns with the most useful regex.
  • Extend the patterns, by adding user defined regex.
  • Find and extarct patterns from text
  • Pandas' Dataframe support.
  • Sort the results of extraction.
  • Summarize the results of extraction.
  • Display extractions by visualy rich text annotation.
  • Build complex extraction rules based on regex (in future release).

Installation

To install the last version of patterns-finder library, use pip:

pip install patterns-finder

Usage

Find a pattern in the text

Just import patterns, like emoji from patterns_finder.patterns.web, then you can use them to find pattern in text:

from patterns_finder.patterns.web import emoji, url, email 

emoji.find("the quick #A52A2A 🦊 jumped 3 times over the lazy 🐶 ")
# Output:
# [(18, 19, 'EMOJI', '🦊'), (49, 50, 'EMOJI', '🐶')]

url.find("The lazy 🐶 has a website https://lazy.dog.com ")
# Output:
# [(25, 45, 'URL', 'https://lazy.dog.com')]

email.find("[email protected] is the email of 🦊 ")
# Output:
# [(0, 19, 'EMAIL', '[email protected]')]

The results provided by the method find for each of pattern are in the form:

[(0, 19, 'EMAIL', '[email protected]')]
  ^  ^       ^          ^ 
  |  |       |          |
 Offset      |          └ Text matching the pattern
  |  |       └ Label of the pattern
  |  └ End index
  └ Start index in the text

Find multiple patterns in the text

To search for different patterns in the text we can use the method finder.patterns_in_text(text, patterns) as follows:

from patterns_finder import finder
from patterns_finder.patterns.web import emoji, url, color_hex
from patterns_finder.patterns.number import integer

patterns = [emoji, color_hex, integer]
text = "the quick #A52A2A 🦊 jumped 3 times over the lazy 🐶 "
finder.patterns_in_text(text, patterns)
# Output:
# [(18, 19, 'EMOJI', '🦊'),
#  (49, 50, 'EMOJI', '🐶'),
#  (10, 17, 'COLOR_HEX', '#A52A2A'),
#  (12, 14, 'INTEGER', '52'),
#  (15, 16, 'INTEGER', '2'),
#  (27, 28, 'INTEGER', '3')]

Find user defined patterns in the text

To define new pattern you can use any regex pattern that are supported by the regex and re packages of python. User defined patterns can be writen in the form of string regex pattern or tuple of string ('regex pattern', 'label').

patterns = [web.emoji, "quick|lazy", ("\\b[a-zA-Z]+\\b", "WORD") ]
text = "the quick #A52A2A 🦊 jumped 3 times over the lazy 🐶 "
finder.patterns_in_text(text, patterns)
# Output: 
# [(18, 19, 'EMOJI', '🦊'),
#  (49, 50, 'EMOJI', '🐶'),
#  (4, 9, 'quick|lazy', 'quick'),
#  (44, 48, 'quick|lazy', 'lazy'),
#  (0, 3, 'WORD', 'the'),
#  (4, 9, 'WORD', 'quick'),
#  (20, 26, 'WORD', 'jumped'),
#  (29, 34, 'WORD', 'times'),
#  (35, 39, 'WORD', 'over'),
#  (40, 43, 'WORD', 'the'),
#  (44, 48, 'WORD', 'lazy')]

Sort extraxted patterns

By using the argument sort_by of the method finder.patterns_in_text we can sort the extraction accoring to different options:

  • sort_by=finder.START sorts the results by the start index in the text
patterns = [web.emoji, color_hex, ('\\b[a-zA-Z]+\\b', 'WORD') ]
finder.patterns_in_text(text, patterns, sort_by=finder.START)
# Output:
# [(0, 3, 'WORD', 'the'),
#  (4, 9, 'WORD', 'quick'),
#  (10, 17, 'COLOR_HEX', '#A52A2A'),
#  (18, 19, 'EMOJI', '🦊'),
#  (20, 26, 'WORD', 'jumped'),
#  (29, 34, 'WORD', 'times'),
#  (35, 39, 'WORD', 'over'),
#  (40, 43, 'WORD', 'the'),
#  (44, 48, 'WORD', 'lazy'),
#  (49, 50, 'EMOJI', '🐶')]
  • sort_by=finder.END sorts the results by the end index in the text
finder.patterns_in_text(text, patterns, sort_by=finder.END)
# Output:
# [(0, 3, 'WORD', 'the'),
#  (4, 9, 'WORD', 'quick'),
#  (10, 17, 'COLOR_HEX', '#A52A2A'),
#  (18, 19, 'EMOJI', '🦊'),
#  (20, 26, 'WORD', 'jumped'),
#  (29, 34, 'WORD', 'times'),
#  (35, 39, 'WORD', 'over'),
#  (40, 43, 'WORD', 'the'),
#  (44, 48, 'WORD', 'lazy'),
#  (49, 50, 'EMOJI', '🐶')]
  • sort_by=finder.LABEL sorts the results by pattern's label
finder.patterns_in_text(text, patterns, sort_by=finder.LABEL)
# Output:
# [(10, 17, 'COLOR_HEX', '#A52A2A'),
#  (18, 19, 'EMOJI', '🦊'),
#  (49, 50, 'EMOJI', '🐶'),
#  (0, 3, 'WORD', 'the'),
#  (4, 9, 'WORD', 'quick'),
#  (20, 26, 'WORD', 'jumped'),
#  (29, 34, 'WORD', 'times'),
#  (35, 39, 'WORD', 'over'),
#  (40, 43, 'WORD', 'the'),
#  (44, 48, 'WORD', 'lazy')]
  • sort_by=finder.TEXT sorts the results by the extracted text
finder.patterns_in_text(text, patterns, sort_by=finder.TEXT)
# Output:
# [(10, 17, 'COLOR_HEX', '#A52A2A'),
#  (20, 26, 'WORD', 'jumped'),
#  (44, 48, 'WORD', 'lazy'),
#  (35, 39, 'WORD', 'over'),
#  (4, 9, 'WORD', 'quick'),
#  (0, 3, 'WORD', 'the'),
#  (40, 43, 'WORD', 'the'),
#  (29, 34, 'WORD', 'times'),
#  (49, 50, 'EMOJI', '🐶'),
#  (18, 19, 'EMOJI', '🦊')]

Summarize results of extraction

By using the argument summary_type, one can choose the desired form of output results.

  • summary_type=finder.NONE retruns a list with all details, without summarization.
patterns = [ color_hex, ('\\b[a-zA-Z]+\\b', 'WORD'), web.emoji ]
finder.patterns_in_text(text, patterns, summary_type=finder.NONE)
# Output:
# [(10, 17, 'COLOR_HEX', '#A52A2A'),
#  (0, 3, 'WORD', 'the'),
#  (4, 9, 'WORD', 'quick'),
#  (20, 26, 'WORD', 'jumped'),
#  (29, 34, 'WORD', 'times'),
#  (35, 39, 'WORD', 'over'),
#  (40, 43, 'WORD', 'the'),
#  (44, 48, 'WORD', 'lazy'),
#  (18, 19, 'EMOJI', '🦊'),
#  (49, 50, 'EMOJI', '🐶')]
  • summary_type=finder.LABEL_TEXT_OFFSET returns a dictionary of patterns labels as keys, with the corresponding offsets and text as values.
finder.patterns_in_text(text, patterns, summary_type=finder.LABEL_TEXT_OFFSET)
# Output:
# {
#  'COLOR_HEX': [[10, 17, '#A52A2A']],
#  'WORD': [[0, 3, 'the'], [4, 9, 'quick'], [20, 26, 'jumped'], [29, 34, 'times'], [35, 39, 'over'], [40, 43, 'the'], [44, 48, 'lazy']],
#  'EMOJI': [[18, 19, '🦊'], [49, 50, '🐶']]
# }
  • summary_type=finder.LABEL_TEXT returns a dictionary of patterns labels as keys, with the corresponding text (without offset) as values.
finder.patterns_in_text(text, patterns, summary_type=finder.LABEL_TEXT)
# Output:
# {
#  'COLOR_HEX': ['#A52A2A'],
#  'WORD': ['the', 'quick', 'jumped', 'times', 'over', 'the', 'lazy'],
#  'EMOJI': ['🦊', '🐶']
# }
  • summary_type=finder.TEXT_ONLY returns a list of the extracted text only.
finder.patterns_in_text(text, patterns, summary_type=finder.TEXT_ONLY)
# Output:
# ['#A52A2A', 'the', 'quick', 'jumped', 'times', 'over', 'the', 'lazy', '🦊', '🐶']

Extract patterns from Pandas DataFrame

This package provides the capability to extract patterns from Pandas' DataFrame easily, by using the method finder.patterns_in_df(df, input_col, output_col, patterns, ...).

from patterns_finder import finder
from patterns_finder.patterns import web
import pandas as pd

patterns = [web.email, web.emoji, web.url]

df = pd.DataFrame(data={
    'text': ["the quick #A52A2A 🦊 jumped 3 times over the lazy 🐶",
                    "[email protected] is the email of 🦊",
                    "The lazy 🐶 has a website https://lazy.dog.com"],
    })

finder.patterns_in_df(df, "text", "extraction", patterns, summary_type=finder.LABEL_TEXT)
# Output:
# |    | text                                                 | extraction                                          |
# |---:|:-----------------------------------------------------|:----------------------------------------------------|
# |  0 | the quick #A52A2A 🦊 jumped 3 times over the lazy 🐶 | {'EMOJI': ['🦊', '🐶']}                            |
# |  1 | [email protected] is the email of 🦊               | {'EMAIL': ['[email protected]'], 'EMOJI': ['🦊']} |
# |  2 | The lazy 🐶 has a website https://lazy.dog.com       | {'EMOJI': ['🐶'], 'URL': ['https://lazy.dog.com']}  |

The method finder.patterns_in_df have also the arguments summary_type and sort_by.

List of all predefined patterns

  • Web
from patterns_finder.web import email, url, uri, mailto, html_link, sql, color_hex, copyright, alphanumeric, emoji, username, quotation, ipv4, ipv6
  • Phone
from patterns_finder.phone import generic, uk, us
  • Credit Cards
from patterns_finder.credit_card import generic, visa, mastercard, discover, american_express
  • Numbers
from patterns_finder.number import integer, float, scientific, hexadecimal, percent, roman
  • Currency
from patterns_finder.currency import monetary, symbol, code, name
  • Languages
from patterns_finder.language import english, french, spanish, arabic, hebrew, turkish, russian, german, chinese, greek, japanese, hindi, bangali, armenian, swedish, portoguese, balinese, georgian
  • Time and Date
from patterns_finder.time_date import time, date, year
  • Postal Code
from patterns_finder.postal_code import us, canada, uk, france, spain, switzerland, brazilian

Contact

Please email your questions or comments to me.

You might also like...
Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.
Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.

textgenrnn Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code, or quickly tr

Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.
Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.

textgenrnn Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code, or quickly tr

texlive expressions for documents

tex2nix Generate Texlive environment containing all dependencies for your document rather than downloading gigabytes of texlive packages. Installation

DomainWordsDict, Chinese words dict that contains more than 68 domains, which can be used as text classification、knowledge enhance task

DomainWordsDict, Chinese words dict that contains more than 68 domains, which can be used as text classification、knowledge enhance task。涵盖68个领域、共计916万词的专业词典知识库,可用于文本分类、知识增强、领域词汇库扩充等自然语言处理应用。

This repository contains Python scripts for extracting linguistic features from Filipino texts.

Filipino Text Linguistic Feature Extractors This repository contains scripts for extracting linguistic features from Filipino texts. The scripts were

Extracting Summary Knowledge Graphs from Long Documents

GraphSum This repo contains the data and code for the G2G model in the paper: Extracting Summary Knowledge Graphs from Long Documents. The other basel

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.
PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

Modular and extensible speech recognition library leveraging pytorch-lightning and hydra.

Lightning ASR Modular and extensible speech recognition library leveraging pytorch-lightning and hydra What is Lightning ASR • Installation • Get Star

Comments
  • Add Support for Patents patterns

    Add Support for Patents patterns

    Support Patent patterns w/ first implementation to support Patents globally

    Example usage:

    from patterns_finder.patterns.patents import global_patent
    global_patent.find("Patent US5960368A is titled Method for acid oxidation of radioactive, hazardous, and mixed organic waste materials ")
    # Output:
    # [(7, 16, 'PATENT', 'US5960368A')]
    
    

    requesting permission to add the patterns :p

    opened by mahzy 0
Releases(1.0.1)
Code for Discovering Topics in Long-tailed Corpora with Causal Intervention.

Code for Discovering Topics in Long-tailed Corpora with Causal Intervention ACL2021 Findings Usage 0. Prepare environment Requirements: python==3.6 te

Xiaobao Wu 8 Dec 16, 2022
official ( API ) for the zAmericanEnglish app in [ Google play ] and [ App store ]

official ( API ) for the zAmericanEnglish app in [ Google play ] and [ App store ]

Plugin 3 Jan 12, 2022
String Gen + Word Checker

Creates random strings and checks if any of them are a real words. Mostly a waste of time ngl but it is cool to see it work and the fact that it can generate a real random word within10sec

1 Jan 06, 2022
Sample data associated with the Aurora-BP study

The Aurora-BP Study and Dataset This repository contains sample code, sample data, and explanatory information for working with the Aurora-BP dataset

Microsoft 16 Dec 12, 2022
A text file containing 479k English words for all your dictionary/word-based projects e.g: auto-completion / autosuggestion

List Of English Words A text file containing over 466k English words. While searching for a list of english words (for an auto-complete tutorial) I fo

dwyl 8.5k Jan 03, 2023
A highly sophisticated sequence-to-sequence model for code generation

CoderX A proof-of-concept AI system by Graham Neubig (June 30, 2021). About CoderX CoderX is a retrieval-based code generation AI system reminiscent o

Graham Neubig 39 Aug 03, 2021
Generate text line images for training deep learning OCR model (e.g. CRNN)

Generate text line images for training deep learning OCR model (e.g. CRNN)

532 Jan 06, 2023
BERTAC (BERT-style transformer-based language model with Adversarially pretrained Convolutional neural network)

BERTAC (BERT-style transformer-based language model with Adversarially pretrained Convolutional neural network) BERTAC is a framework that combines a

6 Jan 24, 2022
Prithivida 690 Jan 04, 2023
Binary LSTM model for text classification

Text Classification The purpose of this repository is to create a neural network model of NLP with deep learning for binary classification of texts re

Nikita Elenberger 1 Mar 11, 2022
Torchrecipes provides a set of reproduci-able, re-usable, ready-to-run RECIPES for training different types of models, across multiple domains, on PyTorch Lightning.

Recipes are a standard, well supported set of blueprints for machine learning engineers to rapidly train models using the latest research techniques without significant engineering overhead.Specifica

Meta Research 193 Dec 28, 2022
A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.

Multilingual Latent Dirichlet Allocation (LDA) Pipeline This project is for text clustering using the Latent Dirichlet Allocation (LDA) algorithm. It

Artifici Online Services inc. 74 Oct 07, 2022
Precision Medicine Knowledge Graph (PrimeKG)

PrimeKG Website | bioRxiv Paper | Harvard Dataverse Precision Medicine Knowledge Graph (PrimeKG) presents a holistic view of diseases. PrimeKG integra

Machine Learning for Medicine and Science @ Harvard 103 Dec 10, 2022
The official code for “DocTr: Document Image Transformer for Geometric Unwarping and Illumination Correction”, ACM MM, Oral Paper, 2021.

Good news! Our new work exhibits state-of-the-art performances on DocUNet benchmark dataset: DocScanner: Robust Document Image Rectification with Prog

Hao Feng 231 Dec 26, 2022
Source code for the paper "TearingNet: Point Cloud Autoencoder to Learn Topology-Friendly Representations"

TearingNet: Point Cloud Autoencoder to Learn Topology-Friendly Representations Created by Jiahao Pang, Duanshun Li, and Dong Tian from InterDigital In

InterDigital 21 Dec 29, 2022
A simple chatbot based on chatterbot that you can use for anything has basic features

Chatbotium A simple chatbot based on chatterbot that you can use for anything has basic features. I have some errors Read the paragraph below: Known b

Herman 1 Feb 16, 2022
Clone a voice in 5 seconds to generate arbitrary speech in real-time

This repository is forked from Real-Time-Voice-Cloning which only support English. English | 中文 Features 🌍 Chinese supported mandarin and tested with

Weijia Chen 25.6k Jan 06, 2023
Generate a cool README/About me page for your Github Profile

Github Profile README/ About Me Generator 💯 This webapp lets you build a cool README for your profile. A few inputs + ~15 mins = Your Github Profile

Rahul Banerjee 179 Jan 07, 2023
Learn meanings behind words is a key element in NLP. This project concentrates on the disambiguation of preposition senses. Therefore, we train a bert-transformer model and surpass the state-of-the-art.

New State-of-the-Art in Preposition Sense Disambiguation Supervisor: Prof. Dr. Alexander Mehler Alexander Henlein Institutions: Goethe University TTLa

Dirk Neuhäuser 4 Apr 06, 2022
Crie tokens de autenticação íntegros e seguros com UToken.

UToken - Tokens seguros. UToken (ou Unhandleable Token) é uma bilioteca criada para ser utilizada na geração de tokens seguros e íntegros, ou seja, nã

Jaedson Silva 0 Nov 29, 2022