Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Expressions.

Overview

patterns-finder

Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Expressions.

This library offers the capabilities:

  • A set of predefined patterns with the most useful regex.
  • Extend the patterns, by adding user defined regex.
  • Find and extarct patterns from text
  • Pandas' Dataframe support.
  • Sort the results of extraction.
  • Summarize the results of extraction.
  • Display extractions by visualy rich text annotation.
  • Build complex extraction rules based on regex (in future release).

Installation

To install the last version of patterns-finder library, use pip:

pip install patterns-finder

Usage

Find a pattern in the text

Just import patterns, like emoji from patterns_finder.patterns.web, then you can use them to find pattern in text:

from patterns_finder.patterns.web import emoji, url, email 

emoji.find("the quick #A52A2A 🦊 jumped 3 times over the lazy 🐶 ")
# Output:
# [(18, 19, 'EMOJI', '🦊'), (49, 50, 'EMOJI', '🐶')]

url.find("The lazy 🐶 has a website https://lazy.dog.com ")
# Output:
# [(25, 45, 'URL', 'https://lazy.dog.com')]

email.find("[email protected] is the email of 🦊 ")
# Output:
# [(0, 19, 'EMAIL', '[email protected]')]

The results provided by the method find for each of pattern are in the form:

[(0, 19, 'EMAIL', '[email protected]')]
  ^  ^       ^          ^ 
  |  |       |          |
 Offset      |          └ Text matching the pattern
  |  |       └ Label of the pattern
  |  └ End index
  └ Start index in the text

Find multiple patterns in the text

To search for different patterns in the text we can use the method finder.patterns_in_text(text, patterns) as follows:

from patterns_finder import finder
from patterns_finder.patterns.web import emoji, url, color_hex
from patterns_finder.patterns.number import integer

patterns = [emoji, color_hex, integer]
text = "the quick #A52A2A 🦊 jumped 3 times over the lazy 🐶 "
finder.patterns_in_text(text, patterns)
# Output:
# [(18, 19, 'EMOJI', '🦊'),
#  (49, 50, 'EMOJI', '🐶'),
#  (10, 17, 'COLOR_HEX', '#A52A2A'),
#  (12, 14, 'INTEGER', '52'),
#  (15, 16, 'INTEGER', '2'),
#  (27, 28, 'INTEGER', '3')]

Find user defined patterns in the text

To define new pattern you can use any regex pattern that are supported by the regex and re packages of python. User defined patterns can be writen in the form of string regex pattern or tuple of string ('regex pattern', 'label').

patterns = [web.emoji, "quick|lazy", ("\\b[a-zA-Z]+\\b", "WORD") ]
text = "the quick #A52A2A 🦊 jumped 3 times over the lazy 🐶 "
finder.patterns_in_text(text, patterns)
# Output: 
# [(18, 19, 'EMOJI', '🦊'),
#  (49, 50, 'EMOJI', '🐶'),
#  (4, 9, 'quick|lazy', 'quick'),
#  (44, 48, 'quick|lazy', 'lazy'),
#  (0, 3, 'WORD', 'the'),
#  (4, 9, 'WORD', 'quick'),
#  (20, 26, 'WORD', 'jumped'),
#  (29, 34, 'WORD', 'times'),
#  (35, 39, 'WORD', 'over'),
#  (40, 43, 'WORD', 'the'),
#  (44, 48, 'WORD', 'lazy')]

Sort extraxted patterns

By using the argument sort_by of the method finder.patterns_in_text we can sort the extraction accoring to different options:

  • sort_by=finder.START sorts the results by the start index in the text
patterns = [web.emoji, color_hex, ('\\b[a-zA-Z]+\\b', 'WORD') ]
finder.patterns_in_text(text, patterns, sort_by=finder.START)
# Output:
# [(0, 3, 'WORD', 'the'),
#  (4, 9, 'WORD', 'quick'),
#  (10, 17, 'COLOR_HEX', '#A52A2A'),
#  (18, 19, 'EMOJI', '🦊'),
#  (20, 26, 'WORD', 'jumped'),
#  (29, 34, 'WORD', 'times'),
#  (35, 39, 'WORD', 'over'),
#  (40, 43, 'WORD', 'the'),
#  (44, 48, 'WORD', 'lazy'),
#  (49, 50, 'EMOJI', '🐶')]
  • sort_by=finder.END sorts the results by the end index in the text
finder.patterns_in_text(text, patterns, sort_by=finder.END)
# Output:
# [(0, 3, 'WORD', 'the'),
#  (4, 9, 'WORD', 'quick'),
#  (10, 17, 'COLOR_HEX', '#A52A2A'),
#  (18, 19, 'EMOJI', '🦊'),
#  (20, 26, 'WORD', 'jumped'),
#  (29, 34, 'WORD', 'times'),
#  (35, 39, 'WORD', 'over'),
#  (40, 43, 'WORD', 'the'),
#  (44, 48, 'WORD', 'lazy'),
#  (49, 50, 'EMOJI', '🐶')]
  • sort_by=finder.LABEL sorts the results by pattern's label
finder.patterns_in_text(text, patterns, sort_by=finder.LABEL)
# Output:
# [(10, 17, 'COLOR_HEX', '#A52A2A'),
#  (18, 19, 'EMOJI', '🦊'),
#  (49, 50, 'EMOJI', '🐶'),
#  (0, 3, 'WORD', 'the'),
#  (4, 9, 'WORD', 'quick'),
#  (20, 26, 'WORD', 'jumped'),
#  (29, 34, 'WORD', 'times'),
#  (35, 39, 'WORD', 'over'),
#  (40, 43, 'WORD', 'the'),
#  (44, 48, 'WORD', 'lazy')]
  • sort_by=finder.TEXT sorts the results by the extracted text
finder.patterns_in_text(text, patterns, sort_by=finder.TEXT)
# Output:
# [(10, 17, 'COLOR_HEX', '#A52A2A'),
#  (20, 26, 'WORD', 'jumped'),
#  (44, 48, 'WORD', 'lazy'),
#  (35, 39, 'WORD', 'over'),
#  (4, 9, 'WORD', 'quick'),
#  (0, 3, 'WORD', 'the'),
#  (40, 43, 'WORD', 'the'),
#  (29, 34, 'WORD', 'times'),
#  (49, 50, 'EMOJI', '🐶'),
#  (18, 19, 'EMOJI', '🦊')]

Summarize results of extraction

By using the argument summary_type, one can choose the desired form of output results.

  • summary_type=finder.NONE retruns a list with all details, without summarization.
patterns = [ color_hex, ('\\b[a-zA-Z]+\\b', 'WORD'), web.emoji ]
finder.patterns_in_text(text, patterns, summary_type=finder.NONE)
# Output:
# [(10, 17, 'COLOR_HEX', '#A52A2A'),
#  (0, 3, 'WORD', 'the'),
#  (4, 9, 'WORD', 'quick'),
#  (20, 26, 'WORD', 'jumped'),
#  (29, 34, 'WORD', 'times'),
#  (35, 39, 'WORD', 'over'),
#  (40, 43, 'WORD', 'the'),
#  (44, 48, 'WORD', 'lazy'),
#  (18, 19, 'EMOJI', '🦊'),
#  (49, 50, 'EMOJI', '🐶')]
  • summary_type=finder.LABEL_TEXT_OFFSET returns a dictionary of patterns labels as keys, with the corresponding offsets and text as values.
finder.patterns_in_text(text, patterns, summary_type=finder.LABEL_TEXT_OFFSET)
# Output:
# {
#  'COLOR_HEX': [[10, 17, '#A52A2A']],
#  'WORD': [[0, 3, 'the'], [4, 9, 'quick'], [20, 26, 'jumped'], [29, 34, 'times'], [35, 39, 'over'], [40, 43, 'the'], [44, 48, 'lazy']],
#  'EMOJI': [[18, 19, '🦊'], [49, 50, '🐶']]
# }
  • summary_type=finder.LABEL_TEXT returns a dictionary of patterns labels as keys, with the corresponding text (without offset) as values.
finder.patterns_in_text(text, patterns, summary_type=finder.LABEL_TEXT)
# Output:
# {
#  'COLOR_HEX': ['#A52A2A'],
#  'WORD': ['the', 'quick', 'jumped', 'times', 'over', 'the', 'lazy'],
#  'EMOJI': ['🦊', '🐶']
# }
  • summary_type=finder.TEXT_ONLY returns a list of the extracted text only.
finder.patterns_in_text(text, patterns, summary_type=finder.TEXT_ONLY)
# Output:
# ['#A52A2A', 'the', 'quick', 'jumped', 'times', 'over', 'the', 'lazy', '🦊', '🐶']

Extract patterns from Pandas DataFrame

This package provides the capability to extract patterns from Pandas' DataFrame easily, by using the method finder.patterns_in_df(df, input_col, output_col, patterns, ...).

from patterns_finder import finder
from patterns_finder.patterns import web
import pandas as pd

patterns = [web.email, web.emoji, web.url]

df = pd.DataFrame(data={
    'text': ["the quick #A52A2A 🦊 jumped 3 times over the lazy 🐶",
                    "[email protected] is the email of 🦊",
                    "The lazy 🐶 has a website https://lazy.dog.com"],
    })

finder.patterns_in_df(df, "text", "extraction", patterns, summary_type=finder.LABEL_TEXT)
# Output:
# |    | text                                                 | extraction                                          |
# |---:|:-----------------------------------------------------|:----------------------------------------------------|
# |  0 | the quick #A52A2A 🦊 jumped 3 times over the lazy 🐶 | {'EMOJI': ['🦊', '🐶']}                            |
# |  1 | [email protected] is the email of 🦊               | {'EMAIL': ['[email protected]'], 'EMOJI': ['🦊']} |
# |  2 | The lazy 🐶 has a website https://lazy.dog.com       | {'EMOJI': ['🐶'], 'URL': ['https://lazy.dog.com']}  |

The method finder.patterns_in_df have also the arguments summary_type and sort_by.

List of all predefined patterns

  • Web
from patterns_finder.web import email, url, uri, mailto, html_link, sql, color_hex, copyright, alphanumeric, emoji, username, quotation, ipv4, ipv6
  • Phone
from patterns_finder.phone import generic, uk, us
  • Credit Cards
from patterns_finder.credit_card import generic, visa, mastercard, discover, american_express
  • Numbers
from patterns_finder.number import integer, float, scientific, hexadecimal, percent, roman
  • Currency
from patterns_finder.currency import monetary, symbol, code, name
  • Languages
from patterns_finder.language import english, french, spanish, arabic, hebrew, turkish, russian, german, chinese, greek, japanese, hindi, bangali, armenian, swedish, portoguese, balinese, georgian
  • Time and Date
from patterns_finder.time_date import time, date, year
  • Postal Code
from patterns_finder.postal_code import us, canada, uk, france, spain, switzerland, brazilian

Contact

Please email your questions or comments to me.

You might also like...
Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.
Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.

textgenrnn Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code, or quickly tr

Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.
Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.

textgenrnn Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code, or quickly tr

texlive expressions for documents

tex2nix Generate Texlive environment containing all dependencies for your document rather than downloading gigabytes of texlive packages. Installation

DomainWordsDict, Chinese words dict that contains more than 68 domains, which can be used as text classification、knowledge enhance task

DomainWordsDict, Chinese words dict that contains more than 68 domains, which can be used as text classification、knowledge enhance task。涵盖68个领域、共计916万词的专业词典知识库,可用于文本分类、知识增强、领域词汇库扩充等自然语言处理应用。

This repository contains Python scripts for extracting linguistic features from Filipino texts.

Filipino Text Linguistic Feature Extractors This repository contains scripts for extracting linguistic features from Filipino texts. The scripts were

Extracting Summary Knowledge Graphs from Long Documents

GraphSum This repo contains the data and code for the G2G model in the paper: Extracting Summary Knowledge Graphs from Long Documents. The other basel

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.
PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

Modular and extensible speech recognition library leveraging pytorch-lightning and hydra.

Lightning ASR Modular and extensible speech recognition library leveraging pytorch-lightning and hydra What is Lightning ASR • Installation • Get Star

Comments
  • Add Support for Patents patterns

    Add Support for Patents patterns

    Support Patent patterns w/ first implementation to support Patents globally

    Example usage:

    from patterns_finder.patterns.patents import global_patent
    global_patent.find("Patent US5960368A is titled Method for acid oxidation of radioactive, hazardous, and mixed organic waste materials ")
    # Output:
    # [(7, 16, 'PATENT', 'US5960368A')]
    
    

    requesting permission to add the patterns :p

    opened by mahzy 0
Releases(1.0.1)
The NewSHead dataset is a multi-doc headline dataset used in NHNet for training a headline summarization model.

This repository contains the raw dataset used in NHNet [1] for the task of News Story Headline Generation. The code of data processing and training is available under Tensorflow Models - NHNet.

Google Research Datasets 31 Jul 15, 2022
ETM - R package for Topic Modelling in Embedding Spaces

ETM - R package for Topic Modelling in Embedding Spaces This repository contains an R package called topicmodels.etm which is an implementation of ETM

bnosac 37 Nov 06, 2022
Programme de chiffrement et de déchiffrement inverse d'un message en python3.

Chiffrement Inverse En Python3 Programme de chiffrement et de déchiffrement inverse d'un message en python3. Explication du chiffrement inverse avec c

Malik Makkes 2 Mar 26, 2022
Python generation script for BitBirds

BitBirds generation script Intro This is published under MIT license, which means you can do whatever you want with it - entirely at your own risk. Pl

286 Dec 06, 2022
GrammarTagger — A Neural Multilingual Grammar Profiler for Language Learning

GrammarTagger — A Neural Multilingual Grammar Profiler for Language Learning GrammarTagger is an open-source toolkit for grammatical profiling for lan

Octanove Labs 27 Jan 05, 2023
Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Texar is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar provides

ASYML 2.3k Jan 07, 2023
Transformer-based Text Auto-encoder (T-TA) using TensorFlow 2.

T-TA (Transformer-based Text Auto-encoder) This repository contains codes for Transformer-based Text Auto-encoder (T-TA, paper: Fast and Accurate Deep

Jeong Ukjae 13 Dec 13, 2022
A Japanese tokenizer based on recurrent neural networks

Nagisa is a python module for Japanese word segmentation/POS-tagging. It is designed to be a simple and easy-to-use tool. This tool has the following

325 Jan 05, 2023
Lightweight utility tools for the detection of multiple spellings, meanings, and language-specific terminology in British and American English

Breame ( British English and American English) Breame is a lightweight Python package with a number of utility tools to aid in the detection of words

Charles 8 Oct 10, 2022
Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

Hiring We are hiring at all levels (including FTE researchers and interns)! If you are interested in working with us on NLP and large-scale pre-traine

Microsoft 7.8k Jan 09, 2023
Create a machine learning model which will predict if the mortgage will be approved or not based on 5 variables

Mortgage-Application-Analysis Create a machine learning model which will predict if the mortgage will be approved or not based on 5 variables: age, in

1 Jan 29, 2022
Extracting Summary Knowledge Graphs from Long Documents

GraphSum This repo contains the data and code for the G2G model in the paper: Extracting Summary Knowledge Graphs from Long Documents. The other basel

Zeqiu (Ellen) Wu 10 Oct 21, 2022
Faster, modernized fork of the language identification tool langid.py

py3langid py3langid is a fork of the standalone language identification tool langid.py by Marco Lui. Original license: BSD-2-Clause. Fork license: BSD

Adrien Barbaresi 12 Nov 05, 2022
Python api wrapper for JellyFish Lights

Python api wrapper for JellyFish Lights The hope is to make this a pip installable package Current capabalilities: Connects to a local JellyFish Light

10 Dec 18, 2022
本插件是pcrjjc插件的重置版,可以独立于后端api运行

pcrjjc2 本插件是pcrjjc重置版,不需要使用其他后端api,但是需要自行配置客户端 本项目基于AGPL v3协议开源,由于项目特殊性,禁止基于本项目的任何商业行为 配置方法 环境需求:.net framework 4.5及以上 jre8 别忘了装jre8 别忘了装jre8 别忘了装jre8

132 Dec 26, 2022
This is the 25 + 1 year anniversary version of the 1995 Rachford-Rice contest

Rachford-Rice Contest This is the 25 + 1 year anniversary version of the 1995 Rachford-Rice contest. Can you solve the Rachford-Rice problem for all t

13 Sep 20, 2022
(ACL-IJCNLP 2021) Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language Models.

BERT Convolutions Code for the paper Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language Models. Contains expe

mlpc-ucsd 21 Jul 18, 2022
Linking data between GBIF, Biodiverse, and Open Tree of Life

GBIF-biodiverse-OpenTree Linking data between GBIF, Biodiverse, and Open Tree of Life The python scripts will rely on opentree and Dendropy. To set up

2 Oct 03, 2022
Code for ACL 2020 paper "Rigid Formats Controlled Text Generation"

SongNet SongNet: SongCi + Song (Lyrics) + Sonnet + etc. @inproceedings{li-etal-2020-rigid, title = "Rigid Formats Controlled Text Generation",

Piji Li 212 Dec 17, 2022
Chinese NewsTitle Generation Project by GPT2.带有超级详细注释的中文GPT2新闻标题生成项目。

GPT2-NewsTitle 带有超详细注释的GPT2新闻标题生成项目 UpDate 01.02.2021 从网上收集数据,将清华新闻数据、搜狗新闻数据等新闻数据集,以及开源的一些摘要数据进行整理清洗,构建一个较完善的中文摘要数据集。 数据集清洗时,仅进行了简单地规则清洗。

logCong 785 Dec 29, 2022