An open-source NLP library: fast text cleaning and preprocessing.

Last update: Mar 18, 2022

Overview

🌴 dobbi 🦕

Takes care of all of this boring NLP stuff

Description

An open-source NLP library: fast text cleaning and preprocessing.

TL;DR

This library provides a quick and ready-to-use text preprocessing tools for text cleaning and normalization. You can simply remove hashtags, nicknames, emoji, url addresses, punctuation, whitespace and whatever.

Installation

To download dobbi, either fork this GitHub repo or simply use Pypi via pip:

$ pip install dobbi

Usage

Import the library:

import dobbi

Interaction

The library uses method chaining in order to simplify text processing:

dobbi.clean() \
    .hashtag() \
    .nickname() \
    .url() \
    .execute('Check here: https://some-url.com')

Supported methods and patterns

The process consists of three stages:

Initialization methods: initialize a dobbi Work object
Intermediate methods: chain patterns in the needed order
Terminal methods: choose if you need a function or a result

Initialization functions:

dobbi.clean()
dobbi.collect()
dobbi.replace()

Intermediate methods (pattern processing choice):

regexp() - custom regular expressions
url() - URLs
html() - HTML and "<...>" type markups
punctuation() - punctuation
hashtag() - hashtags
emoji() - emoji
emoticons() - emoticons
whitespace() - any type of whitespaces
nickname() - @-starting nicknames

Terminal methods:

execute(str) - executes chosen methods on the provided string.
function() - returns a function which is a combination of the chosen methods.

Examples

1) Clean a random Twitter message

dobbi.clean() \
    .hashtag() \
    .nickname() \
    .url() \
    .execute('#fun #lol    Why  @Alex33 is so funny? Check here: https://some-url.com')

Result:

'Why is so funny? Check here:'

2) Replace nicknames and urls with tokens

dobbi.replace() \
    .hashtag('') \
    .nickname() \
    .url('__CUSTOM_URL_TOKEN__') \
    .execute('#fun #lol    Why  @Alex33 is so funny? Check here: https://some-url.com')

Result:

'Why TOKEN_NICKNAME is so funny? Check here: __CUSTOM_URL_TOKEN__'

3) Get the text cleanup function (one-liner)

~~Please, try to avoid the in-line method chaining, as it is less readable.~~ Do as your heart tells you.

func = dobbi.clean().url().hashtag().punctuation().whitespace().html().function()
func('\t #fun #lol    Why  @Alex33 is so... funny? 
    
    \nCheck
    \there: https://some-url.com'
   )

Result:

'Why Alex33 is so funny Check here'

Chain regexp methods

dobbi.clean() \
    .regexp('#\w+') \
    .regexp('@\w+') \
    .regexp('https?://\S+') \
    .execute('#fun #lol    Why  @Alex33 is so funny? Check here: https://some-url.com')

Result:

'Why is so funny? Check here:'

Additional

Please pay attention that the functions are applied in the order you've specified them. So, you're better to chain .punctuation() as one of the last functions.

Call for collaboration 🤗

If you enjoyed the project I would be grateful if you supported it :)

Below is the list of useful features I would be happy to share with you:

Finding bugs
Making code optimizations
Writing tests
Help with new features development

Task-based datasets, preprocessing, and evaluation for sequence models.

SeqIO: Task-based datasets, preprocessing, and evaluation for sequence models. SeqIO is a library for processing sequential data to be fed into downst

290 Dec 26, 2022

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari Overview | Performance | Installation | Documentation | Contributing 🎉 🎉 🎉 We released the 2.0.0 version with TF2 Support. 🎉 🎉 🎉 If you

2.3k Dec 29, 2022

2k Feb 9, 2021

Data preprocessing rosetta parser for python

datapreprocessing_rosetta_parser I've never done any NLP or text data processing before, so I wanted to use this hackathon as a learning opportunity,

2 Nov 28, 2021

Develop open-source Python Arabic NLP libraries that the Arab world will easily use in all Natural Language Processing applications

2 Oct 22, 2022

Grading tools for Advanced NLP (11-711)Grading tools for Advanced NLP (11-711)

Grading tools for Advanced NLP (11-711) Installation You'll need docker and unzip to use this repo. For docker, visit the official guide to get starte

2 Sep 27, 2022

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU，一个中文文本分类、序列标注工具包，支持中文长文本、短文本的多类、多标签分类任务，支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

186 Dec 24, 2022

🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

15k Jan 2, 2023

💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

15.3k Dec 30, 2022

An open-source NLP library: fast text cleaning and preprocessing.

Related tags

Overview

🌴 dobbi 🦕

Description

TL;DR

Installation

Usage

Interaction

Supported methods and patterns

Examples

1) Clean a random Twitter message

2) Replace nicknames and urls with tokens

3) Get the text cleanup function (one-liner)

Additional

Call for collaboration 🤗

You might also like...

Task-based datasets, preprocessing, and evaluation for sequence models.

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Data preprocessing rosetta parser for python

Develop open-source Python Arabic NLP libraries that the Arab world will easily use in all Natural Language Processing applications

Grading tools for Advanced NLP (11-711)Grading tools for Advanced NLP (11-711)

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Releases(v0_13)

v0_13(Oct 29, 2021)

v0_10(Oct 19, 2021)

v0_06(Oct 18, 2021)

v0_03(Oct 16, 2021)

v0_02(Oct 16, 2021)

v0_01(Oct 16, 2021)

Owner

Iaroslav

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

An example project using OpenPrompt under pytorch-lightning for prompt-based SST2 sentiment analysis model

A sample project that exists for PyPUG's "Tutorial on Packaging and Distributing Projects"

Which Apple Keeps Which Doctor Away? Colorful Word Representations with Visual Oracles

This repository contains all the source code that is needed for the project : An Efficient Pipeline For Bloom’s Taxonomy Using Natural Language Processing and Deep Learning

Data manipulation and transformation for audio signal processing, powered by PyTorch

A deep learning-based translation library built on Huggingface transformers

This is the Alpha of Nutte language, she is not complete yet / Essa é a Alpha da Nutte language, não está completa ainda

Share constant definitions between programming languages and make your constants constant again

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Simple tool/toolkit for evaluating NLG (Natural Language Generation) offering various automated metrics.

Text Normalization（文本正则化）

Legal text retrieval for python

End-to-end MLOps pipeline of a BERT model for emotion classification.

Let Xiao Ai speakers control third-party devices

A minimal code for fairseq vq-wav2vec model inference.

A benchmark for evaluation and comparison of various NLP tasks in Persian language.

Mlcode - Continuous ML API Integrations

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

This is a project of data parallel that running on NLP tasks.