NLPretext packages in a unique library all the text preprocessing functions you need to ease your NLP project.

Overview

NLPretext

Working on an NLP project and tired of always looking for the same silly preprocessing functions on the web? 😫

Need to efficiently extract email adresses from a document? Hashtags from tweets? Remove accents from a French post? 😥

NLPretext got you covered! 🚀

NLPretext packages in a unique library all the text preprocessing functions you need to ease your NLP project.

🔍 Quickly explore below our preprocessing pipelines and individual functions referential.

Cannot find what you were looking for? Feel free to open an issue.

Installation

This package has been tested on Python 3.6, 3.7 and 3.8.

We strongly advise you to do the remaining steps in a virtual environnement.

To install this library you just have to run the following command:

pip install nlpretext

This library uses Spacy as tokenizer. Current models supported are en_core_web_sm and fr_core_news_sm. If not installed, run the following commands:

pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.3.1/en_core_web_sm-2.3.1.tar.gz
pip install https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-2.3.0/fr_core_news_sm-2.3.0.tar.gz

Preprocessing pipeline

Default pipeline

Need to preprocess your text data but no clue about what function to use and in which order? The default preprocessing pipeline got you covered:

from nlpretext import Preprocessor
text = "I just got the best dinner in my life @latourdargent !!! I  recommend 😀 #food #paris \n"
preprocessor = Preprocessor()
text = preprocessor.run(text)
print(text)
# "I just got the best dinner in my life !!! I recommend"

Create your custom pipeline

Another possibility is to create your custom pipeline if you know exactly what function to apply on your data, here's an example:

from nlpretext import Preprocessor
from nlpretext.basic.preprocess import (normalize_whitespace, remove_punct, remove_eol_characters,
remove_stopwords, lower_text)
from nlpretext.social.preprocess import remove_mentions, remove_hashtag, remove_emoji
text = "I just got the best dinner in my life @latourdargent !!! I  recommend 😀 #food #paris \n"
preprocessor = Preprocessor()
preprocessor.pipe(lower_text)
preprocessor.pipe(remove_mentions)
preprocessor.pipe(remove_hashtag)
preprocessor.pipe(remove_emoji)
preprocessor.pipe(remove_eol_characters)
preprocessor.pipe(remove_stopwords, args={'lang': 'en'})
preprocessor.pipe(remove_punct)
preprocessor.pipe(normalize_whitespace)
text = preprocessor.run(text)
print(text)
# "dinner life recommend"

Take a look at all the functions that are available here in the preprocess.py scripts in the different folders: basic, social, token.

Individual Functions

Replacing emails

from nlpretext.basic.preprocess import replace_emails
example = "I have forwarded this email to [email protected]"
example = replace_emails(example, replace_with="*EMAIL*")
print(example)
# "I have forwarded this email to *EMAIL*"

Replacing phone numbers

from nlpretext.basic.preprocess import replace_phone_numbers
example = "My phone number is 0606060606"
example = replace_phone_numbers(example, country_to_detect=["FR"], replace_with="*PHONE*")
print(example)
# "My phone number is *PHONE*"

Removing Hashtags

from nlpretext.social.preprocess import remove_hashtag
example = "This restaurant was amazing #food #foodie #foodstagram #dinner"
example = remove_hashtag(example)
print(example)
# "This restaurant was amazing"

Extracting emojis

from nlpretext.social.preprocess import extract_emojis
example = "I take care of my skin 😀"
example = extract_emojis(example)
print(example)
# [':grinning_face:']

Data augmentation

The augmentation module helps you to generate new texts based on your given examples by modifying some words in the initial ones and to keep associated entities unchanged, if any, in the case of NER tasks. If you want words other than entities to remain unchanged, you can specify it within the stopwords argument. Modifications depend on the chosen method, the ones currently supported by the module are substitutions with synonyms using Wordnet or BERT from the nlpaug library.

from nlpretext.augmentation.text_augmentation import augment_text
example = "I want to buy a small black handbag please."
entities = [{'entity': 'Color', 'word': 'black', 'startCharIndex': 22, 'endCharIndex': 27}]
example = augment_text(example, method=wordnet_synonym”, entities=entities)
print(example)
# "I need to buy a small black pocketbook please."

Make HTML documentation

In order to make the html Sphinx documentation, you need to run at the nlpretext root path: sphinx-apidoc -f nlpretext -o docs/ This will generate the .rst files. You can generate the doc with cd docs && make html

You can now open the file index.html located in the build folder.

Project Organization


├── LICENSE
├── VERSION
├── CONTRIBUTING.md     <- Contribution guidelines
├── README.md           <- The top-level README for developers using this project.
├── .github/workflows   <- Where the CI lives
├── datasets/external   <- Bash scripts to download external datasets
├── docs                <- Sphinx HTML documentation
├── nlpretext           <- Main Package. This is where the code lives
│   ├── preprocessor.py <- Main preprocessing script
│   ├── augmentation    <- Text augmentation script
│   ├── basic           <- Basic text preprocessing 
│   ├── social          <- Social text preprocessing
│   ├── token           <- Token text preprocessing
│   ├── _config         <- Where the configuration and constants live
│   └── _utils          <- Where preprocessing utils scripts lives
├── tests               <- Where the tests lives
├── setup.py            <- makes project pip installable (pip install -e .) so the package can be imported
├── requirements.txt    <- The requirements file for reproducing the analysis environment, e.g.
│                          generated with `pip freeze > requirements.txt`
└── pylintrc            <- The linting configuration file
Comments
  • Bump actions/cache from 2.1.6 to 3.2.1

    Bump actions/cache from 2.1.6 to 3.2.1

    Bumps actions/cache from 2.1.6 to 3.2.1.

    Release notes

    Sourced from actions/cache's releases.

    v3.2.1

    What's Changed

    Full Changelog: https://github.com/actions/cache/compare/v3.2.0...v3.2.1

    v3.2.0

    What's Changed

    New Contributors

    Full Changelog: https://github.com/actions/cache/compare/v3...v3.2.0

    v3.2.0-beta.1

    What's Changed

    v3.1.0-beta.3

    What's Changed

    • Bug fixes for bsdtar fallback, if gnutar not available, and gzip fallback, if cache saved using old cache action, on windows.

    Full Changelog: https://github.com/actions/cache/compare/v3.1.0-beta.2...v3.1.0-beta.3

    ... (truncated)

    Changelog

    Sourced from actions/cache's changelog.

    3.2.1

    • Update @actions/cache on windows to use gnu tar and zstd by default and fallback to bsdtar and zstd if gnu tar is not available. (issue)
    • Added support for fallback to gzip to restore old caches on windows.
    • Added logs for cache version in case of a cache miss.
    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    draft dependencies github_actions 
    opened by dependabot[bot] 0
  • Bump python from 3.9.7-slim-buster to 3.11.1-slim-buster in /docker

    Bump python from 3.9.7-slim-buster to 3.11.1-slim-buster in /docker

    Bumps python from 3.9.7-slim-buster to 3.11.1-slim-buster.

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    draft docker dependencies 
    opened by dependabot[bot] 0
  • The current release is not functional as emoji lib has changed

    The current release is not functional as emoji lib has changed

    🐛 Bug Report

    🔬 How To Reproduce

    Steps to reproduce the behavior:

    1. install nlpretext from pip (1.1.0)
    2. run from nlpretext._config import constants

    Code sample

    Environment

    • OS: macOS Silicon
    • Python version: 3.7, 3.8, 3.9

    📈 Expected behavior

    EMOJI_PATTERN = _emoji.get_emoji_regexp()

    AttributeError: module 'emoji' has no attribute 'get_emoji_regexp'

    bug 
    opened by Guillaume6606 1
  • Bump release-drafter/release-drafter from 5.15.0 to 5.21.1

    Bump release-drafter/release-drafter from 5.15.0 to 5.21.1

    Bumps release-drafter/release-drafter from 5.15.0 to 5.21.1.

    Release notes

    Sourced from release-drafter/release-drafter's releases.

    v5.21.1

    What's Changed

    Dependency Updates

    Full Changelog: https://github.com/release-drafter/release-drafter/compare/v5.21.0...v5.21.1

    v5.21.0

    What's Changed

    New

    Full Changelog: https://github.com/release-drafter/release-drafter/compare/v5.20.1...v5.21.0

    v5.20.1

    What's Changed

    Bug Fixes

    Documentation

    Dependency Updates

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    draft dependencies github_actions 
    opened by dependabot[bot] 0
  • Bump cloudpickle from 2.0.0 to 2.2.0

    Bump cloudpickle from 2.0.0 to 2.2.0

    Bumps cloudpickle from 2.0.0 to 2.2.0.

    Changelog

    Sourced from cloudpickle's changelog.

    2.2.0

    2.1.0

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    draft dependencies python 
    opened by dependabot[bot] 0
Releases(1.1.0)
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

English | 简体中文 | 繁體中文 | 한국어 State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow 🤗 Transformers provides thousands of pretrained models

Hugging Face 77.1k Dec 31, 2022
End-to-end image captioning with EfficientNet-b3 + LSTM with Attention

Image captioning End-to-end image captioning with EfficientNet-b3 + LSTM with Attention Model is seq2seq model. In the encoder pretrained EfficientNet

2 Feb 10, 2022
Final Project for the Intel AI Readiness Boot Camp NLP (Jan)

NLP Boot Camp (Jan) Synopsis Full Name: Prameya Mohanty Name of your School: Delhi Public School, Rourkela Class: VIII Title of the Project: iTransect

TheCodingHub 1 Feb 01, 2022
Pytorch implementation of winner from VQA Chllange Workshop in CVPR'17

2017 VQA Challenge Winner (CVPR'17 Workshop) pytorch implementation of Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challeng

Mark Dong 166 Dec 11, 2022
UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language This repository contains UA-GEC data and an accompanying Python lib

Grammarly 227 Jan 02, 2023
Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Trankit: A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing Trankit is a light-weight Transformer-based Pyth

652 Jan 06, 2023
Sequence-to-sequence framework with a focus on Neural Machine Translation based on Apache MXNet

Sequence-to-sequence framework with a focus on Neural Machine Translation based on Apache MXNet

Amazon Web Services - Labs 1.1k Dec 27, 2022
Grading tools for Advanced NLP (11-711)Grading tools for Advanced NLP (11-711)

Grading tools for Advanced NLP (11-711) Installation You'll need docker and unzip to use this repo. For docker, visit the official guide to get starte

Hao Zhu 2 Sep 27, 2022
Neural network sequence labeling model

Sequence labeler This is a neural network sequence labeling system. Given a sequence of tokens, it will learn to assign labels to each token. Can be u

Marek Rei 250 Nov 03, 2022
一个基于Nonebot2和go-cqhttp的娱乐性qq机器人

Takker - 一个普通的QQ机器人 此项目为基于 Nonebot2 和 go-cqhttp 开发,以 Sqlite 作为数据库的QQ群娱乐机器人 关于 纯兴趣开发,部分功能借鉴了大佬们的代码,作为Q群的娱乐+功能性Bot 声明 此项目仅用于学习交流,请勿用于非法用途 这是开发者的第一个Pytho

风屿 79 Dec 29, 2022
Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

Data Augmentation using Pre-trained Transformer Models Code associated with the Data Augmentation using Pre-trained Transformer Models paper Code cont

44 Dec 31, 2022
NLP: SLU tagging

NLP: SLU tagging

北海若 3 Jan 14, 2022
Word Bot for JKLM Bomb Party

Word Bot for JKLM Bomb Party A bot for Bomb Party on https://www.jklm.fun (Only English) Requirements pynput pyperclip pyautogui Usage: Step 1: Run th

Nicolas 7 Oct 30, 2022
Convolutional 2D Knowledge Graph Embeddings resources

ConvE Convolutional 2D Knowledge Graph Embeddings resources. Paper: Convolutional 2D Knowledge Graph Embeddings Used in the paper, but do not use thes

Tim Dettmers 586 Dec 24, 2022
This project converts your human voice input to its text transcript and to an automated voice too.

Human Voice to Automated Voice & Text Introduction: In this project, whenever you'll speak, it will turn your voice into a robot voice and furthermore

Hassan Shahzad 3 Oct 15, 2021
A deep learning-based translation library built on Huggingface transformers

DL Translate A deep learning-based translation library built on Huggingface transformers and Facebook's mBART-Large 💻 GitHub Repository 📚 Documentat

Xing Han Lu 244 Dec 30, 2022
PyTorch Implementation of the paper Single Image Texture Translation for Data Augmentation

SITT The repo contains official PyTorch Implementation of the paper Single Image Texture Translation for Data Augmentation. Authors: Boyi Li Yin Cui T

Boyi Li 52 Jan 05, 2023
Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration

Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration This is the official repository for the EMNLP 2021 long pa

70 Dec 11, 2022
Code for the Findings of NAACL 2022(Long Paper): AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks

AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks arXiv link: upcoming To be published in Findings of NA

Allen 16 Nov 12, 2022
Code for Discovering Topics in Long-tailed Corpora with Causal Intervention.

Code for Discovering Topics in Long-tailed Corpora with Causal Intervention ACL2021 Findings Usage 0. Prepare environment Requirements: python==3.6 te

Xiaobao Wu 8 Dec 16, 2022