📝An easy-to-use package to restore punctuation of the text.

Last update: Dec 30, 2022

Related tags

Overview

✏️ rpunct - Restore Punctuation

This repo contains code for Punctuation restoration.

This package is intended for direct use as a punctuation restoration model for the general English language. Alternatively, you can use this for further fine-tuning on domain-specific texts for punctuation restoration tasks. It uses HuggingFace's bert-base-uncased model weights that have been fine-tuned for Punctuation restoration.

Punctuation restoration works on arbitrarily large text. And uses GPU if it's available otherwise will default to CPU.

List of punctuations we restore:

Upper-casing
Period: .
Exclamation: !
Question Mark: ?
Comma: ,
Colon: :
Semi-colon: ;
Apostrophe: '
Dash: -

🚀 Usage

Below is a quick way to get up and running with the model.

First, install the package.

pip install rpunct

Sample python code.

from rpunct import RestorePuncts
# The default language is 'english'
rpunct = RestorePuncts()
rpunct.punctuate("""in 2018 cornell researchers built a high-powered detector that in combination with an algorithm-driven process called ptychography set a world record
by tripling the resolution of a state-of-the-art electron microscope as successful as it was that approach had a weakness it only worked with ultrathin samples that were
a few atoms thick anything thicker would cause the electrons to scatter in ways that could not be disentangled now a team again led by david muller the samuel b eckert
professor of engineering has bested its own record by a factor of two with an electron microscope pixel array detector empad that incorporates even more sophisticated
3d reconstruction algorithms the resolution is so fine-tuned the only blurring that remains is the thermal jiggling of the atoms themselves""")
# Outputs the following:
# In 2018, Cornell researchers built a high-powered detector that, in combination with an algorithm-driven process called Ptychography, set a world record by tripling the
# resolution of a state-of-the-art electron microscope. As successful as it was, that approach had a weakness. It only worked with ultrathin samples that were a few atoms
# thick. Anything thicker would cause the electrons to scatter in ways that could not be disentangled. Now, a team again led by David Muller, the Samuel B. 
# Eckert Professor of Engineering, has bested its own record by a factor of two with an Electron microscope pixel array detector empad that incorporates even more
# sophisticated 3d reconstruction algorithms. The resolution is so fine-tuned the only blurring that remains is the thermal jiggling of the atoms themselves.

🎯 Accuracy

Here is the number of product reviews we used for finetuning the model:

Language	Number of text samples
English	560,000

We found the best convergence around 3 epochs, which is what presented here and available via a download.

The fine-tuned model obtained the following accuracy on 45,990 held-out text samples:

Accuracy	Overall F1	Eval Support
91%	90%	45,990

💻 🎯 Further Fine-Tuning

To start fine-tuning or training please look into training/train.py file. Running python training/train.py will replicate the results of this model.

☕ Contact

Contact Daulet Nurmanbetov for questions, feedback and/or requests for similar models.

Comments

Update requirements.txt

ERROR: Could not find a version that satisfies the requirement torch==1.8.1 (from rpunct) (from versions: 1.11.0, 1.12.0, 1.12.1, 1.13.0) ERROR: No matching distribution found for torch==1.8.1

opened by Rukaya-lab 0
Forked repo with fixes
I forked this repository (link here) to fix the outdated dependencies and incompatibility with non-CUDA machines. If anyone needs these fixes, feel free to install from the fork:

pip install git+https://github.com/samwaterbury/rpunct.git

Hopefully this repository is updated or another maintainer is assigned. And thanks to the creator @Felflare, this is a useful tool!
opened by samwaterbury 2
Requirements shouldn't ask for such specific versions

First, thanks a lot for providing this package :)

Currently, the requirements.txt, and thus the dependencies in the setup.py are for very specific versions of Pytorch etc. This shouldn't be the case if you want this package to be used as a general library (think of a second package that would do the same but ask for an incompatible version of PyTorch and would prevent any possible installation of the two together). The end user might also be needing a more recent version of PyTorch. Given that PyTorch is almost always backward compatible, and quite stable, I think the requirements for it could be changed from ==1.8.1 to >=1.8.1. I believe the same would be true for the other packages.

opened by adefossez 2
Added ability to pass additional parameters to simpletransformer ner in RestorePuncts class.
Thanks for the great library! When running this without a GPU I had problems. I think there is a simple fix. The simple transformer NER model defaults to enabling cuda. This PR allows the user to pass a dictionary of arguments specifically for the simpletransformers NER model. So you can now run the code on a CPU by initializing rpunct like so

rpunct = RestorePuncts(ner_args={"use_cuda": False})

Before this change, when running rpunct examples on the CPU the following error occurs:

from rpunct import RestorePuncts # The default language is 'english' rpunct = RestorePuncts() rpunct.punctuate("""in 2018 cornell researchers built a high-powered detector that in combination with an algorithm-driven process called ptychography set a world record by tripling the resolution of a state-of-the-art electron microscope as successful as it was that approach had a weakness it only worked with ultrathin samples that were a few atoms thick anything thicker would cause the electrons to scatter in ways that could not be disentangled now a team again led by david muller the samuel b eckert professor of engineering has bested its own record by a factor of two with an electron microscope pixel array detector empad that incorporates even more sophisticated 3d reconstruction algorithms the resolution is so fine-tuned the only blurring that remains is the thermal jiggling of the atoms themselves""")

ValueError Traceback (most recent call last) /var/folders/hx/dhzhl_x51118fm5cd13vzh2h0000gn/T/ipykernel_10548/194907560.py in 1 from rpunct import RestorePuncts 2 # The default language is 'english' ----> 3 rpunct = RestorePuncts() 4 rpunct.punctuate("""in 2018 cornell researchers built a high-powered detector that in combination with an algorithm-driven process called ptychography set a world record 5 by tripling the resolution of a state-of-the-art electron microscope as successful as it was that approach had a weakness it only worked with ultrathin samples that were

~/repos/rpunct/rpunct/punctuate.py in init(self, wrds_per_pred, ner_args) 19 if ner_args is None: 20 ner_args = {} ---> 21 self.model = NERModel("bert", "felflare/bert-restore-punctuation", labels=self.valid_labels, 22 args={"silent": True, "max_seq_length": 512}, **ner_args) 23

~/repos/transformers/transformer-env/lib/python3.8/site-packages/simpletransformers/ner/ner_model.py in init(self, model_type, model_name, labels, args, use_cuda, cuda_device, onnx_execution_provider, **kwargs) 209 self.device = torch.device(f"cuda:{cuda_device}") 210 else: --> 211 raise ValueError( 212 "'use_cuda' set to True when cuda is unavailable." 213 "Make sure CUDA is available or set use_cuda=False."

ValueError: 'use_cuda' set to True when cuda is unavailable.Make sure CUDA is available or set use_cuda=False.
opened by nbertagnolli 1
add use_cuda parameter

using the package in an environment without cuda support causes it to fail. Adding the parameter to shut it off if necessary allows it to function normall.

opened by mjfox3 1

Releases(1.0.1)

1.0.1(May 24, 2021)

Source code(tar.gz)
Source code(zip)

Owner

Daulet Nurmanbetov

Deep Learning, AI and Finance

GitHub Repository

📝An easy-to-use package to restore punctuation of the text.

Related tags

Overview

✏️ rpunct - Restore Punctuation

🚀 Usage

🎯 Accuracy

💻 🎯 Further Fine-Tuning

☕ Contact

Comments

Update requirements.txt

Forked repo with fixes

Requirements shouldn't ask for such specific versions

Added ability to pass additional parameters to simpletransformer ner in RestorePuncts class.

add use_cuda parameter

Releases(1.0.1)

1.0.1(May 24, 2021)

Owner

Daulet Nurmanbetov

Natural Language Processing library built with AllenNLP 🌲🌱

p-tuning for few-shot NLU task

Adversarial Examples for Extreme Multilabel Text Classification

Code for "Generative adversarial networks for reconstructing natural images from brain activity".

Code Implementation of "Learning Span-Level Interactions for Aspect Sentiment Triplet Extraction".

This is the 25 + 1 year anniversary version of the 1995 Rachford-Rice contest

Header-only C++ HNSW implementation with python bindings

Tutorial to pretrain & fine-tune a 🤗 Flax T5 model on a TPUv3-8 with GCP

hashily is a Python module that provides a variety of text decoding and encoding operations.

Contains descriptions and code of the mini-projects developed in various programming languages

Sapiens is a human antibody language model based on BERT.

Official code repository of the paper Linear Transformers Are Secretly Fast Weight Programmers.

Wikipedia-Utils: Preprocessing Wikipedia Texts for NLP

This is the offline-training-pipeline for our project.

Applying "Load What You Need: Smaller Versions of Multilingual BERT" to LaBSE

TruthfulQA: Measuring How Models Imitate Human Falsehoods

NLP Overview

Incorporating KenLM language model with HuggingFace implementation of Wav2Vec2CTC Model using beam search decoding

Mapping a variable-length sentence to a fixed-length vector using BERT model

A number of methods in order to perform Natural Language Processing on live data derived from Twitter