📝An easy-to-use package to restore punctuation of the text.

Related tags

Text Data & NLPrpunct
Overview

✏️ rpunct - Restore Punctuation

forthebadge

This repo contains code for Punctuation restoration.

This package is intended for direct use as a punctuation restoration model for the general English language. Alternatively, you can use this for further fine-tuning on domain-specific texts for punctuation restoration tasks. It uses HuggingFace's bert-base-uncased model weights that have been fine-tuned for Punctuation restoration.

Punctuation restoration works on arbitrarily large text. And uses GPU if it's available otherwise will default to CPU.

List of punctuations we restore:

  • Upper-casing
  • Period: .
  • Exclamation: !
  • Question Mark: ?
  • Comma: ,
  • Colon: :
  • Semi-colon: ;
  • Apostrophe: '
  • Dash: -

🚀 Usage

Below is a quick way to get up and running with the model.

  1. First, install the package.
pip install rpunct
  1. Sample python code.
from rpunct import RestorePuncts
# The default language is 'english'
rpunct = RestorePuncts()
rpunct.punctuate("""in 2018 cornell researchers built a high-powered detector that in combination with an algorithm-driven process called ptychography set a world record
by tripling the resolution of a state-of-the-art electron microscope as successful as it was that approach had a weakness it only worked with ultrathin samples that were
a few atoms thick anything thicker would cause the electrons to scatter in ways that could not be disentangled now a team again led by david muller the samuel b eckert
professor of engineering has bested its own record by a factor of two with an electron microscope pixel array detector empad that incorporates even more sophisticated
3d reconstruction algorithms the resolution is so fine-tuned the only blurring that remains is the thermal jiggling of the atoms themselves""")
# Outputs the following:
# In 2018, Cornell researchers built a high-powered detector that, in combination with an algorithm-driven process called Ptychography, set a world record by tripling the
# resolution of a state-of-the-art electron microscope. As successful as it was, that approach had a weakness. It only worked with ultrathin samples that were a few atoms
# thick. Anything thicker would cause the electrons to scatter in ways that could not be disentangled. Now, a team again led by David Muller, the Samuel B. 
# Eckert Professor of Engineering, has bested its own record by a factor of two with an Electron microscope pixel array detector empad that incorporates even more
# sophisticated 3d reconstruction algorithms. The resolution is so fine-tuned the only blurring that remains is the thermal jiggling of the atoms themselves.

🎯 Accuracy

Here is the number of product reviews we used for finetuning the model:

Language Number of text samples
English 560,000

We found the best convergence around 3 epochs, which is what presented here and available via a download.


The fine-tuned model obtained the following accuracy on 45,990 held-out text samples:

Accuracy Overall F1 Eval Support
91% 90% 45,990

💻 🎯 Further Fine-Tuning

To start fine-tuning or training please look into training/train.py file. Running python training/train.py will replicate the results of this model.


Contact

Contact Daulet Nurmanbetov for questions, feedback and/or requests for similar models.


Comments
  • Update requirements.txt

    Update requirements.txt

    ERROR: Could not find a version that satisfies the requirement torch==1.8.1 (from rpunct) (from versions: 1.11.0, 1.12.0, 1.12.1, 1.13.0) ERROR: No matching distribution found for torch==1.8.1

    opened by Rukaya-lab 0
  • Forked repo with fixes

    Forked repo with fixes

    I forked this repository (link here) to fix the outdated dependencies and incompatibility with non-CUDA machines. If anyone needs these fixes, feel free to install from the fork:

    pip install git+https://github.com/samwaterbury/rpunct.git
    

    Hopefully this repository is updated or another maintainer is assigned. And thanks to the creator @Felflare, this is a useful tool!

    opened by samwaterbury 2
  • Requirements shouldn't ask for such specific versions

    Requirements shouldn't ask for such specific versions

    First, thanks a lot for providing this package :)

    Currently, the requirements.txt, and thus the dependencies in the setup.py are for very specific versions of Pytorch etc. This shouldn't be the case if you want this package to be used as a general library (think of a second package that would do the same but ask for an incompatible version of PyTorch and would prevent any possible installation of the two together). The end user might also be needing a more recent version of PyTorch. Given that PyTorch is almost always backward compatible, and quite stable, I think the requirements for it could be changed from ==1.8.1 to >=1.8.1. I believe the same would be true for the other packages.

    opened by adefossez 2
  • Added ability to pass additional parameters to simpletransformer ner in RestorePuncts class.

    Added ability to pass additional parameters to simpletransformer ner in RestorePuncts class.

    Thanks for the great library! When running this without a GPU I had problems. I think there is a simple fix. The simple transformer NER model defaults to enabling cuda. This PR allows the user to pass a dictionary of arguments specifically for the simpletransformers NER model. So you can now run the code on a CPU by initializing rpunct like so

    rpunct = RestorePuncts(ner_args={"use_cuda": False})
    

    Before this change, when running rpunct examples on the CPU the following error occurs:

    from rpunct import RestorePuncts
    # The default language is 'english'
    rpunct = RestorePuncts()
    rpunct.punctuate("""in 2018 cornell researchers built a high-powered detector that in combination with an algorithm-driven process called ptychography set a world record
    by tripling the resolution of a state-of-the-art electron microscope as successful as it was that approach had a weakness it only worked with ultrathin samples that were
    a few atoms thick anything thicker would cause the electrons to scatter in ways that could not be disentangled now a team again led by david muller the samuel b eckert
    professor of engineering has bested its own record by a factor of two with an electron microscope pixel array detector empad that incorporates even more sophisticated
    3d reconstruction algorithms the resolution is so fine-tuned the only blurring that remains is the thermal jiggling of the atoms themselves""")
    
    

    ValueError Traceback (most recent call last) /var/folders/hx/dhzhl_x51118fm5cd13vzh2h0000gn/T/ipykernel_10548/194907560.py in 1 from rpunct import RestorePuncts 2 # The default language is 'english' ----> 3 rpunct = RestorePuncts() 4 rpunct.punctuate("""in 2018 cornell researchers built a high-powered detector that in combination with an algorithm-driven process called ptychography set a world record 5 by tripling the resolution of a state-of-the-art electron microscope as successful as it was that approach had a weakness it only worked with ultrathin samples that were

    ~/repos/rpunct/rpunct/punctuate.py in init(self, wrds_per_pred, ner_args) 19 if ner_args is None: 20 ner_args = {} ---> 21 self.model = NERModel("bert", "felflare/bert-restore-punctuation", labels=self.valid_labels, 22 args={"silent": True, "max_seq_length": 512}, **ner_args) 23

    ~/repos/transformers/transformer-env/lib/python3.8/site-packages/simpletransformers/ner/ner_model.py in init(self, model_type, model_name, labels, args, use_cuda, cuda_device, onnx_execution_provider, **kwargs) 209 self.device = torch.device(f"cuda:{cuda_device}") 210 else: --> 211 raise ValueError( 212 "'use_cuda' set to True when cuda is unavailable." 213 "Make sure CUDA is available or set use_cuda=False."

    ValueError: 'use_cuda' set to True when cuda is unavailable.Make sure CUDA is available or set use_cuda=False.

    opened by nbertagnolli 1
  • add use_cuda parameter

    add use_cuda parameter

    using the package in an environment without cuda support causes it to fail. Adding the parameter to shut it off if necessary allows it to function normall.

    opened by mjfox3 1
Releases(1.0.1)
Owner
Daulet Nurmanbetov
Deep Learning, AI and Finance
Daulet Nurmanbetov
nlabel is a library for generating, storing and retrieving tagging information and embedding vectors from various nlp libraries through a unified interface.

nlabel is a library for generating, storing and retrieving tagging information and embedding vectors from various nlp libraries through a unified interface.

Bernhard Liebl 2 Jun 10, 2022
kochat

Kochat 챗봇 빌더는 성에 안차고, 자신만의 딥러닝 챗봇 애플리케이션을 만드시고 싶으신가요? Kochat을 이용하면 손쉽게 자신만의 딥러닝 챗봇 애플리케이션을 빌드할 수 있습니다. # 1. 데이터셋 객체 생성 dataset = Dataset(ood=True) #

1 Oct 25, 2021
Making text a first-class citizen in TensorFlow.

TensorFlow Text - Text processing in Tensorflow IMPORTANT: When installing TF Text with pip install, please note the version of TensorFlow you are run

1k Dec 26, 2022
An Analysis Toolkit for Natural Language Generation (Translation, Captioning, Summarization, etc.)

VizSeq is a Python toolkit for visual analysis on text generation tasks like machine translation, summarization, image captioning, speech translation

Facebook Research 409 Oct 28, 2022
Journey is a NLP-Powered Developer assistant

Journey Journey is a NLP-Powered Developer assistant Using on the powerful Natural Language Processing library Mindmeld, this projects aims to assist

Christian Eilers 21 Dec 11, 2022
Code for PED: DETR For (Crowd) Pedestrian Detection

Code for PED: DETR For (Crowd) Pedestrian Detection

36 Sep 13, 2022
A website which allows you to play with the GPT-2 transformer

transformers A website which allows you to play with the GPT-2 model Built with ❤️ by raphtlw Table of contents Model Setup About Contributors Model T

raphtlw 2 Jan 27, 2022
A Python wrapper for simple offline real-time dictation (speech-to-text) and speaker-recognition using Vosk.

Simple-Vosk A Python wrapper for simple offline real-time dictation (speech-to-text) and speaker-recognition using Vosk. Check out the official Vosk G

2 Jun 19, 2022
Framework for fine-tuning pretrained transformers for Named-Entity Recognition (NER) tasks

NERDA Not only is NERDA a mesmerizing muppet-like character. NERDA is also a python package, that offers a slick easy-to-use interface for fine-tuning

Ekstra Bladet 141 Dec 30, 2022
An open collection of annotated voices in Japanese language

声庭 (Koniwa): オープンな日本語音声とアノテーションのコレクション Koniwa (声庭): An open collection of annotated voices in Japanese language 概要 Koniwa(声庭)は利用・修正・再配布が自由でオープンな音声とアノテ

Koniwa project 32 Dec 14, 2022
Code and dataset for the EMNLP 2021 Finding paper "Can NLI Models Verify QA Systems’ Predictions?"

Code and dataset for the EMNLP 2021 Finding paper "Can NLI Models Verify QA Systems’ Predictions?"

Jifan Chen 22 Oct 21, 2022
NeoDays-based tileset for the roguelike CDDA (Cataclysm Dark Days Ahead)

NeoDaysPlus Reduced contrast, expanded, and continuously developed version of the CDDA tileset NeoDays that's being completed with new sprites for mis

0 Nov 12, 2022
pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation

pkuseg:一个多领域中文分词工具包 (English Version) pkuseg 是基于论文[Luo et. al, 2019]的工具包。其简单易用,支持细分领域分词,有效提升了分词准确度。 目录 主要亮点 编译和安装 各类分词工具包的性能对比 使用方式 论文引用 作者 常见问题及解答 主要

LancoPKU 6k Dec 29, 2022
Various capabilities for static malware analysis.

Malchive The malchive serves as a compendium for a variety of capabilities mainly pertaining to malware analysis, such as scripts supporting day to da

MITRE Cybersecurity 64 Nov 22, 2022
Finding Label and Model Errors in Perception Data With Learned Observation Assertions

Finding Label and Model Errors in Perception Data With Learned Observation Assertions This is the project page for Finding Label and Model Errors in P

Stanford Future Data Systems 17 Oct 14, 2022
EdiTTS: Score-based Editing for Controllable Text-to-Speech

Official implementation of EdiTTS: Score-based Editing for Controllable Text-to-Speech

Neosapience 99 Jan 02, 2023
Disfl-QA: A Benchmark Dataset for Understanding Disfluencies in Question Answering

Disfl-QA is a targeted dataset for contextual disfluencies in an information seeking setting, namely question answering over Wikipedia passages. Disfl-QA builds upon the SQuAD-v2 (Rajpurkar et al., 2

Google Research Datasets 52 Jun 21, 2022
A Pytorch implementation of "Splitter: Learning Node Representations that Capture Multiple Social Contexts" (WWW 2019).

Splitter ⠀⠀ A PyTorch implementation of Splitter: Learning Node Representations that Capture Multiple Social Contexts (WWW 2019). Abstract Recent inte

Benedek Rozemberczki 201 Nov 09, 2022
OpenAI CLIP text encoders for multiple languages!

Multilingual-CLIP OpenAI CLIP text encoders for any language Colab Notebook · Pre-trained Models · Report Bug Overview OpenAI recently released the pa

Fredrik Carlsson 481 Dec 30, 2022
🚀 RocketQA, dense retrieval for information retrieval and question answering, including both Chinese and English state-of-the-art models.

In recent years, the dense retrievers based on pre-trained language models have achieved remarkable progress. To facilitate more developers using cutt

475 Jan 04, 2023