Model for recasing and repunctuating ASR transcripts

Overview

Recasing and punctuation model based on Bert

Benoit Favre 2021

This system converts a sequence of lowercase tokens without punctuation to a sequence of cased tokens with punctuation.

It is trained to predict both aspects at the token level in a multitask fashion, from fine-tuned BERT representations.

The model predicts the following recasing labels:

  • lower: keep lowercase
  • upper: convert to upper case
  • capitalize: set first letter as upper case
  • other: left as is

And the following punctuation labels:

  • o: no punctuation
  • period: .
  • comma: ,
  • question: ?
  • exclamation: !

Input tokens are batched as sequences of length 256 that are processed independently without overlap.

In training, batches containing less that 256 tokens are simulated by drawing uniformly a length and replacing all tokens and labels after that point with padding (called Cut-drop).

Changelong:

  • Fix generation when input is smaller than max length

Installation

Use your favourite method for installing Python requirements. For example:

python -mvenv env
. env/bin/activate
pip3 install -r requirements.txt -f https://download.pytorch.org/whl/torch_stable.html

Prediction

Predict from raw text:

python recasepunc.py predict checkpoint/path.iteration < input.txt > output.txt

Models

  • French: fr-txt.large.19000 trained on 160M tokens from Common Crawl
    • Iterations: 19000
    • Batch size: 16
    • Max length: 256
    • Seed: 871253
    • Cut-drop probability: 0.1
    • Train loss: 0.021128975618630648
    • Valid loss: 0.015684964135289192
    • Recasing accuracy: 96.73
    • Punctuation accuracy: 95.02
      • All punctuation F-score: 67.79
      • Comma F-score: 67.94
      • Period F-score: 72.91
      • Question F-score: 57.57
      • Exclamation mark F-score: 15.78
    • Training data: First 100M words from Common Crawl

Training

Notes: You need to modify file names adequately. Training tensors are precomputed and loaded in CPU memory.

Stage 0: download text data

Stage 1: tokenize and normalize text with Moses tokenizer, and extract recasing and repunctuation labels

python recasepunc.py preprocess < input.txt > input.case+punc

Stage 2: sub-tokenize with Flaubert tokenizer, and generate pytorch tensors

python recasepunc.py tensorize input.case+punc input.case+punc.x input.case+punc.y

Stage 3: train model

python recasepunc.py train train.x train.y valid.x valid.y checkpoint/path

Stage 4: evaluate performance on a test set

python recasepunc.py eval checkpoint/path.iteration test.x test.y
Comments
  • Is it possible to customize for new language?

    Is it possible to customize for new language?

    Dear Benoit Favre,

    Your project is really important! Is it possible to customize for new language? If yes, could you tell short hints for it?

    Thank you in advance!

    opened by ican24 5
  • Can't get attribute 'WordpieceTokenizer'

    Can't get attribute 'WordpieceTokenizer'

    Hi thanks for your effort on developing recasepunc! I know that you can't provide help for models not trained by you, but maybe you have an idea what's going wrong here:

    I'm loading the model vosk-recasepunc-de-0.21 from https://alphacephei.com/vosk/models. When I do so, torch tells me that it can't find WordpieceTokenizer. Do you know why? Is the model incompatible?

    Punc predict path: C:\Users\admin\meety\vosk-recasepunc-de-0.21\checkpoint Traceback (most recent call last): File "main2.py", line 120, in t = transcriber() File "main2.py", line 32, in init self.casePuncPredictor = CasePuncPredictor(punc_predict_path, lang="de") File "C:\Users\admin\meety\recasepunc.py", line 273, in init loaded = torch.load(checkpoint_path, map_location=device if torch.cuda.is_available() else 'cpu') File "C:\Users\admin\Anaconda3\envs\meety\lib\site-packages\torch\serialization.py", line 607, in load return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args) File "C:\Users\admin\Anaconda3\envs\meety\lib\site-packages\torch\serialization.py", line 882, in _load result = unpickler.load() File "C:\Users\admin\Anaconda3\envs\meety\lib\site-packages\torch\serialization.py", line 875, in find_class return super().find_class(mod_name, name) AttributeError: Can't get attribute 'WordpieceTokenizer' on <module 'main' from 'main2.py'>

    opened by padmalcom 4
  • Can't do inference

    Can't do inference

    Hello, I'm trying to use example.py on a french model (fr.22000 or fr-txt.large.19000) But I have this error: raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for Model: Unexpected key(s) in state_dict: "bert.position_ids". I also tried with the following command, same error in output. python recasepunc.py predict fr.22000 < toto.txt > output.txt Do you have any advice? Thanks

    opened by MatFrancois 3
  • Memory usage

    Memory usage

    Hi, on start punctuation app use about 9Gb RAM, but in one moment(in load model ). Then we need about 1.5GB. Can we reduce 9GB on start? maybe on start we check our model and it feature can be turn off?

    opened by gubri 1
  • Russian model doesn't work, while English does

    Russian model doesn't work, while English does

    When I use Russian model, it gives me this error:

    WARNING: reverting to cpu as cuda is not available
    Some weights of the model checkpoint at DeepPavlov/rubert-base-cased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight']
    
     File "C:\pypy\rus\recasepunc.py", line 741, in <module>
        main(config, config.action, config.action_args)
      File "C:\pypy\rus\recasepunc.py", line 715, in main
        generate_predictions(config, *args)
      File "C:\pypy\rus\recasepunc.py", line 349, in generate_predictions
        for line in sys.stdin:
      File "C:\Users\Xenia\AppData\Local\Programs\Python\Python39\lib\codecs.py", line 322, in decode
        (result, consumed) = self._buffer_decode(data, self.errors, final)
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xef in position 0: invalid continuation byte
    
     File "C:\Users\Xenia\AppData\Local\Programs\Python\Python39\lib\site-packages\flask\app.py", line 1796, in dispatch_request
        return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)      
      File "C:\pypy\app.py", line 32, in process_audio
        cased = subprocess.check_output('python rus/recasepunc.py predict rus/checkpoint', shell=True, text=True, input=text)
      File "C:\Users\Xenia\AppData\Local\Programs\Python\Python39\lib\subprocess.py", 
    line 420, in check_output
        return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
      File "C:\Users\Xenia\AppData\Local\Programs\Python\Python39\lib\subprocess.py", 
    line 524, in run
        raise CalledProcessError(retcode, process.args,
    subprocess.CalledProcessError: Command 'python rus/recasepunc.py predict rus/checkpoint' returned non-zero exit status 1.
    

    Sorry for a long message, I'm not sure which of these messages are the most important. Should I use another version of transformers? I use transformers==4.16.2 and it works fine with English model.

    opened by xenia19 0
  • Export model to be Used in C++

    Export model to be Used in C++

    Is it possible that export model to something that can be used in C++ using libtorch?

    1. export existing model(checkpoint provided in this repo)
    2. export model after I train with my own data which option above possible, or both?
    opened by leohuang2013 0
  • RuntimeError when predicting with the french models

    RuntimeError when predicting with the french models

    I tried to use the french models (both fr.22000 and fr-txt.large.19000) on a very simple text:

    j'aime les fleurs les olives et la raclette

    When running python3 recasepunc.py predict fr.22000 < input.txt > output.txt (or with the other model), I get the following RuntimeError:

    Traceback (most recent call last):
      File "/home/mael/charly/recasepunc/recasepunc.py", line 733, in <module>
        main(config, config.action, config.action_args)
      File "/home/mael/charly/recasepunc/recasepunc.py", line 707, in main
        generate_predictions(config, *args)
      File "/home/mael/charly/recasepunc/recasepunc.py", line 336, in generate_predictions
        model.load_state_dict(loaded['model_state_dict'])
      File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1497, in load_state_dict
        raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
    RuntimeError: Error(s) in loading state_dict for Model:
    	Unexpected key(s) in state_dict: "bert.position_ids".
    

    I tried the same with the english model, and it worked perfectly. Looks like something is broken with the french ones?

    opened by maelchiotti 2
  • parameters like --dab_rate can't be set from cmd line bc they are bool

    parameters like --dab_rate can't be set from cmd line bc they are bool

    look at parameters below. They really became bool, i find this bug while debugging it. ''' if name == 'main': parser = argparse.ArgumentParser() parser.add_argument("action", help="train|eval|predict|tensorize|preprocess", type=str) ... parser.add_argument("--updates", help="number of training updates to perform", default=default_config.updates, type=bool) parser.add_argument("--period", help="validation period in updates", default=default_config.period, type=bool) parser.add_argument("--lr", help="learning rate", default=default_config.lr, type=bool) parser.add_argument("--dab-rate", help="drop at boundaries rate", default=default_config.dab_rate, type=bool) config = Config(**parser.parse_args().dict)

    main(config, config.action, config.action_args)
    

    '''

    opened by al-zatv 0
  • Cannot use trained model for validation or prediction

    Cannot use trained model for validation or prediction

    Hi, thank you for this repo! I'm trying to reproduce results for different language, so I'm using multilingual-bert fine-tuned to my language dataset. Everything goes well during preprocessing and training, the resuls are comparable with those for English and French (97-99% for case and punctuation).

    But when I try to use trained model, it gives very poor results even for sentences from training dataset. It works, sometimes it puts capital letters or dots, but it's rare and mostly model can't handle. Also when I try to evaluate model with command from the README (also tried it for already used validation sets, for instance with command python recasepunc.py eval bertugan_casepunc.24000 valid.case+punc.x valid.case+punc.y) it gives error:

    File "recasepunc.py", line 220, in batchify x = x[:(len(x) // max_length) * max_length].reshape(-1, max_length) TypeError: unhashable type: 'slice'

    Sorry for pointing to two different problems in one Issue, but I though maybe it can be one common mistake for both cases.

    opened by khusainovaidar 5
Releases(0.3)
Owner
Benoit Favre
Benoit Favre
API for the GPT-J language model 🦜. Including a FastAPI backend and a streamlit frontend

gpt-j-api 🦜 An API to interact with the GPT-J language model. You can use and test the model in two different ways: Streamlit web app at http://api.v

Víctor Gallego 276 Dec 31, 2022
A Semi-Intelligent ChatBot filled with statistical and economical data for the Premier League.

MONEYBALL - ChatBot Module: 4006CEM, Class: B, Group: 5 Contributors: Jonas Djondo Roshan Kc Cole Samson Daniel Rodrigues Ihteshaam Naseer Kind remind

Jonas Djondo 1 Nov 18, 2021
LightSeq: A High-Performance Inference Library for Sequence Processing and Generation

LightSeq is a high performance inference library for sequence processing and generation implemented in CUDA. It enables highly efficient computation of modern NLP models such as BERT, GPT2, Transform

Bytedance Inc. 2.5k Jan 03, 2023
Code from the paper "High-Performance Brain-to-Text Communication via Handwriting"

Code from the paper "High-Performance Brain-to-Text Communication via Handwriting"

Francis R. Willett 305 Dec 22, 2022
A BERT-based reverse dictionary of Korean proverbs

Wisdomify A BERT-based reverse-dictionary of Korean proverbs. 김유빈 : 모델링 / 데이터 수집 / 프로젝트 설계 / back-end 김종윤 : 데이터 수집 / 프로젝트 설계 / front-end / back-end 임용

94 Dec 08, 2022
Chatbot with Pytorch, Python & Nextjs

Installation Instructions Make sure that you have Python 3, gcc, venv, and pip installed. Clone the repository $ git clone https://github.com/sahr

Rohit Sah 0 Dec 11, 2022
MiCECo - Misskey Custom Emoji Counter

MiCECo Misskey Custom Emoji Counter Introduction This little script counts custo

7 Dec 25, 2022
Open-World Entity Segmentation

Open-World Entity Segmentation Project Website Lu Qi*, Jason Kuen*, Yi Wang, Jiuxiang Gu, Hengshuang Zhao, Zhe Lin, Philip Torr, Jiaya Jia This projec

DV Lab 408 Dec 29, 2022
MASS: Masked Sequence to Sequence Pre-training for Language Generation

MASS: Masked Sequence to Sequence Pre-training for Language Generation

Microsoft 1.1k Dec 17, 2022
Using Bert as the backbone model for lime, designed for NLP task explanation (sentence pair text classification task)

Lime Comparing deep contextualized model for sentences highlighting task. In addition, take the classic explanation model "LIME" with bert-base model

JHJu 2 Jan 18, 2022
Mapping a variable-length sentence to a fixed-length vector using BERT model

Are you looking for X-as-service? Try the Cloud-Native Neural Search Framework for Any Kind of Data bert-as-service Using BERT model as a sentence enc

Han Xiao 11.1k Jan 01, 2023
PyTranslator é simultaneamente um editor e tradutor de texto com diversos recursos e interface feito com coração e 100% em Python

PyTranslator O Que é e para que serve o PyTranslator? PyTranslator é simultaneamente um editor e tradutor de texto em com interface gráfica que usa a

Elizeu Barbosa Abreu 1 May 12, 2022
Repo for Enhanced Seq2Seq Autoencoder via Contrastive Learning for Abstractive Text Summarization

ESACL: Enhanced Seq2Seq Autoencoder via Contrastive Learning for AbstractiveText Summarization This repo is for our paper "Enhanced Seq2Seq Autoencode

Rachel Zheng 14 Nov 01, 2022
Data and code to support "Applied Natural Language Processing" (INFO 256, Fall 2021, UC Berkeley)

anlp21 Course materials for "Applied Natural Language Processing" (INFO 256, Fall 2021, UC Berkeley) Syllabus: http://people.ischool.berkeley.edu/~dba

David Bamman 48 Dec 06, 2022
OpenAI CLIP text encoders for multiple languages!

Multilingual-CLIP OpenAI CLIP text encoders for any language Colab Notebook · Pre-trained Models · Report Bug Overview OpenAI recently released the pa

Fredrik Carlsson 481 Dec 30, 2022
Machine learning classifiers to predict American Sign Language .

ASL-Classifiers American Sign Language (ASL) is a natural language that serves as the predominant sign language of Deaf communities in the United Stat

Tarek idrees 0 Feb 08, 2022
Sentiment Analysis Project using Count Vectorizer and TF-IDF Vectorizer

Sentiment Analysis Project This project contains two sentiment analysis programs for Hotel Reviews using a Hotel Reviews dataset from Datafiniti. The

Simran Farrukh 0 Mar 28, 2022
小布助手对话短文本语义匹配的一个baseline

oppo-text-match 小布助手对话短文本语义匹配的一个baseline 模型 参考:https://kexue.fm/archives/8213 base版本线下大概0.952,线上0.866(单模型,没做K-flod融合)。 训练 测试环境:tensorflow 1.15 + keras

苏剑林(Jianlin Su) 132 Dec 14, 2022
RuCLIP tiny (Russian Contrastive Language–Image Pretraining) is a neural network trained to work with different pairs (images, texts).

RuCLIPtiny Zero-shot image classification model for Russian language RuCLIP tiny (Russian Contrastive Language–Image Pretraining) is a neural network

Shahmatov Arseniy 26 Sep 20, 2022