Transformation spoken text to written text

Last update: Dec 28, 2022

Related tags

Overview

Transformation spoken text to written text

This model is used for formatting raw asr text output from spoken text to written text (Eg. date, number, id, ...). It also supports formatting "out of vocab" by using external vocabulary.

Some of examples:

input  : tám giờ chín phút ngày mười tám tháng năm năm hai nghìn không trăm hai mươi hai
output : 8h9 18/5/2022

input  : mã số quy đê tê tê đê hai tám chéo hai không không ba
output : mã số qdttd28/2003

input  : thể tích tám mét khối trọng lượng năm mươi ki lô gam
output : thể tích 8 m3 trọng lượng 50 kg

input    : ngày hai tám tháng tư cô vít bùng phát ở sờ cốt lờn chiếm tám mươi phần trăm là biến chủng đen ta và bê ta
ex_vocab : ['scotland', 'covid', 'delta', 'beta']
output   : 28/4 covid bùng phát ở scotland chiếm 80 % là biến chủng delta và beta

Model architecture

Infer model

Play around at Huggingface Space

import torch
import model_handling
from data_handling import DataCollatorForNormSeq2Seq
from model_handling import EncoderDecoderSpokenNorm
import os
os.environ["CUDA_VISIBLE_DEVICES"] = ""

Init tokenizer and model

tokenizer = model_handling.init_tokenizer()
model = EncoderDecoderSpokenNorm.from_pretrained('nguyenvulebinh/spoken-norm', cache_dir=model_handling.cache_dir)
data_collator = DataCollatorForNormSeq2Seq(tokenizer)

Infer sample

bias_list = ['scotland', 'covid', 'delta', 'beta']
input_str = 'ngày hai tám tháng tư cô vít bùng phát ở sờ cốt lờn chiếm tám mươi phần trăm là biến chủng đen ta và bê ta'

inputs = tokenizer([input_str])
input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']
if len(bias_list) > 0:
    bias = data_collator.encode_list_string(bias_list)
    bias_input_ids = bias['input_ids']
    bias_attention_mask = bias['attention_mask']
else:
    bias_input_ids = None
    bias_attention_mask = None

inputs = {
    "input_ids": torch.tensor(input_ids),
    "attention_mask": torch.tensor(attention_mask),
    "bias_input_ids": bias_input_ids,
    "bias_attention_mask": bias_attention_mask,
}

Format input text with bias phrases

outputs = model.generate(**inputs, output_attentions=True, num_beams=1, num_return_sequences=1)

for output in outputs.cpu().detach().numpy().tolist():
    # print('\n', tokenizer.decode(output, skip_special_tokens=True).split(), '\n')
    print(tokenizer.sp_model.DecodePieces(tokenizer.decode(output, skip_special_tokens=True).split()))

28/4 covid bùng phát ở scotland chiếm 80 % là biến chủng delta và beta

Format input text without bias phrases

outputs = model.generate(**{
    "input_ids": torch.tensor(input_ids),
    "attention_mask": torch.tensor(attention_mask),
    "bias_input_ids": None,
    "bias_attention_mask": None,
}, output_attentions=True, num_beams=1, num_return_sequences=1)

for output in outputs.cpu().detach().numpy().tolist():
    # print('\n', tokenizer.decode(output, skip_special_tokens=True).split(), '\n')
    print(tokenizer.sp_model.DecodePieces(tokenizer.decode(output, skip_special_tokens=True).split()))

28/4 cô vít bùng phát ở sờ cốt lờn chiếm 80 % là biến chủng đen ta và bê ta

Contact

[email protected]

Transformation spoken text to written text

Related tags

Overview

Transformation spoken text to written text

Model architecture

Infer model

Init tokenizer and model

Infer sample

Format input text with bias phrases

Format input text without bias phrases

Contact

Owner

Nguyen Binh

The Easy-to-use Dialogue Response Selection Toolkit for Researchers

IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

TextFlint is a multilingual robustness evaluation platform for natural language processing tasks,

Code for paper "Which Training Methods for GANs do actually Converge? (ICML 2018)"

Contact Extraction with Question Answering.

Meta learning algorithms to train cross-lingual NLI (multi-task) models

Official Stanford NLP Python Library for Many Human Languages

edge-SR: Super-Resolution For The Masses

This project is part of Eleuther AI's quest to create a massive repository of high quality text data for training language models.

This program do translate english words to portuguese

A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

A Structured Self-attentive Sentence Embedding

Large-scale Knowledge Graph Construction with Prompting

Simple virtual assistant using pyttsx3 and speech recognition optionally with pywhatkit and pther libraries.

A natural language modeling framework based on PyTorch

PyTorch original implementation of Cross-lingual Language Model Pretraining.

Contains analysis of trends from Fitbit Dataset (source: Kaggle) to see how the trends can be applied to Bellabeat customers and Bellabeat products

Utilities for preprocessing text for deep learning with Keras

Pipeline for fast building text classification TF-IDF + LogReg baselines.

뉴스 도메인 질의응답 시스템 (21-1학기 졸업 프로젝트)