Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge

Last update: Oct 14, 2022

Overview

Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge

This is an implementation of the paper, along with the pipeline and pretrained model using an open dataset. Audio samples of the paper is available here.

Recipe

This open pipeline uses the Databaker dataset. Please refer to our previous pipeline for dataset preprocessing, while only the Databaker dataset is used. Besides, you need to run lexicon/build_databaker.py to build the vocabulary, download the lexicon from zdic.net, and encode them with XLM-R. Feel free to change the target directory to save the data, which is specified in build_databaker.py and lexicon_utils.py.

Below are the commands to train and evaluate. Default target directories specified in the preprocessing scripts are used, so please substitute them with your own. The evaluation script can be run simultaneously with the training script. You may also use the evaluation script to synthesize samples from pretrained models. Please refer to the help of the arguments for their meanings.

python -m torch.distributed.launch --nproc_per_node=NGPU --model-dir=MODEL_DIR --log-dir=LOG_DIR --data-dir=D:\free_corpus\packed\ --training_languages=zh-cn --eval_languages=zh-cn --training_speakers=databaker --eval_steps=100000:150000 --hparams="input_method=char,multi_speaker=True,use_knowledge_attention=True,remove_space=True,data_format=nlti" --external_embed=D:\free_corpus\packed\embed.zip --vocab=D:\free_corpus\packed\db_vocab.json

python eval.py --model-dir=MODEL_DIR --log-dir=LOG_DIR --data-dir=D:\free_corpus\packed\ --eval_languages=zh-cn --eval_meta=D:\free_corpus\packed\metadata.eval.txt --hparams="input_method=char,multi_speaker=True,use_knowledge_attention=True,remove_space=True,data_format=nlti" --start_step=100000 --vocab=D:\free_corpus\packed\db_vocab.json --external_embed=D:\free_corpus\packed\embed.zip --eval_speakers=databaker

Besides, to report CER, you need to create azure_key.json with your own Azure STT subscription, with content of {"subscription": "YOUR_KEY", "region": "YOUR_REGION"}, see utils/transcribe.py. Due to significant differences of the datasets used, the implementation is for demonstration only and could not fully reproduce the results in the paper.

Pretrained Model

The pretrained models on Databaker are available at OneDrive Link, which reaches a CER of 4.19%. Relevant files necessary for generation of speeches including lexicon texts, lexicon embeddings, the vocabulary file, and evaluation scripts are also included to aid fast reproduction.

Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge

Related tags

Overview

Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge

Recipe

Pretrained Model

Owner

Mutian He

Official code for "Decoupling Zero-Shot Semantic Segmentation"

PPLNN is a Primitive Library for Neural Network is a high-performance deep-learning inference engine for efficient AI inferencing

Repository for the Bias Benchmark for QA dataset.

3D-Reconstruction 基于深度学习方法的单目多视图三维重建

Analyzes your GitHub Profile and presents you with a report on how likely you are to become the next MLH Fellow!

Baseline powergrid model for NY

DCT-Mask: Discrete Cosine Transform Mask Representation for Instance Segmentation

Attack classification models with transferability, black-box attack; unrestricted adversarial attacks on imagenet

Repo for 2021 SDD assessment task 2, by Felix, Anna, and James.

[SIGMETRICS 2022] One Proxy Device Is Enough for Hardware-Aware Neural Architecture Search

PyTorch implementation of the wavelet analysis from Torrence & Compo

ShuttleNet: Position-aware Fusion of Rally Progress and Player Styles for Stroke Forecasting in Badminton (AAAI'22)

Code for NeurIPS2021 submission "A Surrogate Objective Framework for Prediction+Programming with Soft Constraints"

Package for extracting emotions from social media text. Tailored for financial data.

1st place solution in CCF BDCI 2021 ULSEG challenge

TalkingHead-1KH is a talking-head dataset consisting of YouTube videos

The first machine learning framework that encourages learning ML concepts instead of memorizing class functions.

Official Implementation for HyperStyle: StyleGAN Inversion with HyperNetworks for Real Image Editing

PyTorch implementation of the Crafting Better Contrastive Views for Siamese Representation Learning

[NeurIPS 2021] ORL: Unsupervised Object-Level Representation Learning from Scene Images