Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS)

Last update: Dec 05, 2022

Overview

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS)

Yoonhyung Lee, Joongbo Shin, Kyomin Jung

Abstract: Although early text-to-speech (TTS) models such as Tacotron 2 have succeeded in generating human-like speech, their autoregressive architectures have several limitations: (1) They require a lot of time to generate a mel-spectrogram consisting of hundreds of steps. (2) The autoregressive speech generation shows a lack of robustness due to its error propagation property. In this paper, we propose a novel non-autoregressive TTS model called BVAE-TTS, which eliminates the architectural limitations and generates a mel-spectrogram in parallel. BVAE-TTS adopts a bidirectional-inference variational autoencoder (BVAE) that learns hierarchical latent representations using both bottom-up and top-down paths to increase its expressiveness. To apply BVAE to TTS, we design our model to utilize text information via an attention mechanism. By using attention maps that BVAE-TTS generates, we train a duration predictor so that the model uses the predicted duration of each phoneme at inference. In experiments conducted on LJSpeech dataset, we show that our model generates a mel-spectrogram 27 times faster than Tacotron 2 with similar speech quality. Furthermore, our BVAE-TTS outperforms Glow-TTS, which is one of the state-of-the-art non-autoregressive TTS models, in terms of both speech quality and inference speed while having 58% fewer parameters. One-sentence Summary: In this paper, a novel non-autoregressive text-to-speech model based on bidirectional-inference variational autoencoder called BVAE-TTS is proposed.

Training

Download and extract the LJ Speech dataset
Make preprocessed folder in the LJSpeech directory and do preprocessing of the data using prepare_data.ipynb
Set the data_path in hparams.py to the preprocessed folder
Train your own BVAE-TTS model

python train.py --gpu=0 --logdir=baseline

Pre-trained models

We provide a pre-trained BVAE-TTS model, which is a model that you would obtain with the current setting (e.g. hyperparameters, dataset split). Also, we provide a pre-trained WaveGlow model that is used to obtain the audio samples. After downloading the models, you can generate audio samples using inference.ipynb.

Audio Samples

You can hear the audio samples here

Reference

1.NVIDIA/tacotron2: https://github.com/NVIDIA/tacotron2
2.NVIDIA/waveglow: https://github.com/NVIDIA/waveglow
3.pclucas/iaf-vae: https://github.com/pclucas14/iaf-vae

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS)

Related tags

Overview

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS)

Yoonhyung Lee, Joongbo Shin, Kyomin Jung

Training

Pre-trained models

Audio Samples

Reference

Owner

LEE YOON HYUNG

Research code for "What to Pre-Train on? Efficient Intermediate Task Selection", EMNLP 2021

Text Analysis & Topic Extraction on Android App user reviews

✨Rubrix is a production-ready Python framework for exploring, annotating, and managing data in NLP projects.

KoBERT - Korean BERT pre-trained cased (KoBERT)

Model parallel transformers in JAX and Haiku

基于百度的语音识别，用python实现，pyaudio+pyqt

This repository will contain the code for the CVPR 2021 paper "GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields"

This is a MD5 password/passphrase brute force tool

Torchrecipes provides a set of reproduci-able, re-usable, ready-to-run RECIPES for training different types of models, across multiple domains, on PyTorch Lightning.

Spokestack is a library that allows a user to easily incorporate a voice interface into any Python application with a focus on embedded systems.

NumPy String-Indexed is a NumPy extension that allows arrays to be indexed using descriptive string labels

ASCEND Chinese-English code-switching dataset

中文空间语义理解评测

Proquabet - Convert your prose into proquints and then you essentially have Vogon poetry

[Preprint] Escaping the Big Data Paradigm with Compact Transformers, 2021

Nystromformer: A Nystrom-based Algorithm for Approximating Self-Attention

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Full Spectrum Bioinformatics - a free online text designed to introduce key topics in Bioinformatics using the Python

Jarvis is a simple Chatbot with a GUI capable of chatting and retrieving information and daily news from the internet for it's user.

Some embedding layer implementation using ivy library