VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Last update: Jan 08, 2023

Related tags

Overview

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Jaehyeon Kim, Jungil Kong, and Juhee Son

In our recent paper, we propose VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech.

Several recent end-to-end text-to-speech (TTS) models enabling single-stage training and parallel sampling have been proposed, but their sample quality does not match that of two-stage TTS systems. In this work, we present a parallel end-to-end TTS method that generates more natural sounding audio than current two-stage models. Our method adopts variational inference augmented with normalizing flows and an adversarial training process, which improves the expressive power of generative modeling. We also propose a stochastic duration predictor to synthesize speech with diverse rhythms from input text. With the uncertainty modeling over latent variables and the stochastic duration predictor, our method expresses the natural one-to-many relationship in which a text input can be spoken in multiple ways with different pitches and rhythms. A subjective human evaluation (mean opinion score, or MOS) on the LJ Speech, a single speaker dataset, shows that our method outperforms the best publicly available TTS systems and achieves a MOS comparable to ground truth.

Visit our demo for audio samples.

We also provide the pretrained models.

VITS at training	VITS at inference

Pre-requisites

Python >= 3.6
Clone this repository
Install python requirements. Please refer requirements.txt
1. You may need to install espeak first: apt-get install espeak
Download datasets
1. Download and extract the LJ Speech dataset, then rename or create a link to the dataset folder: ln -s /path/to/LJSpeech-1.1/wavs DUMMY1
2. For mult-speaker setting, download and extract the VCTK dataset, and downsample wav files to 22050 Hz. Then rename or create a link to the dataset folder: ln -s /path/to/VCTK-Corpus/downsampled_wavs DUMMY2
Build Monotonic Alignment Search and run preprocessing if you use your own datasets.

# Cython-version Monotonoic Alignment Search
cd monotonic_align
python setup.py build_ext --inplace

# Preprocessing (g2p) for your own datasets. Preprocessed phonemes for LJ Speech and VCTK have been already provided.
# python preprocess.py --text_index 1 --filelists filelists/ljs_audio_text_train_filelist.txt filelists/ljs_audio_text_val_filelist.txt filelists/ljs_audio_text_test_filelist.txt 
# python preprocess.py --text_index 2 --filelists filelists/vctk_audio_sid_text_train_filelist.txt filelists/vctk_audio_sid_text_val_filelist.txt filelists/vctk_audio_sid_text_test_filelist.txt

Training Exmaple

# LJ Speech
python train.py -c configs/ljs_base.json -m ljs_base

# VCTK
python train_ms.py -c configs/vctk_base.json -m vctk_base

Inference Example

See inference.ipynb

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Related tags

Overview

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Jaehyeon Kim, Jungil Kong, and Juhee Son

Pre-requisites

Training Exmaple

Inference Example

Owner

Jaehyeon Kim

My Body is a Cage: the Role of Morphology in Graph-Based Incompatible Control

Cluttered MNIST Dataset

ReferFormer - Official Implementation of ReferFormer

State-of-the-art data augmentation search algorithms in PyTorch

Pytorch Implementation of Various Point Transformers

deep learning model with only python and numpy with test accuracy 99 % on mnist dataset and different optimization choices

Back to Basics: Efficient Network Compression via IMP

ML-Decoder: Scalable and Versatile Classification Head

This repo contains the source code and a benchmark for predicting user's utilities with Machine Learning techniques for Computational Persuasion

Multi-agent reinforcement learning algorithm and environment

A self-supervised learning framework for audio-visual speech

TANL: Structured Prediction as Translation between Augmented Natural Languages

A PyTorch-based R-YOLOv4 implementation which combines YOLOv4 model and loss function from R3Det for arbitrary oriented object detection.

an implementation of softmax splatting for differentiable forward warping using PyTorch

This is the official implementation for the paper "(Almost) Free Incentivized Exploration from Decentralized Learning Agents" in NeurIPS 2021.

Tree LSTM implementation in PyTorch

PyTorch implementation of CVPR'18 - Perturbative Neural Networks

Recognize Handwritten Digits using Deep Learning on the browser itself.

Main Results on ImageNet with Pretrained Models

Activating More Pixels in Image Super-Resolution Transformer