Unofficial Pytorch Implementation of WaveGrad2

Last update: Nov 29, 2022

Overview

WaveGrad 2 — Unofficial PyTorch Implementation

WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis
Unofficial PyTorch+Lightning Implementation of Chen et al.(JHU, Google Brain), WaveGrad2.
Audio Samples: https://mindslab-ai.github.io/wavegrad2/

TODO

More training for WaveGrad-Base setup
Checkpoint release
WaveGrad-Large Decoder
Inference by reduced sampling steps

Requirements

Pytorch
Pytorch-Lightning==1.2.10
The requirements are highlighted in requirements.txt.
We also provide docker setup Dockerfile.

Datasets

The supported datasets are

LJSpeech: a single-speaker English dataset consists of 13100 short audio clips of a female speaker reading passages from 7 non-fiction books, approximately 24 hours in total.
AISHELL-3: a Mandarin TTS dataset with 218 male and female speakers, roughly 85 hours in total.
etc.

We take LJSpeech as an example hereafter.

Preprocessing

Adjust preprocess.yaml, especially path section.

path:
  corpus_path: '/DATA1/LJSpeech-1.1' # LJSpeech corpus path
  lexicon_path: 'lexicon/librispeech-lexicon.txt'
  raw_path: './raw_data/LJSpeech'
  preprocessed_path: './preprocessed_data/LJSpeech'

run prepare_align.py for some preparations.

python prepare_align.py -c preprocess.yaml

Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences. Alignments for the LJSpeech and AISHELL-3 datasets are provided here. You have to unzip the files in preprocessed_data/LJSpeech/TextGrid/.
After that, run preprocess.py.

python preprocess.py -c preprocess.yaml

Alternately, you can align the corpus by yourself.
Download the official MFA package and run it to align the corpus.

./montreal-forced-aligner/bin/mfa_align raw_data/LJSpeech/ lexicon/librispeech-lexicon.txt english preprocessed_data/LJSpeech

./montreal-forced-aligner/bin/mfa_train_and_align raw_data/LJSpeech/ lexicon/librispeech-lexicon.txt preprocessed_data/LJSpeech

And then run preprocess.py.

python preprocess.py -c preprocess.yaml

Training

Adjust hparameter.yaml, especially train section.

train:
  batch_size: 12 # Dependent on GPU memory size
  adam:
    lr: 3e-4
    weight_decay: 1e-6
  decay:
    rate: 0.05
    start: 25000
    end: 100000
  num_workers: 16 # Dependent on CPU cores
  gpus: 2 # number of GPUs
  loss_rate:
    dur: 1.0

If you want to train with other dataset, adjust data section in hparameter.yaml

data:
  lang: 'eng'
  text_cleaners: ['english_cleaners'] # korean_cleaners, english_cleaners, chinese_cleaners
  speakers: ['LJSpeech']
  train_dir: 'preprocessed_data/LJSpeech'
  train_meta: 'train.txt'  # relative path of metadata file from train_dir
  val_dir: 'preprocessed_data/LJSpeech'
  val_meta: 'val.txt'  # relative path of metadata file from val_dir'
  lexicon_path: 'lexicon/librispeech-lexicon.txt'

run trainer.py

python trainer.py

If you want to resume training from checkpoint, check parser.

parser = argparse.ArgumentParser()
parser.add_argument('-r', '--resume_from', type =int,\
	required = False, help = "Resume Checkpoint epoch number")
parser.add_argument('-s', '--restart', action = "store_true",\
	required = False, help = "Significant change occured, use this")
parser.add_argument('-e', '--ema', action = "store_true",
	required = False, help = "Start from ema checkpoint")
args = parser.parse_args()

During training, tensorboard logger is logging loss, spectrogram and audio.

tensorboard --logdir=./tensorboard --bind_all

Inference

run inference.py

python inference.py -c <checkpoint_path> --text <'text'>

Or you can run inference.ipynb.

Checkpoint file will be released!

Note

Since this repo is unofficial implementation and WaveGrad2 paper do not provide several details, a slight differences between paper could exist.
We listed modifications or arbitrary setups

Normal LSTM without ZoneOut is applied for encoder.
g2p_en is applied instead of Google's unknown G2P.
Trained with LJSpeech datasdet instead of Google's proprietary dataset.
- Due to dataset replacement, output audio's sampling rate becomes 22.05kHz instead of 24kHz.
MT + SpecAug are not implemented.
hyperparameters
- train.batch_size: 12 for 2 A100 (40GB) GPUs
- train.adam.lr: 3e-4 and train.adam.weight_decay: 1e-6
- train.decay learning rate decay is applied during training
- train.loss_rate: 1 as total_loss = 1 * L1_loss + 1 * duration_loss
- ddpm.ddpm_noise_schedule: torch.linspace(1e-6, 0.01, hparams.ddpm.max_step)
- encoder.channel is reduced to 512 from 1024 or 2048
Current sample page only contains samples from WaveGrad-Base decoder.
TODO things.

Tree

.
├── Dockerfile
├── README.md
├── dataloader.py
├── docs
│   ├── spec.png
│   ├── tb.png
│   └── tblogger.png
├── hparameter.yaml
├── inference.py
├── lexicon
│   ├── librispeech-lexicon.txt
│   └── pinyin-lexicon-r.txt
├── lightning_model.py
├── model
│   ├── base.py
│   ├── downsampling.py
│   ├── encoder.py
│   ├── gaussian_upsampling.py
│   ├── interpolation.py
│   ├── layers.py
│   ├── linear_modulation.py
│   ├── nn.py
│   ├── resampling.py
│   ├── upsampling.py
│   └── window.py
├── prepare_align.py
├── preprocess.py
├── preprocess.yaml
├── preprocessor
│   ├── ljspeech.py
│   └── preprocessor.py
├── text
│   ├── __init__.py
│   ├── cleaners.py
│   ├── cmudict.py
│   ├── numbers.py
│   └── symbols.py
├── trainer.py
├── utils
│   ├── mel.py
│   ├── stft.py
│   ├── tblogger.py
│   └── utils.py
└── wavegrad2_tester.ipynb

Author

This code is implemented by

Seungu Han at MINDs Lab [email protected]
Junhyeok Lee at MINDs Lab [email protected]

Special thanks to

Kang-wook Kim at MINDs Lab
Wonbin Jung at MINDs Lab
Sang Hoon Woo at MINDs Lab

References

Chen et al., WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis
Chen et al.,WaveGrad: Estimating Gradients for Waveform Generation
Ho et al., Denoising Diffusion Probabilistic Models
Shen et al., Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling

This implementation uses code from following repositories:

The webpage for the audio samples uses a template from:

WaveGrad2 Official Github.io

The audio samples on our webpage(TBD) are partially derived from:

LJSpeech: a single-speaker English dataset consists of 13100 short audio clips of a female speaker reading passages from 7 non-fiction books, approximately 24 hours in total.
WaveGrad2 Official Github.io

Unofficial Pytorch Implementation of WaveGrad2

Related tags

Overview

WaveGrad 2 — Unofficial PyTorch Implementation

TODO

Requirements

Datasets

Preprocessing

Training

Inference

Note

Tree

Author

References

Owner

MINDs Lab

Physics-Informed Neural Networks (PINN) and Deep BSDE Solvers of Differential Equations for Scientific Machine Learning (SciML) accelerated simulation

Group-Free 3D Object Detection via Transformers

Official implementation of SIGIR'2021 paper: "Sequential Recommendation with Graph Neural Networks".

Code for NAACL 2021 full paper "Efficient Attentions for Long Document Summarization"

Multispectral Object Detection with Yolov5

Road Crack Detection Using Deep Learning Methods

Video Matting via Consistency-Regularized Graph Neural Networks

Microscopy Image Cytometry Toolkit

Repository for "Space-Time Correspondence as a Contrastive Random Walk" (NeurIPS 2020)

The Malware Open-source Threat Intelligence Family dataset contains 3,095 disarmed PE malware samples from 454 families

PAWS 🐾 Predicting View-Assignments with Support Samples

Learning trajectory representations using self-supervision and programmatic supervision.

Multi-scale discriminator feature-wise loss function

Official implementation of Monocular Quasi-Dense 3D Object Tracking

imbalanced-DL: Deep Imbalanced Learning in Python

A Closer Look at Structured Pruning for Neural Network Compression

X-modaler is a versatile and high-performance codebase for cross-modal analytics.

Proposal, Tracking and Segmentation (PTS): A Cascaded Network for Video Object Segmentation

PyTorch implementation of the Pose Residual Network (PRN)

Predicting Price of house by considering ,house age, Distance from public transport