Implementation of Google Brain's WaveGrad high-fidelity vocoder

Overview

alt-text-1

WaveGrad

Implementation (PyTorch) of Google Brain's high-fidelity WaveGrad vocoder (paper). First implementation on GitHub with high-quality generation for 6-iterations.

Status

  • Documented API.
  • High-fidelity generation.
  • Multi-iteration inference support (stable for low iterations).
  • Stable and fast training with mixed-precision support.
  • Distributed training support.
  • Training also successfully runs on a single 12GB GPU with batch size 96.
  • CLI inference support.
  • Flexible architecture configuration for your own data.
  • Estimated RTF on popular GPU and CPU devices (see below).
  • 100- and lower-iteration inferences are faster than real-time on RTX 2080 Ti. 6-iteration inference is faster than one reported in the paper.
  • Parallel grid search for the best noise schedule.
  • Uploaded generated samples for different number of iterations (see generated_samples folder).
  • Pretrained checkpoint on 22KHz LJSpeech dataset with noise schedules.

Real-time factor (RTF)

Number of parameters: 15.810.401

Model Stable RTX 2080 Ti Tesla K80 Intel Xeon 2.3GHz*
1000 iterations + 9.59 - -
100 iterations + 0.94 5.85 -
50 iterations + 0.45 2.92 -
25 iterations + 0.22 1.45 -
12 iterations + 0.10 0.69 4.55
6 iterations + 0.04 0.33 2.09

*Note: Used an old version of Intel Xeon CPU.


About

WaveGrad is a conditional model for waveform generation through estimating gradients of the data density with WaveNet-similar sampling quality. This vocoder is neither GAN, nor Normalizing Flow, nor classical autoregressive model. The main concept of vocoder is based on Denoising Diffusion Probabilistic Models (DDPM), which utilize Langevin dynamics and score matching frameworks. Furthemore, comparing to classic DDPM, WaveGrad achieves super-fast convergence (6 iterations and probably lower) w.r.t. Langevin dynamics iterative sampling scheme.


Installation

  1. Clone this repo:
git clone https://github.com/ivanvovk/WaveGrad.git
cd WaveGrad
  1. Install requirements:
pip install -r requirements.txt

Training

1 Preparing data

  1. Make train and test filelists of your audio data like ones included into filelists folder.
  2. Make a configuration file* in configs folder.

*Note: if you are going to change hop_length for STFT, then make sure that the product of your upsampling factors in config is equal to your new hop_length.

2 Single and Distributed GPU training

  1. Open runs/train.sh script and specify visible GPU devices and path to your configuration file. If you specify more than one GPU the training will run in distributed mode.
  2. Run sh runs/train.sh

3 Tensorboard and logging

To track your training process run tensorboard by tensorboard --logdir=logs/YOUR_LOGDIR_FOLDER. All logging information and checkpoints will be stored in logs/YOUR_LOGDIR_FOLDER. logdir is specified in config file.

4 Noise schedule grid search

Once model is trained, grid search for the best schedule* for a needed number of iterations in notebooks/inference.ipynb. The code supports parallelism, so you can specify more than one number of jobs to accelerate the search.

*Note: grid search is necessary just for a small number of iterations (like 6 or 7). For larger number just try Fibonacci sequence benchmark.fibonacci(...) initialization: I used it for 25 iteration and it works well. From good 25-iteration schedule, for example, you can build a higher-order schedule by copying elements.

Noise schedules for pretrained model
  • 6-iteration schedule was obtained using grid search. After, based on obtained scheme, by hand, I found a slightly better approximation.
  • 7-iteration schedule was obtained in the same way.
  • 12-iteration schedule was obtained in the same way.
  • 25-iteration schedule was obtained using Fibonacci sequence benchmark.fibonacci(...).
  • 50-iteration schedule was obtained by repeating elements from 25-iteration scheme.
  • 100-iteration schedule was obtained in the same way.
  • 1000-iteration schedule was obtained in the same way.

Inference

CLI

Put your mel-spectrograms in some folder. Make a filelist. Then run this command with your own arguments:

sh runs/inference.sh -c <your-config> -ch <your-checkpoint> -ns <your-noise-schedule> -m <your-mel-filelist> -v "yes"

Jupyter Notebook

More inference details are provided in notebooks/inference.ipynb. There you can also find how to set a noise schedule for the model and make grid search for the best scheme.


Other

Generated audios

Examples of generated audios are provided in generated_samples folder. Quality degradation between 1000-iteration and 6-iteration inferences is not noticeable if found the best schedule for the latter.

Pretrained checkpoints

You can find a pretrained checkpoint file* on LJSpeech (22KHz) via this Google Drive link.

*Note: uploaded checkpoint is a dict with a single key 'model'.


Important details, issues and comments

  • During training WaveGrad uses a default noise schedule with 1000 iterations and linear scale betas from range (1e-6, 0.01). For inference you can set another schedule with less iterations. Tune betas carefully, the output quality really highly depends on it.
  • By default model runs in a mixed-precision way. Batch size is modified compared to the paper (256 -> 96) since authors trained their model on TPU.
  • After ~10k training iterations (1-2 hours) on a single GPU the model performs good generation for 50-iteration inference. Total training time is about 1-2 days (for absolute convergence).
  • At some point training might start to behave weird and crazy (loss explodes), so I have introduced learning rate (LR) scheduling and gradient clipping. If loss explodes for your data, then try to decrease LR scheduler gamma a bit. It should help.
  • By default hop length of your STFT is equal 300 (thus total upsampling factor). Other cases are not tested, but you can try. Remember, that total upsampling factor should be still equal to your new hop length.

History of updates

  • (NEW: 10/24/2020) Huge update. Distributed training and mixed-precision support. More correct positional encoding. CLI support for inference. Parallel grid search. Model size significantly decreased.
  • New RTF info for NVIDIA Tesla K80 GPU card (popular in Google Colab service) and CPU Intel Xeon 2.3GHz.
  • Huge update. New 6-iteration well generated sample example. New noise schedule setting API. Added the best schedule grid search code.
  • Improved training by introducing smarter learning rate scheduler. Obtained high-fidelity synthesis.
  • Stable training and multi-iteration inference. 6-iteration noise scheduling is supported.
  • Stable training and fixed-iteration inference with significant background static noise left. All positional encoding issues are solved.
  • Stable training of 25-, 50- and 1000-fixed-iteration models. Found no linear scaling (C=5000 from paper) of positional encoding (bug).
  • Stable training of 25-, 50- and 1000-fixed-iteration models. Fixed positional encoding downscaling. Parallel segment sampling is replaced by full-mel sampling.
  • (RELEASE, first on GitHub). Parallel segment sampling and broken positional encoding downscaling. Bad quality with clicks from concatenation from parallel-segment generation.

References

Owner
Ivan Vovk
• Mathematics • Machine Learning • Speech technologies
Ivan Vovk
Meaningful titles for tabs and PDF downloads! Also supports tab search.

arxiv-utils If you are a researcher that reads a lot on ArXiv, you'll benefit a lot from this web extension. Renames the title of PDF page to the pape

Johnson 174 Dec 20, 2022
Implementation of Axial attention - attending to multi-dimensional data efficiently

Axial Attention Implementation of Axial attention in Pytorch. A simple but powerful technique to attend to multi-dimensional data efficiently. It has

Phil Wang 250 Dec 25, 2022
Taking A Closer Look at Domain Shift: Category-level Adversaries for Semantics Consistent Domain Adaptation

Taking A Closer Look at Domain Shift: Category-level Adversaries for Semantics Consistent Domain Adaptation (CVPR2019) This is a pytorch implementatio

Yawei Luo 280 Jan 01, 2023
Equipped customers with insights about their EVs Hourly energy consumption and helped predict future charging behavior using LSTM model

Equipped customers with insights about their EVs Hourly energy consumption and helped predict future charging behavior using LSTM model. Designed sample dashboard with insights and recommendation for

Yash 2 Apr 07, 2022
implement of SwiftNet:Real-time Video Object Segmentation

SwiftNet The official PyTorch implementation of SwiftNet:Real-time Video Object Segmentation, which has been accepted by CVPR2021. Requirements Python

haochen wang 64 Dec 14, 2022
[AAAI-2021] Visual Boundary Knowledge Translation for Foreground Segmentation

Trans-Net Code for (Visual Boundary Knowledge Translation for Foreground Segmentation, AAAI2021). [https://ojs.aaai.org/index.php/AAAI/article/view/16

ZJU-VIPA 2 Mar 04, 2022
Implementation for paper: Self-Regulation for Semantic Segmentation

Self-Regulation for Semantic Segmentation This is the PyTorch implementation for paper Self-Regulation for Semantic Segmentation, ICCV 2021. Citing SR

Dong ZHANG 30 Nov 21, 2022
IJON is an annotation mechanism that analysts can use to guide fuzzers such as AFL.

IJON SPACE EXPLORER IJON is an annotation mechanism that analysts can use to guide fuzzers such as AFL. Using only a small (usually one line) annotati

Chair for Sys­tems Se­cu­ri­ty 146 Dec 16, 2022
tinykernel - A minimal Python kernel so you can run Python in your Python

tinykernel - A minimal Python kernel so you can run Python in your Python

fast.ai 37 Dec 02, 2022
Generalized Data Weighting via Class-level Gradient Manipulation

Generalized Data Weighting via Class-level Gradient Manipulation This repository is the official implementation of Generalized Data Weighting via Clas

18 Nov 12, 2022
Human motion synthesis using Unity3D

Human motion synthesis using Unity3D Prerequisite: Software: amc2bvh.exe, Unity 2017, Blender. Unity: RockVR (Video Capture), scenes, character models

Hao Xu 9 Jun 01, 2022
This is a project based on retinaface face detection, including ghostnet and mobilenetv3

English | 简体中文 RetinaFace in PyTorch Chinese detailed blog:https://zhuanlan.zhihu.com/p/379730820 Face recognition with masks is still robust---------

pogg 59 Dec 21, 2022
Imbalanced Gradients: A Subtle Cause of Overestimated Adversarial Robustness

Imbalanced Gradients: A Subtle Cause of Overestimated Adversarial Robustness Code for Paper "Imbalanced Gradients: A Subtle Cause of Overestimated Adv

Hanxun Huang 11 Nov 30, 2022
A very lightweight monitoring system for Raspberry Pi clusters running Kubernetes.

OMNI A very lightweight monitoring system for Raspberry Pi clusters running Kubernetes. Why? When I finished my Kubernetes cluster using a few Raspber

Matias Godoy 148 Dec 29, 2022
Array Camera Ptychography

Array Camera Ptychography This repository provides the code for the following papers: Schulz, Timothy J., David J. Brady, and Chengyu Wang. "Photon-li

Brady lab in Optical Sciences 1 Nov 15, 2021
Caffe: a fast open framework for deep learning.

Caffe Caffe is a deep learning framework made with expression, speed, and modularity in mind. It is developed by Berkeley AI Research (BAIR)/The Berke

Berkeley Vision and Learning Center 33k Dec 28, 2022
《Lerning n Intrinsic Grment Spce for Interctive Authoring of Grment Animtion》

Learning an Intrinsic Garment Space for Interactive Authoring of Garment Animation Overview This is the demo code for training a motion invariant enco

YuanBo 213 Dec 14, 2022
GalaXC: Graph Neural Networks with Labelwise Attention for Extreme Classification

GalaXC GalaXC: Graph Neural Networks with Labelwise Attention for Extreme Classification @InProceedings{Saini21, author = {Saini, D. and Jain,

Extreme Classification 28 Dec 05, 2022
PyTorch Language Model for 1-Billion Word (LM1B / GBW) Dataset

PyTorch Large-Scale Language Model A Large-Scale PyTorch Language Model trained on the 1-Billion Word (LM1B) / (GBW) dataset Latest Results 39.98 Perp

Ryan Spring 114 Nov 04, 2022
yolox_backbone is a deep-learning library and is a collection of YOLOX Backbone models.

YOLOX-Backbone yolox-backbone is a deep-learning library and is a collection of YOLOX backbone models. Install pip install yolox-backbone Load a Pret

Yonghye Kwon 21 Dec 28, 2022