Implementation of Google Brain's WaveGrad high-fidelity vocoder

Overview

alt-text-1

WaveGrad

Implementation (PyTorch) of Google Brain's high-fidelity WaveGrad vocoder (paper). First implementation on GitHub with high-quality generation for 6-iterations.

Status

  • Documented API.
  • High-fidelity generation.
  • Multi-iteration inference support (stable for low iterations).
  • Stable and fast training with mixed-precision support.
  • Distributed training support.
  • Training also successfully runs on a single 12GB GPU with batch size 96.
  • CLI inference support.
  • Flexible architecture configuration for your own data.
  • Estimated RTF on popular GPU and CPU devices (see below).
  • 100- and lower-iteration inferences are faster than real-time on RTX 2080 Ti. 6-iteration inference is faster than one reported in the paper.
  • Parallel grid search for the best noise schedule.
  • Uploaded generated samples for different number of iterations (see generated_samples folder).
  • Pretrained checkpoint on 22KHz LJSpeech dataset with noise schedules.

Real-time factor (RTF)

Number of parameters: 15.810.401

Model Stable RTX 2080 Ti Tesla K80 Intel Xeon 2.3GHz*
1000 iterations + 9.59 - -
100 iterations + 0.94 5.85 -
50 iterations + 0.45 2.92 -
25 iterations + 0.22 1.45 -
12 iterations + 0.10 0.69 4.55
6 iterations + 0.04 0.33 2.09

*Note: Used an old version of Intel Xeon CPU.


About

WaveGrad is a conditional model for waveform generation through estimating gradients of the data density with WaveNet-similar sampling quality. This vocoder is neither GAN, nor Normalizing Flow, nor classical autoregressive model. The main concept of vocoder is based on Denoising Diffusion Probabilistic Models (DDPM), which utilize Langevin dynamics and score matching frameworks. Furthemore, comparing to classic DDPM, WaveGrad achieves super-fast convergence (6 iterations and probably lower) w.r.t. Langevin dynamics iterative sampling scheme.


Installation

  1. Clone this repo:
git clone https://github.com/ivanvovk/WaveGrad.git
cd WaveGrad
  1. Install requirements:
pip install -r requirements.txt

Training

1 Preparing data

  1. Make train and test filelists of your audio data like ones included into filelists folder.
  2. Make a configuration file* in configs folder.

*Note: if you are going to change hop_length for STFT, then make sure that the product of your upsampling factors in config is equal to your new hop_length.

2 Single and Distributed GPU training

  1. Open runs/train.sh script and specify visible GPU devices and path to your configuration file. If you specify more than one GPU the training will run in distributed mode.
  2. Run sh runs/train.sh

3 Tensorboard and logging

To track your training process run tensorboard by tensorboard --logdir=logs/YOUR_LOGDIR_FOLDER. All logging information and checkpoints will be stored in logs/YOUR_LOGDIR_FOLDER. logdir is specified in config file.

4 Noise schedule grid search

Once model is trained, grid search for the best schedule* for a needed number of iterations in notebooks/inference.ipynb. The code supports parallelism, so you can specify more than one number of jobs to accelerate the search.

*Note: grid search is necessary just for a small number of iterations (like 6 or 7). For larger number just try Fibonacci sequence benchmark.fibonacci(...) initialization: I used it for 25 iteration and it works well. From good 25-iteration schedule, for example, you can build a higher-order schedule by copying elements.

Noise schedules for pretrained model
  • 6-iteration schedule was obtained using grid search. After, based on obtained scheme, by hand, I found a slightly better approximation.
  • 7-iteration schedule was obtained in the same way.
  • 12-iteration schedule was obtained in the same way.
  • 25-iteration schedule was obtained using Fibonacci sequence benchmark.fibonacci(...).
  • 50-iteration schedule was obtained by repeating elements from 25-iteration scheme.
  • 100-iteration schedule was obtained in the same way.
  • 1000-iteration schedule was obtained in the same way.

Inference

CLI

Put your mel-spectrograms in some folder. Make a filelist. Then run this command with your own arguments:

sh runs/inference.sh -c <your-config> -ch <your-checkpoint> -ns <your-noise-schedule> -m <your-mel-filelist> -v "yes"

Jupyter Notebook

More inference details are provided in notebooks/inference.ipynb. There you can also find how to set a noise schedule for the model and make grid search for the best scheme.


Other

Generated audios

Examples of generated audios are provided in generated_samples folder. Quality degradation between 1000-iteration and 6-iteration inferences is not noticeable if found the best schedule for the latter.

Pretrained checkpoints

You can find a pretrained checkpoint file* on LJSpeech (22KHz) via this Google Drive link.

*Note: uploaded checkpoint is a dict with a single key 'model'.


Important details, issues and comments

  • During training WaveGrad uses a default noise schedule with 1000 iterations and linear scale betas from range (1e-6, 0.01). For inference you can set another schedule with less iterations. Tune betas carefully, the output quality really highly depends on it.
  • By default model runs in a mixed-precision way. Batch size is modified compared to the paper (256 -> 96) since authors trained their model on TPU.
  • After ~10k training iterations (1-2 hours) on a single GPU the model performs good generation for 50-iteration inference. Total training time is about 1-2 days (for absolute convergence).
  • At some point training might start to behave weird and crazy (loss explodes), so I have introduced learning rate (LR) scheduling and gradient clipping. If loss explodes for your data, then try to decrease LR scheduler gamma a bit. It should help.
  • By default hop length of your STFT is equal 300 (thus total upsampling factor). Other cases are not tested, but you can try. Remember, that total upsampling factor should be still equal to your new hop length.

History of updates

  • (NEW: 10/24/2020) Huge update. Distributed training and mixed-precision support. More correct positional encoding. CLI support for inference. Parallel grid search. Model size significantly decreased.
  • New RTF info for NVIDIA Tesla K80 GPU card (popular in Google Colab service) and CPU Intel Xeon 2.3GHz.
  • Huge update. New 6-iteration well generated sample example. New noise schedule setting API. Added the best schedule grid search code.
  • Improved training by introducing smarter learning rate scheduler. Obtained high-fidelity synthesis.
  • Stable training and multi-iteration inference. 6-iteration noise scheduling is supported.
  • Stable training and fixed-iteration inference with significant background static noise left. All positional encoding issues are solved.
  • Stable training of 25-, 50- and 1000-fixed-iteration models. Found no linear scaling (C=5000 from paper) of positional encoding (bug).
  • Stable training of 25-, 50- and 1000-fixed-iteration models. Fixed positional encoding downscaling. Parallel segment sampling is replaced by full-mel sampling.
  • (RELEASE, first on GitHub). Parallel segment sampling and broken positional encoding downscaling. Bad quality with clicks from concatenation from parallel-segment generation.

References

Owner
Ivan Vovk
• Mathematics • Machine Learning • Speech technologies
Ivan Vovk
A modern pure-Python library for reading PDF files

pdf A modern pure-Python library for reading PDF files. The goal is to have a modern interface to handle PDF files which is consistent with itself and

6 Apr 06, 2022
buildseg is a building extraction plugin of QGIS based on PaddlePaddle.

buildseg buildseg is a Building Extraction plugin for QGIS based on PaddlePaddle. How to use Download and install QGIS and clone the repo : git clone

39 Dec 09, 2022
Official implementation of "StyleCariGAN: Caricature Generation via StyleGAN Feature Map Modulation" (SIGGRAPH 2021)

StyleCariGAN: Caricature Generation via StyleGAN Feature Map Modulation This repository contains the official PyTorch implementation of the following

Wonjong Jang 270 Dec 30, 2022
[ACL 20] Probing Linguistic Features of Sentence-level Representations in Neural Relation Extraction

REval Table of Contents Introduction Overview Requirements Installation Probing Usage Citation License 🎓 Introduction REval is a simple framework for

13 Jan 06, 2023
Deep Illuminator is a data augmentation tool designed for image relighting. It can be used to easily and efficiently generate a wide range of illumination variants of a single image.

Deep Illuminator Deep Illuminator is a data augmentation tool designed for image relighting. It can be used to easily and efficiently generate a wide

George Chogovadze 52 Nov 29, 2022
A large-scale video dataset for the training and evaluation of 3D human pose estimation models

ASPset-510 (Australian Sports Pose Dataset) is a large-scale video dataset for the training and evaluation of 3D human pose estimation models. It contains 17 different amateur subjects performing 30

Aiden Nibali 25 Jun 20, 2021
The repository offers the official implementation of our BMVC 2021 paper in PyTorch.

CrossMLP Cascaded Cross MLP-Mixer GANs for Cross-View Image Translation Bin Ren1, Hao Tang2, Nicu Sebe1. 1University of Trento, Italy, 2ETH, Switzerla

Bingoren 16 Jul 27, 2022
Official implementation of VaxNeRF (Voxel-Accelearated NeRF).

VaxNeRF Paper | Google Colab This is the official implementation of VaxNeRF (Voxel-Accelearated NeRF). VaxNeRF provides very fast training and slightl

naruya 132 Nov 21, 2022
Face and other object detection using OpenCV and ML Yolo

Object-and-Face-Detection-Using-Yolo- Opencv and YOLO object and face detection is implemented. You only look once (YOLO) is a state-of-the-art, real-

Happy N. Monday 3 Feb 15, 2022
A simple, clean TensorFlow implementation of Generative Adversarial Networks with a focus on modeling illustrations.

IllustrationGAN A simple, clean TensorFlow implementation of Generative Adversarial Networks with a focus on modeling illustrations. Generated Images

268 Nov 27, 2022
Official Repository for the paper "Improving Baselines in the Wild".

iWildCam and FMoW baselines (WILDS) This repository was originally forked from the official repository of WILDS datasets (commit 7e103ed) For general

Kazuki Irie 3 Nov 24, 2022
Benchmark for the generalization of 3D machine learning models across different remeshing/samplings of a surface.

Discretization Robust Correspondence Benchmark One challenge of machine learning on 3D surfaces is that there are many different representations/sampl

Nicholas Sharp 10 Sep 30, 2022
AtlasNet: A Papier-Mâché Approach to Learning 3D Surface Generation

AtlasNet [Project Page] [Paper] [Talk] AtlasNet: A Papier-Mâché Approach to Learning 3D Surface Generation Thibault Groueix, Matthew Fisher, Vladimir

577 Dec 17, 2022
Implementation of QuickDraw - an online game developed by Google, combined with AirGesture - a simple gesture recognition application

QuickDraw - AirGesture Introduction Here is my python source code for QuickDraw - an online game developed by google, combined with AirGesture - a sim

Viet Nguyen 89 Dec 18, 2022
A Structured Self-attentive Sentence Embedding

Structured Self-attentive sentence embeddings Implementation for the paper A Structured Self-Attentive Sentence Embedding, which was published in ICLR

Kaushal Shetty 488 Nov 28, 2022
Python-based Informatics Kit for Analysing Chemical Units

INSTALLATION Python-based Informatics Kit for the Analysis of Chemical Units Step 1: Make a conda environment: conda create -n pikachu python=3.9 cond

47 Dec 23, 2022
Configure SRX interfaces with Scrapli

Configure SRX interfaces with Scrapli Overview This example will show how to configure interfaces on Juniper's SRX firewalls. In addition to the Pytho

Calvin Remsburg 1 Jan 07, 2022
Exposure Time Calculator (ETC) and radial velocity precision estimator for the Near InfraRed Planet Searcher (NIRPS) spectrograph

NIRPS-ETC Exposure Time Calculator (ETC) and radial velocity precision estimator for the Near InfraRed Planet Searcher (NIRPS) spectrograph February 2

Nolan Grieves 2 Sep 15, 2022
Simple cross-platform application for DaVinci surgical video frame annotation

About DaVid is a simple cross-platform GUI for annotating robotic and endoscopic surgical actions for use in deep-learning research. Features Simple a

Cyril Zakka 4 Oct 09, 2021
Official codebase for Legged Robots that Keep on Learning: Fine-Tuning Locomotion Policies in the Real World

Legged Robots that Keep on Learning Official codebase for Legged Robots that Keep on Learning: Fine-Tuning Locomotion Policies in the Real World, whic

Laura Smith 70 Dec 07, 2022