BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis

Related tags

Deep Learningbddm
Overview

Bilateral Denoising Diffusion Models (BDDMs)

GitHub Stars visitors arXiv demo

This is the official PyTorch implementation of the following paper:

BDDM: BILATERAL DENOISING DIFFUSION MODELS FOR FAST AND HIGH-QUALITY SPEECH SYNTHESIS
Max W. Y. Lam, Jun Wang, Dan Su, Dong Yu

Abstract: Diffusion probabilistic models (DPMs) and their extensions have emerged as competitive generative models yet confront challenges of efficient sampling. We propose a new bilateral denoising diffusion model (BDDM) that parameterizes both the forward and reverse processes with a schedule network and a score network, which can train with a novel bilateral modeling objective. We show that the new surrogate objective can achieve a lower bound of the log marginal likelihood tighter than a conventional surrogate. We also find that BDDM allows inheriting pre-trained score network parameters from any DPMs and consequently enables speedy and stable learning of the schedule network and optimization of a noise schedule for sampling. Our experiments demonstrate that BDDMs can generate high-fidelity audio samples with as few as three sampling steps. Moreover, compared to other state-of-the-art diffusion-based neural vocoders, BDDMs produce comparable or higher quality samples indistinguishable from human speech, notably with only seven sampling steps (143x faster than WaveGrad and 28.6x faster than DiffWave).

Paper: Published at ICLR 2022 on OpenReview

BDDM

This implementation supports model training and audio generation, and also provides the pre-trained models for the benchmark LJSpeech and VCTK dataset.

Visit our demo page for audio samples.

Updates:

  • May 20, 2021: Released our follow-up work FastDiff on GitHub, where we futher optimized the speed-and-quality trade-off.
  • May 10, 2021: Added the experiment configurations and model checkpoints for the VCTK dataset.
  • May 9, 2021: Added the searched noise schedules for the LJSpeech and VCTK datasets.
  • March 20, 2021: Released the PyTorch implementation of BDDM with pre-trained models for the LJSpeech dataset.

Recipes:

  • (Option 1) To train the BDDM scheduling network yourself, you can download the pre-trained score network from philsyn/DiffWave-Vocoder (provided at egs/lj/DiffWave.pkl), and follow the training steps below. (Start from Step I.)
  • (Option 2) To search for noise schedules using BDDM, we provide a pre-trained BDDM for LJSpeech at egs/lj/DiffWave-GALR.pkl and for VCTK at egs/vctk/DiffWave-GALR.pkl . (Start from Step III.)
  • (Option 3) To directly generate samples using BDDM, we provide the searched schedules for LJSpeech at egs/lj/noise_schedules and for VCTK at egs/vctk/noise_schedules (check conf.yml for the respective configurations). (Start from Step IV.)

Getting Started

We provide an example of how you can generate high-fidelity samples using BDDMs.

To try BDDM on your own dataset, simply clone this repo in your local machine provided with NVIDIA GPU + CUDA cuDNN and follow the below intructions.

Dependencies

Step I. Data Preparation and Configuraion

Download the LJSpeech dataset.

For training, we first need to setup a file conf.yml for configuring the data loader, the score and the schedule networks, the training procedure, the noise scheduling and sampling parameters.

Note: Appropriately modify the paths in "train_data_dir" and "valid_data_dir" for training; and the path in "gen_data_dir" for sampling. All dir paths should be link to a directory that store the waveform audios (in .wav) or the Mel-spectrogram files (in .mel).

Step II. Training a Schedule Network

Suppose that a well-trained score network (theta) is stored at $theta_path, we start by modifying "load": $theta_path in conf.yml.

After modifying the relevant hyperparameters for a schedule network (especially "tau"), we can train the schedule network (f_phi in paper) using:

# Training on device 0
sh train.sh 0 conf.yml

Note: In practice, we found that 10K training steps would be enough to obtain a promising scheduling network. This normally takes no more than half an hour for training with one GPU.

Step III. Searching for Noise Schedules

Given a well-trained BDDM (theta, phi), we can now run the noise scheduling algorithm to find the best schedule (optimizing the trade-off between quality and speed).

First, we set "load" in conf.yml to the path of the trained BDDM.

After setting the maximum number of sampling steps in scheduling ("N"), we run:

# Scheduling on device 0
sh schedule.sh 0 conf.yml

Step IV. Evaluation or Generation

For evaluation, we set "gen_data_dir" in conf.yml to the path of a directory that stores the test set of audios (in .wav).

For generation, we set "gen_data_dir" in conf.yml to the path of a directory that stores the Mel-spectrogram (by default in .mel generated by TacotronSTFT or by our dataset loader bddm/loader/dataset.py).

Then, we run:

# Generation/evaluation on device 0 (only support single-GPU scheduling)
sh generate.sh 0 conf.yml

Acknowledgements

This implementation uses parts of the code from the following Github repos:
Tacotron2
DiffWave-Vocoder
as described in our code.

Citations

@inproceedings{lam2022bddm,
  title={BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis},
  author={Lam, Max WY and Wang, Jun and Su, Dan and Yu, Dong},
  booktitle={International Conference on Learning Representations},
  year={2022}
}

License

Copyright 2022 Tencent

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Disclaimer

This is not an officially supported Tencent product.

Owner
Research repositories.
Code repository for the work "Multi-Domain Incremental Learning for Semantic Segmentation", accepted at WACV 2022

Multi-Domain Incremental Learning for Semantic Segmentation This is the Pytorch implementation of our work "Multi-Domain Incremental Learning for Sema

Pgxo20 24 Jan 02, 2023
FinGAT: A Financial Graph Attention Networkto Recommend Top-K Profitable Stocks

FinGAT: A Financial Graph Attention Networkto Recommend Top-K Profitable Stocks This is our implementation for the paper: FinGAT: A Financial Graph At

Yu-Che Tsai 64 Dec 13, 2022
AntroPy: entropy and complexity of (EEG) time-series in Python

AntroPy is a Python 3 package providing several time-efficient algorithms for computing the complexity of time-series. It can be used for example to e

Raphael Vallat 153 Dec 27, 2022
Lite-HRNet: A Lightweight High-Resolution Network

LiteHRNet Benchmark πŸ”₯ πŸ”₯ Based on MMsegmentation πŸ”₯ πŸ”₯ Cityscapes FCN resize concat config mIoU last mAcc last eval last mIoU best mAcc best eval bes

16 Dec 12, 2022
Programming with Neural Surrogates of Programs

Programming with Neural Surrogates of Programs

0 Dec 12, 2021
Python scripts form performing stereo depth estimation using the CoEx model in ONNX.

ONNX-CoEx-Stereo-Depth-estimation Python scripts form performing stereo depth estimation using the CoEx model in ONNX. Stereo depth estimation on the

Ibai Gorordo 8 Dec 29, 2022
A toy project using OpenCV and PyMunk

A toy project using OpenCV, PyMunk and Mediapipe the source code for my LindkedIn post It's just a toy project and I didn't write a documentation yet,

Amirabbas Asadi 82 Oct 28, 2022
EsViT: Efficient self-supervised Vision Transformers

Efficient Self-Supervised Vision Transformers (EsViT) PyTorch implementation for EsViT, built with two techniques: A multi-stage Transformer architect

Microsoft 352 Dec 25, 2022
AI-generated-characters for Learning and Wellbeing

AI-generated-characters for Learning and Wellbeing Click here for the full project page. This repository contains the source code for the paper AI-gen

MIT Media Lab 214 Jan 01, 2023
Event queue (Equeue) dialect is an MLIR Dialect that models concurrent devices in terms of control and structure.

Event Queue Dialect Event queue (Equeue) dialect is an MLIR Dialect that models concurrent devices in terms of control and structure. Motivation The m

Cornell Capra 23 Dec 08, 2022
This project contains an implemented version of Face Detection using OpenCV and Mediapipe. This is a code snippet and can be used in projects.

Live-Face-Detection Project Description: In this project, we will be using the live video feed from the camera to detect Faces. It will also detect so

Hassan Shahzad 3 Oct 02, 2021
Make your master artistic punk avatar through machine learning world famous paintings.

Master-art-punk Make your master artistic punk avatar through machine learning world famous paintings. ι€šθΏ‡ζœΊε™¨ε­¦δΉ δΈ–η•Œεη”»εˆΆδ½œε±žδΊŽδ½ ηš„ε€§εΈˆηΊ§θ‰Ίζœ―ζœ‹ε…‹ε€΄εƒ Nowadays, NFT is beco

Philipjhc 53 Dec 27, 2022
A toy compiler that can convert Python scripts to pickle bytecode πŸ₯’

Pickora 🐰 A small compiler that can convert Python scripts to pickle bytecode. Requirements Python 3.8+ No third-party modules are required. Usage us

κŒ—α–˜κ’’κ€€κ“„κ’’κ€€κˆ€κŸ 68 Jan 04, 2023
Deep Learning Interviews book: Hundreds of fully solved job interview questions from a wide range of key topics in AI.

This book was written for you: an aspiring data scientist with a quantitative background, facing down the gauntlet of the interview process in an increasingly competitive field. For most of you, the

4.1k Dec 28, 2022
[NeurIPS'21] Projected GANs Converge Faster

[Project] [PDF] [Supplementary] [Talk] This repository contains the code for our NeurIPS 2021 paper "Projected GANs Converge Faster" by Axel Sauer, Ka

798 Jan 04, 2023
PyTorch implementation of "Transparency by Design: Closing the Gap Between Performance and Interpretability in Visual Reasoning"

Transparency-by-Design networks (TbD-nets) This repository contains code for replicating the experiments and visualizations from the paper Transparenc

David Mascharka 351 Nov 18, 2022
Deep Networks with Recurrent Layer Aggregation

RLA-Net: Recurrent Layer Aggregation Recurrence along Depth: Deep Networks with Recurrent Layer Aggregation This is an implementation of RLA-Net (acce

Joy Fang 21 Aug 16, 2022
This is 2nd term discrete maths project done by UCU students that uses backtracking to solve various problems.

Backtracking Project Sponsors This is a project made by UCU students: Olha Liuba - crossword solver implementation Hanna Yershova - sudoku solver impl

Dasha 4 Oct 17, 2021
Learning to Segment Instances in Videos with Spatial Propagation Network

Learning to Segment Instances in Videos with Spatial Propagation Network This paper is available at the 2017 DAVIS Challenge website. Check our result

Jingchun Cheng 145 Sep 28, 2022
Deep learning based hand gesture recognition using LSTM and MediaPipie.

Hand Gesture Recognition Deep learning based hand gesture recognition using LSTM and MediaPipie. Demo video using PingPong Robot Files Pretrained mode

Brad 24 Nov 11, 2022