BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis

Related tags

Deep Learningbddm
Overview

Bilateral Denoising Diffusion Models (BDDMs)

GitHub Stars visitors arXiv demo

This is the official PyTorch implementation of the following paper:

BDDM: BILATERAL DENOISING DIFFUSION MODELS FOR FAST AND HIGH-QUALITY SPEECH SYNTHESIS
Max W. Y. Lam, Jun Wang, Dan Su, Dong Yu

Abstract: Diffusion probabilistic models (DPMs) and their extensions have emerged as competitive generative models yet confront challenges of efficient sampling. We propose a new bilateral denoising diffusion model (BDDM) that parameterizes both the forward and reverse processes with a schedule network and a score network, which can train with a novel bilateral modeling objective. We show that the new surrogate objective can achieve a lower bound of the log marginal likelihood tighter than a conventional surrogate. We also find that BDDM allows inheriting pre-trained score network parameters from any DPMs and consequently enables speedy and stable learning of the schedule network and optimization of a noise schedule for sampling. Our experiments demonstrate that BDDMs can generate high-fidelity audio samples with as few as three sampling steps. Moreover, compared to other state-of-the-art diffusion-based neural vocoders, BDDMs produce comparable or higher quality samples indistinguishable from human speech, notably with only seven sampling steps (143x faster than WaveGrad and 28.6x faster than DiffWave).

Paper: Published at ICLR 2022 on OpenReview

BDDM

This implementation supports model training and audio generation, and also provides the pre-trained models for the benchmark LJSpeech and VCTK dataset.

Visit our demo page for audio samples.

Updates:

  • May 20, 2021: Released our follow-up work FastDiff on GitHub, where we futher optimized the speed-and-quality trade-off.
  • May 10, 2021: Added the experiment configurations and model checkpoints for the VCTK dataset.
  • May 9, 2021: Added the searched noise schedules for the LJSpeech and VCTK datasets.
  • March 20, 2021: Released the PyTorch implementation of BDDM with pre-trained models for the LJSpeech dataset.

Recipes:

  • (Option 1) To train the BDDM scheduling network yourself, you can download the pre-trained score network from philsyn/DiffWave-Vocoder (provided at egs/lj/DiffWave.pkl), and follow the training steps below. (Start from Step I.)
  • (Option 2) To search for noise schedules using BDDM, we provide a pre-trained BDDM for LJSpeech at egs/lj/DiffWave-GALR.pkl and for VCTK at egs/vctk/DiffWave-GALR.pkl . (Start from Step III.)
  • (Option 3) To directly generate samples using BDDM, we provide the searched schedules for LJSpeech at egs/lj/noise_schedules and for VCTK at egs/vctk/noise_schedules (check conf.yml for the respective configurations). (Start from Step IV.)

Getting Started

We provide an example of how you can generate high-fidelity samples using BDDMs.

To try BDDM on your own dataset, simply clone this repo in your local machine provided with NVIDIA GPU + CUDA cuDNN and follow the below intructions.

Dependencies

Step I. Data Preparation and Configuraion

Download the LJSpeech dataset.

For training, we first need to setup a file conf.yml for configuring the data loader, the score and the schedule networks, the training procedure, the noise scheduling and sampling parameters.

Note: Appropriately modify the paths in "train_data_dir" and "valid_data_dir" for training; and the path in "gen_data_dir" for sampling. All dir paths should be link to a directory that store the waveform audios (in .wav) or the Mel-spectrogram files (in .mel).

Step II. Training a Schedule Network

Suppose that a well-trained score network (theta) is stored at $theta_path, we start by modifying "load": $theta_path in conf.yml.

After modifying the relevant hyperparameters for a schedule network (especially "tau"), we can train the schedule network (f_phi in paper) using:

# Training on device 0
sh train.sh 0 conf.yml

Note: In practice, we found that 10K training steps would be enough to obtain a promising scheduling network. This normally takes no more than half an hour for training with one GPU.

Step III. Searching for Noise Schedules

Given a well-trained BDDM (theta, phi), we can now run the noise scheduling algorithm to find the best schedule (optimizing the trade-off between quality and speed).

First, we set "load" in conf.yml to the path of the trained BDDM.

After setting the maximum number of sampling steps in scheduling ("N"), we run:

# Scheduling on device 0
sh schedule.sh 0 conf.yml

Step IV. Evaluation or Generation

For evaluation, we set "gen_data_dir" in conf.yml to the path of a directory that stores the test set of audios (in .wav).

For generation, we set "gen_data_dir" in conf.yml to the path of a directory that stores the Mel-spectrogram (by default in .mel generated by TacotronSTFT or by our dataset loader bddm/loader/dataset.py).

Then, we run:

# Generation/evaluation on device 0 (only support single-GPU scheduling)
sh generate.sh 0 conf.yml

Acknowledgements

This implementation uses parts of the code from the following Github repos:
Tacotron2
DiffWave-Vocoder
as described in our code.

Citations

@inproceedings{lam2022bddm,
  title={BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis},
  author={Lam, Max WY and Wang, Jun and Su, Dan and Yu, Dong},
  booktitle={International Conference on Learning Representations},
  year={2022}
}

License

Copyright 2022 Tencent

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Disclaimer

This is not an officially supported Tencent product.

Owner
Research repositories.
This is an official implementation for "DeciWatch: A Simple Baseline for 10x Efficient 2D and 3D Pose Estimation"

DeciWatch: A Simple Baseline for 10× Efficient 2D and 3D Pose Estimation This repo is the official implementation of "DeciWatch: A Simple Baseline for

117 Dec 24, 2022
Open source person re-identification library in python

Open-ReID Open-ReID is a lightweight library of person re-identification for research purpose. It aims to provide a uniform interface for different da

Tong Xiao 1.3k Jan 01, 2023
Meta-Learning Sparse Implicit Neural Representations (NeurIPS 2021)

Meta-SparseINR Official PyTorch implementation of "Meta-learning Sparse Implicit Neural Representations" (NeurIPS 2021) by Jaeho Lee*, Jihoon Tack*, N

Jaeho Lee 41 Nov 10, 2022
This is the pytorch implementation of the paper - Axiomatic Attribution for Deep Networks.

Integrated Gradients This is the pytorch implementation of "Axiomatic Attribution for Deep Networks". The original tensorflow version could be found h

Tianhong Dai 150 Dec 23, 2022
Multi-layer convolutional LSTM with Pytorch

Convolution_LSTM_pytorch Thanks for your attention. I haven't got time to maintain this repo for a long time. I recommend this repo which provides an

Zijie Zhuang 733 Dec 30, 2022
Detector for Log4Shell exploitation attempts

log4shell-detector Detector for Log4Shell exploitation attempts Idea The problem with the log4j CVE-2021-44228 exploitation is that the string can be

Florian Roth 729 Dec 25, 2022
A strongly-typed genetic programming framework for Python

monkeys "If an army of monkeys were strumming on typewriters they might write all the books in the British Museum." monkeys is a framework designed to

H. Chase Stevens 115 Nov 27, 2022
Information Gain Filtration (IGF) is a method for filtering domain-specific data during language model finetuning. IGF shows significant improvements over baseline fine-tuning without data filtration.

Information Gain Filtration Information Gain Filtration (IGF) is a method for filtering domain-specific data during language model finetuning. IGF sho

4 Jul 28, 2022
🔎 Monitor deep learning model training and hardware usage from your mobile phone 📱

Monitor deep learning model training and hardware usage from mobile. 🔥 Features Monitor running experiments from mobile phone (or laptop) Monitor har

labml.ai 1.2k Dec 25, 2022
Implementation of the Triangle Multiplicative module, used in Alphafold2 as an efficient way to mix rows or columns of a 2d feature map, as a standalone package for Pytorch

Triangle Multiplicative Module - Pytorch Implementation of the Triangle Multiplicative module, used in Alphafold2 as an efficient way to mix rows or c

Phil Wang 22 Oct 28, 2022
classification task on dataset-CIFAR10,by using Tensorflow/keras

CIFAR10-Tensorflow classification task on dataset-CIFAR10,by using Tensorflow/keras 在这一个库中,我使用Tensorflow与keras框架搭建了几个卷积神经网络模型,针对CIFAR10数据集进行了训练与测试。分别使

3 Oct 17, 2021
Hand Gesture Volume Control is AIML based project which uses image processing to control the volume of your Computer.

Hand Gesture Volume Control Modules There are basically three modules Handtracking Program Handtracking Module Volume Control Program Handtracking Pro

VITTAL 1 Jan 12, 2022
Official repository of "Investigating Tradeoffs in Real-World Video Super-Resolution"

RealBasicVSR [Paper] This is the official repository of "Investigating Tradeoffs in Real-World Video Super-Resolution, arXiv". This repository contain

Kelvin C.K. Chan 566 Dec 28, 2022
Weakly-supervised semantic image segmentation with CNNs using point supervision

Code for our ECCV paper What's the Point: Semantic Segmentation with Point Supervision. Summary This library is a custom build of Caffe for semantic i

27 Sep 14, 2022
[ICCV 2021] Amplitude-Phase Recombination: Rethinking Robustness of Convolutional Neural Networks in Frequency Domain

Amplitude-Phase Recombination (ICCV'21) Official PyTorch implementation of "Amplitude-Phase Recombination: Rethinking Robustness of Convolutional Neur

Guangyao Chen 53 Oct 05, 2022
A lane detection integrated Real-time Instance Segmentation based on YOLACT (You Only Look At CoefficienTs)

Real-time Instance Segmentation and Lane Detection This is a lane detection integrated Real-time Instance Segmentation based on YOLACT (You Only Look

Jin 4 Dec 30, 2022
Codes for “A Deeply Supervised Attention Metric-Based Network and an Open Aerial Image Dataset for Remote Sensing Change Detection”

DSAMNet The pytorch implementation for "A Deeply-supervised Attention Metric-based Network and an Open Aerial Image Dataset for Remote Sensing Change

Mengxi Liu 41 Dec 14, 2022
A clean and scalable template to kickstart your deep learning project 🚀 ⚡ 🔥

Lightning-Hydra-Template A clean and scalable template to kickstart your deep learning project 🚀 ⚡ 🔥 Click on Use this template to initialize new re

Hyunsoo Cho 1 Dec 20, 2021
Transformer part of 12th place solution in Riiid! Answer Correctness Prediction

kaggle_riiid Transformer part of 12th place solution in Riiid! Answer Correctness Prediction. Please see here for more information. Execution You need

Sakami Kosuke 2 Apr 23, 2022
[NeurIPS 2021] Official implementation of paper "Learning to Simulate Self-driven Particles System with Coordinated Policy Optimization".

Code for Coordinated Policy Optimization Webpage | Code | Paper | Talk (English) | Talk (Chinese) Hi there! This is the source code of the paper “Lear

DeciForce: Crossroads of Machine Perception and Autonomy 81 Dec 19, 2022