BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis

Related tags

Deep Learningbddm
Overview

Bilateral Denoising Diffusion Models (BDDMs)

GitHub Stars visitors arXiv demo

This is the official PyTorch implementation of the following paper:

BDDM: BILATERAL DENOISING DIFFUSION MODELS FOR FAST AND HIGH-QUALITY SPEECH SYNTHESIS
Max W. Y. Lam, Jun Wang, Dan Su, Dong Yu

Abstract: Diffusion probabilistic models (DPMs) and their extensions have emerged as competitive generative models yet confront challenges of efficient sampling. We propose a new bilateral denoising diffusion model (BDDM) that parameterizes both the forward and reverse processes with a schedule network and a score network, which can train with a novel bilateral modeling objective. We show that the new surrogate objective can achieve a lower bound of the log marginal likelihood tighter than a conventional surrogate. We also find that BDDM allows inheriting pre-trained score network parameters from any DPMs and consequently enables speedy and stable learning of the schedule network and optimization of a noise schedule for sampling. Our experiments demonstrate that BDDMs can generate high-fidelity audio samples with as few as three sampling steps. Moreover, compared to other state-of-the-art diffusion-based neural vocoders, BDDMs produce comparable or higher quality samples indistinguishable from human speech, notably with only seven sampling steps (143x faster than WaveGrad and 28.6x faster than DiffWave).

Paper: Published at ICLR 2022 on OpenReview

BDDM

This implementation supports model training and audio generation, and also provides the pre-trained models for the benchmark LJSpeech and VCTK dataset.

Visit our demo page for audio samples.

Updates:

  • May 20, 2021: Released our follow-up work FastDiff on GitHub, where we futher optimized the speed-and-quality trade-off.
  • May 10, 2021: Added the experiment configurations and model checkpoints for the VCTK dataset.
  • May 9, 2021: Added the searched noise schedules for the LJSpeech and VCTK datasets.
  • March 20, 2021: Released the PyTorch implementation of BDDM with pre-trained models for the LJSpeech dataset.

Recipes:

  • (Option 1) To train the BDDM scheduling network yourself, you can download the pre-trained score network from philsyn/DiffWave-Vocoder (provided at egs/lj/DiffWave.pkl), and follow the training steps below. (Start from Step I.)
  • (Option 2) To search for noise schedules using BDDM, we provide a pre-trained BDDM for LJSpeech at egs/lj/DiffWave-GALR.pkl and for VCTK at egs/vctk/DiffWave-GALR.pkl . (Start from Step III.)
  • (Option 3) To directly generate samples using BDDM, we provide the searched schedules for LJSpeech at egs/lj/noise_schedules and for VCTK at egs/vctk/noise_schedules (check conf.yml for the respective configurations). (Start from Step IV.)

Getting Started

We provide an example of how you can generate high-fidelity samples using BDDMs.

To try BDDM on your own dataset, simply clone this repo in your local machine provided with NVIDIA GPU + CUDA cuDNN and follow the below intructions.

Dependencies

Step I. Data Preparation and Configuraion

Download the LJSpeech dataset.

For training, we first need to setup a file conf.yml for configuring the data loader, the score and the schedule networks, the training procedure, the noise scheduling and sampling parameters.

Note: Appropriately modify the paths in "train_data_dir" and "valid_data_dir" for training; and the path in "gen_data_dir" for sampling. All dir paths should be link to a directory that store the waveform audios (in .wav) or the Mel-spectrogram files (in .mel).

Step II. Training a Schedule Network

Suppose that a well-trained score network (theta) is stored at $theta_path, we start by modifying "load": $theta_path in conf.yml.

After modifying the relevant hyperparameters for a schedule network (especially "tau"), we can train the schedule network (f_phi in paper) using:

# Training on device 0
sh train.sh 0 conf.yml

Note: In practice, we found that 10K training steps would be enough to obtain a promising scheduling network. This normally takes no more than half an hour for training with one GPU.

Step III. Searching for Noise Schedules

Given a well-trained BDDM (theta, phi), we can now run the noise scheduling algorithm to find the best schedule (optimizing the trade-off between quality and speed).

First, we set "load" in conf.yml to the path of the trained BDDM.

After setting the maximum number of sampling steps in scheduling ("N"), we run:

# Scheduling on device 0
sh schedule.sh 0 conf.yml

Step IV. Evaluation or Generation

For evaluation, we set "gen_data_dir" in conf.yml to the path of a directory that stores the test set of audios (in .wav).

For generation, we set "gen_data_dir" in conf.yml to the path of a directory that stores the Mel-spectrogram (by default in .mel generated by TacotronSTFT or by our dataset loader bddm/loader/dataset.py).

Then, we run:

# Generation/evaluation on device 0 (only support single-GPU scheduling)
sh generate.sh 0 conf.yml

Acknowledgements

This implementation uses parts of the code from the following Github repos:
Tacotron2
DiffWave-Vocoder
as described in our code.

Citations

@inproceedings{lam2022bddm,
  title={BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis},
  author={Lam, Max WY and Wang, Jun and Su, Dan and Yu, Dong},
  booktitle={International Conference on Learning Representations},
  year={2022}
}

License

Copyright 2022 Tencent

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Disclaimer

This is not an officially supported Tencent product.

Owner
Research repositories.
Official implementation of "UCTransNet: Rethinking the Skip Connections in U-Net from a Channel-wise Perspective with Transformer"

[AAAI2022] UCTransNet This repo is the official implementation of "UCTransNet: Rethinking the Skip Connections in U-Net from a Channel-wise Perspectiv

Haonan Wang 199 Jan 03, 2023
Buffon’s needle: one of the oldest problems in geometric probability

Buffon-s-Needle Buffon’s needle is one of the oldest problems in geometric proba

3 Feb 18, 2022
Yas CRNN model training - Yet Another Genshin Impact Scanner

Yas-Train Yet Another Genshin Impact Scanner 又一个原神圣遗物导出器 介绍 该仓库为 Yas 的模型训练程序 相关资料 MobileNetV3 CRNN 使用 假设你会设置基本的pytorch环境。 生成数据集 python main.py gen 训练

wormtql 18 Jan 08, 2023
FaceAPI: AI-powered Face Detection & Rotation Tracking, Face Description & Recognition, Age & Gender & Emotion Prediction for Browser and NodeJS using TensorFlow/JS

FaceAPI AI-powered Face Detection & Rotation Tracking, Face Description & Recognition, Age & Gender & Emotion Prediction for Browser and NodeJS using

Vladimir Mandic 395 Dec 29, 2022
Official repository for MixFaceNets: Extremely Efficient Face Recognition Networks

MixFaceNets This is the official repository of the paper: MixFaceNets: Extremely Efficient Face Recognition Networks. (Accepted in IJCB2021) https://i

Fadi Boutros 51 Dec 13, 2022
[CVPR 2020] GAN Compression: Efficient Architectures for Interactive Conditional GANs

GAN Compression project | paper | videos | slides [NEW!] GAN Compression is accepted by T-PAMI! We released our T-PAMI version in the arXiv v4! [NEW!]

MIT HAN Lab 1k Jan 07, 2023
[Nature Machine Intelligence' 21] "Advancing COVID-19 Diagnosis with Privacy-Preserving Collaboration in Artificial Intelligence"

[UCADI] COVID-19 Diagnosis With Federated Learning Intro We developed a Federated Learning (FL) Framework for global researchers to collaboratively tr

HUST EIC AI-LAB 30 Dec 12, 2022
Frequency Spectrum Augmentation Consistency for Domain Adaptive Object Detection

Frequency Spectrum Augmentation Consistency for Domain Adaptive Object Detection Main requirements torch = 1.0 torchvision = 0.2.0 Python 3 Environm

15 Apr 04, 2022
Replication attempt for the Protein Folding Model

RGN2-Replica (WIP) To eventually become an unofficial working Pytorch implementation of RGN2, an state of the art model for MSA-less Protein Folding f

Eric Alcaide 36 Nov 29, 2022
Official pytorch implementation of DeformSyncNet: Deformation Transfer via Synchronized Shape Deformation Spaces

DeformSyncNet: Deformation Transfer via Synchronized Shape Deformation Spaces Minhyuk Sung*, Zhenyu Jiang*, Panos Achlioptas, Niloy J. Mitra, Leonidas

Zhenyu Jiang 21 Aug 30, 2022
RITA is a family of autoregressive protein models, developed by LightOn in collaboration with the OATML group at Oxford and the Debora Marks Lab at Harvard.

RITA: a Study on Scaling Up Generative Protein Sequence Models RITA is a family of autoregressive protein models, developed by a collaboration of Ligh

LightOn 69 Dec 22, 2022
The trained model and denoising example for paper : Cardiopulmonary Auscultation Enhancement with a Two-Stage Noise Cancellation Approach

The trained model and denoising example for paper : Cardiopulmonary Auscultation Enhancement with a Two-Stage Noise Cancellation Approach

ycj_project 1 Jan 18, 2022
Implementation of various Vision Transformers I found interesting

Implementation of various Vision Transformers I found interesting

Kim Seonghyeon 78 Dec 06, 2022
This is a repository with the code for the ACL 2019 paper

The Story of Heads This is the official repo for the following papers: (ACL 2019) Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy

231 Nov 15, 2022
[NAACL & ACL 2021] SapBERT: Self-alignment pretraining for BERT.

SapBERT: Self-alignment pretraining for BERT This repo holds code for the SapBERT model presented in our NAACL 2021 paper: Self-Alignment Pretraining

Cambridge Language Technology Lab 104 Dec 07, 2022
GAN-STEM-Conv2MultiSlice - Exploring Generative Adversarial Networks for Image-to-Image Translation in STEM Simulation

GAN-STEM-Conv2MultiSlice GAN method to help covert lower resolution STEM images generated by convolution methods to higher resolution STEM images gene

UW-Madison Computational Materials Group 2 Feb 10, 2021
A Learning-based Camera Calibration Toolbox

Learning-based Camera Calibration A Learning-based Camera Calibration Toolbox Paper The pdf file can be found here. @misc{zhang2022learningbased,

Eason 14 Dec 21, 2022
TabNet for fastai

TabNet for fastai This is an adaptation of TabNet (Attention-based network for tabular data) for fastai (=2.0) library. The original paper https://ar

Mikhail Grankin 116 Oct 21, 2022
Self-Supervised Methods for Noise-Removal

SSMNR | Self-Supervised Methods for Noise Removal Image denoising is the task of removing noise from an image, which can be formulated as the task of

1 Jan 16, 2022
Code from PropMix, accepted at BMVC'21

PropMix: Hard Sample Filtering and Proportional MixUp for Learning with Noisy Labels This repository is the official implementation of Hard Sample Fil

6 Dec 21, 2022