BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis

Related tags

Deep Learningbddm
Overview

Bilateral Denoising Diffusion Models (BDDMs)

GitHub Stars visitors arXiv demo

This is the official PyTorch implementation of the following paper:

BDDM: BILATERAL DENOISING DIFFUSION MODELS FOR FAST AND HIGH-QUALITY SPEECH SYNTHESIS
Max W. Y. Lam, Jun Wang, Dan Su, Dong Yu

Abstract: Diffusion probabilistic models (DPMs) and their extensions have emerged as competitive generative models yet confront challenges of efficient sampling. We propose a new bilateral denoising diffusion model (BDDM) that parameterizes both the forward and reverse processes with a schedule network and a score network, which can train with a novel bilateral modeling objective. We show that the new surrogate objective can achieve a lower bound of the log marginal likelihood tighter than a conventional surrogate. We also find that BDDM allows inheriting pre-trained score network parameters from any DPMs and consequently enables speedy and stable learning of the schedule network and optimization of a noise schedule for sampling. Our experiments demonstrate that BDDMs can generate high-fidelity audio samples with as few as three sampling steps. Moreover, compared to other state-of-the-art diffusion-based neural vocoders, BDDMs produce comparable or higher quality samples indistinguishable from human speech, notably with only seven sampling steps (143x faster than WaveGrad and 28.6x faster than DiffWave).

Paper: Published at ICLR 2022 on OpenReview

BDDM

This implementation supports model training and audio generation, and also provides the pre-trained models for the benchmark LJSpeech and VCTK dataset.

Visit our demo page for audio samples.

Updates:

  • May 20, 2021: Released our follow-up work FastDiff on GitHub, where we futher optimized the speed-and-quality trade-off.
  • May 10, 2021: Added the experiment configurations and model checkpoints for the VCTK dataset.
  • May 9, 2021: Added the searched noise schedules for the LJSpeech and VCTK datasets.
  • March 20, 2021: Released the PyTorch implementation of BDDM with pre-trained models for the LJSpeech dataset.

Recipes:

  • (Option 1) To train the BDDM scheduling network yourself, you can download the pre-trained score network from philsyn/DiffWave-Vocoder (provided at egs/lj/DiffWave.pkl), and follow the training steps below. (Start from Step I.)
  • (Option 2) To search for noise schedules using BDDM, we provide a pre-trained BDDM for LJSpeech at egs/lj/DiffWave-GALR.pkl and for VCTK at egs/vctk/DiffWave-GALR.pkl . (Start from Step III.)
  • (Option 3) To directly generate samples using BDDM, we provide the searched schedules for LJSpeech at egs/lj/noise_schedules and for VCTK at egs/vctk/noise_schedules (check conf.yml for the respective configurations). (Start from Step IV.)

Getting Started

We provide an example of how you can generate high-fidelity samples using BDDMs.

To try BDDM on your own dataset, simply clone this repo in your local machine provided with NVIDIA GPU + CUDA cuDNN and follow the below intructions.

Dependencies

Step I. Data Preparation and Configuraion

Download the LJSpeech dataset.

For training, we first need to setup a file conf.yml for configuring the data loader, the score and the schedule networks, the training procedure, the noise scheduling and sampling parameters.

Note: Appropriately modify the paths in "train_data_dir" and "valid_data_dir" for training; and the path in "gen_data_dir" for sampling. All dir paths should be link to a directory that store the waveform audios (in .wav) or the Mel-spectrogram files (in .mel).

Step II. Training a Schedule Network

Suppose that a well-trained score network (theta) is stored at $theta_path, we start by modifying "load": $theta_path in conf.yml.

After modifying the relevant hyperparameters for a schedule network (especially "tau"), we can train the schedule network (f_phi in paper) using:

# Training on device 0
sh train.sh 0 conf.yml

Note: In practice, we found that 10K training steps would be enough to obtain a promising scheduling network. This normally takes no more than half an hour for training with one GPU.

Step III. Searching for Noise Schedules

Given a well-trained BDDM (theta, phi), we can now run the noise scheduling algorithm to find the best schedule (optimizing the trade-off between quality and speed).

First, we set "load" in conf.yml to the path of the trained BDDM.

After setting the maximum number of sampling steps in scheduling ("N"), we run:

# Scheduling on device 0
sh schedule.sh 0 conf.yml

Step IV. Evaluation or Generation

For evaluation, we set "gen_data_dir" in conf.yml to the path of a directory that stores the test set of audios (in .wav).

For generation, we set "gen_data_dir" in conf.yml to the path of a directory that stores the Mel-spectrogram (by default in .mel generated by TacotronSTFT or by our dataset loader bddm/loader/dataset.py).

Then, we run:

# Generation/evaluation on device 0 (only support single-GPU scheduling)
sh generate.sh 0 conf.yml

Acknowledgements

This implementation uses parts of the code from the following Github repos:
Tacotron2
DiffWave-Vocoder
as described in our code.

Citations

@inproceedings{lam2022bddm,
  title={BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis},
  author={Lam, Max WY and Wang, Jun and Su, Dan and Yu, Dong},
  booktitle={International Conference on Learning Representations},
  year={2022}
}

License

Copyright 2022 Tencent

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Disclaimer

This is not an officially supported Tencent product.

Owner
Research repositories.
Photo2cartoon - 人像卡通化探索项目 (photo-to-cartoon translation project)

人像卡通化 (Photo to Cartoon) 中文版 | English Version 该项目为小视科技卡通肖像探索项目。您可使用微信扫描下方二维码或搜索“AI卡通秀”小程序体验卡通化效果。

Minivision_AI 3.5k Dec 30, 2022
EdiBERT, a generative model for image editing

EdiBERT, a generative model for image editing EdiBERT is a generative model based on a bi-directional transformer, suited for image manipulation. The

16 Dec 07, 2022
C3DPO - Canonical 3D Pose Networks for Non-rigid Structure From Motion.

C3DPO: Canonical 3D Pose Networks for Non-Rigid Structure From Motion By: David Novotny, Nikhila Ravi, Benjamin Graham, Natalia Neverova, Andrea Vedal

Meta Research 309 Dec 16, 2022
Cookiecutter PyTorch Lightning

Cookiecutter PyTorch Lightning Instructions # install cookiecutter pip install cookiecutter

Mazen 8 Nov 06, 2022
SciFive: a text-text transformer model for biomedical literature

SciFive SciFive provided a Text-Text framework for biomedical language and natural language in NLP. Under the T5's framework and desrbibed in the pape

Long Phan 54 Dec 24, 2022
Unofficial PyTorch Implementation of AHDRNet (CVPR 2019)

AHDRNet-PyTorch This is the PyTorch implementation of Attention-guided Network for Ghost-free High Dynamic Range Imaging (CVPR 2019). The official cod

Yutong Zhang 4 Sep 08, 2022
Gym Threat Defense

Gym Threat Defense The Threat Defense environment is an OpenAI Gym implementation of the environment defined as the toy example in Optimal Defense Pol

Hampus Ramström 5 Dec 08, 2022
Method for facial emotion recognition compitition of Xunfei and Datawhale .

人脸情绪识别挑战赛-第3名-W03KFgNOc-源代码、模型以及说明文档 队名:W03KFgNOc 排名:3 正确率: 0.75564 队员:yyMoming,xkwang,RichardoMu。 比赛链接:人脸情绪识别挑战赛 文章地址:link emotion 该项目分别训练八个模型并生成csv文

6 Oct 17, 2022
Improving Transferability of Representations via Augmentation-Aware Self-Supervision

Improving Transferability of Representations via Augmentation-Aware Self-Supervision Accepted to NeurIPS 2021 TL;DR: Learning augmentation-aware infor

hankook 38 Sep 16, 2022
The source code for 'Noisy-Labeled NER with Confidence Estimation' accepted by NAACL 2021

Kun Liu*, Yao Fu*, Chuanqi Tan, Mosha Chen, Ningyu Zhang, Songfang Huang, Sheng Gao. Noisy-Labeled NER with Confidence Estimation. NAACL 2021. [arxiv]

30 Nov 12, 2022
Source code of our BMVC 2021 paper: AniFormer: Data-driven 3D Animation with Transformer

AniFormer This is the PyTorch implementation of our BMVC 2021 paper AniFormer: Data-driven 3D Animation with Transformer. Haoyu Chen, Hao Tang, Nicu S

24 Nov 02, 2022
Learning with Noisy Labels via Sparse Regularization, ICCV2021

Learning with Noisy Labels via Sparse Regularization This repository is the official implementation of [Learning with Noisy Labels via Sparse Regulari

Xiong Zhou 38 Oct 20, 2022
Explicable Reward Design for Reinforcement Learning Agents [NeurIPS'21]

Explicable Reward Design for Reinforcement Learning Agents [NeurIPS'21]

3 May 12, 2022
Interpretable and Generalizable Person Re-Identification with Query-Adaptive Convolution and Temporal Lifting

QAConv Interpretable and Generalizable Person Re-Identification with Query-Adaptive Convolution and Temporal Lifting This PyTorch code is proposed in

Shengcai Liao 166 Dec 28, 2022
Official project website for the CVPR 2021 paper "Exploring intermediate representation for monocular vehicle pose estimation"

EgoNet Official project website for the CVPR 2021 paper "Exploring intermediate representation for monocular vehicle pose estimation". This repo inclu

Shichao Li 138 Dec 09, 2022
BasicRL: easy and fundamental codes for deep reinforcement learning。It is an improvement on rainbow-is-all-you-need and OpenAI Spinning Up.

BasicRL: easy and fundamental codes for deep reinforcement learning BasicRL is an improvement on rainbow-is-all-you-need and OpenAI Spinning Up. It is

RayYoh 12 Apr 28, 2022
Pytorch code for our paper Beyond ImageNet Attack: Towards Crafting Adversarial Examples for Black-box Domains)

Beyond ImageNet Attack: Towards Crafting Adversarial Examples for Black-box Domains (ICLR'2022) This is the Pytorch code for our paper Beyond ImageNet

Alibaba-AAIG 37 Nov 23, 2022
Yolo object detection - Yolo object detection with python

How to run download required files make build_image make download Docker versio

3 Jan 26, 2022
Revitalizing CNN Attention via Transformers in Self-Supervised Visual Representation Learning

Revitalizing CNN Attention via Transformers in Self-Supervised Visual Representation Learning

ChongjianGE 89 Dec 02, 2022
Pcos-prediction - Predicts the likelihood of Polycystic Ovary Syndrome based on patient attributes and symptoms

PCOS Prediction 🥼 Predicts the likelihood of Polycystic Ovary Syndrome based on

Samantha Van Seters 1 Jan 10, 2022