Pytorch Implementation of DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis (TTS Extension)

Last update: Jan 02, 2023

Overview

DiffSinger - PyTorch Implementation

PyTorch implementation of DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis (TTS Extension).

Status (2021.06.03)

Naive Version of DiffSinger
Shallow Diffusion Mechanism: Training boundary predictor by leveraging pre-trained auxiliary decoder + Training denoiser using k as a maximum time step

Quickstart

Dependencies

You can install the Python dependencies with

pip3 install -r requirements.txt

Inference

You have to download the pretrained models and put them in output/ckpt/LJSpeech/.

For English single-speaker TTS, run

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step 160000 --mode single -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml

The generated utterances will be put in output/result/.

Batch Inference

Batch inference is also supported, try

python3 synthesize.py --source preprocessed_data/LJSpeech/val.txt --restore_step 160000 --mode batch -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml

to synthesize all utterances in preprocessed_data/LJSpeech/val.txt

Controllability

The pitch/volume/speaking rate of the synthesized utterances can be controlled by specifying the desired pitch/energy/duration ratios. For example, one can increase the speaking rate by 20 % and decrease the volume by 20 % by

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step 160000 --mode single -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml --duration_control 0.8 --energy_control 0.8

Training

Datasets

The supported datasets are

LJSpeech: a single-speaker English dataset consists of 13100 short audio clips of a female speaker reading passages from 7 non-fiction books, approximately 24 hours in total.
(will be added more)

Preprocessing

First, run

python3 prepare_align.py config/LJSpeech/preprocess.yaml

for some preparations.

As described in the paper, Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences. Alignments for the LJSpeech datasets are provided here from ming024's FastSpeech2. You have to unzip the files in preprocessed_data/LJSpeech/TextGrid/.

After that, run the preprocessing script by

python3 preprocess.py config/LJSpeech/preprocess.yaml

Alternately, you can align the corpus by yourself. Download the official MFA package and run

./montreal-forced-aligner/bin/mfa_align raw_data/LJSpeech/ lexicon/librispeech-lexicon.txt english preprocessed_data/LJSpeech

./montreal-forced-aligner/bin/mfa_train_and_align raw_data/LJSpeech/ lexicon/librispeech-lexicon.txt preprocessed_data/LJSpeech

to align the corpus and then run the preprocessing script.

python3 preprocess.py config/LJSpeech/preprocess.yaml

Training

Train your model with

python3 train.py -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml

TensorBoard

Use

tensorboard --logdir output/log/LJSpeech

to serve TensorBoard on your localhost. The loss curves, synthesized mel-spectrograms, and audios are shown.

Implementation Issues

Pitch extractor comparison (on LJ001-0006.wav)

pyworld is used to extract f0 (fundamental frequency) as pitch information in this implementation. Empirically, however, I found that all three methods were equally acceptable for clean datasets (e.g., LJSpeech) as above figures. Note that pysptk would work better for noisy datasets (as described in STYLER).
Stack two layers of FFTBlock for the lyrics encoder (text encoder).
(Naive version) The number of learnable parameters is 34.337M, which is larger than the original paper (26.744M). The diffusion module takes a significant portion of whole parameters.
I did not remove the energy prediction of FastSpeech2 since it is not critical to the model training or performance (as described in LightSpeech). It should be easily removed without any performance degradation.
Use HiFi-GAN instead of Parallel WaveGAN (PWG) for vocoding.

Citation

@misc{lee2021diffsinger,
  author = {Lee, Keon},
  title = {DiffSinger},
  year = {2021},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/keonlee9420/DiffSinger}}
}

References

Authors' codebase
ming024's FastSpeech2 (Later than 2021.02.26 ver.)
hojonathanho's diffusion
lmnt-com's diffwave

This is the implementation of "SELF SUPERVISED REPRESENTATION LEARNING WITH DEEP CLUSTERING FOR ACOUSTIC UNIT DISCOVERY FROM RAW SPEECH" submitted to ICASSP 2022

CPC_DeepCluster This is the implementation of "SELF SUPERVISED REPRESENTATION LEARNING WITH DEEP CLUSTERING FOR ACOUSTIC UNIT DISCOVERY FROM RAW SPEEC

2 Sep 15, 2022

Pytorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

Parallel Tacotron2 Pytorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

170 Dec 27, 2022

PyTorch implementation of the method described in the paper VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop.

VoiceLoop PyTorch implementation of the method described in the paper VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop. VoiceLoop is a n

873 Dec 15, 2022

Comments

Training Error

In this case, , i ran the scripts python3 train.py -p config/vietnam/preprocess.yaml -m config/vietnam/model.yaml -t config/vietnam/train.yaml File "train.py", line 199, in main(args, configs) File "train.py", line 85, in main losses = Loss(batch, output) File "/home/thanhdo/envs/diffsinger_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home/thanhdo/Documents/DiffSinger/model/loss.py", line 69, in forward log_duration_targets = log_duration_targets.masked_select(src_masks) RuntimeError: The size of tensor a (39) must match the size of tensor b (136) at non-singleton dimension 1

opened by thanhdo99 8
diffusion_projection in ResidualBlock

Your implementation has diffusion_projection for every residual block similar to DiffWave, but this is inconsistent with the paper as the original architecture directly adds E_t (output of the step embedding module) to the input before the first convolution layer. Is there a reason behind this change?

opened by tebin 1

Pytorch Implementation of DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis (TTS Extension)

Related tags

Overview

DiffSinger - PyTorch Implementation

Status (2021.06.03)

Quickstart

Dependencies

Inference

Batch Inference

Controllability

Training

Datasets

Preprocessing

Training

TensorBoard

Implementation Issues

Citation

References

You might also like...

This is the implementation of "SELF SUPERVISED REPRESENTATION LEARNING WITH DEEP CLUSTERING FOR ACOUSTIC UNIT DISCOVERY FROM RAW SPEECH" submitted to ICASSP 2022

Pytorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

PyTorch implementation of the method described in the paper VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop.

This is the codebase for Diffusion Models Beat GANS on Image Synthesis.

Codebase for Diffusion Models Beat GANS on Image Synthesis.

High-Resolution Image Synthesis with Latent Diffusion Models

BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis

Code release for paper: The Boombox: Visual Reconstruction from Acoustic Vibrations

Multistream CNN for Robust Acoustic Modeling

Comments

Training Error

diffusion_projection in ResidualBlock

Releases(v0.1.0)

v0.1.0(Jun 4, 2021)

Owner

Keon Lee

Pytorch implementation for ACMMM2021 paper "I2V-GAN: Unpaired Infrared-to-Visible Video Translation".

Codes of paper "Unseen Object Amodal Instance Segmentation via Hierarchical Occlusion Modeling"

Detector for Log4Shell exploitation attempts

Causal Influence Detection for Improving Efficiency in Reinforcement Learning

Gradient-free global optimization algorithm for multidimensional functions based on the low rank tensor train format

U-Net: Convolutional Networks for Biomedical Image Segmentation

3DMV jointly combines RGB color and geometric information to perform 3D semantic segmentation of RGB-D scans.

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

This project aims at building a real-time wide band channel sounder using USRPs

I explore rock vs. mine prediction using a SONAR dataset

Simple, efficient and flexible vision toolbox for mxnet framework.

SUPERVISED-CONTRASTIVE-LEARNING-FOR-PRE-TRAINED-LANGUAGE-MODEL-FINE-TUNING - The Facebook paper about fine tuning RoBERTa with contrastive loss

Code for the paper "Relation of the Relations: A New Formalization of the Relation Extraction Problem"

Official implementation of the paper: "LDNet: Unified Listener Dependent Modeling in MOS Prediction for Synthetic Speech"

🔮 A refreshing functional take on deep learning, compatible with your favorite libraries

Speed-Test - You can check your intenet speed using this tool

Real-Time Social Distance Monitoring tool using Computer Vision

IhoneyBakFileScan Modify - 批量网站备份文件扫描器，增加文件规则，优化内存占用

MPLP: Metapath-Based Label Propagation for Heterogenous Graphs

A Transformer-Based Siamese Network for Change Detection