Pytorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

Last update: Dec 27, 2022

Overview

Parallel Tacotron2

Pytorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

Updates

2021.05.15: Implementation done. Sanity checks on training and inference. But still the model cannot converge.

I'm waiting for your contribution! Please inform me if you find any mistakes in my implementation or any valuable advice to train the model successfully. See the Implementation Issues section.

Training

Requirements

You can install the Python dependencies with
```
pip3 install -r requirements.txt
```
In addition to that, install fairseq (official document, github) to utilize LConvBlock.

Datasets

The supported datasets:

LJSpeech: a single-speaker English dataset consists of 13100 short audio clips of a female speaker reading passages from 7 non-fiction books, approximately 24 hours in total.
(will be added more)

Preprocessing

After downloading the datasets, set the corpus_path in preprocess.yaml and run the preparation script:

python3 prepare_data.py config/LJSpeech/preprocess.yaml

Then, run the preprocessing script:

python3 preprocess.py config/LJSpeech/preprocess.yaml

Training

Train your model with

python3 train.py -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml

The model cannot converge yet. I'm debugging but it would be boosted if your awesome contribution is ready!

TensorBoard

Use

tensorboard --logdir output/log/LJSpeech

to serve TensorBoard on your localhost.

Implementation Issues

Overall, normalization or activation, which is not suggested in the original paper, is adequately arranged to prevent nan value (gradient) on forward and backward calculations.

Text Encoder

Use the FFTBlock of FastSpeech2 for the transformer block of the text encoder.
Use dropout 0.2 for the ConvBlock of the text encoder.
To restore "proprietary normalization engine",
- Apply the same text normalization as in FastSpeech2.
- Implement grapheme_to_phoneme function. (See ./text/init).

Residual Encoder

Use 80 channels mel-spectrogrom instead of 128-bin.
Regular sinusoidal positional embedding is used in frame-level instead of combinations of three positional embeddings in Parallel Tacotron. As the model depends entirely on unsupervised learning for the position, this choice can be a reason for the fails on model converge.

Duration Predictor & Learned Upsampling (The most important but ambiguous part)

Use log durations with the prior: there should be at least one frame in total per sequence.
Use nn.SiLU() for the swish activation.
When obtaining W and C, concatenation operation is applied among S, E, and V after frame-domain (T domain) broadcasting of V. As the detailed process is not described in the original paper, this choice can be a reason for the fails on model converge.

Decoder

Use (Multi-head) Self-attention and LConvBlock.
Iterative mel-spectrogram is projected by a linear layer.
Apply nn.Tanh() to each LConvBLock output (following activation pattern of decoder part in FastSpeech2).

Loss

Use optimization & scheduler of FastSpeech2 (which is from Attention is all you need as described in the original paper).
Base on pytorch-softdtw-cuda (post) for the soft-DTW.
1. Implement customized soft-DTW in model/soft_dtw_cuda.py, reflecting the recursion suggested in the original paper.
2. In the original soft-DTW, the final loss is not assumed and therefore only E is computed. But employed as a loss function, jacobian product is added to return target derivetive of R w.r.t. input X.
3. Currently, the maximum batch size is 6 in 24GiB GPU (TITAN RTX) due to space complexity problem in soft-DTW Loss.
  - In the original paper, a custom differentiable diagonal band operation was implemented and used to solve the complexity of O(T^2), but this part has not been explored in the current implementation yet.
For the stability, mel-spectrogroms are compressed by a sigmoid function before the soft-DTW. If the sigmoid is eliminated, the soft-DTW value is too large, producing nan in the backward.
Guided attention loss is applied for fast convergence of the attention module in residual encoder.

Citation

@misc{lee2021parallel_tacotron2,
  author = {Lee, Keon},
  title = {Parallel-Tacotron2},
  year = {2021},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/keonlee9420/Parallel-Tacotron2}}
}

References

ming024's FastSpeech2 (Later than 2021.02.26 ver.)
Parallel Tacotron: Non-Autoregressive and Controllable TTS
Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

Comments

LightWeightConv layer warnings during training

If just install specified requirements + Pillow and fairseq following warnings appear during training start:

No module named 'lightconv_cuda'

If install lightconv-layer from fairseq, the folllowing warning displayed:

WARNING: Unsupported filter length passed - skipping forward pass

Pytorch 1.7 Cuda 10.2 Fairseq 1.0.0a0+19793a7

opened by idcore 10
Suggestion for adding open German "Thorsten" dataset

Hi.

According to text in README (will be added more) i would like to suggest to add my open German "Thorsten" dataset.

Thorsten: a single-speaker German open dataset consists of 22.668 short audio clips of a male speaker, approximately 23 hours in total (LJSpeech file/directory syntax).

https://github.com/thorstenMueller/deep-learning-german-tts/

opened by thorstenMueller 4
Soft DTW with Cython implementation

Hi @keonlee9420 , have you tried the Cython version of Soft DTW from this repo

https://github.com/mblondel/soft-dtw

Is it available to apply for Parallel Tacotron 2 ? I am trying that repo because the current batch is too small when using CUDA implement of @Maghoumi .

I just wonder that @Maghoumi in https://github.com/Maghoumi/pytorch-softdtw-cuda claims that experiment with batch size

But when applying for Para Taco, the batch size is too small, are there any gap?

opened by v-nhandt21 2
Handle audios with long duration
When I load audios with mel-spectrogram frames larger than max sequence of mel len (1000 frames):

There is a problem when concatenating pos + speaker + mels: I try to set max_seq_len larger (1500),

Then lead to a problem with Soft DTW, they said the maximum is 1024

For solution, I tried to trim mels for fitting 1024 but it seems complicated, now I filter out all audios with frames > 1024

Any suggestion for handle Long Audios? I wonder how it work at inference steps.
opened by v-nhandt21 2
cannot import name II from omegaconf
Great work. But I encounter one problems when train this model :( The error message:

ImportError: cannot import name II form omegaconf

The version of fairseq is 0.10.2 (latest releaser version) and omegaconf is 1.4.1. How to fix it?

Thank you
opened by cnlinxi 2
It seems cannot run

I following your command to run the code, but I get following error. File "train.py", line 87, in main output = model(*(batch[2:])) File "/home/ydc/anaconda3/envs/CD/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/ydc/anaconda3/envs/CD/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 162, in forward return self.gather(outputs, self.output_device) File "/home/ydc/anaconda3/envs/CD/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 174, in gather return gather(outputs, output_device, dim=self.dim) File "/home/ydc/anaconda3/envs/CD/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather res = gather_map(outputs) File "/home/ydc/anaconda3/envs/CD/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map return type(out)(map(gather_map, zip(*outputs))) File "/home/ydc/anaconda3/envs/CD/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map return type(out)(map(gather_map, zip(*outputs))) File "/home/ydc/anaconda3/envs/CD/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map return Gather.apply(target_device, dim, *outputs) File "/home/ydc/anaconda3/envs/CD/lib/python3.8/site-packages/torch/nn/parallel/_functions.py", line 71, in forward return comm.gather(inputs, ctx.dim, ctx.target_device) File "/home/ydc/anaconda3/envs/CD/lib/python3.8/site-packages/torch/nn/parallel/comm.py", line 230, in gather return torch._C._gather(tensors, dim, destination) RuntimeError: Input tensor at index 1 has invalid shape [1, 474, 80], but expected [1, 302, 80]

opened by yangdongchao 2
fix mask and soft-dtw loss

1 fix mask problem when calculating W in LearnUpsampling Module and attention matrix in VaribleLengthAttention module. 2 a new Jacobian matrix of Manhattan distance 3 deal with mel spectrograms of different lengths

opened by zhang-wy15 1
why Lconv block doesn't have stride argument?

Hi, Thanks for implement.

I think Parallel TacoTron2 using same residual Encoder as parallel tacotron 1. In parallel tacotron, using five 17 × 1 LConv blocks interleaved with strided 3 × 1 convolutions

But, in your implementation, Lconvblock doesn't have stride argument. How did you handle this part?

Thanks.

opened by yw0nam 0
Soft DTW

Hello, Has anybody been able to train with softdtw loss. It doesn't converge at all. I think there is a problem with the implementation but I could't spot it. When I train with the real alignments it works well

opened by talipturkmen 0
weights required

Can someone share the weights file link? I couldn't synthesize it or use its inference. If I am wrong please tell me the correct method of using it. Thanks

opened by mrqasimasif 0
Why no alignment at all?
I cloned the code, prepared data according to README, and just updated:

ljspeech data path in config/LJSpeech/train.yaml

unzip generator_LJSpeech.pth.tar.zip to get generator_LJSpeech.pth.tar and the code can run! But, no matter how many steps I trained, the images are always like this and demo audio sounds like noise:
opened by mikesun4096 2

training problem

  File "/data1/hjh/pycharm_projects/tts/parallel-tacotron2_try/model/parallel_tacotron2.py", line 68, in forward
    self.learned_upsampling(durations, V, src_lens, src_masks, max_src_len)
  File "/home/huangjiahong.dracu/miniconda2/envs/parallel_tc2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/data1/hjh/pycharm_projects/tts/parallel-tacotron2_try/model/modules.py", line 335, in forward
    mel_mask = get_mask_from_lengths(mel_len, max_mel_len)
  File "/data1/hjh/pycharm_projects/tts/parallel-tacotron2_try/utils/tools.py", line 87, in get_mask_from_lengths
    ids = torch.arange(0, max_len).unsqueeze(0).expand(batch_size, -1).to(device)
RuntimeError: upper bound and larger bound inconsistent with step sign

Thank you for you jobs. I got above problem when training. I guess it's a Duration prediction problem. How to solve it?

opened by aijianiula0601 0

Could you please share your audio samples, pretrained models and loss curves?

Hi, Thanks for your excellent work! Could you possibly share your audio samples, pretrained models and loss curves with me? Thanks so much for your help!

opened by CocoWang1010 0
fix in implementation of S-DTW backward @taras-sereda
Hey, I've found that in your implementation of S-DTW backward, E - matrices are not used, instead you are using G - matrices and their entries are ignoring scaling factors a, b, c. What's the reason for this? My guess you are doing this in order to preserve and propagate gradients, because they are vanishing due to small values of a, b, c. But I might be wrong, so I'd be glad to hear your motivation on doing this.

Playing with your code, I also found that gradients are vanishing, especially when bandwitdth=None. So I'm solving this problem by normalizing distance matrix, by n_mel_channel. And with this normalization and exact implementation of S-dtw backward I'm able to converge on overfit experiments quicker then with non-exact computation of s-dtw backward. I'm using these SDT hparams:

gamma = 0.05 warp = 256 bandwidth = 50

here is a small test I'm using for checks:

target_spectro = np.load('') target_spectro = torch.from_numpy(target_spectro) target_spectro = target_spectro.unsqueeze(0).cuda() pred_spectro = torch.randn_like(target_spectro, requires_grad=True) optimizer = Adam([pred_spectro]) # model fits in ~3k iterations n_iter = 4_000 for i in range(n_iter): loss = self.numba_soft_dtw(pred_spectro, target_spectro) loss = loss / pred_spectro.size(1) loss.backward() if i % 1_000 == 0: print(f'iter: {i}, loss: {loss.item():.6f}') print(f'd_loss_pred {pred_spectro.grad.mean()}') optimizer.step() optimizer.zero_grad()

Curious to hear how your training is going! Best. Taras
opened by taras-sereda 1

Releases(v0.1.0)

v0.1.0(May 16, 2021)

Source code(tar.gz)
Source code(zip)

Owner

Keon Lee

GitHub Repository

Cobalt Strike teamserver detection.

Cobalt-Strike-det Cobalt Strike teamserver detection. usage: cobaltstrike_verify.py [-l TARGETS] [-t THREADS] optional arguments: -h, --help show this

17 Sep 27, 2022

Geometry-Free View Synthesis: Transformers and no 3D Priors

Geometry-Free View Synthesis: Transformers and no 3D Priors Geometry-Free View Synthesis: Transformers and no 3D Priors Robin Rombach*, Patrick Esser*

293 Dec 22, 2022

Official code for our CVPR '22 paper "Dataset Distillation by Matching Training Trajectories"

Dataset Distillation by Matching Training Trajectories Project Page | Paper This repo contains code for training expert trajectories and distilling sy

256 Jan 05, 2023

A Kernel fuzzer focusing on race bugs

Razzer: Finding kernel race bugs through fuzzing Environment setup $ source scripts/envsetup.sh scripts/envsetup.sh sets up necessary environment var

328 Dec 26, 2022

Scalable Optical Flow-based Image Montaging and Alignment

SOFIMA SOFIMA (Scalable Optical Flow-based Image Montaging and Alignment) is a tool for stitching, aligning and warping large 2d, 3d and 4d microscopy

16 Dec 21, 2022

Multi-Object Tracking in Satellite Videos with Graph-Based Multi-Task Modeling

TGraM Multi-Object Tracking in Satellite Videos with Graph-Based Multi-Task Modeling, Qibin He, Xian Sun, Zhiyuan Yan, Beibei Li, Kun Fu Abstract Rece

6 Nov 25, 2022

Official implementation of the paper Image Generators with Conditionally-Independent Pixel Synthesis https://arxiv.org/abs/2011.13775

CIPS -- Official Pytorch Implementation of the paper Image Generators with Conditionally-Independent Pixel Synthesis Requirements pip install -r requi

201 Dec 21, 2022

The code of "Dependency Learning for Legal Judgment Prediction with a Unified Text-to-Text Transformer".

Code data_preprocess.py: preprocess data for Dependent-T5. parameters.py: define parameters of Dependent-T5. train_tools.py: traning and evaluation co

1 Apr 21, 2022

SeqTR: A Simple yet Universal Network for Visual Grounding

SeqTR This is the official implementation of SeqTR: A Simple yet Universal Network for Visual Grounding, which simplifies and unifies the modelling fo

76 Dec 24, 2022

Combining Reinforcement Learning and Constraint Programming for Combinatorial Optimization

Hybrid solving process for combinatorial optimization problems Combinatorial optimization has found applications in numerous fields, from aerospace to

117 Dec 13, 2022

Tensorflow implementation of the paper "HumanGPS: Geodesic PreServing Feature for Dense Human Correspondences", CVPR 2021.

HumanGPS: Geodesic PreServing Feature for Dense Human Correspondences Tensorflow implementation of the paper "HumanGPS: Geodesic PreServing Feature fo

50 Dec 21, 2022

Neural Reprojection Error: Merging Feature Learning and Camera Pose Estimation

Neural Reprojection Error: Merging Feature Learning and Camera Pose Estimation This is the official repository for our paper Neural Reprojection Error

78 Dec 01, 2022

Reproduces ResNet-V3 with pytorch

ResNeXt.pytorch Reproduces ResNet-V3 (Aggregated Residual Transformations for Deep Neural Networks) with pytorch. Tried on pytorch 1.6 Trains on Cifar

481 Dec 23, 2022

Code release for DS-NeRF (Depth-supervised Neural Radiance Fields)

Depth-supervised NeRF: Fewer Views and Faster Training for Free Project | Paper | YouTube Pytorch implementation of our method for learning neural rad

524 Jan 08, 2023

Evaluation and Benchmarking of Speech Super-resolution Methods

Speech Super-resolution Evaluation and Benchmarking What this repo do: A toolbox for the evaluation of speech super-resolution algorithms. Unify the e

84 Dec 20, 2022

Disentangled Cycle Consistency for Highly-realistic Virtual Try-On, CVPR 2021

Disentangled Cycle Consistency for Highly-realistic Virtual Try-On, CVPR 2021 [WIP] The code for CVPR 2021 paper 'Disentangled Cycle Consistency for H

94 Dec 11, 2022

Vision-Language Transformer and Query Generation for Referring Segmentation (ICCV 2021)

Vision-Language Transformer and Query Generation for Referring Segmentation Please consider citing our paper in your publications if the project helps

143 Dec 23, 2022

Pytorch implementation for our ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering".

TRAnsformer Routing Networks (TRAR) This is an official implementation for ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visu

49 Nov 10, 2022

The official implementation of EIGNN: Efficient Infinite-Depth Graph Neural Networks (NeurIPS 2021)

EIGNN: Efficient Infinite-Depth Graph Neural Networks The official implementation of EIGNN: Efficient Infinite-Depth Graph Neural Networks (NeurIPS 20

14 Nov 22, 2022

A modular active learning framework for Python

Modular Active Learning framework for Python3 Page contents Introduction Active learning from bird's-eye view modAL in action From zero to one in a fe

1.9k Dec 31, 2022

Pytorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

Related tags

Overview

Parallel Tacotron2

Updates

Training

Requirements

Datasets

Preprocessing

Training

TensorBoard

Implementation Issues

Text Encoder

Residual Encoder

Duration Predictor & Learned Upsampling (The most important but ambiguous part)

Decoder

Loss

Citation

References

Comments

Releases(v0.1.0)

v0.1.0(May 16, 2021)

Owner

Keon Lee

Cobalt Strike teamserver detection.

Geometry-Free View Synthesis: Transformers and no 3D Priors

Official code for our CVPR '22 paper "Dataset Distillation by Matching Training Trajectories"

A Kernel fuzzer focusing on race bugs

Scalable Optical Flow-based Image Montaging and Alignment

Multi-Object Tracking in Satellite Videos with Graph-Based Multi-Task Modeling

Official implementation of the paper Image Generators with Conditionally-Independent Pixel Synthesis https://arxiv.org/abs/2011.13775

The code of "Dependency Learning for Legal Judgment Prediction with a Unified Text-to-Text Transformer".

SeqTR: A Simple yet Universal Network for Visual Grounding

Combining Reinforcement Learning and Constraint Programming for Combinatorial Optimization

Tensorflow implementation of the paper "HumanGPS: Geodesic PreServing Feature for Dense Human Correspondences", CVPR 2021.

Neural Reprojection Error: Merging Feature Learning and Camera Pose Estimation

Reproduces ResNet-V3 with pytorch

Code release for DS-NeRF (Depth-supervised Neural Radiance Fields)

Evaluation and Benchmarking of Speech Super-resolution Methods

Disentangled Cycle Consistency for Highly-realistic Virtual Try-On, CVPR 2021

Vision-Language Transformer and Query Generation for Referring Segmentation (ICCV 2021)

Pytorch implementation for our ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering".

The official implementation of EIGNN: Efficient Infinite-Depth Graph Neural Networks (NeurIPS 2021)

A modular active learning framework for Python