STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech

Last update: Dec 12, 2022

Overview

STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech

Keon Lee, Kyumin Park, Daeyoung Kim

In our paper, we propose STYLER, a non-autoregressive TTS framework with style factor modeling that achieves rapidity, robustness, expressivity, and controllability at the same time.

Abstract: Previous works on neural text-to-speech (TTS) have been addressed on limited speed in training and inference time, robustness for difficult synthesis conditions, expressiveness, and controllability. Although several approaches resolve some limitations, there has been no attempt to solve all weaknesses at once. In this paper, we propose STYLER, an expressive and controllable TTS framework with high-speed and robust synthesis. Our novel audio-text aligning method called Mel Calibrator and excluding autoregressive decoding enable rapid training and inference and robust synthesis on unseen data. Also, disentangled style factor modeling under supervision enlarges the controllability in synthesizing process leading to expressive TTS. On top of it, a novel noise modeling pipeline using domain adversarial training and Residual Decoding empowers noise-robust style transfer, decomposing the noise without any additional label. Various experiments demonstrate that STYLER is more effective in speed and robustness than expressive TTS with autoregressive decoding and more expressive and controllable than reading style non-autoregressive TTS. Synthesis samples and experiment results are provided via our demo page, and code is available publicly.

Dependencies

Please install the python dependencies given in requirements.txt.

pip3 install -r requirements.txt

Training

Preparation

Clean Data

Download VCTK dataset and resample audios to a 22050Hz sampling rate.
We provide a bash script for the resampling. Refer to data/resample.sh for the detail.
Put audio files and corresponding text (transcript) files in the same directory. Both audio and text files must have the same name, excluding the extension.
You may need to trim the audio for stable model convergence. Refer to Yeongtae's preprocess_audio.py for helpful preprocessing, including the trimming.
Modify the hp.data_dir in hparams.py.

Noisy Data

Download WHAM! dataset and resample audios to a 22050Hz sampling rate.
Modify the hp.noise_dir in hparams.py.

Vocoder

Unzip hifigan/generator_universal.pth.tar.zip in the same directory.

Preprocess

First, download ResCNN Softmax+Triplet pretrained model of philipperemy's DeepSpeaker for the speaker embedding as described in our paper and locate it in hp.speaker_embedder_dir.

Second, download the Montreal Forced Aligner(MFA) package and the pretrained (LibriSpeech) lexicon file through the following commands. MFA is used to obtain the alignments between the utterances and the phoneme sequences as FastSpeech2.

wget https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner/releases/download/v1.1.0-beta.2/montreal-forced-aligner_linux.tar.gz
tar -zxvf montreal-forced-aligner_linux.tar.gz

wget http://www.openslr.org/resources/11/librispeech-lexicon.txt -O montreal-forced-aligner/pretrained_models/librispeech-lexicon.txt

Then, process all the necessary features. You will get a stat.txt file in your hp.preprocessed_path/. You have to modify the f0 and energy parameters in the hparams.py according to the content of stat.txt.

python3 preprocess.py

Finally, get the noisy data separately from the clean data by mixing each utterance with a randomly selected piece of background noise from WHAM! dataset.

python3 preprocess_noisy.py

Train

Now you have all the prerequisites! Train the model using the following command:

python3 train.py

Inference

Prepare Texts

Create sentences.py in data/ which has a python list named sentences of texts to be synthesized. Note that sentences can contain more than one text.

# In 'data/sentences.py',
sentences = [
    "Nothing is lost, everything is recycled."
]

Prepare Reference Audios

Reference audio preparation has a similar process to training data preparation. There could be two kinds of references: clean and noisy.

First, put clean audios with corresponding texts in a single directory and modify the hp.ref_audio_dir in hparams.py and process all the necessary features. Refer to the Clean Data section of Train Preparation.

python3 preprocess_refs.py

Then, get the noisy references.

python3 preprocess_noisy.py --refs

Synthesize

The following command will synthesize all combinations of texts in data/sentences.py and audios in hp.ref_audio_dir.

python3 synthesize.py --ckpt CHECKPOINT_PATH

Or you can specify single reference audio in hp.ref_audio_dir as follows.

python3 synthesize.py --ckpt CHECKPOINT_PATH --ref_name AUDIO_FILENAME

Also, there are several useful options.

--speaker_id will specify the speaker. The specified speaker's embedding should be in hp.preprocessed_path/spker_embed. The default value is None, and the speaker embedding is calculated at runtime on each input audio.
--inspection will give you additional outputs that show the effects of each encoder of STYLER. The samples are the same as the Style Factor Modeling section on our demo page.
--cont will generate the samples as the Style Factor Control section on our demo page.
```
python3 synthesize.py --ckpt CHECKPOINT_PATH --cont --r1 AUDIO_FILENAME_1 --r2 AUDIO_FILENAME_1
```
Note that --cont option is only working on preprocessed data. In detail, the audios' name should have the same format as VCTK dataset (e.g., p323_229), and the preprocessed data must be existing in hp.preprocessed_path.

TensorBoard

The TensorBoard loggers are stored in the log directory. Use

tensorboard --logdir log

to serve the TensorBoard on your localhost. Here are some logging views of the model training on VCTK for 560k steps.

Notes

There were too many noise data where extraction was not possible through pyworld as in clean data. To resolve this, pysptk was applied to extract log f0 for the noisy data's fundamental frequency. The --noisy_input option will automate this process during synthesizing.

If MFA-related problems occur during running preprocess.py, try to manually run MFA by the following command.

# Replace $data_dir and $PREPROCESSED_PATH with ./VCTK-Corpus-92/wav48_silence_trimmed and ./preprocessed/VCTK/TextGrid, for example
./montreal-forced-aligner/bin/mfa_align $YOUR_data_dir montreal-forced-aligner/pretrained_models/librispeech-lexicon.txt english $YOUR_PREPROCESSED_PATH -j 8

DeepSpeaker on VCTK dataset shows clear identification among speakers. The following figure shows the T-SNE plot of extracted speaker embedding in our experiments.
Currently, preprocess.py divides the dataset into two subsets: train and validation set. If you need other sets, such as a test set, the only thing to do is modifying the text files (train.txt or val.txt) in hp.preprocessed_path/.

Citation

If you would like to use or refer to this implementation, please cite our paper with the repo.

@article{lee2021styler,
  title={STYLER: Style Modeling with Rapidity and Robustness via SpeechDecomposition for Expressive and Controllable Neural Text to Speech},
  author={Lee, Keon and Park, Kyumin and Kim, Daeyoung},
  journal={arXiv preprint arXiv:2103.09474},
  year={2021}
}

References

Comments

some questions

1.in paper, why output of encoder （text_encoding) upsample and downsample? 2. what is the meaning of text_encoding_neck+pitch_encoding、text_encoding_neck+energy_encoding？ why not cat？

opened by Pydataman 3
Low resource languages that won't work with MFA?

Is there a way to fine tune a model or training two languages side by side such that a very low resource language can be trained with the voices of a high resource language?

opened by michael-conrad 3
Undefined names

Hi, I noticed some undefined names around the code:

synthesize.py:495:67: F821 undefined name 'reference' noise_mixer_refs.py:56:42: F821 undefined name 'eps' noise_mixer_refs.py:59:40: F821 undefined name 'eps'

opened by L3str4nge 2
About the pre-process

Hi, I want to ask the trimming operation whether is very important for training your model? Furthermore, can you share the scripts to trimming VCTK dataset?

opened by yangdongchao 0
Bump tensorflow from 2.4.0 to 2.5.1
Bumps tensorflow from 2.4.0 to 2.5.1.

Release notes

Sourced from tensorflow's releases.

TensorFlow 2.5.1

Release 2.5.1

This release introduces several vulnerability fixes:

Fixes a heap out of bounds access in sparse reduction operations (CVE-2021-37635)

Fixes a floating point exception in SparseDenseCwiseDiv (CVE-2021-37636)

Fixes a null pointer dereference in CompressElement (CVE-2021-37637)

Fixes a null pointer dereference in RaggedTensorToTensor (CVE-2021-37638)

Fixes a null pointer dereference and a heap OOB read arising from operations restoring tensors (CVE-2021-37639)

Fixes an integer division by 0 in sparse reshaping (CVE-2021-37640)

Fixes a division by 0 in ResourceScatterDiv (CVE-2021-37642)

Fixes a heap OOB in RaggedGather (CVE-2021-37641)

Fixes a std::abort raised from TensorListReserve (CVE-2021-37644)

Fixes a null pointer dereference in MatrixDiagPartOp (CVE-2021-37643)

Fixes an integer overflow due to conversion to unsigned (CVE-2021-37645)

Fixes a bad allocation error in StringNGrams caused by integer conversion (CVE-2021-37646)

Fixes a null pointer dereference in SparseTensorSliceDataset (CVE-2021-37647)

Fixes an incorrect validation of SaveV2 inputs (CVE-2021-37648)

Fixes a null pointer dereference in UncompressElement (CVE-2021-37649)

Fixes a segfault and a heap buffer overflow in {Experimental,}DatasetToTFRecord (CVE-2021-37650)

Fixes a heap buffer overflow in FractionalAvgPoolGrad (CVE-2021-37651)

Fixes a use after free in boosted trees creation (CVE-2021-37652)

Fixes a division by 0 in ResourceGather (CVE-2021-37653)

Fixes a heap OOB and a CHECK fail in ResourceGather (CVE-2021-37654)

Fixes a heap OOB in ResourceScatterUpdate (CVE-2021-37655)

Fixes an undefined behavior arising from reference binding to nullptr in RaggedTensorToSparse (CVE-2021-37656)

Fixes an undefined behavior arising from reference binding to nullptr in MatrixDiagV* ops (CVE-2021-37657)

Fixes an undefined behavior arising from reference binding to nullptr in MatrixSetDiagV* ops (CVE-2021-37658)

Fixes an undefined behavior arising from reference binding to nullptr and heap OOB in binary cwise ops (CVE-2021-37659)

Fixes a division by 0 in inplace operations (CVE-2021-37660)

Fixes a crash caused by integer conversion to unsigned (CVE-2021-37661)

Fixes an undefined behavior arising from reference binding to nullptr in boosted trees (CVE-2021-37662)

Fixes a heap OOB in boosted trees (CVE-2021-37664)

Fixes vulnerabilities arising from incomplete validation in QuantizeV2 (CVE-2021-37663)

Fixes vulnerabilities arising from incomplete validation in MKL requantization (CVE-2021-37665)

Fixes an undefined behavior arising from reference binding to nullptr in RaggedTensorToVariant (CVE-2021-37666)

Fixes an undefined behavior arising from reference binding to nullptr in unicode encoding (CVE-2021-37667)

Fixes an FPE in tf.raw_ops.UnravelIndex (CVE-2021-37668)

Fixes a crash in NMS ops caused by integer conversion to unsigned (CVE-2021-37669)

Fixes a heap OOB in UpperBound and LowerBound (CVE-2021-37670)

Fixes an undefined behavior arising from reference binding to nullptr in map operations (CVE-2021-37671)

Fixes a heap OOB in SdcaOptimizerV2 (CVE-2021-37672)

Fixes a CHECK-fail in MapStage (CVE-2021-37673)

Fixes a vulnerability arising from incomplete validation in MaxPoolGrad (CVE-2021-37674)

Fixes an undefined behavior arising from reference binding to nullptr in shape inference (CVE-2021-37676)

Fixes a division by 0 in most convolution operators (CVE-2021-37675)

Fixes vulnerabilities arising from missing validation in shape inference for Dequantize (CVE-2021-37677)

Fixes an arbitrary code execution due to YAML deserialization (CVE-2021-37678)

Fixes a heap OOB in nested tf.map_fn with RaggedTensors (CVE-2021-37679)

... (truncated)

Changelog

Sourced from tensorflow's changelog.

Release 2.5.1

This release introduces several vulnerability fixes:

Fixes a heap out of bounds access in sparse reduction operations (CVE-2021-37635)

Fixes a floating point exception in SparseDenseCwiseDiv (CVE-2021-37636)

Fixes a null pointer dereference in CompressElement (CVE-2021-37637)

Fixes a null pointer dereference in RaggedTensorToTensor (CVE-2021-37638)

Fixes a null pointer dereference and a heap OOB read arising from operations restoring tensors (CVE-2021-37639)

Fixes an integer division by 0 in sparse reshaping (CVE-2021-37640)

Fixes a division by 0 in ResourceScatterDiv (CVE-2021-37642)

Fixes a heap OOB in RaggedGather (CVE-2021-37641)

Fixes a std::abort raised from TensorListReserve (CVE-2021-37644)

Fixes a null pointer dereference in MatrixDiagPartOp (CVE-2021-37643)

Fixes an integer overflow due to conversion to unsigned (CVE-2021-37645)

Fixes a bad allocation error in StringNGrams caused by integer conversion (CVE-2021-37646)

Fixes a null pointer dereference in SparseTensorSliceDataset (CVE-2021-37647)

Fixes an incorrect validation of SaveV2 inputs (CVE-2021-37648)

Fixes a null pointer dereference in UncompressElement (CVE-2021-37649)

Fixes a segfault and a heap buffer overflow in {Experimental,}DatasetToTFRecord (CVE-2021-37650)

Fixes a heap buffer overflow in FractionalAvgPoolGrad (CVE-2021-37651)

Fixes a use after free in boosted trees creation (CVE-2021-37652)

Fixes a division by 0 in ResourceGather (CVE-2021-37653)

Fixes a heap OOB and a CHECK fail in ResourceGather (CVE-2021-37654)

Fixes a heap OOB in ResourceScatterUpdate (CVE-2021-37655)

Fixes an undefined behavior arising from reference binding to nullptr in RaggedTensorToSparse

... (truncated)

Commits

8222c1c Merge pull request #51381 from tensorflow/mm-fix-r2.5-build

d584260 Disable broken/flaky test

f6c6ce3 Merge pull request #51367 from tensorflow-jenkins/version-numbers-2.5.1-17468

3ca7812 Update version numbers to 2.5.1

4fdf683 Merge pull request #51361 from tensorflow/mm-update-relnotes-on-r2.5

05fc01a Put CVE numbers for fixes in parentheses

bee1dc4 Update release notes for the new patch release

47beb4c Merge pull request #50597 from kruglov-dmitry/v2.5.0-sync-abseil-cmake-bazel

6f39597 Merge pull request #49383 from ashahab/abin-load-segfault-r2.5

0539b34 Merge pull request #48979 from liufengdb/r2.5-cherrypick

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0

Releases(v1.0.0)

v1.0.0(Dec 27, 2021)

Source code(tar.gz)
Source code(zip)
v0.1.0(May 15, 2021)

Source code(tar.gz)
Source code(zip)

Owner

Keon Lee

Expressive Speech Synthesis | Disentangled Representation | Generative Models | NLP | HCI

GitHub Repository https://keonlee9420.github.io/STYLER-Demo/

Official PyTorch implementation of "RMGN: A Regional Mask Guided Network for Parser-free Virtual Try-on" (IJCAI-ECAI 2022)

RMGN-VITON RMGN: A Regional Mask Guided Network for Parser-free Virtual Try-on In IJCAI-ECAI 2022(short oral). [Paper] [Supplementary Material] Abstra

27 Dec 01, 2022

MakeItTalk: Speaker-Aware Talking-Head Animation

MakeItTalk: Speaker-Aware Talking-Head Animation This is the code repository implementing the paper: MakeItTalk: Speaker-Aware Talking-Head Animation

285 Jan 08, 2023

Trained on Simulated Data, Tested in the Real World

43 Nov 18, 2022

[ArXiv 2021] One-Shot Generative Domain Adaptation

GenDA - One-Shot Generative Domain Adaptation One-Shot Generative Domain Adaptation Ceyuan Yang*, Yujun Shen*, Zhiyi Zhang, Yinghao Xu, Jiapeng Zhu, Z

46 Dec 19, 2022

code for EMNLP 2019 paper Text Summarization with Pretrained Encoders

PreSumm This code is for EMNLP 2019 paper Text Summarization with Pretrained Encoders Updates Jan 22 2020: Now you can Summarize Raw Text Input!. Swit

1.2k Dec 28, 2022

This repository contains the code to replicate the analysis from the paper "Moving On - Investigating Inventors' Ethnic Origins Using Supervised Learning"

Replication Code for 'Moving On' - Investigating Inventors' Ethnic Origins Using Supervised Learning This repository contains the code to replicate th

0 Jan 04, 2022

这是一个unet-pytorch的源码，可以训练自己的模型

Unet：U-Net: Convolutional Networks for Biomedical Image Segmentation目标检测模型在Pytorch当中的实现目录性能情况 Performance 所需环境 Environment 注意事项 Attention 文件下载 Downl

567 Jan 05, 2023

This is a tensorflow-based rotation detection benchmark, also called AlphaRotate.

AlphaRotate: A Rotation Detection Benchmark using TensorFlow Abstract AlphaRotate is maintained by Xue Yang with Shanghai Jiao Tong University supervi

972 Jan 05, 2023

Official implementation for "Symbolic Learning to Optimize: Towards Interpretability and Scalability"

Symbolic Learning to Optimize This is the official implementation for ICLR-2022 paper "Symbolic Learning to Optimize: Towards Interpretability and Sca

8 Dec 19, 2022

RepVGG: Making VGG-style ConvNets Great Again

This repository is the code that needs to be submitted for OpenMMLab Algorithm Ecological Challenge，the paper is RepVGG: Making VGG-style ConvNets Great Again

62 May 21, 2022

RIFE - Real-Time Intermediate Flow Estimation for Video Frame Interpolation

RIFE - Real-Time Intermediate Flow Estimation for Video Frame Interpolation YouTube | BiliBili 16X interpolation results from two input images: Introd

28 Dec 09, 2022

MetaBalance: High-Performance Neural Networks for Class-Imbalanced Data

This repository is the official PyTorch implementation of Meta-Balance. Find the paper on arxiv MetaBalance: High-Performance Neural Networks for Clas

20 Oct 18, 2021

Official pytorch code for "APP: Anytime Progressive Pruning"

APP: Anytime Progressive Pruning Diganta Misra1,2,3, Bharat Runwal2,4, Tianlong Chen5, Zhangyang Wang5, Irina Rish1,3 1 Mila - Quebec AI Institute,2 L

12 Nov 22, 2022

Improving Query Representations for DenseRetrieval with Pseudo Relevance Feedback:A Reproducibility Study.

APR The repo for the paper Improving Query Representations for DenseRetrieval with Pseudo Relevance Feedback:A Reproducibility Study. Environment setu

8 Nov 26, 2022

Official implementation for "QS-Attn: Query-Selected Attention for Contrastive Learning in I2I Translation" (CVPR 2022)

QS-Attn: Query-Selected Attention for Contrastive Learning in I2I Translation (CVPR2022) https://arxiv.org/abs/2203.08483 Unpaired image-to-image (I2I

50 Dec 16, 2022

Exploration-Exploitation Dilemma Solving Methods

Exploration-Exploitation Dilemma Solving Methods Medium article for this repo - HERE In ths repo I implemented two techniques for tackling mentioned t

6 Jan 25, 2022

a practicable framework used in Deep Learning. So far UDL only provide DCFNet implementation for the ICCV paper (Dynamic Cross Feature Fusion for Remote Sensing Pansharpening)

UDL UDL is a practicable framework used in Deep Learning (computer vision). Benchmark codes, results and models are available in UDL, please contact @

11 Sep 30, 2022

STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech

Related tags

Overview

STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech

Keon Lee, Kyumin Park, Daeyoung Kim

Dependencies

Training

Preparation

Clean Data

Noisy Data

Vocoder

Preprocess

Train

Inference

Prepare Texts

Prepare Reference Audios

Synthesize

TensorBoard

Notes

Citation

References

Comments

some questions

Low resource languages that won't work with MFA?

Undefined names

About the pre-process

Bump tensorflow from 2.4.0 to 2.5.1

TensorFlow 2.5.1

Release 2.5.1

Release 2.5.1

Releases(v1.0.0)

v1.0.0(Dec 27, 2021)

v0.1.0(May 15, 2021)

Owner

Keon Lee

Official PyTorch implementation of "RMGN: A Regional Mask Guided Network for Parser-free Virtual Try-on" (IJCAI-ECAI 2022)

MakeItTalk: Speaker-Aware Talking-Head Animation

Trained on Simulated Data, Tested in the Real World

[ArXiv 2021] One-Shot Generative Domain Adaptation

code for EMNLP 2019 paper Text Summarization with Pretrained Encoders

This repository contains the code to replicate the analysis from the paper "Moving On - Investigating Inventors' Ethnic Origins Using Supervised Learning"

这是一个unet-pytorch的源码，可以训练自己的模型

This is a tensorflow-based rotation detection benchmark, also called AlphaRotate.

Official implementation for "Symbolic Learning to Optimize: Towards Interpretability and Scalability"

RepVGG: Making VGG-style ConvNets Great Again

RIFE - Real-Time Intermediate Flow Estimation for Video Frame Interpolation

MetaBalance: High-Performance Neural Networks for Class-Imbalanced Data

Official pytorch code for "APP: Anytime Progressive Pruning"

Improving Query Representations for DenseRetrieval with Pseudo Relevance Feedback:A Reproducibility Study.

Official implementation for "QS-Attn: Query-Selected Attention for Contrastive Learning in I2I Translation" (CVPR 2022)

Exploration-Exploitation Dilemma Solving Methods

a practicable framework used in Deep Learning. So far UDL only provide DCFNet implementation for the ICCV paper (Dynamic Cross Feature Fusion for Remote Sensing Pansharpening)

A PyTorch implementation of SIN: Superpixel Interpolation Network

Fully Convolutional DenseNets for semantic segmentation.

ICLR 2021, Fair Mixup: Fairness via Interpolation