StyleSpeech - PyTorch Implementation
PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation.
Status (2021.06.13)
-  StyleSpeech (naivebranch)
-  Meta-StyleSpeech (mainbranch)
Quickstart
Dependencies
You can install the Python dependencies with
pip3 install -r requirements.txt
Inference
You have to download pretrained models and put them in output/ckpt/LibriTTS/.
For English single-speaker TTS, run
python3 synthesize.py --text "YOUR_DESIRED_TEXT" --ref_audio path/to/reference_audio.wav --restore_step 200000 --mode single -p config/LibriTTS/preprocess.yaml -m config/LibriTTS/model.yaml -t config/LibriTTS/train.yaml
The generated utterances will be put in output/result/. Your synthesized speech will have ref_audio's style.
Batch Inference
Batch inference is also supported, try
python3 synthesize.py --source preprocessed_data/LibriTTS/val.txt --restore_step 200000 --mode batch -p config/LibriTTS/preprocess.yaml -m config/LibriTTS/model.yaml -t config/LibriTTS/train.yaml
to synthesize all utterances in preprocessed_data/LibriTTS/val.txt. This can be viewed as a reconstruction of validation datasets referring to themselves for the reference style.
Controllability
The pitch/volume/speaking rate of the synthesized utterances can be controlled by specifying the desired pitch/energy/duration ratios. For example, one can increase the speaking rate by 20 % and decrease the volume by 20 % by
python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step 200000 --mode single -p config/LibriTTS/preprocess.yaml -m config/LibriTTS/model.yaml -t config/LibriTTS/train.yaml --duration_control 0.8 --energy_control 0.8
Note that the controllability is originated from FastSpeech2 and not a vital interest of StyleSpeech.
Training
Datasets
The supported datasets are
- LibriTTS: a multi-speaker English dataset containing 585 hours of speech by 2456 speakers.
- (will be added more)
Preprocessing
First, run
python3 prepare_align.py config/LibriTTS/preprocess.yaml
for some preparations.
In this implementation, Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences.
Download the official MFA package and run
./montreal-forced-aligner/bin/mfa_align raw_data/LibriTTS/ lexicon/librispeech-lexicon.txt english preprocessed_data/LibriTTS
or
./montreal-forced-aligner/bin/mfa_train_and_align raw_data/LibriTTS/ lexicon/librispeech-lexicon.txt preprocessed_data/LibriTTS
to align the corpus and then run the preprocessing script.
python3 preprocess.py config/LibriTTS/preprocess.yaml
Training
Train your model with
python3 train.py -p config/LibriTTS/preprocess.yaml -m config/LibriTTS/model.yaml -t config/LibriTTS/train.yaml
As described in the paper, the script will start from pre-training the naive model until meta_learning_warmup steps and then meta-train the model for additional steps via episodic training.
TensorBoard
Use
tensorboard --logdir output/log/LibriTTS
to serve TensorBoard on your localhost.
Implementation Issues
- Use 22050Hzsampling rate instead of16kHz.
- Add one fully connected layer at the beginning of Mel-Style Encoder to upsample input mel-spectrogram from 80to128.
- The model size including meta-learner is 28.197M.
- Use a maximum 16batch size on training instead of48or20mainly due to the lack of memory capacity with a single 24GiB TITAN-RTX. This can be achieved by the following script to filter out data longer thanmax_seq_len:This will generatepython3 filelist_filtering.py -p config/LibriTTS/preprocess.yaml -m config/LibriTTS/model.yamltrain_filtered.txtin the same location oftrain.txt.
- Since the total batch size is decreased, the number of training steps is doubled compared to the original paper.
- Use HiFi-GAN instead of MelGAN for vocoding.
Citation
@misc{lee2021stylespeech,
  author = {Lee, Keon},
  title = {StyleSpeech},
  year = {2021},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/keonlee9420/StyleSpeech}}
}
References
- Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation
- A Style-Based Generator Architecture for Generative Adversarial Networks
- Matching Networks for One Shot Learning
- Prototypical Networks for Few-shot Learning
- TADAM: Task dependent adaptive metric for improved few-shot learning
- ming024's FastSpeech2


 This is what my loss curve looks like.
Can you help me with what can I do now to improve my synthesized audio results?
This is what my loss curve looks like.
Can you help me with what can I do now to improve my synthesized audio results?