PortaSpeech - PyTorch Implementation

PyTorch Implementation of PortaSpeech: Portable and High-Quality Generative Text-to-Speech.

Audio Samples

Audio samples are available at /demo.

Model Size

Module	Normal	Small	Normal (paper)	Small (paper)
Total	24M	7.6M	21.8M	6.7M
LinguisticEncoder	3.7M	1.4M	-	-
VariationalGenerator	11M	2.8M	-	-
FlowPostNet	9.3M	3.4M	-	-

Quickstart

DATASET refers to the names of datasets such as LJSpeech in the following documents.

Dependencies

You can install the Python dependencies with

pip3 install -r requirements.txt

Also, Dockerfile is provided for Docker users.

Inference

You have to download the pretrained models and put them in output/ckpt/DATASET/.

For a single-speaker TTS, run

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step RESTORE_STEP --mode single --dataset DATASET

The generated utterances will be put in output/result/.

Batch Inference

Batch inference is also supported, try

python3 synthesize.py --source preprocessed_data/DATASET/val.txt --restore_step RESTORE_STEP --mode batch --dataset DATASET

to synthesize all utterances in preprocessed_data/DATASET/val.txt.

Controllability

The speaking rate of the synthesized utterances can be controlled by specifying the desired duration ratios. For example, one can increase the speaking rate by 20 by

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step RESTORE_STEP --mode single --dataset DATASET --duration_control 0.8

Please note that the controllability is originated from FastSpeech2 and not a vital interest of PortaSpeech.

Training

Datasets

The supported datasets are

LJSpeech: a single-speaker English dataset consists of 13100 short audio clips of a female speaker reading passages from 7 non-fiction books, approximately 24 hours in total.

Preprocessing

Run

python3 prepare_align.py --dataset DATASET

for some preparations.

For the forced alignment, Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences. Pre-extracted alignments for the datasets are provided here. You have to unzip the files in preprocessed_data/DATASET/TextGrid/. Alternately, you can run the aligner by yourself.

After that, run the preprocessing script by

python3 preprocess.py --dataset DATASET

Training

Train your model with

python3 train.py --dataset DATASET

Useful options:

To use Automatic Mixed Precision, append --use_amp argument to the above command.
The trainer assumes single-node multi-GPU training. To use specific GPUs, specify CUDA_VISIBLE_DEVICES=<GPU_IDs> at the beginning of the above command.

TensorBoard

Use

tensorboard --logdir output/log

to serve TensorBoard on your localhost. The loss curves, synthesized mel-spectrograms, and audios are shown.

Normal Model

Small Model Loss

Notes

For vocoder, HiFi-GAN and MelGAN are supported.
No ReLU activation and LayerNorm in VariationalGenerator to avoid mashed output.
Speed up the convergence of word-to-phoneme alignment in LinguisticEncoder by dividing long words into subwords and sorting the dataset by mel-spectrogram frame length.
There are two kinds of helper loss to improve word-to-phoneme alignment: "ctc" and "dga". You can toggle them as follows:
```
# In the train.yaml
aligner:
    helper_type: "dga" # ["dga", "ctc", "none"]
```
- "dga": Diagonal Guided Attention (DGA) Loss
- "ctc": Connectionist Temporal Classification (CTC) Loss with forward-sum algorithm
- If you set "none", no helper loss will be applied during training.
- The alignments comparision of three methods ("dga", "ctc", and "none" from top to bottom):
- The default setting is "dga". Although "ctc" makes the strongest alignment, the output quality and the accuracy are worse than "dga".
- But still, there is a room for the improvement of output quality. The audio quality and the alingment (accuracy) seem to be a trade-off.
Will be extended to a multi-speaker TTS.

Citation

Please cite this repository by the "Cite this repository" of About section (top right of the main page).

References

jaywalnut310's VITS
jaywalnut310's Glow-TTS
keonlee9420's VAENAR-TTS
keonlee9420's Comprehensive-Transformer-TTS (CTC Loss)
keonlee9420's Comprehensive-Tacotron2 (DGA Loss)

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
audio		audio
config/LJSpeech		config/LJSpeech
deepspeaker		deepspeaker
demo/LJSpeech		demo/LJSpeech
hifigan		hifigan
img		img
lexicon		lexicon
model		model
preprocessed_data/LJSpeech		preprocessed_data/LJSpeech
preprocessor		preprocessor
text		text
utils		utils
.gitignore		.gitignore
CITATION.cff		CITATION.cff
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
dataset.py		dataset.py
evaluate.py		evaluate.py
prepare_align.py		prepare_align.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
synthesize.py		synthesize.py
train.py		train.py

License

keonlee9420/PortaSpeech

Folders and files

Latest commit

History

Repository files navigation

PortaSpeech - PyTorch Implementation

Audio Samples

Model Size

Quickstart

Dependencies

Inference

Batch Inference

Controllability

Training

Datasets

Preprocessing

Training

TensorBoard

Normal Model

Small Model Loss

Notes

Citation

References

About

Topics

Resources

License

Stars

Watchers

Forks

Languages