Official implementation of Meta-StyleSpeech and StyleSpeech

Last update: Jan 05, 2023

Overview

Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

Dongchan Min, Dong Bok Lee, Eunho Yang, and Sung Ju Hwang

This is an official code for our recent paper. We propose Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation. We provide our implementation and pretrained models as open source in this repository.

Abstract : With rapid progress in neural text-to-speech (TTS) models, personalized speech generation is now in high demand for many applications. For practical applicability, a TTS model should generate high-quality speech with only a few audio samples from the given speaker, that are also short in length. However, existing methods either require to fine-tune the model or achieve low adaptation quality without fine-tuning. In this work, we propose StyleSpeech, a new TTS model which not only synthesizes high-quality speech but also effectively adapts to new speakers. Specifically, we propose Style-Adaptive Layer Normalization (SALN) which aligns gain and bias of the text input according to the style extracted from a reference speech audio. With SALN, our model effectively synthesizes speech in the style of the target speaker even from single speech audio. Furthermore, to enhance StyleSpeech's adaptation to speech from new speakers, we extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training. The experimental results show that our models generate high-quality speech which accurately follows the speaker's voice with single short-duration (1-3 sec) speech audio, significantly outperforming baselines.

Demo audio samples are avaliable demo page.

Recent Updates

Few modifications on the Variance Adaptor wich were found to improve the quality of the model . 1) We replace the architecture of variance emdedding from one Conv1D layer to two Conv1D layers followed by a linear layer. 2) We add a layernorm and phoneme-wise positional encoding. Please refer to here.

Getting the pretrained models

Model	Link to the model
Meta-StyleSpeech	Link
StyleSpeech	Link

Prerequisites

Clone this repository.
Install python requirements. Please refer requirements.txt

Inference

You have to download pretrained models and prepared an audio for reference speech sample.

python synthesize.py --text <raw text to synthesize> --ref_audio <path to referecne speech audio> --checkpoint_path <path to pretrained model>

The generated mel-spectrogram will be saved in results/ folder.

Preprocessing the dataset

Our models are trained on LibriTTS dataset. Download, extract and place it in the dataset/ folder.

To preprocess the dataset : First, run

python prepare_align.py

to resample audios to 16kHz and for some other preperations.

Second, Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences.

./montreal-forced-aligner/bin/mfa_align dataset/wav16/ lexicon/librispeech-lexicon.txt  english datset/TextGrid/ -j 10 -v

Third, preprocess the dataset to prepare mel-spectrogram, duration, pitch and energy for fast training.

python preprocess.py

Train!

Train the StyleSpeech from the scratch with

python train.py

Train the Meta-StyleSpeech from pretrained StyleSpeech with

python train_meta.py --checkpoint_path <path to pretrained StyleSpeech model>

Acknowledgements

We refered to

Official implementation of Meta-StyleSpeech and StyleSpeech

Related tags

Overview

Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

Dongchan Min, Dong Bok Lee, Eunho Yang, and Sung Ju Hwang

Recent Updates

Getting the pretrained models

Prerequisites

Inference

Preprocessing the dataset

Train!

Acknowledgements

Owner

min95

Use PaddlePaddle to reproduce the paper：mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

An ultra fast tiny model for lane detection, using onnx_parser, TensorRTAPI, torch2trt to accelerate. our model support for int8, dynamic input and profiling. (Nvidia-Alibaba-TensoRT-hackathon2021)

Framework for fine-tuning pretrained transformers for Named-Entity Recognition (NER) tasks

Simple and efficient RevNet-Library with DeepSpeed support

GVT is a generic translation tool for parts of text on the PC screen with Text to Speak functionality.

Twitter-Sentiment-Analysis - Twitter sentiment analysis for india's top online retailers(2019 to 2022)

Multiple implementations for abstractive text summurization , using google colab

The NewSHead dataset is a multi-doc headline dataset used in NHNet for training a headline summarization model.

Chinese segmentation library

LSTM model - IMDB review sentiment analysis

🌸 fastText + Bloom embeddings for compact, full-coverage vectors with spaCy

Extract city and country mentions from Text like GeoText without regex, but FlashText, a Aho-Corasick implementation.

Data and evaluation code for the paper WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER (EMNLP 2021).

Chatbot with Pytorch, Python & Nextjs

SEJE is a prototype for the paper Learning Text-Image Joint Embedding for Efficient Cross-Modal Retrieval with Deep Feature Engineering.

PyTorch Language Model for 1-Billion Word (LM1B / GBW) Dataset

Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation, available for both PyTorch and Tensorflow.

SASE : Self-Adaptive noise distribution network for Speech Enhancement with heterogeneous data of Cross-Silo Federated learning

Source code of the "Graph-Bert: Only Attention is Needed for Learning Graph Representations" paper

This repo contains simple to use, pretrained/training-less models for speaker diarization.