Byte-based multilingual transformer TTS for low-resource/few-shot language adaptation.

Overview

One model to speak them all 🌎

Audio Language Text
Chinese 人人生而自由,在尊严和权利上一律平等。
English All human beings are born free and equal in dignity and rights.
Japanese すべての人間は、生まれながらにして自由であり、かつ、尊厳と権利とについてびょうどうである。
Korean 모든 인간은 태어날 때부터 자유로우며 그 존엄과 권리에 있어 동등하다.
German Alle Menschen sind frei und gleich an Würde und Rechten geboren.
Russian Все люди рождаются свободными и равными в своем достоинстве и правах.
Spanish Todos los seres humanos nacen libres e iguales en dignidad y derechos.
Gujarati પ્રતિષ્ઠા અને અધિકારોની દૃષ્ટિએ સર્વ માનવો જન્મથી સ્વતંત્ર અને સમાન હોય છે.
...even when there are only 30 utterances for training
Norwegian Alle mennesker er født frie og med samme menneskeverd og menneskerettigheter.
Romanian Toate ființele umane se nasc libere și egale în demnitate și în drepturi.
Greek Όλοι οι άνθρωποι γεννιούνται ελεύθεροι και ίσοι στην αξιοπρέπεια και τα δικαιώματα.

This is an implementation of the paper Multilingual Byte2Speech Models for Scalable Low-resource Speech Synthesis, which can handle 40+ languages in a single model, and learn a brand new language in few shots or minutes of recordings. The code is partially based on the open-source Tacotron2 and Transformer-TTS. More audio samples of the paper are available here.

Quickstart

We follow the paper's training recipe, but with open datasets instead. By a combination of 15 speech datasets with 572 speakers in 38 languages, we can reach results similar to what we demonstrated in the paper to an extent, as shown by the audio samples above. These datasets are listed below, the preprocessor scripts below are located at corpora/. Locations and details to download the data are also given in the respective preprocessor.

Name Preprocessor script name Languages
M-AILABS caito es-es, fr-fr, de-de, uk-ua, ru-ru, pl-pl, it-it, en-us, en-uk
CSS-10 css10 es-es, fr-fr, ja-jp, de-de, fi-fi, hu-hu, ja-jp, nl-nl, ru-ru, zh-cn
SIWIS siwis fr-fr
JSUT jsut ja-jp
KSS kss ko-kr
Databaker databaker zh-cn
LJSpeech ljspeech en-us
NST nst da-dk, nb-no
TTS-Portuguese portuguese pt-br
Thorsten Mueller thorsten de-de
Google google bn-bd, bn-in, ca-es, eu-es, gl-es, gu-in, jv-id, km-kh, kn-in, ml-in, mr-in, my-mm, ne-np, si-lk, su-id, ta-in, te-in, yo-ng
RuLS lsru ru-ru
English Bible enbible en-us
Hifi-TTS hifitts en-us, en-uk
RSS rss ro-ro

Preprocessing

  1. Please download and extract these datasets to the dataset_path specified in corpora/__init__.py. You can change the dataset_path, transformed_path and packed_path to your own.
  2. Run the preprocessor for each dataset given in corpora. The results are saved to transformed_path. include_corpus in corpora/__init__.py could be modified to add or remove datasets to be used. Particularly, you may refer to the preprocessors to include your own datasets to the training,
    and then add the dataset to include_corpus and dataset_language in corpora/__init__.py.
  3. Run the corpora/process_corpus.py, which filters the dataset, trims the audios, produces the metadata, generates the mel spectrograms, and pack all the features into a single zip file. The processed dataset will be put at packed_path, which uses around 100GB space. See the script for details.

Training

Similarly, we split the dataset into three tiers. Below are the commands to train and evaluate on each tier. Please substitute the directories with your own. The evaluation script can be run simultaneously with the training script. You may also use the evaluation script to synthesize samples from pretrained models. Please refer to the help of the arguments for their meanings.

Besides, to report CER, you need to create azure_key.json with your own Azure STT subscription, with content of {"subscription": "YOUR_KEY", "region": "YOUR_REGION"}, see utils/transcribe.py. Due to significant differences of the datasets used, the implementation is for demonstration only and could not fully reproduce the results in the paper.

T1

python -m torch.distributed.launch --nproc_per_node=NGPU train.py --model-dir=MODEL_DIR --log-dir=LOG_DIR --data-dir=DATA_DIR --training_languages=en-us:de-de:ja-jp:es-es --warmup_languages=en-us --ddp=True --eval_steps=40000:100000

python eval.py --model-dir=MODEL_DIR --log-dir=LOG_DIR --data-dir=DATA_DIR --start_step=100000 --eval_languages=en-us:de-de:ja-jp

T2

python -m torch.distributed.launch --nproc_per_node=NGPU train.py --model-dir=MODEL_DIR --log-dir=LOG_DIR --data-dir=DATA_DIR --training_languages=en-us:de-de:fr-fr:ru-ru:en-uk:es-es:uk-ua:pl-pl:it-it:ja-jp:zh-cn --ddp=True --hparams="warmup_steps=350000" --restore_from=T1_MODEL_DIR/model.ckpt-350000 --eval_steps=400000:450000 --eval_languages=zh-cn:ru-ru:it-it

python eval.py --model-dir=MODEL_DIR --log-dir=LOG_DIR --data-dir=DATA_DIR --start_step=400000 --eval_languages=zh-cn:ru-ru:it-it

T3

python -m torch.distributed.launch --nproc_per_node=NGPU train.py --model-dir=MODEL_DIR --log-dir=LOG_DIR --data-dir=DATA_DIR --training_languages=en-us:de-de:fr-fr:ru-ru:en-uk:es-es:uk-ua:pl-pl:it-it:ja-jp:zh-cn:nl-nl:fi-fi: ko-kr:eu-es:pt-br:hu-hu:jv-id:gl-es:gu-in:kn-in:da-dk:su-id:ta-in:ca-es:ml-in:te-in:my-mm:yo-ng:km-kh:mr-in:ne-np:bn-bd: bn-in:si-lk --ddp=True --hparams="warmup_steps=650000,batch_frame_quad_limit=6500000" --restore_from=T2_MODEL_DIR/model.ckpt-650000 --eval_steps=700000:750000 --eval_languages=ko-kr:da-dk:te-in

python eval.py --model-dir=MODEL_DIR --log-dir=LOG_DIR --data-dir=DATA_DIR --start_step=700000 --eval_languages=ko-kr:da-dk:te-in

Few-shot adaptation

Norwegian Bokmal (nb-no), Greek (el-gr), and Romanian (ro-ro) are excluded from the training dataset and can be used for few-shot/low-resource adaptation. The command below gives an example for adaptation to el-gr with 100 samples, and you may substitute the --adapt_languages and --downsample_languages with your own.

python -m torch.distributed.launch --nproc_per_node=NGPU train.py --model-dir=MODEL_DIR --log-dir=LOG_DIR --data-dir=DATA_DIR --training_languages=en-us:de-de:fr-fr:ru-ru:en-uk:es-es:uk-ua:pl-pl:it-it:ja-jp:zh-cn:nl-nl:fi-fi: ko-kr:eu-es:pt-br:hu-hu:jv-id:gl-es:gu-in:kn-in:da-dk:su-id:ta-in:ca-es:ml-in:te-in:my-mm:yo-ng:km-kh:mr-in:ne-np: bn-bd:bn-in:si-lk --adapt_languages=el-gr --downsample_languages=el-gr:100 --ddp=True --hparams="warmup_steps=800000" --restore_from=T3_MODEL_DIR/model.ckpt-700000

python eval.py --model-dir=MODEL_DIR --log-dir=LOG_DIR --data-dir=DATA_DIR --start_step=700000 --eval_languages=el-gr

Performance

Below listed the best CERs of selected languages reached by models from each tier on these open datasets, as well as the CERs on few-shot adaptation. The CERs are based on Azure Speech-to-Text.

T1 en-us de-de ja-jp
2.68% 2.17% 19.06%
T2 it-it ru-ru zh-cn
1.95% 3.21% 7.30%
T3 da-dk ko-kr te-in
1.31% 0.94% 4.41%

Adaptation

#Samples nb-no el-gr ro-ro
30 9.18% 5.71% 5.58%
100 3.63% 4.63% 4.89%

Pretrained Models

The pretrained models are available at OneDrive Link. Metadata for eval are also given to aid fast reproduction. Below listed are the models provided.

Base models

  • T1 350k steps, ready for T2
  • T2 650k steps, ready for T3
  • T3 700k steps, ready for adaptation
  • T3 1.16M steps, which reaches satisfactory performances on most languages

Few-shot adaptation

  • nb-no, 30 samples, at 710k steps
  • nb-no, 100 samples, at 750k steps
  • el-gr, 30 samples, at 1M steps
  • el-gr, 100 samples, at 820k steps
  • ro-ro, 30 samples, at 970k steps
  • ro-ro, 100 samples, at 910k steps

Synthesis

To synthesize audios from the pretrained models, download the models along with the metadata files (lang_id.json and spk_id.json). Since there are no ground truth mels, you need to create metadata with dummy mel targets information , and run eval.py without neither --zipfilepath specified nor mels.zip present in --data-dir. The metadata file takes the form of SPEAKERNAME_FILEID|DUMMY_LENGTH|TEXT|LANG for each line of the file. For example, you can generate the audio examples above by saving the following metadata to script.txt:

databaker_0|500|人人生而自由,在尊严和权利上一律平等。|zh-cn
ljspeech_0|500|All human beings are born free and equal in dignity and rights.|en-us
jsut_0|500|すべての人間は、生まれながらにして自由であり、かつ、尊厳と権利とについてびょうどうである。|ja-jp
kss_0|500|모든 인간은 태어날 때부터 자유로우며 그 존엄과 권리에 있어 동등하다.|ko-kr
thorsten_0|500|Alle Menschen sind frei und gleich an Würde und Rechten geboren.|de-de
hajdurova_0|500|Все люди рождаются свободными и равными в своем достоинстве и правах.|ru-ru
tux_0|500|Todos los seres humanos nacen libres e iguales en dignidad y derechos.|es-es
guf02858_0|500|પ્રતિષ્ઠા અને અધિકારોની દૃષ્ટિએ સર્વ માનવો જન્મથી સ્વતંત્ર અને સમાન હોય છે.|gu-in

, and with the command python eval.py --model-dir=T3_MODEL_DIR --log-dir=OUTPUT_DIR --data-dir=METADATA_DIR --eval_meta=script.txt --eval_step=1160000 --no_wait=True. You may refer to lang_id.json and spk_id.json to synthesize audios with other languages or speakers.

The waveforms are produced by Griffin-Lim, while mel spectrograms are also saved to SPEAKERNAME_FILEID.npy, which are normalized to a [-4, 4] range. Pretrained vocoders like Wavenet can be used to reach better quality. Those using recipes similar to Tacotron2 should be applicable to these mels, although you need to map mels to a range of [0, 1], simply by mels = (mels + 8) / 4.

Owner
Mutian He
Mutian He
Voice Gender Recognition

In this project it was used some different Machine Learning models to identify the gender of a voice (Female or Male) based on some specific speech and voice attributes.

Anne Livia 1 Jan 27, 2022
[SIGMETRICS 2022] One Proxy Device Is Enough for Hardware-Aware Neural Architecture Search

One Proxy Device Is Enough for Hardware-Aware Neural Architecture Search paper | website One Proxy Device Is Enough for Hardware-Aware Neural Architec

10 Dec 16, 2022
Text and code for the forthcoming second edition of Think Bayes, by Allen Downey.

Think Bayes 2 by Allen B. Downey The HTML version of this book is here. Think Bayes is an introduction to Bayesian statistics using computational meth

Allen Downey 1.5k Jan 08, 2023
PyTorch implementation of our ICCV 2021 paper, Interpretation of Emergent Communication in Heterogeneous Collaborative Embodied Agents.

PyTorch implementation of our ICCV 2021 paper, Interpretation of Emergent Communication in Heterogeneous Collaborative Embodied Agents.

Saim Wani 4 May 08, 2022
Council-GAN - Implementation for our paper Breaking the Cycle - Colleagues are all you need (CVPR 2020)

Council-GAN Implementation of our paper Breaking the Cycle - Colleagues are all you need (CVPR 2020) Paper Ori Nizan , Ayellet Tal, Breaking the Cycle

ori nizan 260 Nov 16, 2022
Official PyTorch code for CVPR 2020 paper "Deep Active Learning for Biased Datasets via Fisher Kernel Self-Supervision"

Deep Active Learning for Biased Datasets via Fisher Kernel Self-Supervision https://arxiv.org/abs/2003.00393 Abstract Active learning (AL) aims to min

Denis 29 Nov 21, 2022
BEAMetrics: Benchmark to Evaluate Automatic Metrics in Natural Language Generation

BEAMetrics: Benchmark to Evaluate Automatic Metrics in Natural Language Generation Installing The Dependencies $ conda create --name beametrics python

7 Jul 04, 2022
This is a collection of simple PyTorch implementations of neural networks and related algorithms. These implementations are documented with explanations,

labml.ai Deep Learning Paper Implementations This is a collection of simple PyTorch implementations of neural networks and related algorithms. These i

labml.ai 16.4k Jan 09, 2023
Optimal Adaptive Allocation using Deep Reinforcement Learning in a Dose-Response Study

Optimal Adaptive Allocation using Deep Reinforcement Learning in a Dose-Response Study Supplementary Materials for Kentaro Matsuura, Junya Honda, Imad

Kentaro Matsuura 4 Nov 01, 2022
A Python multilingual toolkit for Sentiment Analysis and Social NLP tasks

pysentimiento: A Python toolkit for Sentiment Analysis and Social NLP tasks A Transformer-based library for SocialNLP classification tasks. Currently

298 Jan 07, 2023
This is an implementation for the CVPR2020 paper "Learning Invariant Representation for Unsupervised Image Restoration"

Learning Invariant Representation for Unsupervised Image Restoration (CVPR 2020) Introduction This is an implementation for the paper "Learning Invari

GarField 88 Nov 07, 2022
Extreme Rotation Estimation using Dense Correlation Volumes

Extreme Rotation Estimation using Dense Correlation Volumes This repository contains a PyTorch implementation of the paper: Extreme Rotation Estimatio

Ruojin Cai 29 Nov 18, 2022
DeepFaceEditing: Deep Face Generation and Editing with Disentangled Geometry and Appearance Control

DeepFaceEditing: Deep Face Generation and Editing with Disentangled Geometry and Appearance Control One version of our system is implemented using the

260 Nov 28, 2022
Implementation for Curriculum DeepSDF

Curriculum-DeepSDF This repository is an implementation for Curriculum DeepSDF. Full paper is available here. Preparation Please follow original setti

Haidong Zhu 69 Dec 29, 2022
95.47% on CIFAR10 with PyTorch

Train CIFAR10 with PyTorch I'm playing with PyTorch on the CIFAR10 dataset. Prerequisites Python 3.6+ PyTorch 1.0+ Training # Start training with: py

5k Dec 30, 2022
Genpass - A Passwors Generator App With Python3

Genpass Welcom again into another python3 App this is simply an Passwors Generat

Mal4D 1 Jan 09, 2022
Deep Learning Tutorial for Kaggle Ultrasound Nerve Segmentation competition, using Keras

Deep Learning Tutorial for Kaggle Ultrasound Nerve Segmentation competition, using Keras This tutorial shows how to use Keras library to build deep ne

Marko Jocić 922 Dec 19, 2022
Project Aquarium is a SUSE-sponsored open source project aiming at becoming an easy to use, rock solid storage appliance based on Ceph.

Project Aquarium Project Aquarium is a SUSE-sponsored open source project aiming at becoming an easy to use, rock solid storage appliance based on Cep

Aquarist Labs 73 Jul 21, 2022
Code for the paper "Generative design of breakwaters usign deep convolutional neural network as a surrogate model"

Generative design of breakwaters usign deep convolutional neural network as a surrogate model This repository contains the code for the paper "Generat

2 Apr 10, 2022
Open source simulator for autonomous vehicles built on Unreal Engine / Unity, from Microsoft AI & Research

Welcome to AirSim AirSim is a simulator for drones, cars and more, built on Unreal Engine (we now also have an experimental Unity release). It is open

Microsoft 13.8k Jan 05, 2023