A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型,适用于英语、普通话/中文、日语、韩语、俄语和藏语(当前已测试)。

Overview

简体中文 | English

并行语音合成

[TOC]

新进展

目录结构

.
|--- config/      # 配置文件
     |--- default.yaml
     |--- ...
|--- datasets/    # 数据处理
|--- encoder/     # 声纹编码器
     |--- voice_encoder.py
     |--- ...
|--- helpers/     # 一些辅助类
     |--- trainer.py
     |--- synthesizer.py
     |--- ...
|--- logdir/      # 训练过程保存目录
|--- losses/      # 一些损失函数
|--- models/      # 合成模型
     |--- layers.py
     |--- duration.py
     |--- parallel.py
|--- pretrained/  # 预训练模型(LJSpeech 数据集)
|--- samples/     # 合成样例
|--- utils/       # 一些通用方法
|--- vocoder/     # 声码器
     |--- melgan.py
     |--- ...
|--- wandb/       # Wandb 保存目录
|--- extract-duration.py
|--- extract-embedding.py
|--- LICENSE
|--- prepare-dataset.py  # 准备脚本
|--- README.md
|--- README_en.md
|--- requirements.txt    # 依赖文件
|--- synthesize.py       # 合成脚本
|--- train-duration.py   # 训练脚本
|--- train-parallel.py

合成样例

部分合成样例见这里

预训练

部分预训练模型见这里

快速开始

步骤(1):克隆仓库

$ git clone https://github.com/atomicoo/ParallelTTS.git

步骤(2):安装依赖

$ conda create -n ParallelTTS python=3.7.9
$ conda activate ParallelTTS
$ pip install -r requirements.txt

步骤(3):合成语音

$ python synthesize.py \
  --checkpoint ./pretrained/ljspeech-parallel-epoch0100.pth \
  --melgan_checkpoint ./pretrained/ljspeech-melgan-epoch3200.pth \
  --input_texts ./samples/english/synthesize.txt \
  --outputs_dir ./outputs/

如果要合成其他语种的语音,需要通过 --config 指定相应的配置文件。

如何训练

步骤(1):准备数据

$ python prepare-dataset.py

通过 --config 可以指定配置文件,默认的 default.yaml 针对 LJSpeech 数据集。

步骤(2):训练对齐模型

$ python train-duration.py

步骤(3):提取持续时间

$ python extract-duration.py

通过 --ground_truth 可以指定是否利用对齐模型生成 Ground-Truth 声谱图。

步骤(4):训练合成模型

$ python train-parallel.py

通过 --ground_truth 可以指定是否使用 Ground-Truth 声谱图进行模型训练。

训练日志

如果使用 TensorBoardX,则运行如下命令:

$ tensorboard --logdir logdir/[DIR]/

强烈推荐使用 Wandb(Weights & Biases),只需在上述训练命令中增加 --enable_wandb 选项。

数据集

  • LJSpeech:英语,女性,22050 Hz,约 24 小时
  • LibriSpeech:英语,多说话人(仅使用 train-clean-100 部分),16000 Hz,总计约 1000 小时
  • JSUT:日语,女性,48000 Hz,约 10 小时
  • BiaoBei:普通话,女性,48000 Hz,约 12 小时
  • KSS:韩语,女性,44100 Hz,约 12 小时
  • RuLS:俄语,多说话人(仅使用单一说话人音频),16000 Hz,总计约 98 小时
  • TWLSpeech(非公开,质量较差):藏语,女性(多说话人,音色相近),16000 Hz,约 23 小时

质量评估

TODO:待补充

速度指标

训练速度:对于 LJSpeech 数据集,设置批次尺寸为 64,可以在单张 8GB 显存的 GTX 1080 显卡上进行训练,训练 ~8h(~300 epochs)后即可合成质量较高的语音。

合成速度:以下测试在 CPU @ Intel Core i7-8550U / GPU @ NVIDIA GeForce MX150 下进行,每段合成音频在 8 秒左右(约 20 词)

批次尺寸 Spec
(GPU)
Audio
(GPU)
Spec
(CPU)
Audio
(CPU)
1 0.042 0.218 0.100 2.004
2 0.046 0.453 0.209 3.922
4 0.053 0.863 0.407 7.897
8 0.062 2.386 0.878 14.599

注意,没有进行多次测试取平均值,结果仅供参考。

一些问题

  • wavegan 分支中,vocoder 代码取自 ParallelWaveGAN,由于声学特征提取方式不兼容,需要进行转化,具体转化代码见这里
  • 普通话模型的文本输入选择拼音序列,因为 BiaoBei 的原始拼音序列不包含标点、以及对齐模型训练不完全,所以合成语音的节奏会有点问题。
  • 韩语模型没有专门训练对应的声码器,而是直接使用 LJSpeech(同为 22050 Hz)的声码器,可能稍微影响合成语音的质量。

参考资料

TODO

  • 合成语音质量评估(MOS)
  • 更多不同语种的测试
  • 语音风格迁移(音色)

欢迎交流

  • 微信号:Joee1995

  • 企鹅号:793071559

Owner
Atomicoo
Atomicoo
Code from the paper "High-Performance Brain-to-Text Communication via Handwriting"

Code from the paper "High-Performance Brain-to-Text Communication via Handwriting"

Francis R. Willett 305 Dec 22, 2022
CATs: Semantic Correspondence with Transformers

CATs: Semantic Correspondence with Transformers For more information, check out the paper on [arXiv]. Training with different backbones and evaluation

74 Dec 10, 2021
This project deals with a simplified version of a more general problem of Aspect Based Sentiment Analysis.

Aspect_Based_Sentiment_Extraction Created on: 5th Jan, 2022. This project deals with an important field of Natural Lnaguage Processing - Aspect Based

Naman Rastogi 4 Jan 01, 2023
(ACL 2022) The source code for the paper "Towards Abstractive Grounded Summarization of Podcast Transcripts"

Towards Abstractive Grounded Summarization of Podcast Transcripts We provide the source code for the paper "Towards Abstractive Grounded Summarization

10 Jul 01, 2022
Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents

Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents [Project Page] [Paper] [Video] Wenlong Huang1, Pieter Abbee

Wenlong Huang 114 Dec 29, 2022
An easy-to-use framework for BERT models, with trainers, various NLP tasks and detailed annonations

FantasyBert English | 中文 Introduction An easy-to-use framework for BERT models, with trainers, various NLP tasks and detailed annonations. You can imp

Fan 137 Oct 26, 2022
Original implementation of the pooling method introduced in "Speaker embeddings by modeling channel-wise correlations"

Speaker-Embeddings-Correlation-Pooling This is the original implementation of the pooling method introduced in "Speaker embeddings by modeling channel

Themos Stafylakis 10 Apr 30, 2022
Paddlespeech Streaming ASR GUI

Paddlespeech-Streaming-ASR-GUI Introduction A paddlespeech Streaming ASR GUI. Us

Niek Zhen 3 Jan 05, 2022
Chinese version of GPT2 training code, using BERT tokenizer.

GPT2-Chinese Description Chinese version of GPT2 training code, using BERT tokenizer or BPE tokenizer. It is based on the extremely awesome repository

Zeyao Du 5.6k Jan 04, 2023
Library for Russian imprecise rhymes generation

TOM RHYMER Library for Russian imprecise rhymes generation. Quick Start Generate rhymes by any given rhyme scheme (aabb, abab, aaccbb, etc ...): from

Alexey Karnachev 6 Oct 18, 2022
Uses Google's gTTS module to easily create robo text readin' on command.

Tool to convert text to speech, creating files for later use. TTRS uses Google's gTTS module to easily create robo text readin' on command.

0 Jun 20, 2021
Text preprocessing, representation and visualization from zero to hero.

Text preprocessing, representation and visualization from zero to hero. From zero to hero • Installation • Getting Started • Examples • API • FAQ • Co

Jonathan Besomi 2.7k Jan 08, 2023
Repository for Project Insight: NLP as a Service

Project Insight NLP as a Service Contents Introduction Features Installation Setup and Documentation Project Details Demonstration Directory Details H

Abhishek Kumar Mishra 286 Dec 06, 2022
Submit issues and feature requests for our API here.

AIx GPT API Submit issues and feature requests for our API here. See https://apps.aixsolutionsgroup.com for more info. Python Quick Start pip install

AIx Solutions 7 Mar 27, 2022
Code for paper "Which Training Methods for GANs do actually Converge? (ICML 2018)"

GAN stability This repository contains the experiments in the supplementary material for the paper Which Training Methods for GANs do actually Converg

Lars Mescheder 884 Nov 11, 2022
Use the power of GPT3 to execute any function inside your programs just by giving some doctests

gptrun Don't feel like coding today? Use the power of GPT3 to execute any function inside your programs just by giving some doctests. How is this diff

Roberto Abdelkader Martínez Pérez 11 Nov 11, 2022
NL-Augmenter 🦎 → 🐍 A Collaborative Repository of Natural Language Transformations

NL-Augmenter 🦎 → 🐍 The NL-Augmenter is a collaborative effort intended to add transformations of datasets dealing with natural language. Transformat

684 Jan 09, 2023
Fast, DB Backed pretrained word embeddings for natural language processing.

Embeddings Embeddings is a python package that provides pretrained word embeddings for natural language processing and machine learning. Instead of lo

Victor Zhong 212 Nov 21, 2022
Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.

Tensor2Tensor Tensor2Tensor, or T2T for short, is a library of deep learning models and datasets designed to make deep learning more accessible and ac

12.9k Jan 07, 2023
NLP Overview

NLP-Overview Introduction The field of NPL encompasses a variety of topics which involve the computational processing and understanding of human langu

PeterPham 1 Jan 13, 2022