A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型,适用于英语、普通话/中文、日语、韩语、俄语和藏语(当前已测试)。

Overview

简体中文 | English

并行语音合成

[TOC]

新进展

目录结构

.
|--- config/      # 配置文件
     |--- default.yaml
     |--- ...
|--- datasets/    # 数据处理
|--- encoder/     # 声纹编码器
     |--- voice_encoder.py
     |--- ...
|--- helpers/     # 一些辅助类
     |--- trainer.py
     |--- synthesizer.py
     |--- ...
|--- logdir/      # 训练过程保存目录
|--- losses/      # 一些损失函数
|--- models/      # 合成模型
     |--- layers.py
     |--- duration.py
     |--- parallel.py
|--- pretrained/  # 预训练模型(LJSpeech 数据集)
|--- samples/     # 合成样例
|--- utils/       # 一些通用方法
|--- vocoder/     # 声码器
     |--- melgan.py
     |--- ...
|--- wandb/       # Wandb 保存目录
|--- extract-duration.py
|--- extract-embedding.py
|--- LICENSE
|--- prepare-dataset.py  # 准备脚本
|--- README.md
|--- README_en.md
|--- requirements.txt    # 依赖文件
|--- synthesize.py       # 合成脚本
|--- train-duration.py   # 训练脚本
|--- train-parallel.py

合成样例

部分合成样例见这里

预训练

部分预训练模型见这里

快速开始

步骤(1):克隆仓库

$ git clone https://github.com/atomicoo/ParallelTTS.git

步骤(2):安装依赖

$ conda create -n ParallelTTS python=3.7.9
$ conda activate ParallelTTS
$ pip install -r requirements.txt

步骤(3):合成语音

$ python synthesize.py \
  --checkpoint ./pretrained/ljspeech-parallel-epoch0100.pth \
  --melgan_checkpoint ./pretrained/ljspeech-melgan-epoch3200.pth \
  --input_texts ./samples/english/synthesize.txt \
  --outputs_dir ./outputs/

如果要合成其他语种的语音,需要通过 --config 指定相应的配置文件。

如何训练

步骤(1):准备数据

$ python prepare-dataset.py

通过 --config 可以指定配置文件,默认的 default.yaml 针对 LJSpeech 数据集。

步骤(2):训练对齐模型

$ python train-duration.py

步骤(3):提取持续时间

$ python extract-duration.py

通过 --ground_truth 可以指定是否利用对齐模型生成 Ground-Truth 声谱图。

步骤(4):训练合成模型

$ python train-parallel.py

通过 --ground_truth 可以指定是否使用 Ground-Truth 声谱图进行模型训练。

训练日志

如果使用 TensorBoardX,则运行如下命令:

$ tensorboard --logdir logdir/[DIR]/

强烈推荐使用 Wandb(Weights & Biases),只需在上述训练命令中增加 --enable_wandb 选项。

数据集

  • LJSpeech:英语,女性,22050 Hz,约 24 小时
  • LibriSpeech:英语,多说话人(仅使用 train-clean-100 部分),16000 Hz,总计约 1000 小时
  • JSUT:日语,女性,48000 Hz,约 10 小时
  • BiaoBei:普通话,女性,48000 Hz,约 12 小时
  • KSS:韩语,女性,44100 Hz,约 12 小时
  • RuLS:俄语,多说话人(仅使用单一说话人音频),16000 Hz,总计约 98 小时
  • TWLSpeech(非公开,质量较差):藏语,女性(多说话人,音色相近),16000 Hz,约 23 小时

质量评估

TODO:待补充

速度指标

训练速度:对于 LJSpeech 数据集,设置批次尺寸为 64,可以在单张 8GB 显存的 GTX 1080 显卡上进行训练,训练 ~8h(~300 epochs)后即可合成质量较高的语音。

合成速度:以下测试在 CPU @ Intel Core i7-8550U / GPU @ NVIDIA GeForce MX150 下进行,每段合成音频在 8 秒左右(约 20 词)

批次尺寸 Spec
(GPU)
Audio
(GPU)
Spec
(CPU)
Audio
(CPU)
1 0.042 0.218 0.100 2.004
2 0.046 0.453 0.209 3.922
4 0.053 0.863 0.407 7.897
8 0.062 2.386 0.878 14.599

注意,没有进行多次测试取平均值,结果仅供参考。

一些问题

  • wavegan 分支中,vocoder 代码取自 ParallelWaveGAN,由于声学特征提取方式不兼容,需要进行转化,具体转化代码见这里
  • 普通话模型的文本输入选择拼音序列,因为 BiaoBei 的原始拼音序列不包含标点、以及对齐模型训练不完全,所以合成语音的节奏会有点问题。
  • 韩语模型没有专门训练对应的声码器,而是直接使用 LJSpeech(同为 22050 Hz)的声码器,可能稍微影响合成语音的质量。

参考资料

TODO

  • 合成语音质量评估(MOS)
  • 更多不同语种的测试
  • 语音风格迁移(音色)

欢迎交流

  • 微信号:Joee1995

  • 企鹅号:793071559

Owner
Atomicoo
Atomicoo
Long text token classification using LongFormer

Long text token classification using LongFormer

abhishek thakur 161 Aug 07, 2022
A pytorch implementation of the ACL2019 paper "Simple and Effective Text Matching with Richer Alignment Features".

RE2 This is a pytorch implementation of the ACL 2019 paper "Simple and Effective Text Matching with Richer Alignment Features". The original Tensorflo

286 Jan 02, 2023
A method to generate speech across multiple speakers

VoiceLoop PyTorch implementation of the method described in the paper VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop. VoiceLoop is a n

Facebook Archive 873 Dec 15, 2022
Chinese real time voice cloning (VC) and Chinese text to speech (TTS).

Chinese real time voice cloning (VC) and Chinese text to speech (TTS). 好用的中文语音克隆兼中文语音合成系统,包含语音编码器、语音合成器、声码器和可视化模块。

Kuang Dada 6 Nov 08, 2022
Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5

NLP-Summarizer Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5 This project aimed to provide in

Samuel Sharkey 1 Feb 07, 2022
The official implementation of "BERT is to NLP what AlexNet is to CV: Can Pre-Trained Language Models Identify Analogies?, ACL 2021 main conference"

BERT is to NLP what AlexNet is to CV This is the official implementation of BERT is to NLP what AlexNet is to CV: Can Pre-Trained Language Models Iden

Asahi Ushio 20 Nov 03, 2022
Common Voice Dataset explorer

Common Voice Dataset Explorer Common Voice Dataset is by Mozilla Made during huggingface finetuning week Usage pip install -r requirements.txt streaml

Ceyda Cinarel 22 Nov 16, 2022
A python wrapper around the ZPar parser for English.

NOTE This project is no longer under active development since there are now really nice pure Python parsers such as Stanza and Spacy. The repository w

ETS 49 Sep 12, 2022
[WWW 2021 GLB] New Benchmarks for Learning on Non-Homophilous Graphs

New Benchmarks for Learning on Non-Homophilous Graphs Here are the codes and datasets accompanying the paper: New Benchmarks for Learning on Non-Homop

94 Dec 21, 2022
The official code for “DocTr: Document Image Transformer for Geometric Unwarping and Illumination Correction”, ACM MM, Oral Paper, 2021.

Good news! Our new work exhibits state-of-the-art performances on DocUNet benchmark dataset: DocScanner: Robust Document Image Rectification with Prog

Hao Feng 231 Dec 26, 2022
Codename generator using WordNet parts of speech database

codenames Codename generator using WordNet parts of speech database References: https://possiblywrong.wordpress.com/2021/09/13/code-name-generator/ ht

possiblywrong 27 Oct 30, 2022
Healthsea is a spaCy pipeline for analyzing user reviews of supplementary products for their effects on health.

Welcome to Healthsea ✨ Create better access to health with spaCy. Healthsea is a pipeline for analyzing user reviews to supplement products by extract

Explosion 75 Dec 19, 2022
This codebase facilitates fast experimentation of differentially private training of Hugging Face transformers.

private-transformers This codebase facilitates fast experimentation of differentially private training of Hugging Face transformers. What is this? Why

Xuechen Li 73 Dec 28, 2022
Official PyTorch implementation of "Dual Path Learning for Domain Adaptation of Semantic Segmentation".

Dual Path Learning for Domain Adaptation of Semantic Segmentation Official PyTorch implementation of "Dual Path Learning for Domain Adaptation of Sema

27 Dec 22, 2022
A simple word search made in python

Word Search Puzzle A simple word search made in python Usage $ python3 main.py -h usage: main.py [-h] [-c] [-f FILE] Generates a word s

Magoninho 16 Mar 10, 2022
Tools and data for measuring the popularity & growth of various programming languages.

growth-data Tools and data for measuring the popularity & growth of various programming languages. Install the dependencies $ pip install -r requireme

3 Jan 06, 2022
REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.

What is MUSE? MUSE stands for Multilingual Universal Sentence Encoder - multilingual extension (16 languages) of Universal Sentence Encoder (USE). MUS

Dani El-Ayyass 47 Sep 05, 2022
Natural Language Processing with transformers

we want to create a repo to illustrate usage of transformers in chinese

Datawhale 763 Dec 27, 2022
Journalism AI – Quotes extraction for modular journalism

Quote extraction for modular journalism (JournalismAI collab 2021)

Journalism AI collab 2021 207 Dec 25, 2022
(ACL-IJCNLP 2021) Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language Models.

BERT Convolutions Code for the paper Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language Models. Contains expe

mlpc-ucsd 21 Jul 18, 2022