Official PyTorch implementation of SyntaSpeech (IJCAI 2022)

Overview

SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech

arXiv | GitHub Stars | downloads | Hugging Face | 中文文档

This repository is the official PyTorch implementation of our IJCAI-2022 paper, in which we propose SyntaSpeech for syntax-aware non-autoregressive Text-to-Speech.



Our SyntaSpeech is built on the basis of PortaSpeech (NeurIPS 2021) with three new features:

  1. We propose Syntactic Graph Builder (Sec. 3.1) and Syntactic Graph Encoder (Sec. 3.2), which is proved to be an effective unit to extract syntactic features to improve the prosody modeling and duration accuracy of TTS model.
  2. We introduce Multi-Length Adversarial Training (Sec. 3.3), which could replace the flow-based post-net in PortaSpeech, speeding up the inference time and improving the audio quality naturalness.
  3. We support three datasets: LJSpeech (single-speaker English dataset), Biaobei (single-speaker Chinese dataset) , and LibriTTS (multi-speaker English dataset).

Environments

conda create -n synta python=3.7
condac activate synta
pip install -U pip
pip install Cython numpy==1.19.1
pip install torch==1.9.0 
pip install -r requirements.txt
# install dgl for graph neural network, dgl-cu102 supports rtx2080, dgl-cu113 support rtx3090
pip install dgl-cu102 dglgo -f https://data.dgl.ai/wheels/repo.html 
sudo apt install -y sox libsox-fmt-mp3
bash mfa_usr/install_mfa.sh # install force alignment tools

Run SyntaSpeech!

Please follow the following steps to run this repo.

1. Preparation

Data Preparation

You can directly use our binarized datasets for LJSpeech and Biaobei. Download them and unzip them into the data/binary/ folder.

As for LibriTTS, you can download the raw datasets and process them with our data_gen modules. Detailed instructions can be found in dosc/prepare_data.

Vocoder Preparation

We provide the pre-trained model of vocoders for three datasets. Specifically, Hifi-GAN for LJSpeech and Biaobei, ParallelWaveGAN for LibriTTS. Download and unzip them into the checkpoints/ folder.

2. Training Example

Then you can train SyntaSpeech in the three datasets.

cd <the root_dir of your SyntaSpeech folder>
export PYTHONPATH=./
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/lj/synta.yaml --exp_name lj_synta --reset # training in LJSpeech
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/biaobei/synta.yaml --exp_name biaobei_synta --reset # training in Biaobei
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/biaobei/synta.yaml --exp_name libritts_synta --reset # training in LibriTTS

3. Tensorboard

tensorboard --logdir=checkpoints/lj_synta
tensorboard --logdir=checkpoints/biaobei_synta
tensorboard --logdir=checkpoints/libritts_synta

4. Inference Example

CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/lj/synta.yaml --exp_name lj_synta --reset --infer # inference in LJSpeech
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/biaobei/synta.yaml --exp_name biaobei_synta --reset --infer # inference in Biaobei
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/biaobei/synta.yaml --exp_name libritts_synta --reset ---infer # inference in LibriTTS

Audio Demos

Audio samples in the paper can be found in our demo page.

We also provide HuggingFace Demo Page for LJSpeech. Try your interesting sentences there!

Citation

@article{ye2022syntaspeech,
  title={SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech},
  author={Ye, Zhenhui and Zhao, Zhou and Ren, Yi and Wu, Fei},
  journal={arXiv preprint arXiv:2204.11792},
  year={2022}
}

Acknowledgements

Our codes are based on the following repos:

Comments
  • pinyin preprocess problem

    pinyin preprocess problem

    005804 你当#1我傻啊#3?脑子#1那么大#2怎么#1塞进去#4? ni3 dang1 wo2 sha3 a5 nao3 zi5 na4 me5 da4 zen3 me5 sai1 jin4 qu4

    txt_struct=[['', ['']], ['你', ['n', 'i3']], ['当', ['d', 'ang1']], ['我', ['uo3']], ['傻', ['sh', 'a3']], ['啊', ['a', '?', 'n', 'ao3']], ['?', ['z', 'i']], ['脑', ['n', 'a4']], ['子', ['m', 'e']], ['那', ['d', 'a4']], ['么', ['z', 'en3']], ['大', ['m', 'e']], ['怎', ['s', 'ai1']], ['么', ['j', 'in4']], ['塞', ['q', 'v4', '?']], ['进', []], ['去', []], ['?', []], ['', ['']]]

    ph_gb_word=['', 'n_i3', 'd_ang1', 'uo3', 'sh_a3', 'a_?n_ao3', 'z_i', 'n_a4', 'm_e', 'd_a4', 'z_en3', 'm_e', 's_ai1', 'j_in4', 'q_v4?', '', '', '', '']

    what is 'a_?_n_ao3'

    in the mfa_dict it appears ch_a1_d_ou1 ,a_?_n_ao3 and so on

    opened by windowxiaoming 2
  • discriminator output['y_c'] never used

    discriminator output['y_c'] never used

    Discriminator's output['y_c'] never used, and never calculated in discriminator forward func. What does this variable mean? https://github.com/yerfor/SyntaSpeech/blob/5b07439633a3e714d2a6759ea4097eb36d6cd99a/tasks/tts/synta.py#L81

    opened by mayfool 2
  • A question of KL divergence calculation

    A question of KL divergence calculation

    In modules/tts/portaspeech/fvae.py, SyntaFVAE compute loss_kl (line 121) , Can someone help explain why loss_kl = ((logqx - logpx) * nonpadding_sqz).sum() / nonpadding_sqz.sum() / logqx.shape[1],I think loss_kl should be compute by loss_kl = logqx.exp()*(logqx - logpx) I would be very grateful if you could reply to me!

    opened by JiaYK 2
  • mfa for multi speaker.

    mfa for multi speaker.

    In the code, group MFA inputs for better parallelism. For multi speaker, it maybe go wrong. For input g_uang3 zh_ou1 n_v3 d_a4 x_ve2 sh_eng1 d_eng1 sh_an1 sh_i1 l_ian2 s_i4 t_ian1 j_ing3 f_ang1 zh_ao3 d_ao4 i2 s_i4 n_v3 sh_i1. The TexGrid is

    	item [1]:
    		class = "IntervalTier"
    		name = "words"
    		xmin = 0.0
    		xmax = 9.4444
    		intervals: size = 56
    			intervals [1]:
    				xmin = 0
    				xmax = 0.5700000000000001
    				text = ""
    			intervals [2]:
    				xmin = 0.5700000000000001
    				xmax = 0.61
    				text = "eng"
    			intervals [3]:
    				xmin = 0.61
    				xmax = 0.79
    				text = "s_an1"
    			intervals [4]:
    				xmin = 0.79
    				xmax = 0.89
    				text = "eng"
    			intervals [5]:
    				xmin = 0.89
    				xmax = 1.06
    				text = "i1"
    			intervals [6]:
    				xmin = 1.06
    				xmax = 1.24
    				text = "eng"
    			intervals [7]:
    				xmin = 1.24
    				xmax = 1.3
    				text = ""
    			intervals [8]:
    				xmin = 1.3
    				xmax = 1.36
    				text = "s_an1"
    			intervals [9]:
    				xmin = 1.36
    				xmax = 1.42
    				text = ""
    			intervals [10]:
    				xmin = 1.42
    				xmax = 1.49
    				text = "eng"
    			intervals [11]:
    				xmin = 1.49
    				xmax = 1.67
    				text = "s_i4"
    			intervals [12]:
    				xmin = 1.67
    				xmax = 1.78
    				text = "eng"
    			intervals [13]:
    				xmin = 1.78
    				xmax = 1.91
    				text = ""
    			intervals [14]:
    				xmin = 1.91
    				xmax = 1.96
    				text = "er4"
    			intervals [15]:
    				xmin = 1.96
    				xmax = 2.06
    				text = "eng"
    			intervals [16]:
    				xmin = 2.06
    				xmax = 2.19
    				text = ""
    			intervals [17]:
    				xmin = 2.19
    				xmax = 2.35
    				text = "i1"
    			intervals [18]:
    				xmin = 2.35
    				xmax = 2.53
    				text = "eng"
    			intervals [19]:
    				xmin = 2.53
    				xmax = 3.03
    				text = "i1"
    			intervals [20]:
    				xmin = 3.03
    				xmax = 3.42
    				text = "eng"
    			intervals [21]:
    				xmin = 3.42
    				xmax = 3.48
    				text = "i1"
    			intervals [22]:
    				xmin = 3.48
    				xmax = 3.6
    				text = ""
    			intervals [23]:
    				xmin = 3.6
    				xmax = 3.64
    				text = "eng"
    			intervals [24]:
    				xmin = 3.64
    				xmax = 3.86
    				text = "i1"
    			intervals [25]:
    				xmin = 3.86
    				xmax = 3.99
    				text = "eng"
    			intervals [26]:
    				xmin = 3.99
    				xmax = 4.59
    				text = ""
    			intervals [27]:
    				xmin = 4.59
    				xmax = 4.869999999999999
    				text = "er4"
    			intervals [28]:
    				xmin = 4.869999999999999
    				xmax = 4.9799999999999995
    				text = "eng"
    			intervals [29]:
    				xmin = 4.9799999999999995
    				xmax = 5.1899999999999995
    				text = "s_i4"
    			intervals [30]:
    				xmin = 5.1899999999999995
    				xmax = 5.34
    				text = ""
    			intervals [31]:
    				xmin = 5.34
    				xmax = 5.43
    				text = "eng"
    			intervals [32]:
    				xmin = 5.43
    				xmax = 5.6
    				text = ""
    			intervals [33]:
    				xmin = 5.6
    				xmax = 5.76
    				text = "i1"
    			intervals [34]:
    				xmin = 5.76
    				xmax = 6.279999999999999
    				text = "eng"
    			intervals [35]:
    				xmin = 6.279999999999999
    				xmax = 6.359999999999999
    				text = "s_an1"
    			intervals [36]:
    				xmin = 6.359999999999999
    				xmax = 6.47
    				text = ""
    			intervals [37]:
    				xmin = 6.47
    				xmax = 6.6
    				text = "eng"
    			intervals [38]:
    				xmin = 6.6
    				xmax = 6.9399999999999995
    				text = "i1"
    			intervals [39]:
    				xmin = 6.9399999999999995
    				xmax = 7.039999999999999
    				text = "eng"
    			intervals [40]:
    				xmin = 7.039999999999999
    				xmax = 7.289999999999999
    				text = "s_an1"
    			intervals [41]:
    				xmin = 7.289999999999999
    				xmax = 7.369999999999999
    				text = "eng"
    			intervals [42]:
    				xmin = 7.369999999999999
    				xmax = 7.6
    				text = "s_i4"
    			intervals [43]:
    				xmin = 7.6
    				xmax = 7.699999999999999
    				text = "eng"
    			intervals [44]:
    				xmin = 7.699999999999999
    				xmax = 7.869999999999999
    				text = ""
    			intervals [45]:
    				xmin = 7.869999999999999
    				xmax = 8.049999999999999
    				text = "er4"
    			intervals [46]:
    				xmin = 8.049999999999999
    				xmax = 8.26
    				text = ""
    			intervals [47]:
    				xmin = 8.26
    				xmax = 8.299999999999999
    				text = "eng"
    			intervals [48]:
    				xmin = 8.299999999999999
    				xmax = 8.36
    				text = "s_i4"
    			intervals [49]:
    				xmin = 8.36
    				xmax = 8.389999999999999
    				text = ""
    			intervals [50]:
    				xmin = 8.389999999999999
    				xmax = 8.42
    				text = "eng"
    			intervals [51]:
    				xmin = 8.42
    				xmax = 8.45
    				text = ""
    			intervals [52]:
    				xmin = 8.45
    				xmax = 8.59
    				text = "s_an1"
    			intervals [53]:
    				xmin = 8.59
    				xmax = 8.83
    				text = ""
    			intervals [54]:
    				xmin = 8.83
    				xmax = 9.1
    				text = "eng"
    			intervals [55]:
    				xmin = 9.1
    				xmax = 9.44
    				text = "i1"
    			intervals [56]:
    				xmin = 9.44
    				xmax = 9.4444
    				text = ""
    
    opened by leon2milan 2
  • Problem with DDP

    Problem with DDP

    Hello, I have experimented on your excellent job with this repo. But I found the ddp is not effective. I wonder if the way I used is wrong?

    CUDA_VISIBLE_DEVICES=0,1,2 python -m torch.distributed.launch --nproc_per_node 3 tasks/run.py --config //fs.yaml --exp_name fs_test_demo --reset

    opened by zhazl 0
Releases(v1.0.0)
Owner
Zhenhui YE
I am currently a second-year computer science Ph.D student at Zhejiang University, working on deep learning and reinforcement learning.
Zhenhui YE
Code for ICCV 2021 paper Graph-to-3D: End-to-End Generation and Manipulation of 3D Scenes using Scene Graphs

Graph-to-3D This is the official implementation of the paper Graph-to-3d: End-to-End Generation and Manipulation of 3D Scenes Using Scene Graphs | arx

Helisa Dhamo 33 Jan 06, 2023
Winners of the Facebook Image Similarity Challenge

Winners of the Facebook Image Similarity Challenge

DrivenData 111 Jan 05, 2023
Omnidirectional Scene Text Detection with Sequential-free Box Discretization (IJCAI 2019). Including competition model, online demo, etc.

Box_Discretization_Network This repository is built on the pytorch [maskrcnn_benchmark]. The method is the foundation of our ReCTs-competition method

Yuliang Liu 266 Nov 24, 2022
This is an official pytorch implementation of Lite-HRNet: A Lightweight High-Resolution Network.

Lite-HRNet: A Lightweight High-Resolution Network Introduction This is an official pytorch implementation of Lite-HRNet: A Lightweight High-Resolution

HRNet 675 Dec 25, 2022
EfficientNetv2 TensorRT int8

EfficientNetv2_TensorRT_int8 EfficientNetv2模型实现来自https://github.com/d-li14/efficientnetv2.pytorch 环境配置 ubuntu:18.04 cuda:11.0 cudnn:8.0 tensorrt:7

34 Apr 24, 2022
A knowledge base construction engine for richly formatted data

Fonduer is a Python package and framework for building knowledge base construction (KBC) applications from richly formatted data. Note that Fonduer is

HazyResearch 386 Dec 05, 2022
This repository contains small projects related to Neural Networks and Deep Learning in general.

ILearnDeepLearning.py Description People say that nothing develops and teaches you like getting your hands dirty. This repository contains small proje

Piotr Skalski 1.2k Dec 22, 2022
Learning High-Speed Flight in the Wild

Learning High-Speed Flight in the Wild This repo contains the code associated to the paper Learning Agile Flight in the Wild. For more information, pl

Robotics and Perception Group 391 Dec 29, 2022
LUKE -- Language Understanding with Knowledge-based Embeddings

LUKE (Language Understanding with Knowledge-based Embeddings) is a new pre-trained contextualized representation of words and entities based on transf

Studio Ousia 587 Dec 30, 2022
Code repo for "Cross-Scale Internal Graph Neural Network for Image Super-Resolution" (NeurIPS'20)

IGNN Code repo for "Cross-Scale Internal Graph Neural Network for Image Super-Resolution" [paper] [supp] Prepare datasets 1 Download training dataset

Shangchen Zhou 278 Jan 03, 2023
Genetic Programming in Python, with a scikit-learn inspired API

Welcome to gplearn! gplearn implements Genetic Programming in Python, with a scikit-learn inspired and compatible API. While Genetic Programming (GP)

Trevor Stephens 1.3k Jan 03, 2023
Predict multi paths to a moving person depending on his trajectory history.

Multi-future Trajectory Prediction The project is about using the Multiverse model to make possible multible-future trajectory prediction for a seen p

Said Gamal 1 Jan 18, 2022
Roger Labbe 13k Dec 29, 2022
A set of Deep Reinforcement Learning Agents implemented in Tensorflow.

Deep Reinforcement Learning Agents This repository contains a collection of reinforcement learning algorithms written in Tensorflow. The ipython noteb

Arthur Juliani 2.2k Jan 01, 2023
Testing the Facial Emotion Recognition (FER) algorithm on animations

PegHeads-Tutorial-3 Testing the Facial Emotion Recognition (FER) algorithm on animations

PegHeads Inc 2 Jan 03, 2022
Re-implementation of the vector capsule with dynamic routing

VectorCapsule Re-implementation of the vector capsule with dynamic routing We implement the vector capsule and dynamic routing via graph neural networ

ZhenchaoTang 10 Feb 10, 2022
Runtime type annotations for the shape, dtype etc. of PyTorch Tensors.

torchtyping Type annotations for a tensor's shape, dtype, names, ... Turn this: def batch_outer_product(x: torch.Tensor, y: torch.Tensor) - torch.Ten

Patrick Kidger 1.2k Jan 03, 2023
Unofficial pytorch implementation of 'Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization'

pytorch-AdaIN This is an unofficial pytorch implementation of a paper, Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization [Hua

Naoto Inoue 873 Jan 06, 2023
Multi Camera Calibration

Multi Camera Calibration 'modules/camera_calibration/app/camera_calibration.cpp' is for calculating extrinsic parameter of each individual cameras. 'm

7 Dec 01, 2022
Few-Shot-Intent-Detection includes popular challenging intent detection datasets with/without OOS queries and state-of-the-art baselines and results.

Few-Shot-Intent-Detection Few-Shot-Intent-Detection is a repository designed for few-shot intent detection with/without Out-of-Scope (OOS) intents. It

Jian-Guo Zhang 73 Dec 26, 2022