Non-Attentive-Tacotron - This is Pytorch Implementation of Google's Non-attentive Tacotron.

Overview

Non-attentive Tacotron - PyTorch Implementation

This is Pytorch Implementation of Google's Non-attentive Tacotron, text-to-speech system. There is some minor modifications to the original paper. We use grapheme directly, not phoneme. For that reason, we use grapheme based forced aligner by using Wav2vec 2.0. We also separate special characters from basic characters, and each is used for embedding respectively. This project is based on NVIDIA tacotron2. Feel free to use this code.

Install

  • Before you start the code, you have to check your python>=3.6, torch>=1.10.1, torchaudio>=0.10.0 version.
  • Torchaudio version is strongly restrict because of recent modification.
  • We support docker image file that we used for this implementation.
  • or You can install a package through the command below:
## download the git repository
git clone https://github.com/JoungheeKim/Non-Attentive-Tacotron.git
cd Non-Attentive-Tacotron

## install python dependency
pip install -r requirements.txt

## install this implementation locally for further development
python setup.py develop

Quickstart

  • Install a package.
  • Download Pretrained tacotron models through links below:
    • LJSpeech-1.1 (English, single-female speaker)
      • trained for 40,000 steps with 32 batch size, 8 accumulation) [LINK]
    • KSS Dataset (Korean, single-female speaker)
      • trained for 40,000 steps with 32 batch size, 8 accumulation) [LINK]
      • trained for 110,000 steps with 32 batch size, 8 accumulation) [LINK]
  • Download Pretrained VocGAN vocoder corresponding tacotron model in this [LINK]
  • Run a python code below:
## import library
from tacotron import get_vocgan
from tacotron.model import NonAttentiveTacotron
from tacotron.tokenizer import BaseTokenizer
import torch

## set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

## set pretrained model path
generator_path = '???'
tacotron_path = '???'

## load generator model
generator = get_vocgan(generator_path)
generator.eval()

## load tacotron model
tacotron = NonAttentiveTacotron.from_pretrained(tacotron_path)
tacotron.eval()

## load tokenizer
tokenizer = BaseTokenizer.from_pretrained(tacotron_path)

## Inference
text = 'This is a non attentive tacotron.'
encoded_text = tokenizer.encode(text)
encoded_torch_text = {key: torch.tensor(item, dtype=torch.long).unsqueeze(0).to(device) for key, item in encoded_text.items()}

with torch.no_grad():
    ## make log mel-spectrogram
    tacotron_output = tacotron.inference(**encoded_torch_text)
    
    ## make audio
    audio = generator.generate_audio(**tacotron_output)

Preprocess & Train

1. Download Dataset

2. Build Forced Aligned Information.

  • Non-Attentive Tacotron is duration based model.
  • So, alignment information between grapheme and audio is essential.
  • We make alignment information using Wav2vec 2.0 released from fairseq.
  • We also support pretrained wav2vec 2.0 model for Korean in this [LINK].
  • The Korean Wav2vec 2.0 model is trained on aihub korean dialog dataset to generate grapheme based prediction described in K-Wav2vec 2.0.
  • The English model is automatically downloaded when you run the code.
  • Run the command below:
## 1. LJSpeech example
## set your data path and audio path(examples are below:)
AUDIO_PATH=/code/gitRepo/data/LJSpeech-1.1/wavs
SCRIPT_PATH=/code/gitRepo/data/LJSpeech-1.1/metadata.csv

## ljspeech forced aligner
## check config options in [configs/preprocess_ljspeech.yaml]
python build_aligned_info.py \
    base.audio_path=${AUDIO_PATH} \
    base.script_path=${SCRIPT_PATH} \
    --config-name preprocess_ljspeech
    
    
## 2. KSS Dataset 
## set your data path and audio path(examples are below:)
AUDIO_PATH=/code/gitRepo/data/kss
SCRIPT_PATH=/code/gitRepo/data/kss/transcript.v.1.4.txt
PRETRAINED_WAV2VEC=korean_wav2vec2

## kss forced aligner
## check config options in [configs/preprocess_kss.yaml]
python build_aligned_info.py \
    base.audio_path=${AUDIO_PATH} \
    base.script_path=${SCRIPT_PATH} \
    base.pretrained_model=${PRETRAINED_WAV2VEC} \
    --config-name preprocess_kss

3. Train & Evaluate

  • It is recommeded to download the pre-trained vocoder before training the non-attentive tacotron model to evaluate the model performance in training phrase.
  • You can download pre-trained VocGAN in this [LINK].
  • We only experiment with our codes on a one gpu such as 2080ti or TITAN RTX.
  • The robotic sounds are gone when I use batch size 32 with 8 accumulation corresponding to 256 batch size.
  • Run the command below:
## 1. LJSpeech example
## set your data generator path and save path(examples are below:)
GENERATOR_PATH=checkpoints_g/ljspeech_29de09d_4000.pt
SAVE_PATH=results/ljspeech

## train ljspeech non-attentive tacotron
## check config options in [configs/train_ljspeech.yaml]
python train.py \
    base.generator_path=${GENERATOR_PATH} \
    base.save_path=${SAVE_PATH} \
    --config-name train_ljspeech
  
  
    
## 2. KSS Dataset   
## set your data generator path and save path(examples are below:)
GENERATOR_PATH=checkpoints_g/vocgan_kss_pretrained_model_epoch_4500.pt
SAVE_PATH=results/kss

## train kss non-attentive tacotron
## check config options in [configs/train_kss.yaml]
python train.py \
    base.generator_path=${GENERATOR_PATH} \
    base.save_path=${SAVE_PATH} \
    --config-name train_kss

Audio Examples

Language Text with Accent(bold) Audio Sample
Korean 이 타코트론은 잘 작동한다. Sample
Korean 타코트론은 잘 작동한다. Sample
Korean 타코트론은 잘 작동한다. Sample
Korean 이 타코트론은 작동한다. Sample

Forced Aligned Information Examples

ToDo

  • Sometimes get torch NAN errors.(help me)
  • Remove robotic sounds in synthetic audio.

References

Owner
Jounghee Kim
I am interested in NLP, Representation Learning, Speech Recognition, Speech Generation.
Jounghee Kim
Code release for the paper “Worldsheet Wrapping the World in a 3D Sheet for View Synthesis from a Single Image”, ICCV 2021.

Worldsheet: Wrapping the World in a 3D Sheet for View Synthesis from a Single Image This repository contains the code for the following paper: R. Hu,

Meta Research 37 Jan 04, 2023
General Vision Benchmark, a project from OpenGVLab

Introduction We build GV-B(General Vision Benchmark) on Classification, Detection, Segmentation and Depth Estimation including 26 datasets for model e

174 Dec 27, 2022
Official implementation of the paper DeFlow: Learning Complex Image Degradations from Unpaired Data with Conditional Flows

DeFlow: Learning Complex Image Degradations from Unpaired Data with Conditional Flows Official implementation of the paper DeFlow: Learning Complex Im

Valentin Wolf 86 Nov 16, 2022
Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge

Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge This is an implementation of the paper,

Mutian He 19 Oct 14, 2022
Official PyTorch implementation of the Fishr regularization for out-of-distribution generalization

Fishr: Invariant Gradient Variances for Out-of-distribution Generalization Official PyTorch implementation of the Fishr regularization for out-of-dist

62 Dec 22, 2022
This code is for eCaReNet: explainable Cancer Relapse Prediction Network.

eCaReNet This code is for eCaReNet: explainable Cancer Relapse Prediction Network. (Towards Explainable End-to-End Prostate Cancer Relapse Prediction

Institute of Medical Systems Biology 2 Jul 28, 2022
Some toy examples of score matching algorithms written in PyTorch

toy_gradlogp This repo implements some toy examples of the following score matching algorithms in PyTorch: ssm-vr: sliced score matching with variance

Ending Hsiao 21 Dec 26, 2022
PyTorch trainer and model for Sequence Classification

PyTorch-trainer-and-model-for-Sequence-Classification After cloning the repository, modify your training data so that the training data is a .csv file

NhanTieu 2 Dec 09, 2022
DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.

DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.

Microsoft 8.4k Jan 01, 2023
A lossless neural compression framework built on top of JAX.

Kompressor Branch CI Coverage main (active) main development A neural compression framework built on top of JAX. Install setup.py assumes a compatible

Rosalind Franklin Institute 2 Mar 14, 2022
MASA-SR: Matching Acceleration and Spatial Adaptation for Reference-Based Image Super-Resolution (CVPR2021)

MASA-SR Official PyTorch implementation of our CVPR2021 paper MASA-SR: Matching Acceleration and Spatial Adaptation for Reference-Based Image Super-Re

DV Lab 126 Dec 20, 2022
Unofficial pytorch implementation of 'Image Inpainting for Irregular Holes Using Partial Convolutions'

pytorch-inpainting-with-partial-conv Official implementation is released by the authors. Note that this is an ongoing re-implementation and I cannot f

Naoto Inoue 525 Jan 01, 2023
Chainer Implementation of Fully Convolutional Networks. (Training code to reproduce the original result is available.)

fcn - Fully Convolutional Networks Chainer implementation of Fully Convolutional Networks. Installation pip install fcn Inference Inference is done as

Kentaro Wada 218 Oct 27, 2022
Unofficial PyTorch Implementation of AHDRNet (CVPR 2019)

AHDRNet-PyTorch This is the PyTorch implementation of Attention-guided Network for Ghost-free High Dynamic Range Imaging (CVPR 2019). The official cod

Yutong Zhang 4 Sep 08, 2022
Think Big, Teach Small: Do Language Models Distil Occam’s Razor?

Think Big, Teach Small: Do Language Models Distil Occam’s Razor? Software related to the paper "Think Big, Teach Small: Do Language Models Distil Occa

0 Dec 07, 2021
Computer vision - fun segmentation experience using classic and deep tools :)

Computer_Vision_Segmentation_Fun Segmentation of Images and Video. Tools: pytorch Models: Classic model - GrabCut Deep model - Deeplabv3_resnet101 Flo

Mor Ventura 1 Dec 18, 2021
Python scripts for performing stereo depth estimation using the MobileStereoNet model in ONNX

ONNX-MobileStereoNet Python scripts for performing stereo depth estimation using the MobileStereoNet model in ONNX Stereo depth estimation on the cone

Ibai Gorordo 23 Nov 29, 2022
Implementation of U-Net and SegNet for building segmentation

Specialized project Created by Katrine Nguyen and Martin Wangen-Eriksen as a part of our specialized project at Norwegian University of Science and Te

Martin.w-e 3 Dec 07, 2022
Pytorch implementation of Distributed Proximal Policy Optimization: https://arxiv.org/abs/1707.02286

Pytorch-DPPO Pytorch implementation of Distributed Proximal Policy Optimization: https://arxiv.org/abs/1707.02286 Using PPO with clip loss (from https

Alexis David Jacq 163 Dec 26, 2022
Lightweight library to build and train neural networks in Theano

Lasagne Lasagne is a lightweight library to build and train neural networks in Theano. Its main features are: Supports feed-forward networks such as C

Lasagne 3.8k Dec 29, 2022