Non-Attentive-Tacotron - This is Pytorch Implementation of Google's Non-attentive Tacotron.

Overview

Non-attentive Tacotron - PyTorch Implementation

This is Pytorch Implementation of Google's Non-attentive Tacotron, text-to-speech system. There is some minor modifications to the original paper. We use grapheme directly, not phoneme. For that reason, we use grapheme based forced aligner by using Wav2vec 2.0. We also separate special characters from basic characters, and each is used for embedding respectively. This project is based on NVIDIA tacotron2. Feel free to use this code.

Install

  • Before you start the code, you have to check your python>=3.6, torch>=1.10.1, torchaudio>=0.10.0 version.
  • Torchaudio version is strongly restrict because of recent modification.
  • We support docker image file that we used for this implementation.
  • or You can install a package through the command below:
## download the git repository
git clone https://github.com/JoungheeKim/Non-Attentive-Tacotron.git
cd Non-Attentive-Tacotron

## install python dependency
pip install -r requirements.txt

## install this implementation locally for further development
python setup.py develop

Quickstart

  • Install a package.
  • Download Pretrained tacotron models through links below:
    • LJSpeech-1.1 (English, single-female speaker)
      • trained for 40,000 steps with 32 batch size, 8 accumulation) [LINK]
    • KSS Dataset (Korean, single-female speaker)
      • trained for 40,000 steps with 32 batch size, 8 accumulation) [LINK]
      • trained for 110,000 steps with 32 batch size, 8 accumulation) [LINK]
  • Download Pretrained VocGAN vocoder corresponding tacotron model in this [LINK]
  • Run a python code below:
## import library
from tacotron import get_vocgan
from tacotron.model import NonAttentiveTacotron
from tacotron.tokenizer import BaseTokenizer
import torch

## set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

## set pretrained model path
generator_path = '???'
tacotron_path = '???'

## load generator model
generator = get_vocgan(generator_path)
generator.eval()

## load tacotron model
tacotron = NonAttentiveTacotron.from_pretrained(tacotron_path)
tacotron.eval()

## load tokenizer
tokenizer = BaseTokenizer.from_pretrained(tacotron_path)

## Inference
text = 'This is a non attentive tacotron.'
encoded_text = tokenizer.encode(text)
encoded_torch_text = {key: torch.tensor(item, dtype=torch.long).unsqueeze(0).to(device) for key, item in encoded_text.items()}

with torch.no_grad():
    ## make log mel-spectrogram
    tacotron_output = tacotron.inference(**encoded_torch_text)
    
    ## make audio
    audio = generator.generate_audio(**tacotron_output)

Preprocess & Train

1. Download Dataset

2. Build Forced Aligned Information.

  • Non-Attentive Tacotron is duration based model.
  • So, alignment information between grapheme and audio is essential.
  • We make alignment information using Wav2vec 2.0 released from fairseq.
  • We also support pretrained wav2vec 2.0 model for Korean in this [LINK].
  • The Korean Wav2vec 2.0 model is trained on aihub korean dialog dataset to generate grapheme based prediction described in K-Wav2vec 2.0.
  • The English model is automatically downloaded when you run the code.
  • Run the command below:
## 1. LJSpeech example
## set your data path and audio path(examples are below:)
AUDIO_PATH=/code/gitRepo/data/LJSpeech-1.1/wavs
SCRIPT_PATH=/code/gitRepo/data/LJSpeech-1.1/metadata.csv

## ljspeech forced aligner
## check config options in [configs/preprocess_ljspeech.yaml]
python build_aligned_info.py \
    base.audio_path=${AUDIO_PATH} \
    base.script_path=${SCRIPT_PATH} \
    --config-name preprocess_ljspeech
    
    
## 2. KSS Dataset 
## set your data path and audio path(examples are below:)
AUDIO_PATH=/code/gitRepo/data/kss
SCRIPT_PATH=/code/gitRepo/data/kss/transcript.v.1.4.txt
PRETRAINED_WAV2VEC=korean_wav2vec2

## kss forced aligner
## check config options in [configs/preprocess_kss.yaml]
python build_aligned_info.py \
    base.audio_path=${AUDIO_PATH} \
    base.script_path=${SCRIPT_PATH} \
    base.pretrained_model=${PRETRAINED_WAV2VEC} \
    --config-name preprocess_kss

3. Train & Evaluate

  • It is recommeded to download the pre-trained vocoder before training the non-attentive tacotron model to evaluate the model performance in training phrase.
  • You can download pre-trained VocGAN in this [LINK].
  • We only experiment with our codes on a one gpu such as 2080ti or TITAN RTX.
  • The robotic sounds are gone when I use batch size 32 with 8 accumulation corresponding to 256 batch size.
  • Run the command below:
## 1. LJSpeech example
## set your data generator path and save path(examples are below:)
GENERATOR_PATH=checkpoints_g/ljspeech_29de09d_4000.pt
SAVE_PATH=results/ljspeech

## train ljspeech non-attentive tacotron
## check config options in [configs/train_ljspeech.yaml]
python train.py \
    base.generator_path=${GENERATOR_PATH} \
    base.save_path=${SAVE_PATH} \
    --config-name train_ljspeech
  
  
    
## 2. KSS Dataset   
## set your data generator path and save path(examples are below:)
GENERATOR_PATH=checkpoints_g/vocgan_kss_pretrained_model_epoch_4500.pt
SAVE_PATH=results/kss

## train kss non-attentive tacotron
## check config options in [configs/train_kss.yaml]
python train.py \
    base.generator_path=${GENERATOR_PATH} \
    base.save_path=${SAVE_PATH} \
    --config-name train_kss

Audio Examples

Language Text with Accent(bold) Audio Sample
Korean 이 타코트론은 잘 작동한다. Sample
Korean 타코트론은 잘 작동한다. Sample
Korean 타코트론은 잘 작동한다. Sample
Korean 이 타코트론은 작동한다. Sample

Forced Aligned Information Examples

ToDo

  • Sometimes get torch NAN errors.(help me)
  • Remove robotic sounds in synthetic audio.

References

Owner
Jounghee Kim
I am interested in NLP, Representation Learning, Speech Recognition, Speech Generation.
Jounghee Kim
Official PyTorch Implementation of SSMix (Findings of ACL 2021)

SSMix: Saliency-based Span Mixup for Text Classification (Findings of ACL 2021) Official PyTorch Implementation of SSMix | Paper Abstract Data augment

Clova AI Research 52 Dec 27, 2022
GuideDog is an AI/ML-based mobile app designed to assist the lives of the visually impaired, 100% voice-controlled

Guidedog Authors: Kyuhee Jo, Steven Gunarso, Jacky Wang, Raghav Sharma GuideDog is an AI/ML-based mobile app designed to assist the lives of the visua

Kyuhee Jo 5 Nov 24, 2021
[IJCAI'21] Deep Automatic Natural Image Matting

Deep Automatic Natural Image Matting [IJCAI-21] This is the official repository of the paper Deep Automatic Natural Image Matting. Introduction | Netw

Jizhizi_Li 316 Jan 06, 2023
Extreme Dynamic Classifier Chains - XGBoost for Multi-label Classification

Extreme Dynamic Classifier Chains Classifier chains is a key technique in multi-label classification, sinceit allows to consider label dependencies ef

6 Oct 08, 2022
A collection of easy-to-use, ready-to-use, interesting deep neural network models

Interesting and reproducible research works should be conserved. This repository wraps a collection of deep neural network models into a simple and un

Aria Ghora Prabono 16 Jun 16, 2022
A simple Tensorflow based library for deep and/or denoising AutoEncoder.

libsdae - deep-Autoencoder & denoising autoencoder A simple Tensorflow based library for Deep autoencoder and denoising AE. Library follows sklearn st

Rajarshee Mitra 147 Nov 18, 2022
Implementation of Continuous Sparsification, a method for pruning and ticket search in deep networks

Continuous Sparsification Implementation of Continuous Sparsification (CS), a method based on l_0 regularization to find sparse neural networks, propo

Pedro Savarese 23 Dec 07, 2022
Scripts for training an AI to play the endless runner Subway Surfers using a supervised machine learning approach by imitation and a convolutional neural network (CNN) for image classification

About subwAI subwAI - a project for training an AI to play the endless runner Subway Surfers using a supervised machine learning approach by imitation

82 Jan 01, 2023
Multi-Object Tracking in Satellite Videos with Graph-Based Multi-Task Modeling

TGraM Multi-Object Tracking in Satellite Videos with Graph-Based Multi-Task Modeling, Qibin He, Xian Sun, Zhiyuan Yan, Beibei Li, Kun Fu Abstract Rece

Qibin He 6 Nov 25, 2022
Companion code for "Bayesian logistic regression for online recalibration and revision of risk prediction models with performance guarantees"

Companion code for "Bayesian logistic regression for online recalibration and revision of risk prediction models with performance guarantees" Installa

0 Oct 13, 2021
This project uses reinforcement learning on stock market and agent tries to learn trading. The goal is to check if the agent can learn to read tape. The project is dedicated to hero in life great Jesse Livermore.

Reinforcement-trading This project uses Reinforcement learning on stock market and agent tries to learn trading. The goal is to check if the agent can

Deepender Singla 1.4k Dec 22, 2022
A Partition Filter Network for Joint Entity and Relation Extraction EMNLP 2021

EMNLP 2021 - A Partition Filter Network for Joint Entity and Relation Extraction

zhy 127 Jan 04, 2023
MAT: Mask-Aware Transformer for Large Hole Image Inpainting

MAT: Mask-Aware Transformer for Large Hole Image Inpainting (CVPR2022, Oral) Wenbo Li, Zhe Lin, Kun Zhou, Lu Qi, Yi Wang, Jiaya Jia [Paper] News This

254 Dec 29, 2022
A new benchmark for Icon Question Answering (IconQA) and a large-scale icon dataset Icon645.

IconQA About IconQA is a new diverse abstract visual question answering dataset that highlights the importance of abstract diagram understanding and c

Pan Lu 24 Dec 30, 2022
Spatial Single-Cell Analysis Toolkit

Single-Cell Image Analysis Package Scimap is a scalable toolkit for analyzing spatial molecular data. The underlying framework is generalizable to spa

Laboratory of Systems Pharmacology @ Harvard 30 Nov 08, 2022
PyTorch Implementation for AAAI'21 "Do Response Selection Models Really Know What's Next? Utterance Manipulation Strategies for Multi-turn Response Selection"

UMS for Multi-turn Response Selection Implements the model described in the following paper Do Response Selection Models Really Know What's Next? Utte

Taesun Whang 47 Nov 22, 2022
Semantic Segmentation Architectures Implemented in PyTorch

pytorch-semseg Semantic Segmentation Algorithms Implemented in PyTorch This repository aims at mirroring popular semantic segmentation architectures i

Meet Shah 3.3k Dec 29, 2022
This repository contains small projects related to Neural Networks and Deep Learning in general.

ILearnDeepLearning.py Description People say that nothing develops and teaches you like getting your hands dirty. This repository contains small proje

Piotr Skalski 1.2k Dec 22, 2022
Continuous Augmented Positional Embeddings (CAPE) implementation for PyTorch

PyTorch implementation of Continuous Augmented Positional Embeddings (CAPE), by Likhomanenko et al. Enhance your Transformer positional embeddings with easy-to-use augmentations!

Guillermo Cámbara 26 Dec 13, 2022
A tight inclusion function for continuous collision detection

Tight-Inclusion Continuous Collision Detection A conservative Continuous Collision Detection (CCD) method with support for minimum separation. You can

Continuous Collision Detection 89 Jan 01, 2023