Official implementation of EdiTTS: Score-based Editing for Controllable Text-to-Speech

Last update: Dec 25, 2022

Overview

EdiTTS: Score-based Editing for Controllable Text-to-Speech

Official implementation of EdiTTS: Score-based Editing for Controllable Text-to-Speech. Audio samples are available on our demo page.

Abstract

We present EdiTTS, an off-the-shelf speech editing methodology based on score-based generative modeling for text-to-speech synthesis. EdiTTS allows for targeted, granular editing of audio, both in terms of content and pitch, without the need for any additional training, task-specific optimization, or architectural modifications to the score-based model backbone. Specifically, we apply coarse yet deliberate perturbations in the Gaussian prior space to induce desired behavior from the diffusion model, while applying masks and softening kernels to ensure that iterative edits are applied only to the target region. Listening tests demonstrate that EdiTTS is capable of reliably generating natural-sounding audio that satisfies user-imposed requirements.

Citation

Please cite this work as follows.

@misc{tae&kim2021editts,
      title={EdiTTS: Score-based Editing for Controllable Text-to-Speech}, 
      author={Jaesung Tae and Hyeongju Kim and Taesu Kim},
      year={2021}
}

Setup

Create a Python virtual environment (venv or conda) and install package requirements as specified in requirements.txt.
```
python -m venv venv
source venv/bin/activate
pip install -U pip
pip install -r requirements.txt
```

Build the monotonic alignment module.

cd model/monotonic_align
python setup.py build_ext --inplace

For more information, refer to the official repository of Grad-TTS.

Checkpoints

The following checkpoints are already included as part of this repository, under checkpts.

Pitch Shifting

Prepare an input file containing samples for speech generation. Mark the segment to be edited via a vertical bar separator, |. For instance, a single sample might look like

In | the face of impediments confessedly discouraging |

We provide a sample input file in resources/filelists/edit_pitch_example.txt.

To run inference, type

CUDA_VISIBLE_DEVICES=0 python edit_pitch.py \
    -f resources/filelists/edit_pitch_example.txt \
    -c checkpts/grad-tts-old.pt -t 1000 \
    -s out/pitch/wavs

Adjust CUDA_VISIBLE_DEVICES as appropriate.

Content Replacement

Prepare an input file containing pairs of sentences. Concatenate each pair with # and mark the parts to be replaced with a vertical bar separator. For instance, a single pair might look like

Three others subsequently | identified | Oswald from a photograph. #Three others subsequently | recognized | Oswald from a photograph.

We provide a sample input file in resources/filelists/edit_content_example.txt.

To run inference, type

CUDA_VISIBLE_DEVICES=0 python edit_content.py \
    -f resources/filelists/edit_content_example.txt \
    -c checkpts/grad-tts-old.pt -t 1000 \
    -s out/content/wavs

References

License

Released under the modified GNU General Public License.

Official implementation of EdiTTS: Score-based Editing for Controllable Text-to-Speech

Related tags

Overview

EdiTTS: Score-based Editing for Controllable Text-to-Speech

Abstract

Citation

Setup

Checkpoints

Pitch Shifting

Content Replacement

References

License

Owner

Neosapience

Distributional Sliced-Wasserstein distance code

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Algo-burn - Script to configure an Algorand address as a "burn" address for one or more ASA tokens

Neuron Merging: Compensating for Pruned Neurons (NeurIPS 2020)

3D Avatar Lip Syncronization from speech (JALI based face-rigging)

Code for the submitted paper Surrogate-based cross-correlation for particle image velocimetry

A basic duplicate image detection service using perceptual image hash functions and nearest neighbor search, implemented using faiss, fastapi, and imagehash

Modified prey-predator system - Modified prey–predator model describes the rate of change for each species by adding coupling terms.

✂️ EyeLipCropper is a Python tool to crop eyes and mouth ROIs of the given video.

OCR Post Correction for Endangered Language Texts

Count the MACs / FLOPs of your PyTorch model.

Submanifold sparse convolutional networks

Code for the paper 'A High Performance CRF Model for Clothes Parsing'.

Python inverse kinematics for your robot model based on Pinocchio.

CAMoE + Dual SoftMax Loss (DSL): Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss

NCVX (NonConVeX): A User-Friendly and Scalable Package for Nonconvex Optimization in Machine Learning.

A simple rest api serving a deep learning model that classifies human gender based on their faces. (vgg16 transfare learning)

Deep Learning for Morphological Profiling

GndNet: Fast ground plane estimation and point cloud segmentation for autonomous vehicles using deep neural networks.

Towhee is a flexible machine learning framework currently focused on computing deep learning embeddings over unstructured data.