Voice Conversion Using Speech-to-Speech Neuro-Style Transfer

Last update: Jan 05, 2023

Related tags

Overview

Voice Conversion Using Speech-to-Speech Neuro-Style Transfer

This repo contains the official implementation of the VAE-GAN from the INTERSPEECH 2020 paper Voice Conversion Using Speech-to-Speech Neuro-Style Transfer.

Examples of generated audio using the Flickr8k Audio Corpus: https://ebadawy.github.io/post/speech_style_transfer. Note that these examples are a result of feeding audio reconstructions of this VAE-GAN to an implementation of WaveNet.

1. Data Preperation

Dataset file structure:

/path/to/database
├── spkr_1
│   ├── sample.wav
├── spkr_2
│   ├── sample.wav
│   ...
└── spkr_N
    ├── sample.wav
    ...
# The directory under each speaker cannot be nested.

Here is an example script for setting up data preparation from the Flickr8k Audio Corpus. The speakers of interest are the same as in the paper, but may be modified to other speakers if desirable.

2. Data Preprocessing

The prepared dataset is organised into a train/eval/test split, the audio is preprocessed and melspectrograms are computed.

python preprocess.py --dataset [path/to/dataset] --test-size [float] --eval-size [float]

3. Training

The VAE-GAN model uses the melspectrograms to learn style transfer between two speakers.

python train.py --model_name [name of the model] --dataset [path/to/dataset]

3.1. Visualization

By default, the code plots a batch of input and output melspectrograms every epoch. You may add --plot-interval -1 to the above command to disable it. Alternatively you may add --plot-interval 20 to plot every 20 epochs.

3.2. Saving Models

By default, models are saved every epoch. With smaller datasets than Flickr8k it may be more appropriate to save less frequently by adding --checkpoint_interval 20 for 20 epochs.

3.3. Epochs

The max number of epochs may be set with --n_epochs. For smaller datasets, you may want to increase this to more than the default 100. To load a pretrained model you can use --epoch and set it to the epoch number of the saved model.

3.4. Pretrained Model

You can access pretrained model files here. By downloading and storing them in a directory src/saved_models/pretrained, you may call it for training or inference with:

--model_name pretrained --epoch 99

Note that for inference the discriminator files D1 and D2 are not required (meanwhile for training further they are). Also here, G1 refers to the decoding generator for speaker 1 (female) and G2 for speaker 2 (male).

4. Inference

The trained VAE-GAN is used for inference on a specified audio file. It works by; sliding a window over a full melspectrogram, locally inferring melspectrogram subsamples, and averaging the overlap. The script then uses Griffin-Lim to reconstruct audio from the generated melspectrogram.

python inference.py --model_name [name of the model] --epoch [epoch number] --trg_id [id of target generator] --wav [path/to/source_audio.wav]

For achieving high quality results like the paper you can feed the reconstructed audio to trained vocoders such as WaveNet. An example pipeline of using this model with wavenet can be found here.

4.1. Directory Input

Instead of a single .wav as input you may specify a whole directory of .wav files by using --wavdir instead of --wav.

4.2. Visualization

By default, plotting input and output melspectrograms is enabled. This is useful for a visual comparison between trained models. To disable set --plot -1

4.3. Reconstructive Evaluation

Alongside the process of generating, components for reconstruction and cyclic reconstruction may be enabled by specifying the generator id of the source audio --src_id [id of source generator].

When set, SSIM metrics for reconstructed melspectrograms and cyclically reconstructed melspectrograms are computed and printed at the end of inference.

This is an extra feature to help with comparing the reconstructive capabilities of different models. The higher the SSIM, the higher quality the reconstruction.

References

Citation

If you find this code useful please cite us in your work:

@inproceedings{AlBadawy2020,
  author={Ehab A. AlBadawy and Siwei Lyu},
  title={{Voice Conversion Using Speech-to-Speech Neuro-Style Transfer}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={4726--4730},
  doi={10.21437/Interspeech.2020-3056},
  url={http://dx.doi.org/10.21437/Interspeech.2020-3056}
}

TODO:

Rewrite preprocess.py to handle:
- multi-process feature extraction
- display error messages for failed cases
Create:
- Notebook for data visualisation
Want to add something else? Please feel free to submit a PR with your changes or open an issue for that.

Voice Conversion Using Speech-to-Speech Neuro-Style Transfer

Related tags

Overview

Voice Conversion Using Speech-to-Speech Neuro-Style Transfer

1. Data Preperation

2. Data Preprocessing

3. Training

3.1. Visualization

3.2. Saving Models

3.3. Epochs

3.4. Pretrained Model

4. Inference

4.1. Directory Input

4.2. Visualization

4.3. Reconstructive Evaluation

References

Citation

TODO:

Owner

Ehab AlBadawy

Parameter-ensemble-differential-evolution - Shows how to do parameter ensembling using differential evolution.

CCP dataset from Clothing Co-Parsing by Joint Image Segmentation and Labeling

An 16kHz implementation of HiFi-GAN for soft-vc.

PN-Net a neural field-based framework for depth estimation from single-view RGB images.

Code for ACM MM 2020 paper "NOH-NMS: Improving Pedestrian Detection by Nearby Objects Hallucination"

EdMIPS: Rethinking Differentiable Search for Mixed-Precision Neural Networks

The official repository for "Intermediate Layers Matter in Momentum Contrastive Self Supervised Learning" paper.

Deep Image Search is an AI-based image search engine that includes deep transfor learning features Extraction and tree-based vectorized search.

SciPy fixes and extensions

Unofficial PyTorch code for BasicVSR

Deep-Learning-Book-Chapter-Summaries - Attempting to make the Deep Learning Book easier to understand.

[NeurIPS 2021] "G-PATE: Scalable Differentially Private Data Generator via Private Aggregation of Teacher Discriminators"

An implementation of EWC with PyTorch

Image super-resolution (SR) is a fast-moving field with novel architectures attracting the spotlight

The InterScript dataset contains interactive user feedback on scripts generated by a T5-XXL model.

Minimalist Error collection Service compatible with Rollbar clients. Sentry or Rollbar alternative.

Code of Classification Saliency-Based Rule for Visible and Infrared Image Fusion

This is a Image aid classification software based on python TK library development

Implementation for Shape from Polarization for Complex Scenes in the Wild

Official tensorflow implementation for CVPR2020 paper “Learning to Cartoonize Using White-box Cartoon Representations”