Unofficial PyTorch Implementation of UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

Last update: Jan 04, 2023

Overview

UnivNet

UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

This is an unofficial PyTorch implementation of Jang et al. (Kakao), UnivNet.

To-Do List

Release checkpoint of pre-trained model
Extract wav samples for audio sample page
Add results including validation loss graph

Key Features

According to the authors of the paper, UnivNet obtained the best objective results among the recent GAN-based neural vocoders (including HiFi-GAN) as well as outperforming HiFi-GAN in a subjective evaluation. Also its inference speed is 1.5 times faster than HiFi-GAN.
This repository uses the same mel-spectrogram function as the Official HiFi-GAN, which is compatible with NVIDIA/tacotron2.

Our default mel calculation hyperparameters are as below, following the original paper.

audio:
  n_mel_channels: 100
  filter_length: 1024
  hop_length: 256 # WARNING: this can't be changed.
  win_length: 1024
  sampling_rate: 24000
  mel_fmin: 0.0
  mel_fmax: 12000.0

You can modify the hyperparameters to be compatible with your acoustic model.

Prerequisites

The implementation needs following dependencies.

Python 3.6
PyTorch 1.6.0
NumPy 1.17.4 and SciPy 1.5.4
Install other dependencies in requirements.txt.
```
pip install -r requirements.txt
```

Datasets

Preparing Data

Download the training dataset. This can be any wav file with sampling rate 24,000Hz. The original paper used LibriTTS.
- LibriTTS train-clean-360 split tar.gz link
- Unzip and place its contents under datasets/LibriTTS/train-clean-360.
If you want to use wav files with a different sampling rate, please edit the configuration file (see below).

Note: The mel-spectrograms calculated from audio file will be saved as **.mel at first, and then loaded from disk afterwards.

Preparing Metadata

Following the format from NVIDIA/tacotron2, the metadata should be formatted as:

path_to_wav|transcript|speaker_id
path_to_wav|transcript|speaker_id
...

Train/validation metadata for LibriTTS train-clean-360 split and are already prepared in datasets/metadata. 5% of the train-clean-360 utterances were randomly sampled for validation.

Since this model is a vocoder, the transcripts are NOT used during training.

Train

Preparing Configuration Files

Run cp config/default.yaml config/config.yaml and then edit config.yaml

Write down the root path of train/validation in the data section. The data loader parses list of files within the path recursively.

data:
  train_dir: 'datasets/'	# root path of train data (either relative/absoulte path is ok)
  train_meta: 'metadata/libritts_train_clean_360_train.txt'	# relative path of metadata file from train_dir
  val_dir: 'datasets/'		# root path of validation data
  val_meta: 'metadata/libritts_train_clean_360_val.txt'		# relative path of metadata file from val_dir

We provide the default metadata for LibriTTS train-clean-360 split.

Modify channel_size in gen to switch between UnivNet-c16 and c32.

gen:
  noise_dim: 64
  channel_size: 32 # 32 or 16
  dilations: [1, 3, 9, 27]
  strides: [8, 8, 4]
  lReLU_slope: 0.2

Training

python trainer.py -c CONFIG_YAML_FILE -n NAME_OF_THE_RUN

Tensorboard

tensorboard --logdir logs/

If you are running tensorboard on a remote machine, you can open the tensorboard page by adding --bind_all option.

Inference

python inference.py -p CHECKPOINT_PATH -i INPUT_MEL_PATH

Pre-trained Model

A pre-trained model will be released soon. The model was trained on LibriTTS train-clean-360 split.

Results

See audio samples at https://mindslab-ai.github.io/univnet/

Comparison with the results on paper

Model	MOS	PESQ(↑)	RMSE(↓)
Recordings	4.16±0.09	4.50	0.000
Results in Paper (UnivNet-c32)	3.93±0.09	3.70	0.316
Ours (UnivNet-c32)	-	TBD	TBD

Note

This code is an unofficial implementation, there may be some differences from the original paper.

Our UnivNet generator has smaller number of parameters (c32: 5.11M, c16: 1.42M) than the paper (c32: 14.89M, c16: 4.00M). So far, we have not encountered any issues from using a smaller model size. If run into any problem, please report it as an issue.

Implementation Authors

Implementation authors are:

Special thanks to

License

This code is licensed under BSD 3-Clause License.

We referred following codes and repositories.

The overall structure of the repository is based on https://github.com/seungwonpark/melgan.
datasets/dataloader.py from https://github.com/NVIDIA/waveglow (BSD 3-Clause License)
model/mpd.py from https://github.com/jik876/hifi-gan (MIT License)
model/lvcnet.py from https://github.com/zceng/LVCNet (Apache License 2.0)
utils/stft_loss.py # Copyright 2019 Tomoki Hayashi # MIT License (https://opensource.org/licenses/MIT)

References

Papers

Datasets

LibriTTS

Unofficial PyTorch Implementation of UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

Related tags

Overview

UnivNet

To-Do List

Key Features

Prerequisites

Datasets

Train

Inference

Pre-trained Model

Results

Note

Implementation Authors

License

References

Owner

MINDs Lab

This repository contains all source code, pre-trained models related to the paper "An Empirical Study on GANs with Margin Cosine Loss and Relativistic Discriminator"

Real time Human Detection Counting

A Python library for unevenly-spaced time series analysis

3D cascade RCNN for object detection on point cloud

Source code for the GPT-2 story generation models in the EMNLP 2020 paper "STORIUM: A Dataset and Evaluation Platform for Human-in-the-Loop Story Generation"

Crawl & visualize ICLR papers and reviews

Unified MultiWOZ evaluation scripts for the context-to-response task.

alfred-py: A deep learning utility library for human

Implementation of Uformer, Attention-based Unet, in Pytorch

Implementation of Memory-Compressed Attention, from the paper "Generating Wikipedia By Summarizing Long Sequences"

Meshed-Memory Transformer for Image Captioning. CVPR 2020

Implementation of [Time in a Box: Advancing Knowledge Graph Completion with Temporal Scopes].

SE3 Pose Interp - Interpolate camera pose or trajectory in SE3, pose interpolation, trajectory interpolation

Python scripts form performing stereo depth estimation using the CoEx model in ONNX.

RM Operation can equivalently convert ResNet to VGG, which is better for pruning; and can help RepVGG perform better when the depth is large.

Official repo for AutoInt: Automatic Integration for Fast Neural Volume Rendering in CVPR 2021

DCT-Mask: Discrete Cosine Transform Mask Representation for Instance Segmentation

Ratatoskr: Worcester Tech's conference scheduling system

This program was designed to detect whether someone is wearing a facemask through a live video stream.

H&M Fashion Image similarity search with Weaviate and DocArray

Unofficial PyTorch Implementation of UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

Related tags

Overview

UnivNet

To-Do List

Key Features

Prerequisites

Datasets

Train

Inference

Pre-trained Model

Results

Note

Implementation Authors

License

References

Owner

MINDs Lab

This repository contains all source code, pre-trained models related to the paper "An Empirical Study on GANs with Margin Cosine Loss and Relativistic Discriminator"

Real time Human Detection Counting

A Python library for unevenly-spaced time series analysis

3D cascade RCNN for object detection on point cloud

Source code for the GPT-2 story generation models in the EMNLP 2020 paper "STORIUM: A Dataset and Evaluation Platform for Human-in-the-Loop Story Generation"

Crawl & visualize ICLR papers and reviews

Unified MultiWOZ evaluation scripts for the context-to-response task.

alfred-py: A deep learning utility library for **human**

Implementation of Uformer, Attention-based Unet, in Pytorch

Implementation of Memory-Compressed Attention, from the paper "Generating Wikipedia By Summarizing Long Sequences"

Meshed-Memory Transformer for Image Captioning. CVPR 2020

Implementation of [Time in a Box: Advancing Knowledge Graph Completion with Temporal Scopes].

SE3 Pose Interp - Interpolate camera pose or trajectory in SE3, pose interpolation, trajectory interpolation

Python scripts form performing stereo depth estimation using the CoEx model in ONNX.

RM Operation can equivalently convert ResNet to VGG, which is better for pruning; and can help RepVGG perform better when the depth is large.

Official repo for AutoInt: Automatic Integration for Fast Neural Volume Rendering in CVPR 2021

DCT-Mask: Discrete Cosine Transform Mask Representation for Instance Segmentation

Ratatoskr: Worcester Tech's conference scheduling system

This program was designed to detect whether someone is wearing a facemask through a live video stream.

H&M Fashion Image similarity search with Weaviate and DocArray

alfred-py: A deep learning utility library for human