Unofficial Parallel WaveGAN (+ MelGAN & Multi-band MelGAN & HiFi-GAN & StyleMelGAN) with Pytorch


Parallel WaveGAN implementation with Pytorch

Open In Colab

This repository provides UNOFFICIAL pytorch implementations of the following models:

You can combine these state-of-the-art non-autoregressive models to build your own great vocoder!

Please check our samples in our demo HP.

Source of the figure:

The goal of this repository is to provide real-time neural vocoder, which is compatible with ESPnet-TTS.
Also, this repository can be combined with NVIDIA/tacotron2-based implementation (See this comment).

You can try the real-time end-to-end text-to-speech demonstration in Google Colab!

  • Real-time demonstration with ESPnet2 Open In Colab
  • Real-time demonstration with ESPnet1 Open In Colab

What's new


This repository is tested on Ubuntu 20.04 with a GPU Titan V.

  • Python 3.6+
  • Cuda 10.0+
  • CuDNN 7+
  • NCCL 2+ (for distributed multi-gpu training)
  • libsndfile (you can install via sudo apt install libsndfile-dev in ubuntu)
  • jq (you can install via sudo apt install jq in ubuntu)
  • sox (you can install via sudo apt install sox in ubuntu)

Different cuda version should be working but not explicitly tested.
All of the codes are tested on Pytorch 1.4, 1.5.1, 1.7.1, 1.8.1, and 1.9.

Pytorch 1.6 works but there are some issues in cpu mode (See #198).


You can select the installation method from two alternatives.

A. Use pip

$ git clone
$ cd ParallelWaveGAN
$ pip install -e .
# If you want to use distributed training, please install
# apex manually by following
$ ...

Note that your cuda version must be exactly matched with the version used for the pytorch binary to install apex.
To install pytorch compiled with different cuda version, see tools/Makefile.

B. Make virtualenv

$ git clone
$ cd ParallelWaveGAN/tools
$ make
# If you want to use distributed training, please run following
# command to install apex.
$ make apex

Note that we specify cuda version used to compile pytorch wheel.
If you want to use different cuda version, please check tools/Makefile to change the pytorch wheel to be installed.


This repository provides Kaldi-style recipes, as the same as ESPnet.
Currently, the following recipes are supported.

  • LJSpeech: English female speaker
  • JSUT: Japanese female speaker
  • JSSS: Japanese female speaker
  • CSMSC: Mandarin female speaker
  • CMU Arctic: English speakers
  • JNAS: Japanese multi-speaker
  • VCTK: English multi-speaker
  • LibriTTS: English multi-speaker
  • YesNo: English speaker (For debugging)

To run the recipe, please follow the below instruction.

# Let us move on the recipe directory
$ cd egs/ljspeech/voc1

# Run the recipe from scratch
$ ./

# You can change config via command line
$ ./ --conf <your_customized_yaml_config>

# You can select the stage to start and stop
$ ./ --stage 2 --stop_stage 2

# If you want to specify the gpu
$ CUDA_VISIBLE_DEVICES=1 ./ --stage 2

# If you want to resume training from 10000 steps checkpoint
$ ./ --stage 2 --resume <path>/<to>/checkpoint-10000steps.pkl

See more info about the recipes in this README.


The decoding speed is RTF = 0.016 with TITAN V, much faster than the real-time.

[decode]: 100%|██████████| 250/250 [00:30<00:00,  8.31it/s, RTF=0.0156]
2019-11-03 09:07:40,480 (decode:127) INFO: finished generation of 250 utterances (RTF = 0.016).

Even on the CPU (Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz 16 threads), it can generate less than the real-time.

[decode]: 100%|██████████| 250/250 [22:16<00:00,  5.35s/it, RTF=0.841]
2019-11-06 09:04:56,697 (decode:129) INFO: finished generation of 250 utterances (RTF = 0.734).

If you use MelGAN's generator, the decoding speed will be further faster.

# On CPU (Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz 16 threads)
[decode]: 100%|██████████| 250/250 [04:00<00:00,  1.04it/s, RTF=0.0882]
2020-02-08 10:45:14,111 (decode:142) INFO: Finished generation of 250 utterances (RTF = 0.137).

[decode]: 100%|██████████| 250/250 [00:06<00:00, 36.38it/s, RTF=0.00189]
2020-02-08 05:44:42,231 (decode:142) INFO: Finished generation of 250 utterances (RTF = 0.002).

If you use Multi-band MelGAN's generator, the decoding speed will be much further faster.

# On CPU (Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz 16 threads)
[decode]: 100%|██████████| 250/250 [01:47<00:00,  2.95it/s, RTF=0.048]
2020-05-22 15:37:19,771 (decode:151) INFO: Finished generation of 250 utterances (RTF = 0.059).

[decode]: 100%|██████████| 250/250 [00:05<00:00, 43.67it/s, RTF=0.000928]
2020-05-22 15:35:13,302 (decode:151) INFO: Finished generation of 250 utterances (RTF = 0.001).

If you want to accelerate the inference more, it is worthwhile to try the conversion from pytorch to tensorflow.
The example of the conversion is available in the notebook (Provided by @dathudeptrai).


Here the results are summarized in the table.
You can listen to the samples and download pretrained models from the link to our google drive.

Model Conf Lang Fs [Hz] Mel range [Hz] FFT / Hop / Win [pt] # iters
ljspeech_parallel_wavegan.v1 link EN 22.05k 80-7600 1024 / 256 / None 400k
ljspeech_parallel_wavegan.v1.long link EN 22.05k 80-7600 1024 / 256 / None 1M
ljspeech_parallel_wavegan.v1.no_limit link EN 22.05k None 1024 / 256 / None 400k
ljspeech_parallel_wavegan.v3 link EN 22.05k 80-7600 1024 / 256 / None 3M
ljspeech_melgan.v1 link EN 22.05k 80-7600 1024 / 256 / None 400k
ljspeech_melgan.v1.long link EN 22.05k 80-7600 1024 / 256 / None 1M
ljspeech_melgan_large.v1 link EN 22.05k 80-7600 1024 / 256 / None 400k
ljspeech_melgan_large.v1.long link EN 22.05k 80-7600 1024 / 256 / None 1M
ljspeech_melgan.v3 link EN 22.05k 80-7600 1024 / 256 / None 2M
ljspeech_melgan.v3.long link EN 22.05k 80-7600 1024 / 256 / None 4M
ljspeech_full_band_melgan.v1 link EN 22.05k 80-7600 1024 / 256 / None 1M
ljspeech_full_band_melgan.v2 link EN 22.05k 80-7600 1024 / 256 / None 1M
ljspeech_multi_band_melgan.v1 link EN 22.05k 80-7600 1024 / 256 / None 1M
ljspeech_multi_band_melgan.v2 link EN 22.05k 80-7600 1024 / 256 / None 1M
ljspeech_hifigan.v1 link EN 22.05k 80-7600 1024 / 256 / None 2.5M
ljspeech_style_melgan.v1 link EN 22.05k 80-7600 1024 / 256 / None 1.5M
jsut_parallel_wavegan.v1 link JP 24k 80-7600 2048 / 300 / 1200 400k
jsut_multi_band_melgan.v2 link JP 24k 80-7600 2048 / 300 / 1200 1M
just_hifigan.v1 link JP 24k 80-7600 2048 / 300 / 1200 2.5M
just_style_melgan.v1 link JP 24k 80-7600 2048 / 300 / 1200 1.5M
csmsc_parallel_wavegan.v1 link ZH 24k 80-7600 2048 / 300 / 1200 400k
csmsc_multi_band_melgan.v2 link ZH 24k 80-7600 2048 / 300 / 1200 1M
csmsc_hifigan.v1 link ZH 24k 80-7600 2048 / 300 / 1200 2.5M
csmsc_style_melgan.v1 link ZH 24k 80-7600 2048 / 300 / 1200 1.5M
arctic_slt_parallel_wavegan.v1 link EN 16k 80-7600 1024 / 256 / None 400k
jnas_parallel_wavegan.v1 link JP 16k 80-7600 1024 / 256 / None 400k
vctk_parallel_wavegan.v1 link EN 24k 80-7600 2048 / 300 / 1200 400k
vctk_parallel_wavegan.v1.long link EN 24k 80-7600 2048 / 300 / 1200 1M
vctk_multi_band_melgan.v2 link EN 24k 80-7600 2048 / 300 / 1200 1M
vctk_hifigan.v1 link EN 24k 80-7600 2048 / 300 / 1200 2.5M
vctk_style_melgan.v1 link EN 24k 80-7600 2048 / 300 / 1200 1.5M
libritts_parallel_wavegan.v1 link EN 24k 80-7600 2048 / 300 / 1200 400k
libritts_parallel_wavegan.v1.long link EN 24k 80-7600 2048 / 300 / 1200 1M
libritts_multi_band_melgan.v2 link EN 24k 80-7600 2048 / 300 / 1200 1M
libritts_hifigan.v1 link EN 24k 80-7600 2048 / 300 / 1200 2.5M
libritts_style_melgan.v1 link EN 24k 80-7600 2048 / 300 / 1200 1.5M
kss_parallel_wavegan.v1 link KO 24k 80-7600 2048 / 300 / 1200 400k
hui_acg_hokuspokus_parallel_wavegan.v1 link DE 24k 80-7600 2048 / 300 / 1200 400k
ruslan_parallel_wavegan.v1 link RU 24k 80-7600 2048 / 300 / 1200 400k

Please access at our google drive to check more results.

How-to-use pretrained models


Here the minimal code is shown to perform analysis-synthesis using the pretrained model.

# Please make sure you installed `parallel_wavegan`
# If not, please install via pip
$ pip install parallel_wavegan

# You can download the pretrained model from terminal
$ python << EOF
from parallel_wavegan.utils import download_pretrained_model
download_pretrained_model("<pretrained_model_tag>", "pretrained_model")

# You can get all of available pretrained models as follows:
$ python << EOF
from parallel_wavegan.utils import PRETRAINED_MODEL_LIST

# Now you can find downloaded pretrained model in `pretrained_model/<pretrain_model_tag>/`
$ ls pretrain_model/<pretrain_model_tag>
  checkpoint-400000steps.pkl    config.yml    stats.h5

# These files can also be downloaded manually from the above results

# Please put an audio file in `sample` directory to perform analysis-synthesis
$ ls sample/
  sample.wav

# Then perform feature extraction -> feature normalization -> synthesis
$ parallel-wavegan-preprocess \
    --config pretrain_model/<pretrain_model_tag>/config.yml \
    --rootdir sample \
    --dumpdir dump/sample/raw
100%|████████████████████████████████████████| 1/1 [00:00<00:00, 914.19it/s]
$ parallel-wavegan-normalize \
    --config pretrain_model/<pretrain_model_tag>/config.yml \
    --rootdir dump/sample/raw \
    --dumpdir dump/sample/norm \
    --stats pretrain_model/<pretrain_model_tag>/stats.h5
2019-11-13 13:44:29,574 (normalize:87) INFO: the number of files = 1.
100%|████████████████████████████████████████| 1/1 [00:00<00:00, 513.13it/s]
$ parallel-wavegan-decode \
    --checkpoint pretrain_model/<pretrain_model_tag>/checkpoint-400000steps.pkl \
    --dumpdir dump/sample/norm \
    --outdir sample
2019-11-13 13:44:31,229 (decode:91) INFO: the number of features to be decoded = 1.
[decode]: 100%|███████████████████| 1/1 [00:00<00:00, 18.33it/s, RTF=0.0146]
2019-11-13 13:44:37,132 (decode:129) INFO: finished generation of 1 utterances (RTF = 0.015).

# You can skip normalization step (on-the-fly normalization, feature extraction -> synthesis)
$ parallel-wavegan-preprocess \
    --config pretrain_model/<pretrain_model_tag>/config.yml \
    --rootdir sample \
    --dumpdir dump/sample/raw
100%|████████████████████████████████████████| 1/1 [00:00<00:00, 914.19it/s]
$ parallel-wavegan-decode \
    --checkpoint pretrain_model/<pretrain_model_tag>/checkpoint-400000steps.pkl \
    --dumpdir dump/sample/raw \
    --normalize-before \
    --outdir sample
2019-11-13 13:44:31,229 (decode:91) INFO: the number of features to be decoded = 1.
[decode]: 100%|███████████████████| 1/1 [00:00<00:00, 18.33it/s, RTF=0.0146]
2019-11-13 13:44:37,132 (decode:129) INFO: finished generation of 1 utterances (RTF = 0.015).

# you can find the generated speech in `sample` directory
$ ls sample
  sample.wav    sample_gen.wav

Decoding with ESPnet-TTS model's features

Here, I show the procedure to generate waveforms with features generated by ESPnet-TTS models.

# Make sure you already finished running the recipe of ESPnet-TTS.
# You must use the same feature settings for both Text2Mel and Mel2Wav models.
# Let us move on "ESPnet" recipe directory
$ cd /path/to/espnet/egs/<recipe_name>/tts1
$ pwd

# If you use ESPnet2, move on `egs2/`
$ cd /path/to/espnet/egs2/<recipe_name>/tts1
$ pwd

# Please install this repository in ESPnet conda (or virtualenv) environment
$ . ./ && pip install -U parallel_wavegan

# You can download the pretrained model from terminal
$ python << EOF
from parallel_wavegan.utils import download_pretrained_model
download_pretrained_model("<pretrained_model_tag>", "pretrained_model")

# You can get all of available pretrained models as follows:
$ python << EOF
from parallel_wavegan.utils import PRETRAINED_MODEL_LIST

# You can find downloaded pretrained model in `pretrained_model/<pretrain_model_tag>/`
$ ls pretrain_model/<pretrain_model_tag>
  checkpoint-400000steps.pkl    config.yml    stats.h5

# These files can also be downloaded manually from the above results

Case 1: If you use the same dataset for both Text2Mel and Mel2Wav

# In this case, you can directly use generated features for decoding.
# Please specify `feats.scp` path for `--feats-scp`, which is located in
# exp/<your_model_dir>/outputs_*_decode/<set_name>/feats.scp.
# Note that do not use outputs_*decode_denorm/<set_name>/feats.scp since
# it is de-normalized features (the input for PWG is normalized features).
$ parallel-wavegan-decode \
    --checkpoint pretrain_model/<pretrain_model_tag>/checkpoint-400000steps.pkl \
    --feats-scp exp/<your_model_dir>/outputs_*_decode/<set_name>/feats.scp \
    --outdir <path_to_outdir>

# In the case of ESPnet2, the generated feature can be found in
# exp/<your_model_dir>/decode_*/<set_name>/norm/feats.scp.
$ parallel-wavegan-decode \
    --checkpoint pretrain_model/<pretrain_model_tag>/checkpoint-400000steps.pkl \
    --feats-scp exp/<your_model_dir>/decode_*/<set_name>/norm/feats.scp \
    --outdir <path_to_outdir>

# You can find the generated waveforms in <path_to_outdir>/.
$ ls <path_to_outdir>
  utt_id_1_gen.wav    utt_id_2_gen.wav  ...    utt_id_N_gen.wav

Case 2: If you use different datasets for Text2Mel and Mel2Wav models

# In this case, you must provide `--normalize-before` option additionally.
# And use `feats.scp` of de-normalized generated features.

# ESPnet1 case
$ parallel-wavegan-decode \
    --checkpoint pretrain_model/<pretrain_model_tag>/checkpoint-400000steps.pkl \
    --feats-scp exp/<your_model_dir>/outputs_*_decode_denorm/<set_name>/feats.scp \
    --outdir <path_to_outdir> \

# ESPnet2 case
$ parallel-wavegan-decode \
    --checkpoint pretrain_model/<pretrain_model_tag>/checkpoint-400000steps.pkl \
    --feats-scp exp/<your_model_dir>/decode_*/<set_name>/denorm/feats.scp \
    --outdir <path_to_outdir> \

# You can find the generated waveforms in <path_to_outdir>/.
$ ls <path_to_outdir>
  utt_id_1_gen.wav    utt_id_2_gen.wav  ...    utt_id_N_gen.wav

If you want to combine these models in python, you can try the real-time demonstration in Google Colab!

  • Real-time demonstration with ESPnet2 Open In Colab
  • Real-time demonstration with ESPnet1 Open In Colab

Decoding with dumped npy files

Sometimes we want to decode with dumped npy files, which are mel-spectrogram generated by TTS models. Please make sure you used the same feature extraction settings of the pretrained vocoder (fs, fft_size, hop_size, win_length, fmin, and fmax).
Only the difference of log_base can be changed with some post-processings (we use log 10 instead of natural log as a default). See detail in the comment.

# Generate dummy npy file of mel-spectrogram
$ ipython
[ins] In [1]: import numpy as np
[ins] In [2]: x = np.random.randn(512, 80)  # (#frames, #mels)
[ins] In [3]:"dummy_1.npy", x)
[ins] In [4]: y = np.random.randn(256, 80)  # (#frames, #mels)
[ins] In [5]:"dummy_2.npy", y)
[ins] In [6]: exit

# Make scp file (key-path format)
$ find -name "*.npy" | awk '{print "dummy_" NR " " $1}' > feats.scp

# Check (<utt_id> <path>)
$ cat feats.scp
dummy_1 ./dummy_1.npy
dummy_2 ./dummy_2.npy

# Decode without feature normalization
# This case assumes that the input mel-spectrogram is normalized with the same statistics of the pretrained model.
$ parallel-wavegan-decode \
    --checkpoint /path/to/checkpoint-400000steps.pkl \
    --feats-scp ./feats.scp \
    --outdir wav
2021-08-10 09:13:07,624 (decode:140) INFO: The number of features to be decoded = 2.
[decode]: 100%|████████████████████████████████████████| 2/2 [00:00<00:00, 13.84it/s, RTF=0.00264]
2021-08-10 09:13:29,660 (decode:174) INFO: Finished generation of 2 utterances (RTF = 0.005).

# Decode with feature normalization
# This case assumes that the input mel-spectrogram is not normalized.
$ parallel-wavegan-decode \
    --checkpoint /path/to/checkpoint-400000steps.pkl \
    --feats-scp ./feats.scp \
    --normalize-before \
    --outdir wav
2021-08-10 09:13:07,624 (decode:140) INFO: The number of features to be decoded = 2.
[decode]: 100%|████████████████████████████████████████| 2/2 [00:00<00:00, 13.84it/s, RTF=0.00264]
2021-08-10 09:13:29,660 (decode:174) INFO: Finished generation of 2 utterances (RTF = 0.005).



The author would like to thank Ryuichi Yamamoto (@r9y9) for his great repository, paper, and valuable discussions.


Tomoki Hayashi (@kan-bayashi)
E-mail: hayashi.tomoki<at>

  • Multi-band MelGAN

    Multi-band MelGAN


    just found

    It seems to provide significantly better quality than regular MelGAN, and is also stunningly fast (0.03 RTF on CPU). The authors will be publishing the code shortly.

    Any chances we will see an implementation in this great repo? =)

    opened by alexdemartos 154
  • Generator exploded after ~138K iters.

    Generator exploded after ~138K iters.

    I observed a interesting behaviour after 138K iters where discriminator dominated the training and generator exploded in both train and validation losses. Do you have any idea why and how to prevent it?

    I am training on LJSpeech and I basically use the same learning schedule you released with the v2 config for LJSpeech. (Train generator until 100K and enable the discriminator)

    Here is the tensorboard screenshot.


    opened by erogol 58
  • Cannot use WaveGAN with Glow-TTS and Nividia's Tacotron2

    Cannot use WaveGAN with Glow-TTS and Nividia's Tacotron2

    Hi. I trained the tacotron2 model ( and Glow-TTS model ( by using the LJ speech dataset and can successfully synthesize voice by using WaveGlow as vocoder. However, when I turned to the Parallel WaveGan, the synthzised waveform is quite strange: Screenshot 2020-06-23 at 4 24 30 PM

    Screenshot 2020-06-22 at 2 31 07 PM Screenshot 2020-06-22 at 2 47 42 PM (In the training time, the hop_size, sample_rate and window_size were set as the same for the tacotron, WaveGlow and waveGan model.)

    I successfully synthesized speech using WaveGan with espnet's FastSpeech, but I failed to use waveGan to synthsize intelligible voice with any model derived from Nivida's Tacotron2 implementation (e.g. Glow-TTS). Could you please give me any advice? (Because in Nivida's Tacotron2, there is no cmvn to the input mel-spectrogram features, so I didn't calculate the cmvn of the training waves and didn't invert it back at the inference time)

    Thank you very much!

    opened by Charlottecuc 31
  • Many iterations of discriminator training causes strange noise

    Many iterations of discriminator training causes strange noise

    I compared the following two models:

    • (Red) The model which trains the discriminator from 200k iters
    • (Blue) The model which trains the discriminator from the first iter Here is the training curve.
    スクリーンショット 2019-11-06 午前0 04 12 From the curve, the blue one is better than the red in terms of log STFT magnitude loss.

    However, the blue model causes strange noise.

    You can listen to the samples.

    I think this is caused by the discriminator (v1 is red and v2 is blue). If you have any idea or suggestion to avoid this issue, please share with me.

    help wanted discussion 
    opened by kan-bayashi 21
  • How is the runtime on CPU?

    How is the runtime on CPU?

    Hi! Thx for the repo. I was curious about the performance on CPU. AFAIK, it is 8x real-time on GPU but could you also share some values about CPU performance?

    opened by erogol 20
  • TTS + ParallelWaveGAN progress

    TTS + ParallelWaveGAN progress

    If you don't mind, I like to share my progress with PWGAN with TTS.

    Here is the first try results:

    Results are not better than what we have with WaveRNN, I should say it is much faster.

    There is a hissing noise in the backgroung. If you have any idea to get rid of this, please let me know.

    The only difference in training (I guess) I don't apply mean-normalization to melspectrograms and I normalize to -4,4 range.

    opened by erogol 18
  • training time for HiFiGAN LJSpeech

    training time for HiFiGAN LJSpeech


    I am training HiFIGAN vocoder on LJSpeech, using the recipe provided . Its been running since more than a week.

    I am using 4 Tesla GPUs with 32 GB memory

    May I know how much time it took for you ?


    opened by nellorebhanuteja 17
  • Training StyleMelGan on custom dataset

    Training StyleMelGan on custom dataset

    Hello again :)

    Are lab mono files required to do the training or that step can be skipped using this script ?

    opened by skol101 17
  • StyleMelGAN tuning

    StyleMelGAN tuning


    • v1
      • MSE loss
      • batch size 32
      • repeats 4
    • v2
      • MSE Loss
      • batch size 8
      • repeats 4
    • v3
      • Hinge loss
      • batch size 8
      • repeats 4

    Learning rate scheduling maybe need to investigate.

    opened by kan-bayashi 17
  • WaveGAN training on Tacotron outputs.

    WaveGAN training on Tacotron outputs.

    Hey. I trained a Rayhane-mamah Tacotron 2 synthesizer without vocoder. As a vocoder, I wanted to use your repository, could you please tell me how to properly train WaveGAN? Need to train on GTA mels? If so, how to do it, if the preprocessing procedure in itself prepares mel spectrograms from ground truth audio on step 1?

    opened by Alexey322 17
  • How can we know multi GPU is working?

    How can we know multi GPU is working?

    It is mentioned in the paper that using more GPUs accelerates the training. I have three NVIDIA K80s and using the flags

    --nnodes 1 --nproc_per_node 3 -c

    Binds all three GPUs and ramps them up to 98% usage, however, I cannot see any decrease in waiting time or epoch rounds and leaving it overnight, did not return any marginally better results. Am I doing anythign wrong? How can we know it is actually working? I tried to set --nnodes 3 but training never even started.

    opened by george-roussos 17
  • Unclear signal flow related to usage of mel spectrograms in StyleMelGAN

    Unclear signal flow related to usage of mel spectrograms in StyleMelGAN


    This is probably just a documentation problem.

    It is unclear how mel spectrograms are used by the StyleMelGAN generator module.

    I've been trying to figure out how to format mel spectrograms so the generator will accept them. To figure that out, I've been looking at the initialization parameters of the StyleMelGANGenerator module.

    The only obvious candidate for defining the format/dimensions of the input spectrogram is the aux_channels parameter. But that wouldn't make sense, for these reasons:

    1. Its default value is 80, but a mel spectrogram contains much more than 80 points of data.
    2. aux_channels controls only one parameter: the in_channels parameter of the first layer in the first TADEResBlock. That would make sense if if the mel spectrograms' dimensions corresponded to this parameter, but...
    3. The diagram of StyleMelGAN's signal path in the original StyleMelGan paper conflicts with point 2); the diagram shows the spectrograms being inserted into every TADEResBlock, not just the first.

    So my questions are:

    1. What is aux_channels? (What kind of data is considered "auxiliary input" - am I correct that this is the spectrograms?)
    2. If aux_channels does not determine how the input spectrograms should be formatted, what does?

    If you can answer these questions for me, I would be happy to improve the documentation/comments myself.

    Thank you!

    opened by andrewrose43 1
  • how to convert model to torchscript?

    how to convert model to torchscript?

    import sys sys.path.insert(1,'/root/Downloads/ParallelWaveGAN-0.5.3/parallel_wavegan/utils') import torch import utils module = utils.load_model('pretrained_model/checkpoint-400000steps.pkl') print(module) #model = torch.load('pretrained_model/checkpoint-400000steps.pkl',map_location=torch.device('cpu')) #print('load model successful!') x = torch.zeros(5, 10, 5, dtype=torch.float64) x = x + (0.1**0.5)*torch.randn(5, 10, 5) c = torch.rand(80,80,5) print(x) print('-------------------') print(c) print('-------------------') print(x.size(-1)) print('-------------------') print(c.size(-1)) trace_model = torch.jit.trace(module,(x,c))

    error is : Traceback (most recent call last): File "", line 19, in trace_model = torch.jit.trace(module,(x,c)) File "/root/anaconda3/envs/pwgan/lib/python3.7/site-packages/torch/jit/", line 768, in trace _module_class, File "/root/anaconda3/envs/pwgan/lib/python3.7/site-packages/torch/jit/", line 983, in trace_module argument_names, File "/root/anaconda3/envs/pwgan/lib/python3.7/site-packages/torch/nn/modules/", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/root/anaconda3/envs/pwgan/lib/python3.7/site-packages/torch/nn/modules/", line 1178, in _slow_forward result = self.forward(*input, **kwargs) File "/root/Downloads/ParallelWaveGAN-0.5.3/parallel_wavegan/models/", line 159, in forward assert c.size(-1) == x.size(-1) AssertionError

    how to set parametr x and c value?

    opened by zhuziying 0
  • Low inference speed of TTS on GPU

    Low inference speed of TTS on GPU

    May I ask why the RTF of TTS is only 0.09 for a 12-seconds sentence? I use fastspeech2_HIFiGAN model and GPU is A2000 (8.0 capability). I thought it should be 50x speedup at least. Because the paper of fastpeech2 says it has 50x than transformer and HifiGAN says it speed up 1000x. So can anyone tells me what's wrong? Thank you!

    opened by dalvlv 2
  • Avocodo Discriminators

    Avocodo Discriminators

    A new interesting vocoder was described in a paper yesterday. It's called Avocodo and supposedly helps with the artifacts that are typical for GAN based vocoding. It supposedly also works better for unseen speakers than HiFiGAN, although I never had any issues with HiFiGAN and unseen speakers anyways.

    The generator seems to be pretty much the same as HiFiGAN's, but it has some new discriminators, which I think would be a nice addition to this repository. Combining Avocodo with e.g. the MultiPreiodDiscriminator would be very interesting!

    feature request 
    opened by Flux9665 0
  • If fine-tuning from pre-trained  should generator_scheduler_params be updated?

    If fine-tuning from pre-trained should generator_scheduler_params be updated?

    I'm fine tuning Hifigan from 2.5ml steps pretrained model to 3ml steps.

    I wonder if this is the way to go by updating milestones?

    generator_optimizer_type: Adam
        lr: 2.0e-4
        betas: [0.5, 0.9]
        weight_decay: 0.0
    generator_scheduler_type: MultiStepLR
        gamma: 0.5
            - 2600000
            - 2700000
            - 2800000
            - 2900000
    generator_grad_norm: -1
    discriminator_optimizer_type: Adam
        lr: 2.0e-4
        betas: [0.5, 0.9]
        weight_decay: 0.0
    discriminator_scheduler_type: MultiStepLR
        gamma: 0.5
            - 2600000
            - 2700000
            - 2800000
            - 2900000
    discriminator_grad_norm: -1
    opened by skol101 0
  • v0.5.5(May 17, 2022)

    What's Changed

    • add recipe for kiritan & ofuton_p_utagoe db (singing voice synthesis) by @PeterGuoRuc in
    • Add recipe for Opencpop by @ftshijt in
    • add causal option for HiFiGAN by @chomeyama in
    • Fix HiFiGAN compatibility by @kan-bayashi in
    • add recipe for natsume (singing voice synthesis) by @PeterGuoRuc in
    • Update readme with pre-trained models on svs and demonstration by @ftshijt in
    • Add recipes and pretrained models for CSD (Korean&English) and KiSIng (Mandarin) databases by @ftshijt in
    • Add new recipe PJS (singing voice synthesis) by @A-Quarter-Mile in
    • add no7singing training by @frankxu2004 in
    • Add icelandic by @G-Thor in
    • add tag_or_url for download_pretrained_model by @roholazandie in
    • Apply black by @kan-bayashi in
    • Update to v0.5.5 by @kan-bayashi in

    New Contributors

    • @PeterGuoRuc made their first contribution in
    • @chomeyama made their first contribution in
    • @A-Quarter-Mile made their first contribution in
    • @frankxu2004 made their first contribution in
    • @G-Thor made their first contribution in
    • @roholazandie made their first contribution in

    Full Changelog:

    Source code(tar.gz)
    Source code(zip)
  • v0.5.4(Feb 10, 2022)

    What's Changed

    • add kss recipe by @windtoker in
    • Fix a noise shape of StyleMelGANGenerator to export ONNX model by @c-bata in
    • add recipe for oniku_kurumi_utagoe db (singing voice synthesis) by @ftshijt in
    • update documentation and correct the download link for oniku-db by @ftshijt in
    • Fix an error librosa update by @kan-bayashi in
    • Add pytorch 1.10.x CI by @kan-bayashi in

    New Contributors

    • @windtoker made their first contribution in
    • @c-bata made their first contribution in
    • @ftshijt made their first contribution in

    Full Changelog:

    Source code(tar.gz)
    Source code(zip)
  • v0.5.3(Aug 26, 2021)

  • v0.5.2(Aug 24, 2021)

  • v0.5.1(Aug 8, 2021)

  • v0.5.0(Aug 7, 2021)

  • v0.4.8(Nov 2, 2020)

  • v0.4.6(Aug 31, 2020)

  • v0.4.5(Aug 18, 2020)

    • Simplify decoding part
    • Add load_model function
    • Add inference method in each genearator
    • Add download_pretraiend_model function to download from google drive directly
    Source code(tar.gz)
    Source code(zip)
  • v0.4.3(Aug 16, 2020)

  • v0.4.2(Aug 15, 2020)

  • v0.4.1(Jun 28, 2020)

  • v0.4.0(May 28, 2020)

  • v0.3.5(May 11, 2020)

  • v0.3.4(Mar 12, 2020)

    What's new


    • Support --pretrain option in training
    • Support --skip-wav-copy option in normalization
    • Support scp style input for all /bin scripts for ESPnet compatibility
    • Better parallelization (much faster in the case of the large dataset)


    • Fix format: npy support
    • Add VCTK recipe
    • Add melgan.v3 config
    • Add parallel_wavegan.v3 config
    Source code(tar.gz)
    Source code(zip)
  • v0.3.0(Feb 15, 2020)

    What's new

    • Support more recipes
    • Support new residual discriminator
    • Support MelGAN generator
    • Support MelGAN discriminator
    • And more refactoring...
    Source code(tar.gz)
    Source code(zip)
  • v0.2.5(Nov 16, 2019)

Tomoki Hayashi
Postdoctoral researcher @ Nagoya University / COO @ Human Dataware Lab. Co., Ltd.
Tomoki Hayashi
Pangu-Alpha for Transformers

Pangu-Alpha for Transformers Usage Download MindSpore FP32 weights for GPU from here to data/Pangu-alpha_2.6B.ckpt Activate MindSpore environment and

One 5 Oct 01, 2022
Searching keywords in PDF file folders

keyword_searching Steps to use this Python scripts: (1)Paste this script into the file folder containing the PDF files you need to search from; (2)Thi

1 Nov 08, 2021
ACL'2021: Learning Dense Representations of Phrases at Scale

DensePhrases DensePhrases is an extractive phrase search tool based on your natural language inputs. From 5 million Wikipedia articles, it can search

Princeton Natural Language Processing 540 Dec 30, 2022
Code to reproduce the results of the paper 'Towards Realistic Few-Shot Relation Extraction' (EMNLP 2021)

Realistic Few-Shot Relation Extraction This repository contains code to reproduce the results in the paper "Towards Realistic Few-Shot Relation Extrac

Bloomberg 8 Nov 09, 2022
Text editor on python tkinter to convert english text to other languages with the help of ployglot.

Transliterator Text Editor This is a simple transliteration program which is used to convert english word to phonetically matching word in another lan

Merin Rose Tom 1 Jan 16, 2022
Python library for interactive topic model visualization. Port of the R LDAvis package.

pyLDAvis Python library for interactive topic model visualization. This is a port of the fabulous R package by Carson Sievert and Kenny Shirley. pyLDA

Ben Mabey 1.7k Dec 20, 2022
Official PyTorch implementation of SegFormer

SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers Figure 1: Performance of SegFormer-B0 to SegFormer-B5. Project page

NVIDIA Research Projects 1.4k Dec 29, 2022
Course project of [email protected]

NaiveMT Prepare Clone this repository git clone [email protected]:Poeroz/NaiveMT.git

Poeroz 2 Apr 24, 2022
Crie tokens de autenticação íntegros e seguros com UToken.

UToken - Tokens seguros. UToken (ou Unhandleable Token) é uma bilioteca criada para ser utilizada na geração de tokens seguros e íntegros, ou seja, nã

Jaedson Silva 0 Nov 29, 2022
InferSent sentence embeddings

InferSent InferSent is a sentence embeddings method that provides semantic representations for English sentences. It is trained on natural language in

Facebook Research 2.2k Dec 27, 2022
A Python script which randomly chooses and prints a file from a directory.

___ ____ ____ _ __ ___ / _ \ | _ \ | _ \ ___ _ __ | '__| / _ \ | |_| || | | || | | | / _ \| '__| | | | __/ | _ || |_| || |_| || __

yesmaybenookay 0 Aug 06, 2021
Python functions for summarizing and improving voice dictation input.

Helpmespeak Help me speak uses Python functions for summarizing and improving voice dictation input. Get started with OpenAI gpt-3 OpenAI is a amazing

Margarita Humanitarian Foundation 6 Dec 17, 2022
A repo for open resources & information for people to succeed in PhD in CS & career in AI / NLP

A repo for open resources & information for people to succeed in PhD in CS & career in AI / NLP

420 Dec 28, 2022
A curated list of FOSS tools to improve the Hacker News experience

Awesome-Hackernews Hacker News is a social news website focusing on computer technologies, hacking and startups. It promotes any content likely to "gr

Bryton Lacquement 141 Dec 27, 2022
A program that uses real statistics to choose the best times to bet on BloxFlip's crash gamemode

Bloxflip Smart Bet A program that uses real statistics to choose the best times to bet on BloxFlip's crash gamemode. THIS

43 Jan 05, 2023
Code for "Semantic Role Labeling as Dependency Parsing: Exploring Latent Tree Structures Inside Arguments".

Code for "Semantic Role Labeling as Dependency Parsing: Exploring Latent Tree Structures Inside Arguments".

Yu Zhang 50 Nov 08, 2022
Implementation of Memorizing Transformers (ICLR 2022), attention net augmented with indexing and retrieval of memories using approximate nearest neighbors, in Pytorch

Memorizing Transformers - Pytorch Implementation of Memorizing Transformers (ICLR 2022), attention net augmented with indexing and retrieval of memori

Phil Wang 364 Jan 06, 2023
Code for paper "Role-oriented Network Embedding Based on Adversarial Learning between Higher-order and Local Features"

Role-oriented Network Embedding Based on Adversarial Learning between Higher-order and Local Features Train python --dataset brazil-flights C

wang zhang 0 Jun 28, 2022
Understand Text Summarization and create your own summarizer in python

Automatic summarization is the process of shortening a text document with software, in order to create a summary with the major points of the original document. Technologies that can make a coherent

Sreekanth M 1 Oct 18, 2022
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility. Main features: Train new vocabularies and tok

Hugging Face 6.2k Dec 31, 2022