An official reimplementation of the method described in the INTERSPEECH 2021 paper - Speech Resynthesis from Discrete Disentangled Self-Supervised Representations.

Overview

Speech Resynthesis from Discrete Disentangled Self-Supervised Representations

Implementation of the method described in the Speech Resynthesis from Discrete Disentangled Self-Supervised Representations.

Abstract: We propose using self-supervised discrete representations for the task of speech resynthesis. To generate disentangled representation, we separately extract low-bitrate representations for speech content, prosodic information, and speaker identity. This allows to synthesize speech in a controllable manner. We analyze various state-of-the-art, self-supervised representation learning methods and shed light on the advantages of each method while considering reconstruction quality and disentanglement properties. Specifically, we evaluate the F0 reconstruction, speaker identification performance (for both resynthesis and voice conversion), recordings' intelligibility, and overall quality using subjective human evaluation. Lastly, we demonstrate how these representations can be used for an ultra-lightweight speech codec. Using the obtained representations, we can get to a rate of 365 bits per second while providing better speech quality than the baseline methods.

Quick Links

Setup

Software

Requirements:

  • Python >= 3.6
  • PyTorch v1.8
  • Install dependencies
    git clone https://github.com/facebookresearch/speech-resynthesis.git
    cd speech-resynthesis
    pip install -r requirements.txt

Data

For LJSpeech:

  1. Download LJSpeech dataset from here into data/LJSpeech-1.1 folder.
  2. Downsample audio from 22.05 kHz to 16 kHz and pad
    bash
    python ./scripts/preprocess.py \
    --srcdir data/LJSpeech-1.1/wavs \
    --outdir data/LJSpeech-1.1/wavs_16khz \
    --pad
    

For VCTK:

  1. Download VCTK dataset from here into data/VCTK-Corpus folder.
  2. Downsample audio from 48 kHz to 16 kHz, trim trailing silences and pad
    python ./scripts/preprocess.py \
    --srcdir data/VCTK-Corpus/wav48 \
    --outdir data/VCTK-Corpus/wav16 \
    --trim --pad

Training

F0 Quantizer Model

To train F0 quantizer model, use the following command:

python -m torch.distributed.launch --nproc_per_node 8 train_f0_vq.py \
--checkpoint_path checkpoints/lj_f0_vq \
--config configs/LJSpeech/f0_vqvae.json

Set <NUM_GPUS> to the number of availalbe GPUs on your machine.

Resynthesis Model

To train a resynthesis model, use the following command:

python -m torch.distributed.launch --nproc_per_node <NUM_GPUS> train.py \
--checkpoint_path checkpoints/lj_vqvae \
--config configs/LJSpeech/vqvae256_lut.json

Supported Configurations

Currently, we support the following training schemes:

Dataset SSL Method Dictionary Size Config Path
LJSpeech HuBERT 100 configs/LJSpeech/hubert100_lut.json
LJSpeech CPC 100 configs/LJSpeech/cpc100_lut.json
LJSpeech VQVAE 256 configs/LJSpeech/vqvae256_lut.json
VCTK HuBERT 100 configs/VCTK/hubert100_lut.json
VCTK CPC 100 configs/VCTK/cpc100_lut.json
VCTK VQVAE 256 configs/VCTK/vqvae256_lut.json

Inference

To generate, simply run:

python inference.py \
--checkpoint_file checkpoints/0 \
-n 10 \
--output_dir generations

To synthesize multiple speakers:

python inference.py \
--checkpoint_file checkpoints/vctk_cpc100 \
-n 10 \
--vc \
--input_code_file datasets/VCTK/cpc100/test.txt \
--output_dir generations_multispkr

You can also generate with codes from a different dataset:

python inference.py \
--checkpoint_file checkpoints/lj_cpc100 \
-n 10 \
--input_code_file datasets/VCTK/cpc100/test.txt \
--output_dir generations_vctk_to_lj

Preprocessing New Datasets

CPC / HuBERT Coding

To quantize new datasets with CPC or HuBERT follow the instructions described in the GSLM code.

To parse CPC output:

python scripts/parse_cpc_codes.py \
--manifest cpc_output_file \
--wav-root wav_root_dir \
--outdir parsed_cpc

To parse HuBERT output:

python parse_hubert_codes.py \
--codes hubert_output_file \
--manifest hubert_tsv_file \
--outdir parsed_hubert 

VQVAE Coding

First, you will need to download LibriLight dataset and move it to data/LibriLight.

For VQVAE, train a vqvae model using the following command:

python -m torch.distributed.launch --nproc_per_node <NUM_GPUS> train.py \
--checkpoint_path checkpoints/ll_vq \
--config configs/LibriLight/vqvae256.json

To extract VQVAE codes:

python infer_vqvae_codes.py \
--input_dir folder_with_wavs_to_code \
--output_dir vqvae_output_folder \
--checkpoint_file checkpoints/ll_vq

To parse VQVAE output:

 python parse_vqvae_codes.py \
 --manifest vqvae_output_file \
 --outdir parsed_vqvae

License

You may find out more about the license here.

Citation

@inproceedings{polyak21_interspeech,
  author={Adam Polyak and Yossi Adi and Jade Copet and 
          Eugene Kharitonov and Kushal Lakhotia and 
          Wei-Ning Hsu and Abdelrahman Mohamed and Emmanuel Dupoux},
  title={{Speech Resynthesis from Discrete Disentangled Self-Supervised Representations}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
}

Acknowledgements

This implementation uses code from the following repos: HiFi-GAN and Jukebox, as described in our code.

Owner
Facebook Research
Facebook Research
[CVPR'21] Multi-Modal Fusion Transformer for End-to-End Autonomous Driving

TransFuser This repository contains the code for the CVPR 2021 paper Multi-Modal Fusion Transformer for End-to-End Autonomous Driving. If you find our

695 Jan 05, 2023
Bringing Characters to Life with Computer Brains in Unity

AI4Animation: Deep Learning for Character Control This project explores the opportunities of deep learning for character animation and control as part

Sebastian Starke 5.5k Jan 04, 2023
Implementation of Vaswani, Ashish, et al. "Attention is all you need."

Attention Is All You Need Paper Implementation This is my from-scratch implementation of the original transformer architecture from the following pape

Brando Koch 195 Dec 30, 2022
MODNet: Trimap-Free Portrait Matting in Real Time

MODNet is a model for real-time portrait matting with only RGB image input.

Zhanghan Ke 2.8k Dec 30, 2022
The Few-Shot Bot: Prompt-Based Learning for Dialogue Systems

Few-Shot Bot: Prompt-Based Learning for Dialogue Systems This repository includes the dataset, experiments results, and code for the paper: Few-Shot B

Andrea Madotto 103 Dec 28, 2022
Contenido del curso Bases de datos del DCC PUC versión 2021-2

IIC2413 - Bases de Datos Tabla de contenidos Equipo Profesores Ayudantes Contenidos Calendario Evaluaciones Resumen de notas Foro Política de integrid

54 Nov 23, 2022
modelvshuman is a Python library to benchmark the gap between human and machine vision

modelvshuman is a Python library to benchmark the gap between human and machine vision. Using this library, both PyTorch and TensorFlow models can be evaluated on 17 out-of-distribution datasets with

Bethge Lab 244 Jan 03, 2023
Robot Hacking Manual (RHM). From robotics to cybersecurity. Papers, notes and writeups from a journey into robot cybersecurity.

RHM: Robot Hacking Manual Download in PDF RHM v0.4 ┃ Read online The Robot Hacking Manual (RHM) is an introductory series about cybersecurity for robo

Víctor Mayoral Vilches 233 Dec 30, 2022
Code for the paper: On Pathologies in KL-Regularized Reinforcement Learning from Expert Demonstrations

Non-Parametric Prior Actor-Critic (N-PPAC) This repository contains the code for On Pathologies in KL-Regularized Reinforcement Learning from Expert D

Cong Lu 5 May 13, 2022
Pytorch Implementation of the paper "Cross-domain Correspondence Learning for Exemplar-based Image Translation"

CoCosNet Pytorch Implementation of the paper "Cross-domain Correspondence Learning for Exemplar-based Image Translation" (CVPR 2020 oral). Update: 202

Lingbo Yang 38 Sep 22, 2021
PyTorch implementation of PP-LCNet

PP-LCNet-Pytorch Pre-Trained Models Google Drive p018 Accuracy Models Top1 Top5 PPLCNet_x0_25 0.5186 0.7565 PPLCNet_x0_35 0.5809 0.8083 PPLCNet_x0_5 0

24 Dec 12, 2022
FaceVerse: a Fine-grained and Detail-controllable 3D Face Morphable Model from a Hybrid Dataset (CVPR2022)

FaceVerse FaceVerse: a Fine-grained and Detail-controllable 3D Face Morphable Model from a Hybrid Dataset Lizhen Wang, Zhiyuan Chen, Tao Yu, Chenguang

Lizhen Wang 219 Dec 28, 2022
Voice Conversion Using Speech-to-Speech Neuro-Style Transfer

This repo contains the official implementation of the VAE-GAN from the INTERSPEECH 2020 paper Voice Conversion Using Speech-to-Speech Neuro-Style Transfer.

Ehab AlBadawy 93 Jan 05, 2023
mmdetection version of TinyBenchmark.

introduction This project is an mmdetection version of TinyBenchmark. TODO list: add TinyPerson dataset and evaluation add crop and merge for image du

34 Aug 27, 2022
Json2Xml tool will help you convert from json COCO format to VOC xml format in Object Detection Problem.

JSON 2 XML All codes assume running from root directory. Please update the sys path at the beginning of the codes before running. Over View Json2Xml t

Nguyễn Trường Lâu 6 Aug 22, 2022
Customer-Transaction-Analysis - This analysis is based on a synthesised transaction dataset containing 3 months worth of transactions for 100 hypothetical customers.

Customer-Transaction-Analysis - This analysis is based on a synthesised transaction dataset containing 3 months worth of transactions for 100 hypothetical customers. It contains purchases, recurring

Ayodeji Yekeen 1 Jan 01, 2022
CAPITAL: Optimal Subgroup Identification via Constrained Policy Tree Search

CAPITAL: Optimal Subgroup Identification via Constrained Policy Tree Search This repository is the official implementation of CAPITAL: Optimal Subgrou

Hengrui Cai 0 Oct 19, 2021
This is an implementation of PIFuhd based on Pytorch

Open-PIFuhd This is a unofficial implementation of PIFuhd PIFuHD: Multi-Level Pixel-Aligned Implicit Function forHigh-Resolution 3D Human Digitization

Lingteng Qiu 235 Dec 19, 2022
Convert ONNX model graph to Keras model format.

Convert ONNX model graph to Keras model format.

Grigory Malivenko 175 Dec 28, 2022
Convnet transfer - Code for paper How transferable are features in deep neural networks?

How transferable are features in deep neural networks? This repository contains source code necessary to reproduce the results presented in the follow

Jason Yosinski 143 Sep 13, 2022