Here is the implementation of our paper S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations.

Related tags

Deep LearningS2VC
Overview

S2VC

Here is the implementation of our paper S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations. In this paper, we proposed S2VC which utilizes Self-Supervised pretrained representation to provide the latent phonetic structure of the utterance from the source speaker and the spectral features of the utterance from the target speaker.

The following is the overall model architecture.

Model architecture

For the audio samples, please refer to our demo page.

Usage

You can download the pretrained model as well as the vocoder following the link under Releases section on the sidebar.

The whole project was developed using Python 3.8, torch 1.7.1, and the pretrained model, as well as the vocoder, were turned to TorchScript, so it's not guaranteed to be backward compatible. You can install the dependencies with

pip install -r requirements.txt

If you encounter any problems while installing fairseq, please refer to pytorch/fairseq for the installation instruction.

Self-Supervised representations

Wav2vec2

In our implementation, we're using Wav2Vec 2.0 Base w/o finetuning which is trained on LibriSpeech. You can download the checkpoint wav2vec_small.pt from pytorch/fairseq.

APC(Autoregressive Predictive Coding), CPC(Contrastive Predictive Coding)

These two representations are extracted using this speech toolkit S3PRL. You can check how to extract various representations from that repo.

Vocoder

The WaveRNN-based neural vocoder is from yistLin/universal-vocoder which is based on the paper, Towards achieving robust universal neural vocoding.

Voice conversion with pretrained models

You can convert an utterance from the source speaker with multiple utterances from the target speaker by preparing a conversion pairs information file in YAML format, like

# pairs_info.yaml
pair1:
    source: VCTK-Corpus/wav48/p225/p225_001.wav
    target:
        - VCTK-Corpus/wav48/p227/p227_001.wav
pair2:
    source: VCTK-Corpus/wav48/p225/p225_001.wav
    target:
        - VCTK-Corpus/wav48/p227/p227_002.wav
        - VCTK-Corpus/wav48/p227/p227_003.wav
        - VCTK-Corpus/wav48/p227/p227_004.wav

And convert multiple pairs at the same time, e.g.

python convert_batch.py \
    -w <WAV2VEC_PATH> \
    -v <VOCODER_PATH> \
    -c <CHECKPOINT_PATH> \
    -s <SOURCE_FEATURE_NAME> \
    -r <REFERENCE_FEATURE_NAME> \
    pairs_info.yaml \
    outputs # the output directory of conversion results

After the conversion, the output directory, outputs, will be containing

pair1.wav
pair1.mel.png
pair1.attn.png
pair2.wav
pair2.mel.png
pair2.attn.png

Train from scratch

Preprocessing

You can preprocess multiple corpora by passing multiple paths. But each path should be the directory that directly contains the speaker directories. And you have to specify the feature you want to extract. Currently, we support apc, cpc, wav2vec2, and timit_posteriorgram. i.e.

python3 preprocess.py
    VCTK-Corpus/wav48 \
    <SECOND_Corpus_PATH> \ # more corpus if you want
    <FEATURE_NAME> \
    <WAV2VEC_PATH> \
    processed/<FEATURE_NAME>  # the output directory of preprocessed features

After preprocessing, the output directory will be containing:

metadata.json
utterance-000x7gsj.tar
utterance-00wq7b0f.tar
utterance-01lpqlnr.tar
...

You may need to preprocess multiple times for different features. i.e.

python3 preprocess.py
    VCTK-Corpus/wav48 apc <WAV2VEC_PATH> processed/apc
python3 preprocess.py
    VCTK-Corpus/wav48 cpc <WAV2VEC_PATH> processed/cpc
    ...

Then merge the metadata of different features.

i.e.

python3 merger.py processed

Training

python train.py processed
    --save_dir ./ckpts \
    -s <SOURCE_FEATURE_NAME> \
    -r <REFERENCE_FEATURE_NAME>

You can further specify --preload for preloading all training data into RAM to boost training speed. If --comment is specified, e.g. --comment CPC-CPC, the training logs will be placed under a newly created directory like, logs/2020-02-02_12:34:56_CPC-CPC, otherwise there won't be any logging. For more details, you can refer to the usage by python train.py -h.

You might also like...
Phonetic PosteriorGram (PPG)-Based Voice Conversion (VC)

ppg-vc Phonetic PosteriorGram (PPG)-Based Voice Conversion (VC) This repo implements different kinds of PPG-based VC models. Pretrained models. More m

The Self-Supervised Learner can be used to train a classifier with fewer labeled examples needed using self-supervised learning.
The Self-Supervised Learner can be used to train a classifier with fewer labeled examples needed using self-supervised learning.

Published by SpaceML • About SpaceML • Quick Colab Example Self-Supervised Learner The Self-Supervised Learner can be used to train a classifier with

Repository providing a wide range of self-supervised pretrained models for computer vision tasks.

Hierarchical Pretraining: Research Repository This is a research repository for reproducing the results from the project "Self-supervised pretraining

The PASS dataset: pretrained models and how to get the data -  PASS: Pictures without humAns for Self-Supervised Pretraining
The PASS dataset: pretrained models and how to get the data - PASS: Pictures without humAns for Self-Supervised Pretraining

The PASS dataset: pretrained models and how to get the data - PASS: Pictures without humAns for Self-Supervised Pretraining

Implementation of the method described in the Speech Resynthesis from Discrete Disentangled Self-Supervised Representations.
Implementation of the method described in the Speech Resynthesis from Discrete Disentangled Self-Supervised Representations.

Speech Resynthesis from Discrete Disentangled Self-Supervised Representations Implementation of the method described in the Speech Resynthesis from Di

PyTorch implementation of our ICCV2021 paper: StructDepth: Leveraging the structural regularities for self-supervised indoor depth estimation
PyTorch implementation of our ICCV2021 paper: StructDepth: Leveraging the structural regularities for self-supervised indoor depth estimation

StructDepth PyTorch implementation of our ICCV2021 paper: StructDepth: Leveraging the structural regularities for self-supervised indoor depth estimat

We evaluate our method on different datasets (including ShapeNet, CUB-200-2011, and Pascal3D+) and achieve state-of-the-art results, outperforming all the other supervised and unsupervised methods and 3D representations, all in terms of performance, accuracy, and training time. [CVPR2021] The source code for our paper 《Removing the Background by Adding the Background: Towards Background Robust Self-supervised Video Representation Learning》.
[CVPR2021] The source code for our paper 《Removing the Background by Adding the Background: Towards Background Robust Self-supervised Video Representation Learning》.

TBE The source code for our paper "Removing the Background by Adding the Background: Towards Background Robust Self-supervised Video Representation Le

Code for our paper Domain Adaptive Semantic Segmentation with Self-Supervised Depth Estimation
Code for our paper Domain Adaptive Semantic Segmentation with Self-Supervised Depth Estimation

CorDA Code for our paper Domain Adaptive Semantic Segmentation with Self-Supervised Depth Estimation Prerequisite Please create and activate the follo

Comments
  • Cannot find f2114342ff9e813e18a580fa41418aee9925414e in https://github.com/s3prl/s3prl

    Cannot find f2114342ff9e813e18a580fa41418aee9925414e in https://github.com/s3prl/s3prl

    Running convert_batch.py throws ValueError: Cannot find f2114342ff9e813e18a580fa41418aee9925414e in https://github.com/s3prl/s3prl that originates from https://github.com/howard1337/S2VC/blob/8a6dcebc052424c41c62be0b22cb581258c5b4aa/data/feature_extract.py#L18

    File "convert_batch.py", line 61, in main
    src_feat_model = FeatureExtractor(src_feat_name, wav2vec_path, device)
    File "/deepmind/experiments/howard1337/s2vc/data/feature_extract.py", line 18, in __init__
    torch.hub.load("s3prl/s3prl:f2114342ff9e813e18a580fa41418aee9925414e", feature_name, refresh=True).eval().to(device)
    File "/storage/usr/conda/envs/s2vc/lib/python3.8/site-packages/torch/hub.py", line 402, in load
    repo_or_dir = _get_cache_or_reload(repo_or_dir, force_reload, verbose, skip_validation)
    File "/storage/usr/conda/envs/s2vc/lib/python3.8/site-packages/torch/hub.py", line 190, in _get_cache_or_reload
    _validate_not_a_forked_repo(repo_owner, repo_name, branch)
    File "/storage/usr/conda/envs/s2vc/lib/python3.8/site-packages/torch/hub.py", line 160, in _validate_not_a_forked_repo
    raise ValueError(f'Cannot find {branch} in https://github.com/{repo_owner}/{repo_name}. '
    ValueError: Cannot find f2114342ff9e813e18a580fa41418aee9925414e in https://github.com/s3prl/s3prl. If it's a commit from a forked repo, please call hub.load() with forked repo directly.
    

    Any idea on how to solve this?

    opened by jerrymatjila 1
  • Could you provide ppg-extracting code?

    Could you provide ppg-extracting code?

    Dear author,

    In your paper, you mentioned you extracted ppg and SSL features by s3prl toolkit. However, I cannot find in s3prl on how to extract ppg. Could you provide the code or guideline on extracting ppgs? Thanks a lot!
    
    opened by hongchengzhu 0
  • What are vocoder-ckpt-*.pt?

    What are vocoder-ckpt-*.pt?

    You release the following vocoder checkpoints:

    vocoder-ckpt-apc.pt
    vocoder-ckpt-cpc.pt
    vocoder-ckpt-wav2vec2.pt
    

    What are they?

    Are they vocoders fine-tuned on the output of a particular model? I didn't see that described in the paper. Why is this needed, if the S2VC output is a mel? If it's because different models produce different mels, do you use vocoder-ckpt-cpc.pt when target model is cpc? And if so, how did you do the fine-tuning?

    opened by turian 0
  • Training of other features (apc, timit_posteriorgram etc.) do not work

    Training of other features (apc, timit_posteriorgram etc.) do not work

    I have tried training with other than the cpc feature on my prepared corpus. However, the training script fails when the loss function (train.py , line 69). I found that the size of the output vector out is hard-coded, which is inconsistent with the size of the target Mel spectrogram of other features.

    The size of some vectors of the model are:

    • apc case: Input dim: 512, Reference dim: 512, Target dim: 240
    • cpc case: Input dim: 256, Reference dim: 256, Target dim: 80

    I prepared the input feature vectors by using preprocess.py, e.g. python .\preprocess.py (my own corpus) apc .\checkpoints\wav2vec_small.pt processed/apc.

    I have modified the model by changing the size of the vectors and can run train.py now. In the model.py, __init__() of S2VC function, I replace 80 with a function argument and pass the size of Mel vector size. But I cannot determine the modification is appropriate, for I am not familiar with NLP.

    convert_batch.py with pre-trained models works well as you described in README.md.

    Other details of my situation are:

    • Windows 10, PowerShell
    • pytorch 1.7.1 + cu110
    • torchaudio 0.7.1
    • sox 1.4.1
    • tqdm 4.42.0
    • librosa 0.8.1
    opened by sage-git 0
Releases(v1.0)
Source code, datasets and trained models for the paper Learning Advanced Mathematical Computations from Examples (ICLR 2021), by François Charton, Amaury Hayat (ENPC-Rutgers) and Guillaume Lample

Maths from examples - Learning advanced mathematical computations from examples This is the source code and data sets relevant to the paper Learning a

Facebook Research 171 Nov 23, 2022
Library extending Jupyter notebooks to integrate with Apache TinkerPop and RDF SPARQL.

Graph Notebook: easily query and visualize graphs The graph notebook provides an easy way to interact with graph databases using Jupyter notebooks. Us

Amazon Web Services 501 Dec 28, 2022
Generate high quality pictures. GAN. Generative Adversarial Networks

ESRGAN generate high quality pictures. GAN. Generative Adversarial Networks """ Super-resolution of CelebA using Generative Adversarial Networks. The

Lieon 1 Dec 14, 2021
Learning Pixel-level Semantic Affinity with Image-level Supervision for Weakly Supervised Semantic Segmentation, CVPR 2018

Learning Pixel-level Semantic Affinity with Image-level Supervision This code is deprecated. Please see https://github.com/jiwoon-ahn/irn instead. Int

Jiwoon Ahn 337 Dec 15, 2022
This repository contains the code for "SBEVNet: End-to-End Deep Stereo Layout Estimation" paper by Divam Gupta, Wei Pu, Trenton Tabor, Jeff Schneider

SBEVNet: End-to-End Deep Stereo Layout Estimation This repository contains the code for "SBEVNet: End-to-End Deep Stereo Layout Estimation" paper by D

Divam Gupta 19 Dec 17, 2022
Optimizing DR with hard negatives and achieving SOTA first-stage retrieval performance on TREC DL Track (SIGIR 2021 Full Paper).

Optimizing Dense Retrieval Model Training with Hard Negatives Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Jiafeng Guo, Min Zhang, Shaoping Ma 🔥 News 2021-10

Jingtao Zhan 99 Dec 27, 2022
Learnable Multi-level Frequency Decomposition and Hierarchical Attention Mechanism for Generalized Face Presentation Attack Detection

LMFD-PAD Note This is the official repository of the paper: LMFD-PAD: Learnable Multi-level Frequency Decomposition and Hierarchical Attention Mechani

28 Dec 02, 2022
CS583: Deep Learning

CS583: Deep Learning

Shusen Wang 2.6k Dec 30, 2022
Code for the Active Speakers in Context Paper (CVPR2020)

Active Speakers in Context This repo contains the official code and models for the "Active Speakers in Context" CVPR 2020 paper. Before Training The c

43 Oct 14, 2022
Official repository of the paper "A Variational Approximation for Analyzing the Dynamics of Panel Data". Mixed Effect Neural ODE. UAI 2021.

Official repository of the paper (UAI 2021) "A Variational Approximation for Analyzing the Dynamics of Panel Data", Mixed Effect Neural ODE. Panel dat

Jurijs Nazarovs 7 Nov 26, 2022
End-To-End Crowdsourcing

End-To-End Crowdsourcing Comparison of traditional crowdsourcing approaches to a state-of-the-art end-to-end crowdsourcing approach LTNet on sentiment

Andreas Koch 1 Mar 06, 2022
Code for "Adversarial Attack Generation Empowered by Min-Max Optimization", NeurIPS 2021

Min-Max Adversarial Attacks [Paper] [arXiv] [Video] [Slide] Adversarial Attack Generation Empowered by Min-Max Optimization Jingkang Wang, Tianyun Zha

Jingkang Wang 12 Nov 23, 2022
Codebase for arXiv preprint "NeRF++: Analyzing and Improving Neural Radiance Fields"

NeRF++ Codebase for arXiv preprint "NeRF++: Analyzing and Improving Neural Radiance Fields" Work with 360 capture of large-scale unbounded scenes. Sup

Kai Zhang 722 Dec 28, 2022
SMCA replication There are no extra compiled components in SMCA DETR and package dependencies are minimal

Usage There are no extra compiled components in SMCA DETR and package dependencies are minimal, so the code is very simple to use. We provide instruct

22 May 06, 2022
MoViNets PyTorch implementation: Mobile Video Networks for Efficient Video Recognition;

MoViNet-pytorch Pytorch unofficial implementation of MoViNets: Mobile Video Networks for Efficient Video Recognition. Authors: Dan Kondratyuk, Liangzh

189 Dec 20, 2022
The implementation of PEMP in paper "Prior-Enhanced Few-Shot Segmentation with Meta-Prototypes"

Prior-Enhanced network with Meta-Prototypes (PEMP) This is the PyTorch implementation of PEMP. Overview of PEMP Meta-Prototypes & Adaptive Prototypes

Jianwei ZHANG 8 Oct 14, 2021
Predicting Student Attentiveness using OpenCV

Predicting-Student-Attentiveness-using-OpenCV The model will predict if a student is attentive or not through facial parameter received through the st

Johann Pinto 2 Aug 20, 2022
Example scripts for the detection of lanes using the ultra fast lane detection model in Tensorflow Lite.

TFlite Ultra Fast Lane Detection Inference Example scripts for the detection of lanes using the ultra fast lane detection model in Tensorflow Lite. So

Ibai Gorordo 12 Aug 27, 2022
Clustering is a popular approach to detect patterns in unlabeled data

Visual Clustering Clustering is a popular approach to detect patterns in unlabeled data. Existing clustering methods typically treat samples in a data

Tarek Naous 24 Nov 11, 2022
Unofficial pytorch implementation of the paper "Dynamic High-Pass Filtering and Multi-Spectral Attention for Image Super-Resolution"

DFSA Unofficial pytorch implementation of the ICCV 2021 paper "Dynamic High-Pass Filtering and Multi-Spectral Attention for Image Super-Resolution" (p

2 Nov 15, 2021