Pytorch Implementation of Neural Analysis and Synthesis: Reconstructing Speech from Self-Supervised Representations

Related tags

Deep LearningNANSY
Overview

NANSY:

Unofficial Pytorch Implementation of Neural Analysis and Synthesis: Reconstructing Speech from Self-Supervised Representations

Notice

Papers' Demo

Check Authors' Demo page

Sample-Only Demo Page

Check Demo Page

Concerns

Among the various controllabilities, it is rather obvious that the voice conversion technique can be misused and potentially harm other people. 
More concretely, there are possible scenarios where it is being used by random unidentified users and contributing to spreading fake news. 
In addition, it can raise concerns about biometric security systems based on speech. 
To mitigate such issues, the proposed system should not be released without a consent so that it cannot be easily used by random users with malicious intentions. 
That being said, there is still a potential for this technology to be used by unidentified users. 
As a more solid solution, therefore, we believe a detection system that can discriminate between fake and real speech should be developed.

We provide both pretrained checkpoint of Discriminator network and inference code for this concern.

Environment

Requirements

pip install -r requirements.txt

Docker

Image

If using cu113 compatible environment, use Dockerfile
If using cu102 compatible environment, use Dockerfile-cu102

docker build -f Dockerfile -t nansy:v0.0 .

Container

After building appropriate image, use docker-compose or docker to run a container.
You may want to modify docker-compose.yml or docker_run_script.sh

docker-compose -f docker-compose.yml run --service-ports --name CONTAINER_NAME nansy_container bash
or
bash docker_run_script.sh

Pretrained hifi-gan

Download pretrained hifi-gan config and checkpoint
from hifi-gan to ./configs/hifi-gan/UNIVERSAL_V1

Pretrained Checkpoints

TODO

Datasets

Datasets used when training are:

Custom Datasets

Write your own code!
If inheriting datasets.custom.CustomDataset, self.data should be as:

self.data: list
self.data[i]: dict must have:
    'wav_path_22k': str = path_to_22k_wav_file
    'wav_path_16k': str = (optional) path_to_16k_wav_file
    'speaker_id': str = speaker_id

Train

If you prefer pytorch-lightning, python train.py -g 1

parser = argparse.ArgumentParser()
parser.add_argument("--config", type=str, default="configs/train_nansy.yaml")
parser.add_argument('-g', '--gpus', type=str,
                    help="number of gpus to use")
parser.add_argument('-p', '--resume_checkpoint_path', type=str, default=None,
                    help="path of checkpoint for resuming")
args = parser.parse_args()
return args

else python train_torch.py # TODO, not completely supported now

Configs Description

Edit configs/train_nansy.yaml.

Dataset settings

  • Adjust datasets.*.datasets list.
    • Paths to dataset config files should be in the list
datasets:
  train:
    class: datasets.base.MultiDataset
    datasets: [
      # 'configs/datasets/css10.yaml',
        'configs/datasets/vctk.yaml',
        'configs/datasets/libritts360.yaml',
    ]

    mode: train
    batch_size: 32 # Depends on GPU Memory, Original paper used 32
    shuffle: True
    num_workers: 16 # Depends on available CPU cores

  eval:
    class: datasets.base.MultiDataset
    datasets: [
      # 'configs/datasets/css10.yaml',
        'configs/datasets/vctk.yaml',
        'configs/datasets/libritts360.yaml',
    ]

    mode: eval
    batch_size: 32
    shuffle: False
    num_workers: 4
Dataset Config

Dataset configs are at ./configs/datasets/.
You might want to replace /raid/vision/dhchoi/data to YOUR_PATH_DO_DATA, especially at path section.

class: datasets.vctk.VCTKDataset # implemented Dataset class name
load:
  audio: 'configs/audio/22k.yaml'

path:
  root: /raid/vision/dhchoi/data/
  wav22: /raid/vision/dhchoi/data/VCTK-Corpus/wav22
  wav16: /raid/vision/dhchoi/data/VCTK-Corpus/wav16
  txt: /raid/vision/dhchoi/data/VCTK-Corpus/txt
  timestamp: ./vctk-silence-labels/vctk-silences.0.92.txt

  configs:
    train: /raid/vision/dhchoi/data/VCTK-Corpus/vctk_22k_train.txt
    eval: /raid/vision/dhchoi/data/VCTK-Corpus/vctk_22k_val.txt
    test: /raid/vision/dhchoi/data/VCTK-Corpus/vctk_22k_test.txt

Model Settings

  • Comment out or Delete Discriminator section if no Discriminator needed.
  • Adjust optimizer class, lr and betas if needed.
models:
  Analysis:
    class: models.analysis.Analysis

    optim:
      class: torch.optim.Adam
      kwargs:
        lr: 1e-4
        betas: [ 0.5, 0.9 ]

  Synthesis:
    class: models.synthesis.Synthesis

    optim:
      class: torch.optim.Adam
      kwargs:
        lr: 1e-4
        betas: [ 0.5, 0.9 ]

  Discriminator:
    class: models.synthesis.Discriminator

    optim:
      class: torch.optim.Adam
      kwargs:
        lr: 1e-4
        betas: [ 0.5, 0.9 ]

Logging & Pytorch-lightning settings

For pytorch-lightning configs in section pl, check official docs

pl:
  checkpoint:
    callback:
      save_top_k: -1
      monitor: "train/backward"
      verbose: True
      every_n_epochs: 1 # epochs

  trainer:
    gradient_clip_val: 0 # don't clip (default value)
    max_epochs: 10000
    num_sanity_val_steps: 1
    fast_dev_run: False
    check_val_every_n_epoch: 1
    progress_bar_refresh_rate: 1
    accelerator: "ddp"
    benchmark: True

logging:
  log_dir: /raid/vision/dhchoi/log/nansy/ # PATH TO SAVE TENSORBOARD LOG FILES
  seed: "31" # Experiment Seed
  freq: 100 # Logging frequency (step)
  device: cuda # Training Device (used only in train_torch.py) 
  nepochs: 1000 # Max epochs to run

  save_files: [ # Files To save for each experiment
      './*.py',
      './*.sh',
      'configs/*.*',
      'datasets/*.*',
      'models/*.*',
      'utils/*.*',
  ]

Tensorboard

During training, tensorboard logger logs loss, spectrogram and audio.

tensorboard --logdir YOUR_LOG_DIR_AT_CONFIG/YOUR_SEED --bind_all

Inference

Generator

python inference.py or bash inference.sh

You may want to edit inferece.py for custom manipulation.

parser = argparse.ArgumentParser()
parser.add_argument('--path_audio_conf', type=str, default='configs/audio/22k.yaml',
                    help='')
parser.add_argument('--path_ckpt', type=str, required=True,
                    help='path to pl checkpoint')
parser.add_argument('--path_audio_source', type=str, required=True,
                    help='path to source audio file, sr=22k')
parser.add_argument('--path_audio_target', type=str, required=True,
                    help='path to target audio file, sr=16k')
parser.add_argument('--tsa_loop', type=int, default=100,
                    help='iterations for tsa')
parser.add_argument('--device', type=str, default='cuda',
                    help='')
args = parser.parse_args()
return args

Discriminator

Note that 0=gt, 1=gen

python classify.py or bash classify.sh

parser = argparse.ArgumentParser()
parser.add_argument('--path_audio_conf', type=str, default='configs/audio/22k.yaml',
                    help='')
parser.add_argument('--path_ckpt', type=str, required=True,
                    help='path to pl checkpoint')
parser.add_argument('--path_audio_gt', type=str, required=True,
                    help='path to audio with same speaker')
parser.add_argument('--path_audio_gen', type=str, required=True,
                    help='path to generated audio ')
parser.add_argument('--device', type=str, default='cuda')
args = parser.parse_args()

License

NEEDS WORK

BSD 3-Clause License.

References

  • Choi, Hyeong-Seok, et al. "Neural Analysis and Synthesis: Reconstructing Speech from Self-Supervised Representations."

  • Baevski, Alexei, et al. "wav2vec 2.0: A framework for self-supervised learning of speech representations."

  • Desplanques, Brecht, Jenthe Thienpondt, and Kris Demuynck. "Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification."

  • Chen, Mingjian, et al. "Adaspeech: Adaptive text to speech for custom voice."

  • Cookbook formulae for audio equalizer biquad filter coefficients

This implementation uses codes/data from following repositories:

Provided Checkpoints are trained from:

Special Thanks

MINDsLab Inc. for GPU support

Special Thanks to:

for help with Audio-domain knowledge

Owner
Dongho Choi 최동호
Dongho Choi 최동호
Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit

CNTK Chat Windows build status Linux build status The Microsoft Cognitive Toolkit (https://cntk.ai) is a unified deep learning toolkit that describes

Microsoft 17.3k Dec 29, 2022
PyTorch implemention of ICCV'21 paper SGPA: Structure-Guided Prior Adaptation for Category-Level 6D Object Pose Estimation

SGPA: Structure-Guided Prior Adaptation for Category-Level 6D Object Pose Estimation This is the PyTorch implemention of ICCV'21 paper SGPA: Structure

Chen Kai 24 Dec 05, 2022
This is a official repository of SimViT.

SimViT This is a official repository of SimViT. We will open our models and codes about object detection and semantic segmentation soon. Our code refe

ligang 57 Dec 15, 2022
Simple-System-Convert--C--F - Simple System Convert With Python

Simple-System-Convert--C--F REQUIREMENTS Python version : 3 HOW TO USE Run the c

Jonathan Santos 2 Feb 16, 2022
Minimisation of a negative log likelihood fit to extract the lifetime of the D^0 meson (MNLL2ELDM)

Minimisation of a negative log likelihood fit to extract the lifetime of the D^0 meson (MNLL2ELDM) Introduction The average lifetime of the $D^{0}$ me

Son Gyo Jung 1 Dec 17, 2021
Milano is a tool for automating hyper-parameters search for your models on a backend of your choice.

Milano (This is a research project, not an official NVIDIA product.) Documentation https://nvidia.github.io/Milano Milano (Machine learning autotuner

NVIDIA Corporation 147 Dec 17, 2022
EigenGAN Tensorflow, EigenGAN: Layer-Wise Eigen-Learning for GANs

Gender Bangs Body Side Pose (Yaw) Lighting Smile Face Shape Lipstick Color Painting Style Pose (Yaw) Pose (Pitch) Zoom & Rotate Flush & Eye Color Mout

Zhenliang He 321 Dec 01, 2022
Learning to Map Large-scale Sparse Graphs on Memristive Crossbar

Release of AutoGMap:Learning to Map Large-scale Sparse Graphs on Memristive Crossbar For reproduction of our searched model, the Ubuntu OS is recommen

2 Aug 23, 2022
Vrcwatch - Supply the local time to VRChat as Avatar Parameters through OSC

English: README-EN.md VRCWatch VRCWatch は、VRChat 内のアバター向けに現在時刻を送信するためのプログラムです。 使

Kosaki Mezumona 17 Nov 30, 2022
Code for CVPR2021 paper "Learning Salient Boundary Feature for Anchor-free Temporal Action Localization"

AFSD: Learning Salient Boundary Feature for Anchor-free Temporal Action Localization This is an official implementation in PyTorch of AFSD. Our paper

Tencent YouTu Research 146 Dec 24, 2022
Madanalysis5 - A package for event file analysis and recasting of LHC results

Welcome to MadAnalysis 5 Outline What is MadAnalysis 5? Requirements Downloading

MadAnalysis 15 Jan 01, 2023
End-to-end machine learning project for rices detection

Basmatinet Welcome to this project folks ! Whether you like it or not this project is all about riiiiice or riz in french. It is also about Deep Learn

Béranger 47 Jun 18, 2022
Generative Modelling of BRDF Textures from Flash Images [SIGGRAPH Asia, 2021]

Neural Material Official code repository for the paper: Generative Modelling of BRDF Textures from Flash Images [SIGGRAPH Asia, 2021] Henzler, Deschai

Philipp Henzler 80 Dec 20, 2022
Automatic library of congress classification, using word embeddings from book titles and synopses.

Automatic Library of Congress Classification The Library of Congress Classification (LCC) is a comprehensive classification system that was first deve

Ahmad Pourihosseini 3 Oct 01, 2022
People log into different sites every day to get information and browse through these sites one by one

HyperLink People log into different sites every day to get information and browse through these sites one by one. And they are exposed to advertisemen

0 Feb 17, 2022
Dataset for the Research2Clinics @ NeurIPS 2021 Paper: What Do You See in this Patient? Behavioral Testing of Clinical NLP Models

Behavioral Testing of Clinical NLP Models This repository contains code for testing the behavior of clinical prediction models based on patient letter

Betty van Aken 2 Sep 20, 2022
Change is Everywhere: Single-Temporal Supervised Object Change Detection in Remote Sensing Imagery (ICCV 2021)

Change is Everywhere Single-Temporal Supervised Object Change Detection in Remote Sensing Imagery by Zhuo Zheng, Ailong Ma, Liangpei Zhang and Yanfei

Zhuo Zheng 125 Dec 13, 2022
Weakly Supervised Text-to-SQL Parsing through Question Decomposition

Weakly Supervised Text-to-SQL Parsing through Question Decomposition The official repository for the paper "Weakly Supervised Text-to-SQL Parsing thro

14 Dec 19, 2022
Towards Debiasing NLU Models from Unknown Biases

Towards Debiasing NLU Models from Unknown Biases Abstract: NLU models often exploit biased features to achieve high dataset-specific performance witho

Ubiquitous Knowledge Processing Lab 22 Jun 14, 2022
Exploring the link between uncertainty estimates obtained via "exact" Bayesian inference and out-of-distribution (OOD) detection.

Uncertainty-based OOD detection Exploring the link between uncertainty estimates obtained by "exact" Bayesian inference and out-of-distribution (OOD)

Christian Henning 1 Nov 05, 2022