A PyTorch Implementation of the paper - Choi, Woosung, et al. "Investigating u-nets with various intermediate blocks for spectrogram-based singing voice separation." 21th International Society for Music Information Retrieval Conference, ISMIR. 2020.

Overview

Investigating U-NETS With Various Intermediate Blocks For Spectrogram-based Singing Voice Separation

A Pytorch Implementation of the paper "Investigating U-NETS With Various Intermediate Blocks For Spectrogram-based Singing Voice Separation (ISMIR 2020)"

Installation

conda install pytorch=1.6 cudatoolkit=10.2 -c pytorch
conda install -c conda-forge ffmpeg librosa
conda install -c anaconda jupyter
pip install musdb museval pytorch_lightning effortless_config wandb pydub nltk spacy 

Dataset

  1. Download Musdb18
  2. Unzip files
  3. We recommend you to use the wav file mode for the fast data preparation.
    musdbconvert path/to/musdb-stems-root path/to/new/musdb-wav-root

Demonstration: A Pretrained Model (TFC_TDF_Net (large))

Colab Link

Tutorial

1. activate your conda

conda activate yourcondaname

2. Training a default UNet with TFC_TDFs

python main.py --musdb_root ../repos/musdb18_wav --musdb_is_wav True --filed_mode True --target_name vocals --mode train --gpus 4 --distributed_backend ddp --sync_batchnorm True --pin_memory True --num_workers 32 --precision 16 --run_id debug --optimizer adam --lr 0.001 --save_top_k 3 --patience 100 --min_epochs 1000 --max_epochs 2000 --n_fft 2048 --hop_length 1024 --num_frame 128  --train_loss spec_mse --val_loss raw_l1 --model tfc_tdf_net  --spec_est_mode mapping --spec_type complex --n_blocks 7 --internal_channels 24  --n_internal_layers 5 --kernel_size_t 3 --kernel_size_f 3 --min_bn_units 16 --tfc_tdf_activation relu  --first_conv_activation relu --last_activation identity --seed 2020

3. Evaluation

After training is done, checkpoints are saved in the following directory.

etc/modelname/run_id/*.ckpt

For evaluation,

python main.py --musdb_root ../repos/musdb18_wav --musdb_is_wav True --filed_mode True --target_name vocals --mode eval --gpus 1 --pin_memory True --num_workers 64 --precision 32 --run_id debug --batch_size 4 --n_fft 2048 --hop_length 1024 --num_frame 128 --train_loss spec_mse --val_loss raw_l1 --model tfc_tdf_net --spec_est_mode mapping --spec_type complex --n_blocks 7 --internal_channels 24 --n_internal_layers 5 --kernel_size_t 3 --kernel_size_f 3 --min_bn_units 16 --tfc_tdf_activation relu --first_conv_activation relu --last_activation identity --log wandb --ckpt vocals_epoch=891.ckpt

Below is the result.

wandb:          test_result/agg/vocals_SDR 6.954695
wandb:   test_result/agg/accompaniment_SAR 14.3738075
wandb:          test_result/agg/vocals_SIR 15.5527
wandb:   test_result/agg/accompaniment_SDR 13.561705
wandb:   test_result/agg/accompaniment_ISR 22.69328
wandb:   test_result/agg/accompaniment_SIR 18.68421
wandb:          test_result/agg/vocals_SAR 6.77698
wandb:          test_result/agg/vocals_ISR 12.45371

4. Interactive Report (wandb)

wandb report

Indermediate Blocks

Please see this document.

How to use

1. Training

1.1. Intermediate Block independent Parameters

1.1.A. General Parameters
  • --musdb_root musdb path
  • --musdb_is_wav whether the path contains wav files or not
  • --filed_mode whether you want to use filed mode or not. recommend to use it for the fast data preparation.
  • --target_name one of vocals, drum, bass, other
1.1.B. Training Environment
  • --mode train or eval
  • --gpus number of gpus
    • (WARN) gpus > 1 might be problematic when evaluating models.
  • distributed_backend use this option only when you are using multi-gpus. distributed backend, one of ddp, dp, ... we recommend you to use ddp.
  • --sync_batchnorm True only when you are using ddp
  • --pin_memory
  • --num_workers
  • --precision 16 or 32
  • --dev_mode whether you want a developement mode or not. dev mode is much faster because it uses only a small subset of the dataset.
  • --run_id (optional) directory path where you want to store logs and etc. if none then the timestamp.
  • --log True for default pytorch lightning log. wandb is also available.
  • --seed random seed for a deterministic result.
1.1.C. Training hyperparmeters
  • --batch_size trivial :)
  • --optimizer adam, rmsprop, etc
  • --lr learning rate
  • --save_top_k how many top-k epochs you want to save the training state (criterion: validation loss)
  • --patience early stop control parameter. see pytorch lightning docs.
  • --min_epochs trivial :)
  • --max_epochs trivial :)
  • --model
    • tfc_tdf_net
    • tfc_net
    • tdc_net
1.1.D. Fourier parameters
  • --n_fft
  • --hop_length
  • num_frame number of frames (time slices)
1.1.F. criterion
  • --train_loss: spec_mse, raw_l1, etc...
  • --val_loss: spec_mse, raw_l1, etc...

1.2. U-net Parameters

  • --n_blocks: number of intermediate blocks. must be an odd integer. (default=7)
  • --input_channels:
    • if you use two-channeled complex-valued spectrogram, then 4
    • if you use two-channeled manginutde spectrogram, then 2
  • --internal_channels: number of internal chennels (default=24)
  • --first_conv_activation: (default='relu')
  • --last_activation: (default='sigmoid')
  • --t_down_layers: list of layer where you want to doubles/halves the time resolution. if None, ds/us applied to every single layer. (default=None)
  • --f_down_layers: list of layer where you want to doubles/halves the frequency resolution. if None, ds/us applied to every single layer. (default=None)

1.3. SVS Framework

  • --spec_type: type of a spectrogram. ['complex', 'magnitude']

  • --spec_est_mode: spectrogram estimation method. ['mapping', 'masking']

  • CaC Framework

    • you can use cac framework [1] by setting
      • --spec_type complex --spec_est_mode mapping --last_activation identity
  • Mag-only Framework

    • if you want to use the traditional magnitude-only estimation with sigmoid, then try
      • --spec_type magnitude --spec_est_mode masking --last_activation sigmoid
    • you can also change the last activation as follows
      • --spec_type magnitude --spec_est_mode masking --last_activation relu
  • Alternatives

    • you can build an svs framework with any combination of these parameters
    • e.g. --spec_type complex --spec_est_mode masking --last_activation tanh

1.4. Block-dependent Parameters

1.4.A. TDF Net
  • --bn_factor: bottleneck factor $bn$ (default=16)
  • --min_bn_units: when target frequency domain size is too small, we just use this value instead of $\frac{f}{bn}$. (default=16)
  • --bias: (default=False)
  • --tdf_activation: activation function of each block (default=relu)

1.4.B. TDC Net
  • --n_internal_layers: number of 1-d CNNs in a block (default=5)
  • --kernel_size_f: size of kernel of frequency-dimension (default=3)
  • --tdc_activation: activation function of each block (default=relu)

1.4.C. TFC Net
  • --n_internal_layers: number of 1-d CNNs in a block (default=5)
  • --kernel_size_t: size of kernel of time-dimension (default=3)
  • --kernel_size_f: size of kernel of frequency-dimension (default=3)
  • --tfc_activation: activation function of each block (default=relu)

1.4.D. TFC_TDF Net
  • --n_internal_layers: number of 1-d CNNs in a block (default=5)
  • --kernel_size_t: size of kernel of time-dimension (default=3)
  • --kernel_size_f: size of kernel of frequency-dimension (default=3)
  • --tfc_tdf_activation: activation function of each block (default=relu)
  • --bn_factor: bottleneck factor $bn$ (default=16)
  • --min_bn_units: when target frequency domain size is too small, we just use this value instead of $\frac{f}{bn}$. (default=16)
  • --tfc_tdf_bias: (default=False)

1.4.E. TDC_RNN Net
  • '--n_internal_layers' : number of 1-d CNNs in a block (default=5)

  • '--kernel_size_f' : size of kernel of frequency-dimension (default=3)

  • '--bn_factor_rnn' : (default=16)

  • '--num_layers_rnn' : (default=1)

  • '--bias_rnn' : bool, (default=False)

  • '--min_bn_units_rnn' : (default=16)

  • '--bn_factor_tdf' : (default=16)

  • '--bias_tdf' : bool, (default=False)

  • '--tdc_rnn_activation' : (default='relu')

current bug - cuda error occurs when tdc_rnn net with precision 16

Reproducible Experimental Results

  • TFC_TDF_large
    • parameters
    --musdb_root ../repos/musdb18_wav
    --musdb_is_wav True
    --filed_mode True
    
    --gpus 4
    --distributed_backend ddp
    --sync_batchnorm True
    
    --num_workers 72
    --train_loss spec_mse
    --val_loss raw_l1
    --batch_size 12
    --precision 16
    --pin_memory True
    --num_worker 72         
    --save_top_k 3
    --patience 200
    --run_id debug_large
    --log wandb
    --min_epochs 2000
    --max_epochs 3000
    
    --optimizer adam
    --lr 0.001
    
    --model tfc_tdf_net
    --n_fft 4096
    --hop_length 1024
    --num_frame 128
    --spec_type complex
    --spec_est_mode mapping
    --last_activation identity
    --n_blocks 9
    --internal_channels 24
    --n_internal_layers 5
    --kernel_size_t 3 
    --kernel_size_f 3 
    --tfc_tdf_bias True
    --seed 2020
    
    
    • training
    python main.py --musdb_root ../repos/musdb18_wav --musdb_is_wav True --filed_mode True --gpus 4 --distributed_backend ddp --sync_batchnorm True --num_workers 72 --train_loss spec_mse --val_loss raw_l1 --batch_size 24 --precision 16 --pin_memory True --num_worker 72 --save_top_k 3 --patience 200 --run_id debug_large --log wandb --min_epochs 2000 --max_epochs 3000 --optimizer adam --lr 0.001 --model tfc_tdf_net --n_fft 4096 --hop_length 1024 --num_frame 128 --spec_type complex --spec_est_mode mapping --last_activation identity --n_blocks 9 --internal_channels 24 --n_internal_layers 5 --kernel_size_t 3 --kernel_size_f 3 --tfc_tdf_bias True --seed 2020
    • evaluation result (epoch 2007)
      • SDR 8.029
      • ISR 13.708
      • SIR 16.409
      • SAR 7.533

Interactive Report (wandb)

wandb report

You can cite this paper as follows:

@inproceedings{choi_2020, Author = {Choi, Woosung and Kim, Minseok and Chung, Jaehwa and Lee, Daewon and Jung, Soonyoung}, Booktitle = {21th International Society for Music Information Retrieval Conference}, Editor = {ISMIR}, Month = {OCTOBER}, Title = {Investigating U-Nets with various intermediate blocks for spectrogram-based singing voice separation.}, Year = {2020}}

Reference

[1] Woosung Choi, Minseok Kim, Jaehwa Chung, DaewonLee, and Soonyoung Jung, “Investigating u-nets with various intermediate blocks for spectrogram-based singingvoice separation.,” in 21th International Society for Music Information Retrieval Conference, ISMIR, Ed., OCTOBER 2020.

Owner
Woosung Choi
WooSung Choi Ph.d candidate @IELab-AT-KOREA-UNIV Seoul, Korea
Woosung Choi
The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate.

The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate. Website • Key Features • How To Use • Docs •

Pytorch Lightning 21.1k Dec 29, 2022
🤖 Project template for your next awesome AI project. 🦾

🤖 AI Awesome Project Template 👋 Template author You may want to adjust badge links in a README.md file. 💎 Installation with pip Installation is as

Wiktor Łazarski 18 Nov 23, 2022
Deep High-Resolution Representation Learning for Human Pose Estimation

Deep High-Resolution Representation Learning for Human Pose Estimation (accepted to CVPR2019) News If you are interested in internship or research pos

HRNet 167 Dec 27, 2022
Nb workflows - A workflow platform which allows you to run parameterized notebooks programmatically

NB Workflows Description If SQL is a lingua franca for querying data, Jupyter sh

Xavier Petit 6 Aug 18, 2022
Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

Computational Linguistics Research Group 8.4k Jan 03, 2023
Project page for End-to-end Recovery of Human Shape and Pose

End-to-end Recovery of Human Shape and Pose Angjoo Kanazawa, Michael J. Black, David W. Jacobs, Jitendra Malik CVPR 2018 Project Page Requirements Pyt

1.4k Dec 29, 2022
PyTorch implementations of Generative Adversarial Networks.

This repository has gone stale as I unfortunately do not have the time to maintain it anymore. If you would like to continue the development of it as

Erik Linder-Norén 13.4k Jan 08, 2023
MacroTools provides a library of tools for working with Julia code and expressions.

MacroTools.jl MacroTools provides a library of tools for working with Julia code and expressions. This includes a powerful template-matching system an

FluxML 278 Dec 11, 2022
Repo público onde postarei meus estudos de Python, buscando aprender por meio do compartilhamento do aprendizado!

Seja bem vindo à minha repo de Estudos em Python 3! Este é um repositório criado por um programador amador que estuda tópicos de finanças, estatística

32 Dec 24, 2022
【Arxiv】Exploring Separable Attention for Multi-Contrast MR Image Super-Resolution

SANet Exploring Separable Attention for Multi-Contrast MR Image Super-Resolution Dependencies numpy==1.18.5 scikit_image==0.16.2 torchvision==0.8.1 to

36 Jan 05, 2023
This repository is for EMNLP 2021 paper: It is Not as Good as You Think! Evaluating Simultaneous Machine Translation on Interpretation Data

InterpretationData This repository is for our EMNLP 2021 paper: It is Not as Good as You Think! Evaluating Simultaneous Machine Translation on Interpr

4 Apr 21, 2022
Any-to-any voice conversion using synthetic specific-speaker speeches as intermedium features

MediumVC MediumVC is an utterance-level method towards any-to-any VC. Before that, we propose SingleVC to perform A2O tasks(Xi → Ŷi) , Xi means utter

谷下雨 47 Dec 25, 2022
Implementation of Segnet, FCN, UNet , PSPNet and other models in Keras.

Image Segmentation Keras : Implementation of Segnet, FCN, UNet, PSPNet and other models in Keras. Implementation of various Deep Image Segmentation mo

Divam Gupta 2.6k Jan 05, 2023
Vehicle direction identification consists of three module detection , tracking and direction recognization.

Vehicle-direction-identification Vehicle direction identification consists of three module detection , tracking and direction recognization. Algorithm

5 Nov 15, 2022
Project page of the paper 'Analyzing Perception-Distortion Tradeoff using Enhanced Perceptual Super-resolution Network' (ECCVW 2018)

EPSR (Enhanced Perceptual Super-resolution Network) paper This repo provides the test code, pretrained models, and results on benchmark datasets of ou

Subeesh Vasu 78 Nov 19, 2022
The final project of "Applying AI to EHR Data" of "AI for Healthcare" nanodegree - Udacity.

Patient Selection for Diabetes Drug Testing Project Overview EHR data is becoming a key source of real-world evidence (RWE) for the pharmaceutical ind

Omar Laham 1 Jan 14, 2022
Mememoji - A facial expression classification system that recognizes 6 basic emotions: happy, sad, surprise, fear, anger and neutral.

a project built with deep convolutional neural network and ❤️ Table of Contents Motivation The Database The Model 3.1 Input Layer 3.2 Convolutional La

Jostine Ho 761 Dec 05, 2022
Java and SHACL code commented in the paper "Towards compliance checking in reified I/O logic via SHACL" submitted to ICAIL 2021

shRIOL The subfolder shRIOL contains Java files to execute the SHACL files on the OWL ontology. To compile the Java files: "javac -cp ./src/;./lib/* -

1 Dec 06, 2022
Framework for training options with different attention mechanism and using them to solve downstream tasks.

Using Attention in HRL Framework for training options with different attention mechanism and using them to solve downstream tasks. Requirements GPU re

5 Nov 03, 2022
Use graph-based analysis to re-classify stocks and to improve Markowitz portfolio optimization

Dynamic Stock Industrial Classification Use graph-based analysis to re-classify stocks and experiment different re-classification methodologies to imp

Sheng Yang 10 Dec 05, 2022