Deep Learning: Architectures & Methods Project: Deep Learning for Audio Super-Resolution

Overview

Deep Learning: Architectures & Methods Project: Deep Learning for Audio Super-Resolution

Figure: Example visualization of the method and baseline as a spectogram

This is the implementation of our Project for the course "Deep Learning: Architectures and Methods" by Prof. Christian Kersting from the Artificial Intelligence and Machine Learning Lab at the Technical University of Darmstadt in the summer semester 2021.

In the field of audio signal processing, Super-Resolution is one of the most relevant topics. The motivation is to reconstruct high- quality audio from low-quality signals. From a practical perspective, the technique has applications in telephony or generally in applications in which audio is transmitted and has to be compressed accordingly. Other applications are the processing of ancient recordings, for example old sound recordings of music, speech or videos. First approaches of the combination of machine learning and audio signal processing lead to promising results and outperform standard techniques. Accordingly the scope of the project was to reimplement the paper Temporal FiLM: Capturing Long-Range SequenceDependencies with Feature-Wise Modulation by Birnbaum et al. in PyTorch, reproduce the results and extend them further to the music domain.

This repository contains everything needed to prepare the data sets, train the model and create final evaluation and visualization of the results. We also provide the weights of the models to reproduce our reported results.

Installation

This project was originally developed with Python 3.8, PyTorch 1.7, and CUDA 11.0. The training requires at least one NVIDIA GeForce GTX 980 (4GB memory).

  • Create conda environment:
conda create --name audiosr
source activate audiosr
conda install PYTORCH torchvision cudatoolkit=11.0 -c pytorch
  • Install the dependencies:
pip install -r requirements.txt

Dataset preparation

To reproduce the results shown below tha datasets have to be prepared. This repo includes scripts to prepare the following dataset:

VCTK preparation

  • run prep_dataset.py from ./datasets to create a h5 container of a specified input.
  • to reproduce results prepare the following h5 files:
python prep_dataset.py \
  --file-list vctk/speaker1/speaker1-train-files.txt \
  --in-dir ./VCTK-Corpus/wav48/p225/ \
  --out vctk-speaker1-train.4.16000.8192.4096.h5 \
  --scale 4 \
  --sr 16000 \
  --dimension 8192 \
  --stride 4096 \
  --interpolate \
  --low-pass
python prep_dataset.py \
  --file-list vctk/speaker1/speaker1-val-files.txt \
  --in-dir ./VCTK-Corpus/wav48/p225/ \
  --out vctk-speaker1-val.4.16000.8192.4096.h5 \
  --scale 4 \
  --sr 16000 \
  --dimension 8192 \
  --stride 4096 \
  --interpolate \
  --low-pass

GTZAN preparation

  • run prep_dataset.py from ./datasets to create a h5 container of a specified input.
  • to reproduce results prepare the following h5 files:
python prep_dataset.py \
  --file-list gtzan/blues_wav_list_train.txt \
  --in-dir gtzan/data/genres/blues/ \
  --out blues-train.4.22000.8192.16384.h5 \
  --scale 4 \
  --sr 22000 \
  --dimension 8192 \
  --stride 16384 \
  --interpolate \
  --low-pass
python prep_dataset.py \
  --file-list gtzan/blues_wav_list_val.txt \
  --in-dir gtzan/data/genres/blues/ \
  --out blues-val.4.22000.8192.16384.h5 \
  --scale 4 \
  --sr 22000 \
  --dimension 8192 \
  --stride 16384 \
  --interpolate \
  --low-pass

Piano dataset preparation

python prep_piano.py \
  --file-list data/music_train.npy \
  --out piano-train.4.16000.8192.131072.h5 \
  --scale 4 \
  --sr 16000 \
  --dimension 8192 \
  --stride 131072 \
  --interpolate \
  --low-pass
python prep_piano.py \
  --file-list data/music_valid.npy \
  --out piano-val.4.16000.8192.131072.h5 \
  --scale 4 \
  --sr 16000 \
  --dimension 8192 \
  --stride 131072 \
  --interpolate \
  --low-pass

Notes:

  • the --in-dir argument has to be adapted to the respective dataset location
  • The dimension parameter and sampling rate define the absolute length of a patch (dim/sr = length patch)

Model

Generally, there are three main models in this implementation.

Baseline

On the one hand the b-spline interpolation which serves as the baseline and can be found in the data loader in prep_dataset.py.

Model

On the other hand two neural networks whose implementation can be found in the /models/ folder. In a first step a model was implemented which uses a batchnorm layer instead of the later used TFILM layer. This is implemented in audiounet.py. The final model, which is also used in the paper, can be found in tfilmunet.py.

Train Model

To run the trainings use the following commands and change the dataset root the corresponding domain.

python train.py \
  --dataset-root hereroottodataset! \
  --epochs 50 \
  --lr 3*10e-4 \
  --batch-size 16 

Evaluation

Save examples from inference

It is possible to evaluate any given wav-file with the inference.py script by invoking the --save-example flag and saving the results as wav-files and spectrogram plots. The script performs the following steps:

  • prepares all files in a provided list (--wave-file-list) and creates a low-res version and the baseline reconstruction
  • runs inference on the prepared files to create a super-resolution output
  • saves all results to the "examples" folder with the respective file names
  • saves spectrogram plots of all versions as pdf-files

Notes:

It is important to adapt the sampling parameter (--sr) which is set to 16000 by default. The sampling rate has to be the one of the original wav file. The scale (--scale) defines the down sampling factor which is set to 4 by default. Depending on which trained model is used for the inference the parameters --checkpoints-root and --checkpoint have to be specified accordingly.

To reproduce an example from our plot run the following command from the repo root directory (modify --checkpoints-root if necessary):

python inference.py \
  --save-example \
  --wave-file-list assets/save_wav_list.txt \
  --scale 4 \
  --sr 16000 \
  --checkpoint pretrained/vctk_speaker1_pretrained.pth

Results

Training Dataset Ratio BASELINE SNR (dB) BASELINE LSD (dB) METHOD SNR (dB) METHOD LSD (dB) Checkpoint
VTCK SingleSpeaker r = 4 15.6 5.4 16.6 3.2 Checkpoint
Piano r = 4 19.7 2.9 20.4 2.2 Checkpoint
GTZAN (Genre: Blues) r = 4 13.3 7.8 13.8 3.8 Checkpoint

Qualitative Examples

Here we provide a qualitative example per Dataset. These can be generated using inference.py

VTCK SingleSpeaker Piano GTZAN (Genre: Blues)
Low Resolution Low Resolution Low Resolution
Baseline Baseline Baseline
Method Method Method
High Resolution High Resolution High Resolution
Owner
Oliver Hahn
Master Thesis @VIsual Inference Lab | Grad Student @Technical University of Darmstadt
Oliver Hahn
A curated list of references for MLOps

A curated list of references for MLOps

Larysa Visengeriyeva 9.3k Jan 07, 2023
PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models

PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models This repository is the official implementation of the fol

DistributedML 41 Dec 06, 2022
Official Pytorch implementation of "Beyond Static Features for Temporally Consistent 3D Human Pose and Shape from a Video", CVPR 2021

TCMR: Beyond Static Features for Temporally Consistent 3D Human Pose and Shape from a Video Qualtitative result Paper teaser video Introduction This r

Hongsuk Choi 215 Jan 06, 2023
Local-Global Stratified Transformer for Efficient Video Recognition

DualFormer This repo is the implementation of our manuscript entitled "Local-Global Stratified Transformer for Efficient Video Recognition". Our model

Sea AI Lab 19 Dec 07, 2022
View model summaries in PyTorch!

torchinfo (formerly torch-summary) Torchinfo provides information complementary to what is provided by print(your_model) in PyTorch, similar to Tensor

Tyler Yep 1.5k Jan 05, 2023
This is a model to classify Vietnamese sign language using Motion history image (MHI) algorithm and CNN.

Vietnamese sign lagnuage recognition using MHI and CNN This is a model to classify Vietnamese sign language using Motion history image (MHI) algorithm

Phat Pham 3 Feb 24, 2022
A port of muP to JAX/Haiku

MUP for Haiku This is a (very preliminary) port of Yang and Hu et al.'s μP repo to Haiku and JAX. It's not feature complete, and I'm very open to sugg

18 Dec 30, 2022
Fantasy Points Prediction and Dream Team Formation

Fantasy-Points-Prediction-and-Dream-Team-Formation Collected Data from open source resources that have over 100 Parameters for predicting cricket play

Akarsh Singh 2 Sep 13, 2022
Python Library for learning (Structure and Parameter) and inference (Statistical and Causal) in Bayesian Networks.

pgmpy pgmpy is a python library for working with Probabilistic Graphical Models. Documentation and list of algorithms supported is at our official sit

pgmpy 2.2k Jan 03, 2023
Udacity Suse Cloud Native Foundations Scholarship Course Walkthrough

SUSE Cloud Native Foundations Scholarship Udacity is collaborating with SUSE, a global leader in true open source solutions, to empower developers and

Shivansh Srivastava 34 Oct 18, 2022
TensorFlow 101: Introduction to Deep Learning for Python Within TensorFlow

TensorFlow 101: Introduction to Deep Learning I have worked all my life in Machine Learning, and I've never seen one algorithm knock over its benchmar

Sefik Ilkin Serengil 896 Jan 04, 2023
TorchX: A PyTorch Extension Library for More Efficient Deep Learning

TorchX TorchX: A PyTorch Extension Library for More Efficient Deep Learning. @misc{torchx, author = {Ansheng You and Changxu Wang}, title = {T

Donny You 8 May 28, 2022
Basit bir burç modülü.

Bu modulu burclar hakkinda gundelik bir sekilde bilgi alin diye yaptim ve sizler icin kullanima sunuyorum. Modulun kullanimi asiri basit: Ornek Kullan

Special 17 Jun 08, 2022
Filtering variational quantum algorithms for combinatorial optimization

Current gate-based quantum computers have the potential to provide a computational advantage if algorithms use quantum hardware efficiently.

1 Feb 09, 2022
Autoregressive Predictive Coding: An unsupervised autoregressive model for speech representation learning

Autoregressive Predictive Coding This repository contains the official implementation (in PyTorch) of Autoregressive Predictive Coding (APC) proposed

iamyuanchung 173 Dec 18, 2022
Official implementation of Densely connected normalizing flows

Densely connected normalizing flows This repository is the official implementation of NeurIPS 2021 paper Densely connected normalizing flows. Poster a

Matej Grcić 31 Dec 12, 2022
Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds (CVPR 2022)

Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds (CVPR2022)[paper] Authors: Chenhang He, Ruihuang Li, Shuai Li, L

Billy HE 141 Dec 30, 2022
This project aims to be a handler for input creation and running of multiple RICEWQ simulations.

What is autoRICEWQ? This project aims to be a handler for input creation and running of multiple RICEWQ simulations. What is RICEWQ? From the descript

Yass Fuentes 1 Feb 01, 2022
[NeurIPS 2021] Large Scale Learning on Non-Homophilous Graphs: New Benchmarks and Strong Simple Methods

Large Scale Learning on Non-Homophilous Graphs: New Benchmarks and Strong Simple Methods Large Scale Learning on Non-Homophilous Graphs: New Benchmark

60 Jan 03, 2023
[NeurIPS 2021] Well-tuned Simple Nets Excel on Tabular Datasets

[NeurIPS 2021] Well-tuned Simple Nets Excel on Tabular Datasets Introduction This repo contains the source code accompanying the paper: Well-tuned Sim

52 Jan 04, 2023