Evaluation and Benchmarking of Speech Super-resolution Methods

Overview

Speech Super-resolution Evaluation and Benchmarking

What this repo do:

  • A toolbox for the evaluation of speech super-resolution algorithms.
  • Unify the evaluation pipline of speech super-resolution algorithms for a easier comparison between different systems.
  • Benchmarking speech super-resolution methods (pull request is welcome). Encouraging reproducible research.

I build this repo while I'm writing my paper for INTERSPEECH 2022: Neural Vocoder is All You Need for Speech Super-resolution. The model mentioned in this paper, NVSR, will also be open-sourced here.

Installation

Install via pip:

pip3 install ssr_eval

Please make sure you have already installed sox.

Quick Example

A basic example: Evaluate on a system that do nothing:

from ssr_eval import test 
test()
  • The evaluation result json file will be stored in the ./results directory: Example file
  • The code will automatically handle stuffs like downloading test sets.
  • You will find a field "averaged" at the bottom of the json file that looks like below. This field mark the performance of the system.
"averaged": {
        "proc_fft_24000_44100": {
            "lsd": 5.152331300436993,
            "log_sispec": 5.8051057146229095,
            "sispec": 30.23394207533686,
            "ssim": 0.8484425044157442
        }
    }

Here we report four metrics:

  1. Log spectral distance(LSD).
  2. Log scale invariant spectral distance [1] (log-sispec).
  3. Scale invariant spectral distance [1] (sispec).
  4. Structral similarity (SSIM).

⚠️ LSD is the most widely used metric for super-resolution. And I include another three metrics just in case you need them.


main_idea

Below is the code of test()

from ssr_eval import SSR_Eval_Helper, BasicTestee

# You need to implement a class for the model to be evaluated.
class MyTestee(BasicTestee):
    def __init__(self) -> None:
        super().__init__()

    # You need to implement this function
    def infer(self, x):
        """A testee that do nothing

        Args:
            x (np.array): [sample,], with model_input_sr sample rate
            target (np.array): [sample,], with model_output_sr sample rate

        Returns:
            np.array: [sample,]
        """
        return x

def test():
    testee = MyTestee()
    # Initialize a evaluation helper
    helper = SSR_Eval_Helper(
        testee,
        test_name="unprocessed",  # Test name for storing the result
        input_sr=44100,  # The sampling rate of the input x in the 'infer' function
        output_sr=44100,  # The sampling rate of the output x in the 'infer' function
        evaluation_sr=48000,  # The sampling rate to calculate evaluation metrics.
        setting_fft={
            "cutoff_freq": [
                12000
            ],  # The cutoff frequency of the input x in the 'infer' function
        },
        save_processed_result=True
    )
    # Perform evaluation
    ## Use all eight speakers in the test set for evaluation (limit_test_speaker=-1) 
    ## Evaluate on 10 utterance for each speaker (limit_test_nums=10)
    helper.evaluate(limit_test_nums=10, limit_test_speaker=-1)

The code will automatically handle stuffs like downloading test sets. The evaluation result will be saved in the ./results directory.

Baselines

We provide several pretrained baselines. For example, to run the NVSR baseline, you can click the link in the following table for more details.


Table.1 Log-spectral distance (LSD) on different input sampling-rate (Evaluated on 44.1kHz).

Method One for all Params 2kHz 4kHz 8kHz 12kHz 16kHz 24kHz 32kHz AVG
NVSR [Pretrained Model] Yes 99.0M 1.04 0.98 0.91 0.85 0.79 0.70 0.60 0.84
WSRGlow(24kHz→48kHz) No 229.9M - - - - - 0.79 - -
WSRGlow(12kHz→48kHz) No 229.9M - - - 0.87 - - - -
WSRGlow(8kHz→48kHz) No 229.9M - - 0.98 - - - - -
WSRGlow(4kHz→48kHz) No 229.9M - 1.12 - - - - - -
Nu-wave(24kHz→48kHz) No 3.0M - - - - - 1.22 - -
Nu-wave(12kHz→48kHz) No 3.0M - - - 1.40 - - - -
Nu-wave(8kHz→48kHz) No 3.0M - - 1.42 - - - - -
Nu-wave(4kHz→48kHz) No 3.0M - 1.42 - - - - - -
Unprocessed - - 5.69 5.50 5.15 4.85 4.54 3.84 2.95 4.65

Click the link of the model for more details.

Here "one for all" means model can process flexible input sampling rate.

Features

The following code demonstrate the full options in the SSR_Eval_Helper:

testee = MyTestee()
helper = SSR_Eval_Helper(testee, # Your testsee object with 'infer' function implemented
                        test_name="unprocess",  # The name of this test. Used for saving the log file in the ./results directory
                        test_data_root="./your_path/vctk_test", # The directory to store the test data, which will be automatically downloaded.
                        input_sr=44100, # The sampling rate of the input x in the 'infer' function
                        output_sr=44100, # The sampling rate of the output x in the 'infer' function
                        evaluation_sr=48000, # The sampling rate to calculate evaluation metrics. 
                        save_processed_result=False, # If True, save model output in the dataset directory.
                        # (Recommend/Default) Use fourier method to simulate low-resolution effect
                        setting_fft = {
                            "cutoff_freq": [1000, 2000, 4000, 6000, 8000, 12000, 16000], # The cutoff frequency of the input x in the 'infer' function
                        }, 
                        # Use lowpass filtering to simulate low-resolution effect. All possible combinations will be evaluated. 
                        setting_lowpass_filtering = {
                            "filter":["cheby","butter","bessel","ellip"], # The type of filter 
                            "cutoff_freq": [1000, 2000, 4000, 6000, 8000, 12000, 16000], 
                            "filter_order": [3,6,9] # Filter orders
                        }, 
                        # Use subsampling method to simulate low-resolution effect
                        setting_subsampling = {
                            "cutoff_freq": [1000, 2000, 4000, 6000, 8000, 12000, 16000],
                        }, 
                        # Use mp3 compression method to simulate low-resolution effect
                        setting_mp3_compression = {
                            "low_kbps": [32, 48, 64, 96, 128],
                        },
)

helper.evaluate(limit_test_nums=10, # For each speaker, only evaluate on 10 utterances.
                limit_test_speaker=-1 # Evaluate on all the speakers. 
                )

⚠️ I recommand all the users to use fourier method (setting_fft) to simulate low-resolution effect for the convinence of comparing between different system.

Dataset Details

We build the test sets using VCTK (version 0.92), a multi-speaker English corpus that contains 110 speakers with different accents.

  • Speakers used for the test set: p360, p361, p362, p363, p364, p374, p376, s5
  • For the remaining 100 speakers, p280 and p315 are omitted for the technical issues.
  • Other 98 speakers are used for training.

Citation

If you find this repo useful for your research, please consider citing:

@misc{liu2022neural,
      title={Neural Vocoder is All You Need for Speech Super-resolution}, 
      author={Haohe Liu and Woosung Choi and Xubo Liu and Qiuqiang Kong and Qiao Tian and DeLiang Wang},
      year={2022},
      eprint={2203.14941},
      archivePrefix={arXiv},
      primaryClass={eess.AS}
}

Reference

[1] Liu, Haohe, et al. "VoiceFixer: Toward General Speech Restoration with Neural Vocoder." arXiv preprint arXiv:2109.13731 (2021).

Owner
Haohe Liu (刘濠赫)
Haohe Liu (刘濠赫)
PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World [ACL 2021]

piglet PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World [ACL 2021] This repo contains code and data for PIGLeT. If you like

Rowan Zellers 51 Oct 08, 2022
Reinforcement-learning - Repository of the class assignment questions for the course on reinforcement learning

DSE 314/614: Reinforcement Learning This repository containing reinforcement lea

Manav Mishra 4 Apr 15, 2022
Project Aquarium is a SUSE-sponsored open source project aiming at becoming an easy to use, rock solid storage appliance based on Ceph.

Project Aquarium Project Aquarium is a SUSE-sponsored open source project aiming at becoming an easy to use, rock solid storage appliance based on Cep

Aquarist Labs 73 Jul 21, 2022
FCOSR: A Simple Anchor-free Rotated Detector for Aerial Object Detection

FCOSR: A Simple Anchor-free Rotated Detector for Aerial Object Detection FCOSR: A Simple Anchor-free Rotated Detector for Aerial Object Detection arXi

59 Nov 29, 2022
Sketch-Based 3D Exploration with Stacked Generative Adversarial Networks

pix2vox [Demonstration video] Sketch-Based 3D Exploration with Stacked Generative Adversarial Networks. Generated samples Single-category generation M

Takumi Moriya 232 Nov 14, 2022
Code release for "Transferable Semantic Augmentation for Domain Adaptation" (CVPR 2021)

Transferable Semantic Augmentation for Domain Adaptation Code release for "Transferable Semantic Augmentation for Domain Adaptation" (CVPR 2021) Paper

66 Dec 16, 2022
Multi Task Vision and Language

12-in-1: Multi-Task Vision and Language Representation Learning Please cite the following if you use this code. Code and pre-trained models for 12-in-

Facebook Research 712 Dec 19, 2022
Hierarchical Clustering: O(1)-Approximation for Well-Clustered Graphs

Hierarchical Clustering: O(1)-Approximation for Well-Clustered Graphs This repository contains code to accompany the paper "Hierarchical Clustering: O

3 Sep 25, 2022
✅ How Robust are Fact Checking Systems on Colloquial Claims?. In NAACL-HLT, 2021.

How Robust are Fact Checking Systems on Colloquial Claims? Official PyTorch implementation of our NAACL paper: Byeongchang Kim*, Hyunwoo Kim*, Seokhee

Byeongchang Kim 19 Mar 15, 2022
Robust Instance Segmentation through Reasoning about Multi-Object Occlusion [CVPR 2021]

Robust Instance Segmentation through Reasoning about Multi-Object Occlusion [CVPR 2021] Abstract Analyzing complex scenes with DNN is a challenging ta

Irene Yuan 24 Jun 27, 2022
Official implementation for the paper "Attentive Prototypes for Source-free Unsupervised Domain Adaptive 3D Object Detection"

Attentive Prototypes for Source-free Unsupervised Domain Adaptive 3D Object Detection PyTorch code release of the paper "Attentive Prototypes for Sour

Deepti Hegde 23 Oct 17, 2022
Project NII pytorch scripts

project-NII-pytorch-scripts By Xin Wang, National Institute of Informatics, since 2021 I am a new pytorch user. If you have any suggestions or questio

Yamagishi and Echizen Laboratories, National Institute of Informatics 184 Dec 23, 2022
Revisiting Weakly Supervised Pre-Training of Visual Perception Models

SWAG: Supervised Weakly from hashtAGs This repository contains SWAG models from the paper Revisiting Weakly Supervised Pre-Training of Visual Percepti

Meta Research 134 Jan 05, 2023
Usable Implementation of "Bootstrap Your Own Latent" self-supervised learning, from Deepmind, in Pytorch

Bootstrap Your Own Latent (BYOL), in Pytorch Practical implementation of an astoundingly simple method for self-supervised learning that achieves a ne

Phil Wang 1.4k Dec 29, 2022
This repository contains code to train and render Mixture of Volumetric Primitives (MVP) models

Mixture of Volumetric Primitives -- Training and Evaluation This repository contains code to train and render Mixture of Volumetric Primitives (MVP) m

Meta Research 125 Dec 29, 2022
Tensorflow/Keras Plug-N-Play Deep Learning Models Compilation

DeepBay This project was created with the objective of compile Machine Learning Architectures created using Tensorflow or Keras. The architectures mus

Whitman Bohorquez 4 Sep 26, 2022
CROSS-LINGUAL ABILITY OF MULTILINGUAL BERT: AN EMPIRICAL STUDY

M-BERT-Study CROSS-LINGUAL ABILITY OF MULTILINGUAL BERT: AN EMPIRICAL STUDY Motivation Multilingual BERT (M-BERT) has shown surprising cross lingual a

CogComp 1 Feb 28, 2022
RNN Predict Street Commercial Vitality

RNN-for-Predicting-Street-Vitality Code and dataset for Predicting the Vitality of Stores along the Street based on Business Type Sequence via Recurre

Zidong LIU 1 Dec 15, 2021
An implementation of paper `Real-time Convolutional Neural Networks for Emotion and Gender Classification` with PaddlePaddle.

简介 通过PaddlePaddle框架复现了论文 Real-time Convolutional Neural Networks for Emotion and Gender Classification 中提出的两个模型,分别是SimpleCNN和MiniXception。利用 imdb_crop

8 Mar 11, 2022
PyTorch version of the paper 'Enhanced Deep Residual Networks for Single Image Super-Resolution' (CVPRW 2017)

About PyTorch 1.2.0 Now the master branch supports PyTorch 1.2.0 by default. Due to the serious version problem (especially torch.utils.data.dataloade

Sanghyun Son 2.1k Jan 01, 2023