The codebase for Data-driven general-purpose voice activity detection.

Overview

Data driven GPVAD

Repository for the work in TASLP 2021 Voice activity detection in the wild: A data-driven approach using teacher-student training.

Framework

Sample predictions against other methods

Samples_1

Samples_2

Samples_3

Samples_4

Noise robustness

Speech

Background

Speech

Results

Our best model trained on the SRE (V3) dataset obtains the following results:

Precision Recall F1 AUC FER Event-F1
aurora_clean 96.844 95.102 95.93 98.66 3.06 74.8
aurora_noisy 90.435 92.871 91.544 97.63 6.68 54.45
dcase18 89.202 88.362 88.717 95.2 10.82 57.85

Usage

We provide most of our pretrained models in this repository, including:

  1. Both teachers (T_1, T_2)
  2. Unbalanced audioset pretrained model
  3. Voxceleb 2 pretrained model
  4. Our best submission (SRE V3 trained)

To download and run evaluation just do:

git clone https://github.com/RicherMans/Datadriven-VAD
cd Datadriven-VAD
pip3 install -r requirements.txt
python3 forward.py -w example/example.wav

Running this will print:

|   index | event_label   |   onset |   offset | filename            |
|--------:|:--------------|--------:|---------:|:--------------------|
|       0 | Speech        |    0.28 |     0.94 | example/example.wav |
|       1 | Speech        |    1.04 |     2.22 | example/example.wav |

Predicting voice activity

We support single file and filelist-batching in our script. Obtaining VAD predictions is easy:

python3 forward.py -w example/example.wav

Or if one prefers to do that batch_wise, first prepare a filelist: find . -type f -name *.wav > wavlist.txt' And then just run:

python3 forward.py -l wavlist

Extra parameters

  • -model adjusts the pretrained model. Can be one of t1,t2,v2,a2,a2_v2,sre. Refer to the paper for each respective model. By default we use sre.
  • -soft instead of predicting human-readable timestamps, the model is now outputting the raw probabilities.
  • -hard instead of predicting human-readable timestamps, the model is now outputting the post-processed 0-1 flags indicating speech. Please note this is different from the paper, which thresholded the soft probabilities without post-processing.
  • -th adjusts the threshold. If a single threshold is passed (e.g., -th 0.5), we utilize simple binearization. Otherwise use the default double threshold with -th 0.5 0.1.
  • -o outputs the results into a new folder.

Training from scratch

If you intend to rerun our work, prepare some data and extract log-Mel spectrogram features. Say, you have downloaded the balanced subset of AudioSet and stored all files in a folder data/balanced/. Then:

cd data;
mkdir hdf5 csv_labels;
find balanced -type f > wavs.txt;
python3 extract_features.py wavs.txt -o hdf5/balanced.h5
h5ls -r hdf5/balanced.h5 | awk -F[/' '] 'BEGIN{print "filename","hdf5path"}NR>1{print $2,"hdf5/balanced.h5"}'> csv_labels/balanced.csv

The input for our label prediction script is a csv file with exactly two columns, filename and hdf5path.

An example csv_labels/balanced.csv would be:

filename hdf5path
--PJHxphWEs_30.000.wav hdf5/balanced.h5                                                                                          
--ZhevVpy1s_50.000.wav hdf5/balanced.h5                                                                                          
--aE2O5G5WE_0.000.wav hdf5/balanced.h5                                                                                           
--aO5cdqSAg_30.000.wav hdf5/balanced.h5                                                                                          

After feature extraction, proceed to predict labels:

mkdir -p softlabels/{hdf5,csv};
python3 prepare_labels.py --pre ../pretrained_models/teacher1/model.pth csv_labels/balanced.csv softlabels/hdf5/balanced.h5 softlabels/csv/balanced.csv

Lastly, just train:

cd ../; #Go to project root
# Change config accoringly with input data
python3 run.py train configs/example.yaml

Citation

If youre using this work, please cite it in your publications.

@article{Dinkel2021,
author = {Dinkel, Heinrich and Wang, Shuai and Xu, Xuenan and Wu, Mengyue and Yu, Kai},
doi = {10.1109/TASLP.2021.3073596},
issn = {2329-9290},
journal = {IEEE/ACM Transactions on Audio, Speech, and Language Processing},
pages = {1542--1555},
title = {{Voice Activity Detection in the Wild: A Data-Driven Approach Using Teacher-Student Training}},
url = {https://ieeexplore.ieee.org/document/9405474/},
volume = {29},
year = {2021}
}

and

@inproceedings{Dinkel2020,
  author={Heinrich Dinkel and Yefei Chen and Mengyue Wu and Kai Yu},
  title={{Voice Activity Detection in the Wild via Weakly Supervised Sound Event Detection}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={3665--3669},
  doi={10.21437/Interspeech.2020-0995},
  url={http://dx.doi.org/10.21437/Interspeech.2020-0995}
}
Owner
Heinrich Dinkel
日新月异
Heinrich Dinkel
House3D: A Rich and Realistic 3D Environment

House3D: A Rich and Realistic 3D Environment Yi Wu, Yuxin Wu, Georgia Gkioxari and Yuandong Tian House3D is a virtual 3D environment which consists of

Meta Research 1.1k Dec 14, 2022
Official and maintained implementation of the paper "OSS-Net: Memory Efficient High Resolution Semantic Segmentation of 3D Medical Data" [BMVC 2021].

OSS-Net: Memory Efficient High Resolution Semantic Segmentation of 3D Medical Data Christoph Reich, Tim Prangemeier, Özdemir Cetin & Heinz Koeppl | Pr

Christoph Reich 23 Sep 21, 2022
[ACL 2022] LinkBERT: A Knowledgeable Language Model 😎 Pretrained with Document Links

LinkBERT: A Knowledgeable Language Model Pretrained with Document Links This repo provides the model, code & data of our paper: LinkBERT: Pretraining

Michihiro Yasunaga 264 Jan 01, 2023
Negative Sample is Negative in Its Own Way: Tailoring Negative Sentences forImage-Text Retrieval

NSGDC Some codes in this repo are copied/modified from opensource implementations made available by UNITER, PyTorch, HuggingFace, OpenNMT, and Nvidia.

Zhihao Fan 2 Nov 07, 2022
Code for Paper: Self-supervised Learning of Motion Capture

Self-supervised Learning of Motion Capture This is code for the paper: Hsiao-Yu Fish Tung, Hsiao-Wei Tung, Ersin Yumer, Katerina Fragkiadaki, Self-sup

Hsiao-Yu Fish Tung 87 Jul 25, 2022
A graph neural network (GNN) model to predict protein-protein interactions (PPI) with no sample features

A graph neural network (GNN) model to predict protein-protein interactions (PPI) with no sample features

2 Jul 25, 2022
QuadTree Attention for Vision Transformers (ICLR2022)

This repository contains codes for quadtree attention. This repo contains codes for feature matching, image classficiation, object detection and seman

tangshitao 222 Dec 28, 2022
PointRCNN: 3D Object Proposal Generation and Detection from Point Cloud, CVPR 2019.

PointRCNN PointRCNN: 3D Object Proposal Generation and Detection from Point Cloud Code release for the paper PointRCNN:3D Object Proposal Generation a

Shaoshuai Shi 1.5k Dec 27, 2022
Source code for the paper "PLOME: Pre-training with Misspelled Knowledge for Chinese Spelling Correction" in ACL2021

PLOME:Pre-training with Misspelled Knowledge for Chinese Spelling Correction (ACL2021) This repository provides the code and data of the work in ACL20

197 Nov 26, 2022
Source code for our Paper "Learning in High-Dimensional Feature Spaces Using ANOVA-Based Matrix-Vector Multiplication"

NFFT4ANOVA Source code for our Paper "Learning in High-Dimensional Feature Spaces Using ANOVA-Based Matrix-Vector Multiplication" This package uses th

Theresa Wagner 1 Aug 10, 2022
Repository for RNNs using TensorFlow and Keras - LSTM and GRU Implementation from Scratch - Simple Classification and Regression Problem using RNNs

RNN 01- RNN_Classification Simple RNN training for classification task of 3 signal: Sine, Square, Triangle. 02- RNN_Regression Simple RNN training for

Nahid Ebrahimian 13 Dec 13, 2022
Python Auto-ML Package for Tabular Datasets

Tabular-AutoML AutoML Package for tabular datasets Tabular dataset tuning is now hassle free! Run one liner command and get best tuning and processed

Sagnik Roy 18 Nov 20, 2022
2D&3D human pose estimation

Human Pose Estimation Papers [CVPR 2016] - 201511 [IJCAI 2016] - 201602 Other Action Recognition with Joints-Pooled 3D Deep Convolutional Descriptors

133 Jan 02, 2023
Bringing Computer Vision and Flutter together , to build an awesome app !!

Bringing Computer Vision and Flutter together , to build an awesome app !! Explore the Directories Flutter · Machine Learning Table of Contents About

Padmanabha Banerjee 14 Apr 07, 2022
RoMA: Robust Model Adaptation for Offline Model-based Optimization

RoMA: Robust Model Adaptation for Offline Model-based Optimization Implementation of RoMA: Robust Model Adaptation for Offline Model-based Optimizatio

9 Oct 31, 2022
Simulation-based inference for the Galactic Center Excess

Simulation-based inference for the Galactic Center Excess Siddharth Mishra-Sharma and Kyle Cranmer Abstract The nature of the Fermi gamma-ray Galactic

Siddharth Mishra-Sharma 3 Jan 21, 2022
Si Adek Keras is software VR dangerous object detection.

Si Adek Python Keras Sistem Informasi Deteksi Benda Berbahaya Keras Python. Version 1.0 Developed by Ananda Rauf Maududi. Developed date: 24 November

Ananda Rauf 1 Dec 21, 2021
Assessing the Influence of Models on the Performance of Reinforcement Learning Algorithms applied on Continuous Control Tasks

Assessing the Influence of Models on the Performance of Reinforcement Learning Algorithms applied on Continuous Control Tasks This is the master thesi

Giacomo Arcieri 1 Mar 21, 2022
Definition of a business problem according to Wilson Lower Bound Score and Time Based Average Rating

Wilson Lower Bound Score, Time Based Rating Average In this study I tried to calculate the product rating and sorting reviews more accurately. I have

3 Sep 30, 2021
Caffe: a fast open framework for deep learning.

Caffe Caffe is a deep learning framework made with expression, speed, and modularity in mind. It is developed by Berkeley AI Research (BAIR)/The Berke

Berkeley Vision and Learning Center 33k Dec 28, 2022