A data annotation pipeline to generate high-quality, large-scale speech datasets with machine pre-labeling and fully manual auditing.

Overview

About

This repository provides data and code for the paper:

Scalable Data Annotation Pipeline for High-Quality Large Speech Datasets Development (submitted to NeurIPS 2021 Track on Datasets and Benchmarks Round2)

Authors: Mingkuan Liu, Chi Zhang, Hua Xing, Chao Feng, Monchu Chen, Judith Bishop, Grace Ngapo

Keywords: speech processing, speech dataset, human in the loop, annotation pipeline, quality assurance, speech annotation

Abstract

This paper introduces a human-in-the-loop (HITL) data annotation pipeline to generate high-quality, large-scale speech datasets. The pipeline combines human and machine advantages to more quickly, accurately, and cost-effectively annotate datasets with machine pre-labeling and fully manual auditing. Quality control mechanisms such as blind testing, behavior monitoring, and data validation have been adopted in the annotation pipeline to mitigate potential bias introduced by machine-generated labels. Our A/B testing and pilot results demonstrated the HITL pipeline can improve annotation speed and capacity by at least 80% and quality is comparable to or higher than manual double pass annotation. We are leveraging this scalable pipeline to create and continuously grow ultra-high volume off-the-shelf (UHV-OTS) speech corpora for multiple languages, with the capability to expand to 10,000+ hours per language annually. Customized datasets can be produced from the UHV-OTS corpora using dynamic packaging. UHV-OTS is a long-term Appen project to support commercial and academic research data needs in speech processing. Appen will donate a number of free speech datasets from the UHV-OTS each year to support academic and open source community research under the CC-BY-SA license. We are also releasing the code of the data pre-processing and pre-tagging pipeline under the Apache 2.0 license to allow reproduction of the results reported in the paper. Code and data are available in https://github.com/Appen/UHV-OTS-Speech

HITL speech corpora development system pipeline for UHV-OTS corpora

Reproduce the automated machine pre-labeling results reported in the paper

0. Experiment envirionments setup

We use docker to run all the experiments and data processing for the corpora construction. To illustrate the algorithms used in the automatic modules in our pipeline, we build this docker enveronment containing all the testing scripts or demo scripts of each module. After you git cloned this repo, please run the docker build command like in below.

cd UHV-OTS-Speech
docker build -t uhv-ots-speech-demo:cpu ./

After the images has been built, please docker run the image in a container.

docker run -it uhv-ots-speech-demo:cpu /bin/bash

Inside the container, in /opt/scripts, there are several sub folder, each of which is the testing/demo scripts of a module.

1. Data pre-filtering: synthetic speech detection

We utlized the algorithm propposed in Towards End-to-End Synthetic Speech Detection and adopted the library and pre-trained models in authors's github repo. The original work achieved synthetic speech detection EER as low as 2.16% on in-domain testing data and 1.95% on cross-domain data. We developped a simple demo script to run a part of the ASVspoof2019 and give out the detection results and likelihood.

If the full testing is needed please run the codes in original authors' repo. Please download the ASVspoof 2019 and 2015 data by running following command Inside the container:

cd /opt/scripts/synthetic_detection
./download.sh

But if only want to see how the module is working, inside the container, please run the following command Inside the container to see how it works.

cd /opt/scripts/synthetic_detection
./run_demo.sh 

2. Data pre-processing: music/vocal source separation

We utilized well performed spleeter library for source separation. The spleeter is source separation library of Deezer and was introduced in "Spleeter: a fast and efficient music source separation tool with pre-trained models". We post the script to run this tool on web scraped audio files. To run the tool with sample file, please run following command Inside the container.

cd /opt/scripts/source_separation
./run_demo.sh

The script will try to separate each audio in ./sample_aduio folders into two files, one *_bgm.wav one *_speech.wav, both in mono 16kHz 16bit liner PCM wav format. The rest of automatic processing will be performed on the *_speech.wav file, which is considered to be the speech channel of original audio.

3. Data pre-filtering: language/accent identification

We apply language identification to pre-filter the raw audio data and ensure that the data is correctly routed to the corresponding language data processing pipeline. We trained a language ID systme based on the x-vector, which was introduced in "X-VECTORS: ROBUST DNN EMBEDDINGS FOR SPEAKER RECOGNITION". The x-vector model was trained with the VoxLingua107 dataset, and the language ID algorithm achieved 93% accuracy on the VoxLingua107 dev set.

The language id module was developped based on the Kaldi recipe. The model and x-vectors have been prepared and stored in this folder, to run the test and get EER, please run the command in below, Inside the container:

cd /opt/scripts/language_id
./run_test.sh

Accent identification is more challenging than language identification. We’ve adopted the x-vector plus LDA/PLDA framework to detect twenty-two different English accents using proprietary data. Our current accent detection accuracy is 75%. The x-vector model and x-vectors of training and testing data were prepared and stored in this folder, same as LDA/PLDA classifier model. To check the performance, please run the command as in below Inside the container:

cd /opt/scripts/accent_id
./run_test.sh

4. Data pre-tagging: speech detection

This is the folder containing the demo scripts of speech segmentation. The speech segmentation in this folder is adopted from the InaSpeechSegmenter which was introduced in AN OPEN-SOURCE SPEAKER GENDER DETECTION FRAMEWORK FOR MONITORING GENDER EQUALITY. We only used the speech detection module of it and it's pretrained model, which can be found in the original authors' repo.

The inaSpeechSegmenter system won the first place in the Music and/or Speech Detection in Music Information Retrieval Evaluation eXchange 2018 (MIREX 2018). This module also achieved 97.5% detection accuracy with an average boundary mismatch of 97ms at Appen's proprietary testset. To run demo of this module, please run the following command Inside the container:

cd /opt/scripts/speech_detection
./run_demo.sh

You can check the output csv file in folder ./output

5. Data pre-tagging: speaker diarization

This is the speaker diarization system developed based on BUT's diarization system introduced in Analysis of the BUT Diarization System for VoxConverse Challenge.

The speaker diarization framework generally involves an embedding stage followed by a clustering stage.

We tested the pipeline with VoxConverse corpus, which is an audio-visual diarization dataset consisting of over 50 hours of multi-speaker clips of human speech, extracted from videos collected on the internet. The DER achieved on VoxConverse using the BUT system is 4.41%, which is consistent with the result in BUT's report.

To download the dataset, please run the command Inside the container as in following:

cd /opt/scripts/speaker_diarization
./download.sh

After the data downloading, please run the test on VoxConverse data by running the commands in below Inside the container:

cd /opt/scripts/speaker_diarization
./run_test.sh

6. Data pre-tagging: speaker clustering & identification

We utlized an ECAPA-TDNN embedding algorithm introduced in Ecapa-tdnn: Emphasized channel412attention, propagation and aggregation in tdnn based speaker verification to generate speaker embeddings, which is used for speaker identification. A pre-trained embedding model by SpeechBrain toolkit is adopted in our pipeline, which produces EER of 0.7% on VoxCeleb 1 dataset.

Please download the VoxCeleb1 data and then run the test to check the system's performance inside the container

cd /opt/scripts/SpeakerSec/
./download.sh
./run_test.sh

7. Data pre-tagging: gender detection

An x-vector embedding model plus Multi-layer Perceptron (MLP) classifier framework is implemented gender_detection folder. We used the x-vector model introduced in "X-VECTORS: ROBUST DNN EMBEDDINGS FOR SPEAKER RECOGNITION". The pretrained x-vector model was used to extract the x-vectors of training and test data for MLP. Our gender detection model achieved 99.85% accuracy on VoxCeleb1 testing set in VoxCeleb: a large-scale speaker identification dataset. To run the test of gender detection and check results, please run the command Inside the container:

cd /opt/scripts/gender_detection
./run_test.sh

8. Data pre-tagging: speech recognition/transcription

To run the experiments on Librispeech test-clean and test-other data with our own Chain model, please run the following command to download Librispeech data inside the container.

cd /opt/scripts/asr_kaldichain
./download_prepare_extract.sh

The test-clean and test-other data will be downloaded inside the container.

In this module, we trained our own ASR model using Kaldi toolkit introduced in "The kaldi speech recognition toolkit", specifically using the chain model recipe introduced in "Purely sequence-trained neural networks for ASR based on lattice-free MMI", which can be found originally in Kaldi's repo. But we trained our model using 11 corpora at hand, including free public corpora, purchased corpora, and self owned corpora.

To run the test on Librispeech test-other and test-clean data with our trained model, please run the following command, inside the container.

cd /opt/scripts/asr_kaldichain
./run_test.sh

9. Data pre-tagging: domain/topic detection

So far we adopted a pipeline of topic detection of Multi-label Text Classification using BERT introduced in webpage. It was developped by original author based on the BERT. It applied BERT to the problem of multi-label text classification. We assembled the original scripts from the repo to replicate the Kaggle’s Toxic Comment Classification Challenge to benchmark BERT’s performance for the multi-label text classification.

To run the benchmark test, please run the following commands inside the container

cd /opt/scripts/topic_detection
./run_test.sh

UHV-OTS dataset format

Detailed exaplanation of UHV-OTS dataset format is attached here.

Sample codes to parse UHV-OTS dataset to Kaldi style format

A script generate_kaldi_file.py was provided to generate the Kaldi format documents to run a Kaldi experiments. After you acquired a batch of UHV-OTS-Speehc data, you can run this script as in follow:

./generate_kaldi_file.py path-to-batch-data

In this repo, we prepared a sample of batch data in ./sample_dataset, you can try the converting script on that folder to check the generated Kaldi documents.

Speech Annotation Instruction

Detailed annotation guideline is attached here.

License

Software license

The code and pre-trained models of our speech data pre-processing and pre-tagging pipeline are under the Apache 2.0 license to allow reproduction of the results reported in the paper.

Dataset license

The UHV-OTS speech corpora development is an ongoing, long-term Appen project to support commercial and academic research data needs for tasks related to speech processing.

Dataset consumers can visit https://appen.com/off-the-shelf-datasets/ to order existing datasets or contact us to discuss their specific dataset needs. Appen will consolidate those needs and adjust our UHV-OTS delivery pipeline accordingly, to deliver datasets of highest demand.

Appen will donate a number of free speech datasets from the UHV-OTS each year to support academic and open source community research under the CC-BY-SA license. These free datasets will be downloadable from Appen's https://appen.com/open-source-datasets/ website. The first batch of free available dataset will be released in late of 2021.

References

Owner
Appen Repos
Appen Repos
This repository contains the implementation of Deep Detail Enhancment for Any Garment proposed in Eurographics 2021

Deep-Detail-Enhancement-for-Any-Garment Introduction This repository contains the implementation of Deep Detail Enhancment for Any Garment proposed in

40 Dec 13, 2022
Generate images from texts. In Russian. In PaddlePaddle

ruDALL-E PaddlePaddle ruDALL-E in PaddlePaddle. Install: pip install rudalle_paddle==0.0.1rc1 Run with free v100 on AI Studio. Original Pytorch versi

AgentMaker 20 Oct 18, 2022
VISNOTATE: An Opensource tool for Gaze-based Annotation of WSI Data

VISNOTATE: An Opensource tool for Gaze-based Annotation of WSI Data Introduction Requirements Installation and Setup Supported Hardware and Software R

SigmaLab 1 Jun 14, 2022
Clustergram - Visualization and diagnostics for cluster analysis in Python

Clustergram Visualization and diagnostics for cluster analysis Clustergram is a diagram proposed by Matthias Schonlau in his paper The clustergram: A

Martin Fleischmann 96 Dec 26, 2022
Edison AT is software Depression Assistant personal.

Edison AT Edison AT is software / program Depression Assistant personal. Feature: Analyze emotional real-time from face. Audio Edison(Comingsoon relea

Ananda Rauf 2 Apr 24, 2022
Ensemble Knowledge Guided Sub-network Search and Fine-tuning for Filter Pruning

Ensemble Knowledge Guided Sub-network Search and Fine-tuning for Filter Pruning This repository is official Tensorflow implementation of paper: Ensemb

Seunghyun Lee 12 Oct 18, 2022
🌾 PASTIS 🌾 Panoptic Agricultural Satellite TIme Series

🌾 PASTIS 🌾 Panoptic Agricultural Satellite TIme Series (optical and radar) The PASTIS Dataset Dataset presentation PASTIS is a benchmark dataset for

86 Jan 04, 2023
Official Repository of NeurIPS2021 paper: PTR

PTR: A Benchmark for Part-based Conceptual, Relational, and Physical Reasoning Figure 1. Dataset Overview. Introduction A critical aspect of human vis

Yining Hong 32 Jun 02, 2022
DeepLM: Large-scale Nonlinear Least Squares on Deep Learning Frameworks using Stochastic Domain Decomposition (CVPR 2021)

DeepLM DeepLM: Large-scale Nonlinear Least Squares on Deep Learning Frameworks using Stochastic Domain Decomposition (CVPR 2021) Run Please install th

Jingwei Huang 130 Dec 02, 2022
Tensorflow implementation for "Improved Transformer for High-Resolution GANs" (NeurIPS 2021).

HiT-GAN Official TensorFlow Implementation HiT-GAN presents a Transformer-based generator that is trained based on Generative Adversarial Networks (GA

Google Research 78 Oct 31, 2022
Distributing Deep Learning Hyperparameter Tuning for 3D Medical Image Segmentation

DistMIS Distributing Deep Learning Hyperparameter Tuning for 3D Medical Image Segmentation. DistriMIS Distributing Deep Learning Hyperparameter Tuning

HiEST 2 Sep 09, 2022
Code and data of the Fine-Grained R2R Dataset proposed in paper Sub-Instruction Aware Vision-and-Language Navigation

Fine-Grained R2R Code and data of the Fine-Grained R2R Dataset proposed in the EMNLP2020 paper Sub-Instruction Aware Vision-and-Language Navigation. C

YicongHong 34 Nov 15, 2022
Weakly Supervised Learning of Instance Segmentation with Inter-pixel Relations, CVPR 2019 (Oral)

Weakly Supervised Learning of Instance Segmentation with Inter-pixel Relations The code of: Weakly Supervised Learning of Instance Segmentation with I

Jiwoon Ahn 472 Dec 29, 2022
Demonstrational Session git repo for H SAF User Workshop (28/1)

5th H SAF User Workshop The 5th H SAF User Workshop supported by EUMeTrain will be held in online in January 24-28 2022. This repository contains inst

H SAF 4 Aug 04, 2022
ALL Snow Removed: Single Image Desnowing Algorithm Using Hierarchical Dual-tree Complex Wavelet Representation and Contradict Channel Loss (HDCWNet)

ALL Snow Removed: Single Image Desnowing Algorithm Using Hierarchical Dual-tree Complex Wavelet Representation and Contradict Channel Loss (HDCWNet) (

Wei-Ting Chen 49 Dec 27, 2022
Self-supervised learning optimally robust representations for domain generalization.

OptDom: Learning Optimal Representations for Domain Generalization This repository contains the official implementation for Optimal Representations fo

Yangjun Ruan 18 Aug 25, 2022
AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition

AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition [ArXiv] [Project Page] This repository is the official implementation of AdaMML:

International Business Machines 43 Dec 26, 2022
《A-CNN: Annularly Convolutional Neural Networks on Point Clouds》(2019)

A-CNN: Annularly Convolutional Neural Networks on Point Clouds Created by Artem Komarichev, Zichun Zhong, Jing Hua from Department of Computer Science

Artёm Komarichev 44 Feb 24, 2022
PyTorch Implementation of Sparse DETR

Sparse DETR By Byungseok Roh*, Jaewoong Shin*, Wuhyun Shin*, and Saehoon Kim at Kakao Brain. (*: Equal contribution) This repository is an official im

Kakao Brain 113 Dec 28, 2022
Using this you can control your PC/Laptop volume by Hand Gestures (pinch-in, pinch-out) created with Python.

Hand Gesture Volume Controller Using this you can control your PC/Laptop volume by Hand Gestures (pinch-in, pinch-out). Code Firstly I have created a

Tejas Prajapati 16 Sep 11, 2021