Research code for the paper "Fine-tuning wav2vec2 for speaker recognition"

Overview

Fine-tuning wav2vec2 for speaker recognition

This is the code used to run the experiments in https://arxiv.org/abs/2109.15053. Detailed logs of each training run can be found here:

Installing dependencies

If poetry is not installed, see https://python-poetry.org/docs/. We also expect at least python 3.8 on the system. If this is not the case, look into https://github.com/pyenv/pyenv for an easy tool to install a specific python version on your system.

The python dependencies can be installed (in a project-specific virtual environment) by:

$ poetry shell  # enter project-specific virtual environment

From now on, every command which should be run under the virtual environment (which looks like (wav2vec-speaker-identification- -py ) $ ) which is shortened to (xxx) $ .

Then install all required python packages:

(xxx) $ pip install -U pip
(xxx) $ poetry update # install dependencies 

Because PyTorch is currently serving the packages on PiPY incorrectly, we need to use pip to install the specific PyTorch versions we need.

(xxx) $ pip install -r requirements/requirements_cuda101.txt # if CUDA 10.1
(xxx) $ pip install -r requirements/requirements_cuda110.txt # if CUDA 11.0

Make sure to modify/create a requirements file for your operating system and CUDA version.

Finally, install the local package in the virtual environment by running

(xxx) $ poetry install

Setting up the environment

Copy the example environment variables:

$ cp .env.example .env 

You can then fill in .env accordingly.

Downloading and using voxceleb1 and 2

I've experienced that the download links for voxceleb1/2 can be unstable. I recommend manually downloading the dataset from the google drive link displayed on https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html.

You should end up 4 zip files, which should be placed in $DATA_FOLDER/voxceleb_archives.

  1. vox1_dev_wav.zip
  2. vox1_test_wav.zip
  3. vox2_dev_aac.zip
  4. vox2_test_aac.zip

You should also download the meta files of voxceleb. You can use preparation_scripts/download_pretrained_models.sh to download them to the expected location $DATA_FOLDER/voxceleb_meta.

Converting voxceleb2 data from .m4a to .wav

This requires ffmpeg to be installed on the machine. Check with ffmpeg -version. Assuming the voxceleb2 data is placed at $DATA_FOLDER/voxceleb_archives/vox2_dev_aac.zip and $DATA_FOLDER/voxceleb_archives/vox2_test_aac.zip, run the following commands, starting from the root project directory.

source .env

PDIR=$PWD # folder where this README is located
D=$DATA_FOLDER # location of data - should be set in .env file 
WORKERS=$(nproc --all) # number of CPUs available 

# extract voxceleb 2 data
cd $D
mkdir -p convert_tmp/train convert_tmp/test

unzip voxceleb_archives/vox2_dev_aac.zip -d convert_tmp/train
unzip voxceleb_archives/vox2_test_aac.zip -d convert_tmp/test

# run the conversion script
cd $PDIR
poetry run python preparation_scripts/voxceleb2_convert_to_wav.py $D/convert_tmp --num_workers $WORKERS

# rezip the converted data
cd $D/convert_tmp/train
zip $D/voxceleb_archives/vox2_dev_wav.zip wav -r

cd $D/convert_tmp/test
zip $D/voxceleb_archives/vox2_test_wav.zip wav -r

# delete the unzipped .m4a files
cd $D
rm -r convert_tmp

Note that this process can take a few hours on a fast machine and day(s) on a single (slow) cpu. Make sure to save the vox2_dev_wav.zip and vox2_test_wav.zip files somewhere secure, so you don't have redo this process :).

Downloading pre-trained models.

You can run ./preparation_scripts/download_pretrained_models.sh to download the pre-trained models of wav2vec2 to the required $DATA_DIRECTORY/pretrained_models directory.

Running the experiments

Below we show all the commands for training the specified network. They should reproduce the results in the paper. Note that we used a SLURM GPU cluster and each command therefore includes hydra/launcher=slurm. If you want to reproduce these locally these lines need to be removed.

wav2vec2-sv-ce

auto_lr_find

python run.py +experiment=speaker_wav2vec2_ce \
tune_model=True data/module=voxceleb1 \
trainer.auto_lr_find=auto_lr_find tune_iterations=5000

5k iters, visually around 1e-4

grid search

grid = 1e-5, 5e-5, 9e-5, 1e-4, 2e-4, 5e-4, 1e-3

python run.py -m +experiment=speaker_wav2vec2_ce \
data.dataloader.train_batch_size=66 \
optim.algo.lr=1e-5,5e-5,9e-5,1e-4,2e-4,5e-4,1e-3 \
hydra/launcher=slurm hydra.launcher.exclude=cn104 hydra.launcher.array_parallelism=7

best performance n=3

python run.py -m +experiment=speaker_wav2vec2_ce \
data.dataloader.train_batch_size=66 optim.algo.lr=9e-5 \
seed=26160,79927,90537 \
hydra/launcher=slurm hydra.launcher.exclude=cn104 hydra.launcher.array_parallelism=3

best pooling n=3

python run.py -m +experiment=speaker_wav2vec2_ce \
data.dataloader.train_batch_size=66 optim.algo.lr=9e-5 \
seed=168621,597558,440108 \
network.stat_pooling_type=mean,mean+std,attentive,quantile,first,first+cls,last,middle,random,max \
hydra/launcher=slurm hydra.launcher.exclude=cn104 hydra.launcher.array_parallelism=4

wav2vec2-sv-aam

aam with m=0.2 and s=30

auto_lr_find

python run.py +experiment=speaker_wav2vec2_ce \
tune_model=True data/module=voxceleb1 \
trainer.auto_lr_find=auto_lr_find tune_iterations=5000 \
optim/loss=aam_softmax

grid search

python run.py -m +experiment=speaker_wav2vec2_aam \
data.dataloader.train_batch_size=66 \
optim.algo.lr=1e-5,5e-5,9e-5,1e-4,2e-4,5e-4,1e-3 \
hydra/launcher=slurm hydra.launcher.exclude=cn104 hydra.launcher.array_parallelism=7

same grid

best performance n=3

python run.py -m +experiment=speaker_wav2vec2_aam \
data.dataloader.train_batch_size=66 optim.algo.lr=0.00005 \
seed=29587,14352,70814 \
hydra/launcher=slurm hydra.launcher.exclude=cn104 hydra.launcher.array_parallelism=3

best pooling n=3

python run.py -m +experiment=speaker_wav2vec2_aam \
data.dataloader.train_batch_size=66 optim.algo.lr=0.00005 \
seed=392401,39265,62634  \
network.stat_pooling_type=mean,mean+std,attentive,quantile,first,first+cls,last,middle,random,max \
hydra/launcher=slurm hydra.launcher.exclude=cn104 hydra.launcher.array_parallelism=4

wav2vec2-sv-bce

auto_lr_find

python run.py +experiment=speaker_wav2vec2_pairs \
tune_model=True data/module=voxceleb1_pairs \
trainer.auto_lr_find=auto_lr_find tune_iterations=5000

grid search

5e-6,7e6,9e-6,1e-5,2e-5,3e-5,4e-5,1e-4

python run.py -m +experiment=speaker_wav2vec2_pairs \
optim.algo.lr=5e-6,7e-6,9e-6,1e-5,2e-5,3e-5,4e-5,1e-4 \
data.dataloader.train_batch_size=32 \
hydra/launcher=slurm hydra.launcher.exclude=cn104 hydra.launcher.array_parallelism=8

best performance n=4

python run.py -m +experiment=speaker_wav2vec2_pairs \
optim.algo.lr=0.00003 data.dataloader.train_batch_size=32 \
seed=154233,979426,971817,931201 \
hydra/launcher=slurm hydra.launcher.exclude=cn104 hydra.launcher.array_parallelism=4 

xvector

auto_lr_find

python run.py +experiment=speaker_xvector \
tune_model=True data/module=voxceleb1 \
trainer.auto_lr_find=auto_lr_find tune_iterations=5000

grid search

1e-5,6e-5,1e-4,2e-4,3e-4,4e-4,8e-4,1e-3

python run.py -m +experiment=speaker_xvector \
optim.algo.lr=1e-5,6e-5,1e-4,2e-4,3e-4,4e-4,8e-4,1e-3 \
data.dataloader.train_batch_size=66 \
hydra/launcher=slurm hydra.launcher.exclude=cn105 hydra.launcher.array_parallelism=8

best performance n=3

python run.py -m +experiment=speaker_xvector \
optim.algo.lr=0.0004 trainer.max_steps=100_000 \
data.dataloader.train_batch_size=66 \
seed=82713,479728,979292 \
hydra/launcher=slurm hydra.launcher.exclude=cn105 hydra.launcher.array_parallelism=6 \

ecapa-tdnn

auto_lr_find

python run.py +experiment=speaker_ecapa_tdnn \
tune_model=True data/module=voxceleb1 \
trainer.auto_lr_find=auto_lr_find tune_iterations=5000

grid search

5e-6,1e-5,5e-4,1e-4,5e-3,7e-4,9e-4,1e-3

python run.py -m +experiment=speaker_ecapa_tdnn \
optim.algo.lr=5e-6,1e-5,5e-4,1e-4,5e-3,7e-4,9e-4,1e-3 \
data.dataloader.train_batch_size=66 \
hydra/launcher=slurm hydra.launcher.exclude=cn105 hydra.launcher.array_parallelism=8

best performance n=3

python run.py -m +experiment=speaker_ecapa_tdnn \
optim.algo.lr=0.001 trainer.max_steps=100_000 \
data.dataloader.train_batch_size=66 \
seed=494671,196126,492116 \
hydra/launcher=slurm hydra.launcher.exclude=cn105 hydra.launcher.array_parallelism=6

Ablation

baseline

python run.py -m +experiment=speaker_wav2vec2_aam \
data.dataloader.train_batch_size=66 optim.algo.lr=0.00005 \
seed=392401,39265,62634 network.stat_pooling_type=first+cls \
hydra/launcher=slurm hydra.launcher.array_parallelism=3

unfrozen feature extractor

python run.py -m +experiment=speaker_wav2vec2_aam \
data.dataloader.train_batch_size=66 optim.algo.lr=0.00005 \
seed=914305,386390,865459 network.stat_pooling_type=first+cls \
network.completely_freeze_feature_extractor=False tag=no_freeze \
hydra/launcher=slurm hydra.launcher.array_parallelism=3 hydra.launcher.exclude=cn104

no pre-trained weights

python run.py -m +experiment=speaker_wav2vec2_aam \
data.dataloader.train_batch_size=66 optim.algo.lr=0.00005 \
seed=517646,414321,137524 network.stat_pooling_type=first+cls \
network.completely_freeze_feature_extractor=False network.reset_weights=True tag=no_pretrain \
hydra/launcher=slurm hydra.launcher.array_parallelism=3 hydra.launcher.exclude=cn104

no layerdrop

python run.py -m +experiment=speaker_wav2vec2_aam \
data.dataloader.train_batch_size=66 optim.algo.lr=0.00005 \
seed=15249,728106,821754 network.stat_pooling_type=first+cls \
network.layerdrop=0.0 tag=no_layer \
hydra/launcher=slurm hydra.launcher.array_parallelism=3

no dropout

python run.py -m +experiment=speaker_wav2vec2_aam \
data.dataloader.train_batch_size=66 optim.algo.lr=0.00005 \
seed=627687,883727,154405 network.stat_pooling_type=first+cls \
network.layerdrop=0.0 network.attention_dropout=0 \ 
network.feat_proj_dropout=0 network.hidden_dropout=0 tag=no_drop \
hydra/launcher=slurm hydra.launcher.array_parallelism=3 

no time masking

python run.py -m +experiment=speaker_wav2vec2_aam \
data.dataloader.train_batch_size=66 optim.algo.lr=0.00005 \
seed=602400,553540,419322 network.stat_pooling_type=first+cls \
network.layerdrop=0.0 network.attention_dropout=0 network.feat_proj_dropout=0 \
network.hidden_dropout=0 network.mask_time_prob=0 tag=no_mask \
hydra/launcher=slurm hydra.launcher.array_parallelism=3 

batch size 32

python run.py -m +experiment=speaker_wav2vec2_aam \
data.dataloader.train_batch_size=32 trainer.max_steps=200_000 \
optim.algo.lr=0.00005 network.stat_pooling_type=first+cls \
tag=bs_32 seed=308966,753370,519822 \
hydra/launcher=slurm hydra.launcher.array_parallelism=3 

batch size 128

python run.py -m +experiment=speaker_wav2vec2_aam \
data.dataloader.train_batch_size=128 trainer.max_steps=50_000 \
optim.algo.lr=0.00005 seed=54375,585956,637400 \
network.stat_pooling_type=first+cls tag=bs_128 \
hydra/launcher=slurm hydra.launcher.array_parallelism=3 hydra.launcher.exclude=cn104

constant lr=3e-6

python run.py -m +experiment=speaker_wav2vec2_aam \
data.dataloader.train_batch_size=66 optim.algo.lr=3e-6 \
seed=549686,190215,637679 network.stat_pooling_type=first+cls \
optim/schedule=constant tag=lr_low \
hydra/launcher=slurm hydra.launcher.array_parallelism=3 

constant lr=5e-5

python run.py -m +experiment=speaker_wav2vec2_aam \
data.dataloader.train_batch_size=66 optim.algo.lr=0.00005 \
seed=419703,980724,124995 network.stat_pooling_type=first+cls \
optim/schedule=constant tag=lr_same \
hydra/launcher=slurm hydra.launcher.array_parallelism=3  

tri_stage

python run.py -m +experiment=speaker_wav2vec2_aam \
data.dataloader.train_batch_size=66 optim.algo.lr=0.00005 \
seed=856797,952324,89841 network.stat_pooling_type=first+cls \
optim/schedule=tri_stage tag=lr_3stage \
optim.schedule.scheduler.lr_lambda.initial_lr=1e-7 optim.schedule.scheduler.lr_lambda.final_lr=1e-7 \
hydra/launcher=slurm hydra.launcher.array_parallelism=3

exp decay

python run.py -m +experiment=speaker_wav2vec2_aam \
data.dataloader.train_batch_size=66 optim.algo.lr=0.00005 seed=962764,682423,707761 \
network.stat_pooling_type=first+cls optim/schedule=exp_decay tag=lr_exp_decay \
optim.schedule.scheduler.lr_lambda.final_lr=1e-7 \
hydra/launcher=slurm hydra.launcher.array_parallelism=3  
Owner
Nik
PhD student at Radboud University Nijmegen
Nik
JaQuAD: Japanese Question Answering Dataset

JaQuAD: Japanese Question Answering Dataset for Machine Reading Comprehension (2022, Skelter Labs)

SkelterLabs 84 Dec 27, 2022
Coreference resolution for English, German and Polish, optimised for limited training data and easily extensible for further languages

Coreferee Author: Richard Paul Hudson, msg systems ag 1. Introduction 1.1 The basic idea 1.2 Getting started 1.2.1 English 1.2.2 German 1.2.3 Polish 1

msg systems ag 169 Dec 21, 2022
History Aware Multimodal Transformer for Vision-and-Language Navigation

History Aware Multimodal Transformer for Vision-and-Language Navigation This repository is the official implementation of History Aware Multimodal Tra

Shizhe Chen 46 Nov 23, 2022
Fidibo.com comments Sentiment Analyser

Fidibo.com comments Sentiment Analyser Introduction This project first asynchronously grab Fidibo.com books comment data using grabber.py and then sav

Iman Kermani 3 Apr 15, 2022
COVID-19 Chatbot with Rasa 2.0: open source conversational AI

COVID-19 chatbot implementation with Rasa open source 2.0, conversational AI framework.

Aazim Parwaz 1 Dec 23, 2022
VD-BERT: A Unified Vision and Dialog Transformer with BERT

VD-BERT: A Unified Vision and Dialog Transformer with BERT PyTorch Code for the following paper at EMNLP2020: Title: VD-BERT: A Unified Vision and Dia

Salesforce 44 Nov 01, 2022
Search Git commits in natural language

NaLCoS - NAtural Language COmmit Search Search commit messages in your repository in natural language. NaLCoS (NAtural Language COmmit Search) is a co

Pushkar Patel 50 Mar 22, 2022
TextFlint is a multilingual robustness evaluation platform for natural language processing tasks,

TextFlint is a multilingual robustness evaluation platform for natural language processing tasks, which unifies general text transformation, task-specific transformation, adversarial attack, sub-popu

TextFlint 587 Dec 20, 2022
Learning to Rewrite for Non-Autoregressive Neural Machine Translation

RewriteNAT This repo provides the code for reproducing our proposed RewriteNAT in EMNLP 2021 paper entitled "Learning to Rewrite for Non-Autoregressiv

Xinwei Geng 20 Dec 25, 2022
Basic Utilities for PyTorch Natural Language Processing (NLP)

Basic Utilities for PyTorch Natural Language Processing (NLP) PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. tor

Michael Petrochuk 2.1k Jan 01, 2023
FewCLUE: 为中文NLP定制的小样本学习测评基准

FewCLUE: 为中文NLP定制的小样本学习测评基准

CLUE benchmark 387 Jan 04, 2023
KR-FinBert And KR-FinBert-SC

KR-FinBert & KR-FinBert-SC Much progress has been made in the NLP (Natural Language Processing) field, with numerous studies showing that domain adapt

5 Jul 29, 2022
Learn meanings behind words is a key element in NLP. This project concentrates on the disambiguation of preposition senses. Therefore, we train a bert-transformer model and surpass the state-of-the-art.

New State-of-the-Art in Preposition Sense Disambiguation Supervisor: Prof. Dr. Alexander Mehler Alexander Henlein Institutions: Goethe University TTLa

Dirk Neuhäuser 4 Apr 06, 2022
A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS.

A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to ach

Keon Lee 237 Jan 02, 2023
This project deals with a simplified version of a more general problem of Aspect Based Sentiment Analysis.

Aspect_Based_Sentiment_Extraction Created on: 5th Jan, 2022. This project deals with an important field of Natural Lnaguage Processing - Aspect Based

Naman Rastogi 4 Jan 01, 2023
🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

Hugging Face 15k Jan 02, 2023
Blue Brain text mining toolbox for semantic search and structured information extraction

Blue Brain Search Source Code DOI Data & Models DOI Documentation Latest Release Python Versions License Build Status Static Typing Code Style Securit

The Blue Brain Project 29 Dec 01, 2022
A workshop with several modules to help learn Feast, an open-source feature store

Workshop: Learning Feast This workshop aims to teach users about Feast, an open-source feature store. We explain concepts & best practices by example,

Feast 52 Jan 05, 2023
Nateve compiler developed with python.

Adam Adam is a Nateve Programming Language compiler developed using Python. Nateve Nateve is a new general domain programming language open source ins

Nateve 7 Jan 15, 2022
TFIDF-based QA system for AIO2 competition

AIO2 TF-IDF Baseline This is a very simple question answering system, which is developed as a lightweight baseline for AIO2 competition. In the traini

Masatoshi Suzuki 4 Feb 19, 2022