Code of the paper "Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition"

Last update: Dec 01, 2022

Related tags

Overview

SEW (Squeezed and Efficient Wav2vec)

The repo contains the code of the paper "Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition" by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q Weinberger, and Yoav Artzi.

Model Checkpoints

Unsupervisedly Pre-trained on LibriSpeech 960h

Model	Pre-training updates	Dataset	Model
W2V2-tiny	100K	Librispeech 960h	download
W2V2-small	100K	Librispeech 960h	download
W2V2-mid	100K	Librispeech 960h	download
W2V2-base	100K	Librispeech 960h	download
SEW-tiny	100K	Librispeech 960h	download
SEW-small	100K	Librispeech 960h	download
SEW-mid	100K	Librispeech 960h	download
SEW-D-tiny	100K	Librispeech 960h	download
SEW-D-small	100K	Librispeech 960h	download
SEW-D-mid	100K	Librispeech 960h	download
SEW-D-mid (k127)	100K	Librispeech 960h	download
SEW-D-base	100K	Librispeech 960h	download
SEW-D-base+	100K	Librispeech 960h	download
SEW-D-mid	400K	Librispeech 960h	download
SEW-D-mid (k127)	400K	Librispeech 960h	download
SEW-D-base+	400K	Librispeech 960h	download

Usage

Dependencies

The code is tested with fairseq commit 05255f9, deberta commit bf17ca4 and the following packages.

torch==1.8.0
torchaudio==0.8.0
tqdm==4.49.0
Hydra==2.5
hydra-core==1.0.4
fvcore==0.1.5.post20210330
omegaconf==2.0.5
einops==0.3.0
fire==0.2.1

Apex

Please install NVIDIA's apex with

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
  --global-option="--deprecated_fused_adam" --global-option="--xentropy" \
  --global-option="--fast_multihead_attn" ./

wav2letter decoder

Currently, we are decoding with wav2letter v0.2 python binding at commit 96f5f9d Please install the python binding here https://github.com/flashlight/wav2letter/tree/96f5f9d3b41e01af0a031ee0d2604acd9ef3b1b0/bindings/python The newest commit d5a93f0 in v0.2 branch leads to worse WER for wav2vec 2.0 baselines.

Installation

git clone https://github.com/asappresearch/sew.git
cd sew 
pip install -e .

Pre-training

Pre-training SEW models

Run the following command where $model_size can be tiny, small, or mid, and $ngpu is tne number of GPUs you want to use.

bash scripts/pt-sew.sh $model_size $ngpu

Pre-training SEW-D models

bash scripts/pt-sew-d.sh $model_size $ngpu

where $model_size can be tiny, small, mid, mid-k127, base, or base+.

Fine-tuning

Run the following script to fine-tune a model with the hyperparameters from wav2vec 2.0.

bash scripts/ft-model.sh $pre_trained_model $split $ngpu

where $pre_trained_model can be either a W2V2, SEW, or a SEW-D model checkpoint and $split can be 10m, 1h, 10h, or 100h.

Here we also provide a set of hyperparameters which sets all dropouts the same as the pre-training stage, and we found it to be more stable.

bash scripts/ft-model-stable.sh $pre_trained_model $split $ngpu

If you see out of GPU memory error, please scale down the dataset.max_tokens and scale up the optimization.update_freq in scripts/ft-model.sh. For example modifying these lines

  dataset.max_tokens=3200000 \
  optimization.update_freq="[$((8 / $ngpu))]" \

  dataset.max_tokens=1600000 \
  optimization.update_freq="[$((16 / $ngpu))]" \

which reduces the batch size and increases the gradient accumulation steps in order to use less GPU memory.

Evaluation

Please run this script to prepare the official LibriSpeech 4-gram language model.

bash scripts/prepare_librispeech_lm.sh $kenlm_build_bin

where $kenlm_build_bin is the folder that contains the KenLM build_binary executable file (e.g. /home/user/kenlm/build/bin).

Then run this script to evaluate a pre-trained ASR model

python tools/eval_w2v.py tunelm --subsets '["dev-clean", "dev-other", "test-clean", "test-other"]' --model $asr_checkpoint

Code for the paper Learning the Predictability of the Future

Learning the Predictability of the Future Code from the paper Learning the Predictability of the Future. Website of the project in hyperfuture.cs.colu

Computer Vision Lab at Columbia University

139 Nov 18, 2022

PyTorch code for the paper: FeatMatch: Feature-Based Augmentation for Semi-Supervised Learning

FeatMatch: Feature-Based Augmentation for Semi-Supervised Learning This is the PyTorch implementation of our paper: FeatMatch: Feature-Based Augmentat

43 Nov 19, 2022

Code for the paper A Theoretical Analysis of the Repetition Problem in Text Generation

A Theoretical Analysis of the Repetition Problem in Text Generation This repository share the code for the paper "A Theoretical Analysis of the Repeti

37 Nov 21, 2022

Code for our ICASSP 2021 paper: SA-Net: Shuffle Attention for Deep Convolutional Neural Networks

SA-Net: Shuffle Attention for Deep Convolutional Neural Networks (paper) By Qing-Long Zhang and Yu-Bin Yang [State Key Laboratory for Novel Software T

199 Jan 8, 2023

Open source repository for the code accompanying the paper 'Non-Rigid Neural Radiance Fields Reconstruction and Novel View Synthesis of a Deforming Scene from Monocular Video'.

Non-Rigid Neural Radiance Fields This is the official repository for the project "Non-Rigid Neural Radiance Fields: Reconstruction and Novel View Synt

296 Dec 29, 2022

Comments

8000 sample rate audio

Hello there,

I'm trying to train on 8000 Hz sample rate audio dataset. Is it enough to simply add task.sample_rate=8000 to the fairseq command or there are additional config changes that I should make?

I would much appreciate any advice

Thank you

opened by Mega4alik 0
How to train using not English Languages

Hi! Thank you for the awesome model!

We are very interested in your project and we try to use the sew for Japanese Language. When we train the model, should we use these scripts? Thanks! https://github.com/asappresearch/sew/tree/master/scripts

opened by jigenji 1
:bug: Fix padding mask calculation

This PR updates the padding mask calculation to be the same as the one in the reference Wav2Vec2 implementation (same commit as listed in SEW's README): https://github.com/pytorch/fairseq/blob/05255f96410e5b1eaf3bf59b767d5b4b7e2c3a35/fairseq/models/wav2vec/wav2vec2.py#L477

For more details on how and why it was fixed in fairseq, check out this PR by @patrickvonplaten https://github.com/pytorch/fairseq/pull/3228

opened by anton-l 0

Releases(v0.0.1)

v0.0.1(Sep 15, 2021)

First release.
Source code(tar.gz)
Source code(zip)

Code of the paper "Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition"

Related tags

Overview

SEW (Squeezed and Efficient Wav2vec)

Model Checkpoints

Unsupervisedly Pre-trained on LibriSpeech 960h

Usage

Dependencies

Apex

wav2letter decoder

Installation

Pre-training

Fine-tuning

Evaluation

You might also like...

Code for the paper Learning the Predictability of the Future

PyTorch code for the paper: FeatMatch: Feature-Based Augmentation for Semi-Supervised Learning

Code for the paper A Theoretical Analysis of the Repetition Problem in Text Generation

Code for our ICASSP 2021 paper: SA-Net: Shuffle Attention for Deep Convolutional Neural Networks

Open source repository for the code accompanying the paper 'Non-Rigid Neural Radiance Fields Reconstruction and Novel View Synthesis of a Deforming Scene from Monocular Video'.

Code for the Shortformer model, from the paper by Ofir Press, Noah A. Smith and Mike Lewis.

PyTorch code for ICLR 2021 paper Unbiased Teacher for Semi-Supervised Object Detection

Official code for paper "Optimization for Oriented Object Detection via Representation Invariance Loss".

Code for our CVPR 2021 paper "MetaCam+DSCE"

Comments

8000 sample rate audio

How to train using not English Languages

:bug: Fix padding mask calculation

Releases(v0.0.1)

v0.0.1(Sep 15, 2021)

Owner

ASAPP Research

ECCV18 Workshops - Enhanced SRGAN. Champion PIRM Challenge on Perceptual Super-Resolution. The training codes are in BasicSR.

A generalized framework for prototyping full-stack cooperative driving automation applications under CARLA+SUMO.

View model summaries in PyTorch!

DARTS-: Robustly Stepping out of Performance Collapse Without Indicators

Implementation of ETSformer, state of the art time-series Transformer, in Pytorch

Heart Arrhythmia Classification

Vehicle detection using machine learning and computer vision techniques for Udacity's Self-Driving Car Engineer Nanodegree.

Python scripts for performing road segemtnation and car detection using the HybridNets multitask model in ONNX.

Dense Passage Retriever - is a set of tools and models for open domain Q&A task.

Spherical CNNs

Custom IMDB Dataset is extracted between 2020-2021 and custom distilBERT model is trained for movie success probability prediction

This is a code repository for paper OODformer: Out-Of-Distribution Detection Transformer

Continuous Diffusion Graph Neural Network

基于Paddle框架的arcface复现

Tools to create pixel-wise object masks, bounding box labels (2D and 3D) and 3D object model (PLY triangle mesh) for object sequences filmed with an RGB-D camera.

Adjust Decision Boundary for Class Imbalanced Learning

User-friendly bulk RNAseq deconvolution using simulated annealing

TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-Captured Scenarios

RGB-D Local Implicit Function for Depth Completion of Transparent Objects

[ECCV 2020] Gradient-Induced Co-Saliency Detection