Small repo describing how to use Hugging Face's Wav2Vec2 with PyCTCDecode

Overview

🤗 Transformers Wav2Vec2 + PyCTCDecode

Introduction

This repo shows how 🤗 Transformers can be used in combination with kensho-technologies's PyCTCDecode & KenLM ngram as a simple way to boost word error rate (WER).

Included is a file to create an ngram with KenLM as well as a simple evaluation script to compare the results of using Wav2Vec2 with PyCTCDecode + KenLM vs. without using any language model.

Note: The scripts are written to be used on GPU. If you want to use a CPU instead, simply remove all .to("cuda") occurances in eval.py.

Installation

In a first step, one should install KenLM. For Ubuntu, it should be enough to follow the installation steps described here. The installed kenlm folder should be move into this repo for ./create_ngram.py to function correctly. Alternatively, one can also link the lmplz binary file to a lmplz bash command to directly run lmplz instead of ./kenlm/build/bin/lmplz.

Next, some Python dependencies should be installed. Assuming PyTorch is installed, it should be sufficient to run pip install -r requirements.txt.

Run evaluation

Create ngram

In a first step on should create a ngram. E.g. for polish the command would be:

./create_ngram.py --language polish --path_to_ngram polish.arpa

After the language model is created, one should open the file. one should add a </s> The file should have a structure which looks more or less as follows:

\data\        
ngram 1=86586
ngram 2=546387
ngram 3=796581           
ngram 4=843999             
ngram 5=850874              
                                                  
\1-grams:
-5.7532206      <unk>   0
0       <s>     -0.06677356                                                                            
-3.4645514      drugi   -0.2088903
...

Now it is very important also add a </s> token to the n-gram so that it can be correctly loaded. You can simple copy the line:

0 <s> -0.06677356

and change <s> to </s>. When doing this you should also inclease ngram by 1. The new ngram should look as follows:

\data\
ngram 1=86587
ngram 2=546387
ngram 3=796581
ngram 4=843999
ngram 5=850874

\1-grams:
-5.7532206      <unk>   0
0       <s>     -0.06677356
0       </s>     -0.06677356
-3.4645514      drugi   -0.2088903
...

Now the ngram can be correctly used with pyctcdecode

Run eval

Having created the ngram, one can run:

./eval.py --language polish --path_to_ngram polish.arpa

To compare Wav2Vec2 + LM vs. Wav2Vec2 + No LM on polish.

Results

Without tuning any hyperparameters, the following results were obtained:

Comparison of Wav2Vec2 without Language model vs. Wav2Vec2 with `pyctcdecode` + KenLM 5gram.
Fine-tuned Wav2Vec2 models were used and evaluated on MLS datasets.
Take a closer look at `./eval.py` for comparison

==================================================portuguese==================================================
polish - No LM - | WER: 0.3069742867206763 | CER: 0.06054530156286364 | Time: 58.04590034484863
polish - With LM - | WER: 0.2291299753434308 | CER: 0.06211174564528545 | Time: 191.65409898757935

==================================================spanish==================================================
portuguese - No LM - | WER: 0.18208286674132138 | CER: 0.05016682956422096 | Time: 114.61633825302124
portuguese - With LM - | WER: 0.1487761958086706 | CER: 0.04489231909945738 | Time: 429.78511357307434

==================================================polish==================================================
spanish - No LM - | WER: 0.2581272104769545 | CER: 0.0703088156033147 | Time: 147.8634352684021
spanish - With LM - | WER: 0.14927852292116295 | CER: 0.052034208044195916 | Time: 563.0732748508453

It can be seen that the word error rate (WER) is significantly improved when using PyCTCDecode + KenLM. However, the character error rate (CER) does not improve as much or not at all. This is expected since using a language model will make sure that words that are predicted are words that exist in the language's vocabulary. Wav2Vec2 without a LM produces many words that are more or less correct but contain a couple of spelling errors, thus not contributing to a good WER. Those words are likely to be "corrected" by Wav2Vec2 + LM leading to an improved WER. However a Wav2Vec2 already has a good character error rate as its vocabulary is composed of characters meaning that a "word-based" language model doesn't really help in this case.

Overall WER is probably the more important metric though, so it might make a lot of sense to add a LM to Wav2Vec2.

In terms of speed, adding a LM significantly reduces speed. However, the script is not at all optimized for speed so using multi-processing and batched inference would significantly speed up both Wav2Vec2 without LM and with LM.

Owner
Patrick von Platen
Patrick von Platen
Deploy recommendation engines with Edge Computing

RecoEdge: Bringing Recommendations to the Edge A one stop solution to build your recommendation models, train them and, deploy them in a privacy prese

NimbleEdge 131 Jan 02, 2023
Official implementation of the paper Image Generators with Conditionally-Independent Pixel Synthesis https://arxiv.org/abs/2011.13775

CIPS -- Official Pytorch Implementation of the paper Image Generators with Conditionally-Independent Pixel Synthesis Requirements pip install -r requi

Multimodal Lab @ Samsung AI Center Moscow 201 Dec 21, 2022
Least Square Calibration for Peer Reviews

Least Square Calibration for Peer Reviews Requirements gurobipy - for solving convex programs GPy - for Bayesian baseline numpy pandas To generate p

Sigma <a href=[email protected]"> 1 Nov 01, 2021
Image Captioning using CNN ,LSTM and Attention

Image Captioning using CNN ,LSTM and Attention This is a deeplearning model which tries to summarize an image into a text . Installation Install this

ASUTOSH GHANTO 1 Dec 16, 2021
StorSeismic: An approach to pre-train a neural network to store seismic data features

StorSeismic: An approach to pre-train a neural network to store seismic data features This repository contains codes and resources to reproduce experi

Seismic Wave Analysis Group 11 Dec 05, 2022
Official PyTorch implementation of the paper: DeepSIM: Image Shape Manipulation from a Single Augmented Training Sample

DeepSIM: Image Shape Manipulation from a Single Augmented Training Sample (ICCV 2021 Oral) Project | Paper Official PyTorch implementation of the pape

Eliahu Horwitz 393 Dec 22, 2022
xitorch: differentiable scientific computing library

xitorch is a PyTorch-based library of differentiable functions and functionals that can be widely used in scientific computing applications as well as deep learning.

24 Apr 15, 2021
CVPR 2022 "Online Convolutional Re-parameterization"

OREPA: Online Convolutional Re-parameterization This repo is the PyTorch implementation of our paper to appear in CVPR2022 on "Online Convolutional Re

Mu Hu 121 Dec 21, 2022
Collects many various multi-modal transformer architectures, including image transformer, video transformer, image-language transformer, video-language transformer and related datasets

The repository collects many various multi-modal transformer architectures, including image transformer, video transformer, image-language transformer, video-language transformer and related datasets

Jun Chen 139 Dec 21, 2022
Existing Literature about Machine Unlearning

Machine Unlearning Papers 2021 Brophy and Lowd. Machine Unlearning for Random Forests. In ICML 2021. Bourtoule et al. Machine Unlearning. In IEEE Symp

Jonathan Brophy 213 Jan 08, 2023
Codes for the ICCV'21 paper "FREE: Feature Refinement for Generalized Zero-Shot Learning"

FREE This repository contains the reference code for the paper "FREE: Feature Refinement for Generalized Zero-Shot Learning". [arXiv][Paper] 1. Prepar

Shiming Chen 28 Jul 29, 2022
A minimal yet resourceful implementation of diffusion models (along with pretrained models + synthetic images for nine datasets)

A minimal yet resourceful implementation of diffusion models (along with pretrained models + synthetic images for nine datasets)

Vikash Sehwag 65 Dec 19, 2022
Official implementation of the paper Chunked Autoregressive GAN for Conditional Waveform Synthesis

PyEmits, a python package for easy manipulation in time-series data. Time-series data is very common in real life. Engineering FSI industry (Financial

Descript 150 Dec 06, 2022
Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth [Paper]

Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth [Paper] Downloads [Downloads] Trained ckpt files for NYU Depth V2 and

98 Jan 01, 2023
Compute descriptors for 3D point cloud registration using a multi scale sparse voxel architecture

MS-SVConv : 3D Point Cloud Registration with Multi-Scale Architecture and Self-supervised Fine-tuning Compute features for 3D point cloud registration

42 Jul 25, 2022
Multi Camera Calibration

Multi Camera Calibration 'modules/camera_calibration/app/camera_calibration.cpp' is for calculating extrinsic parameter of each individual cameras. 'm

7 Dec 01, 2022
The code for our paper submitted to RAL/IROS 2022: OverlapTransformer: An Efficient and Rotation-Invariant Transformer Network for LiDAR-Based Place Recognition.

OverlapTransformer The code for our paper submitted to RAL/IROS 2022: OverlapTransformer: An Efficient and Rotation-Invariant Transformer Network for

HAOMO.AI 136 Jan 03, 2023
The Self-Supervised Learner can be used to train a classifier with fewer labeled examples needed using self-supervised learning.

Published by SpaceML • About SpaceML • Quick Colab Example Self-Supervised Learner The Self-Supervised Learner can be used to train a classifier with

SpaceML 92 Nov 30, 2022
A hybrid framework (neural mass model + ML) for SC-to-FC prediction

The current workflow simulates brain functional connectivity (FC) from structural connectivity (SC) with a neural mass model. Gradient descent is applied to optimize the parameters in the neural mass

Yilin Liu 1 Jan 26, 2022
Algorithm to texture 3D reconstructions from multi-view stereo images

MVS-Texturing Welcome to our project that textures 3D reconstructions from images. This project focuses on 3D reconstructions generated using structur

Nils Moehrle 766 Jan 04, 2023