End-to-end speech secognition toolkit

Overview

End-to-end speech secognition toolkit

This is an E2E ASR toolkit modified from Espnet1 (version 0.9.9).
This is the official implementation of paper:
Consistent Training and Decoding For End-to-end Speech Recognition Using Lattice-free MMI
This is also the official implementation of paper:
Improving Mandarin End-to-End Speech Recognition with Word N-gram Language Model
We achieve state-of-the-art results on two of the most popular results in Aishell-1 and AIshell-2 Mandarin datasets.
Please feel free to change / modify the code as you like. :)

Update

  • 2021/12/29: Release the first version, which contains all MMI-related features, including MMI training criteria, MMI Prefix Score (for attention-based encoder-decoder, AED) and MMI Alignment Score (For neural transducer, NT).
  • 2022/1/6: Release the word-level N-gram LM scorer.

Environment:

The main dependencies of this code can be divided into three part: kaldi, espnet and k2.

  1. kaldi is mainly used for feature extraction. To install kaldi, please follow the instructions here.
  2. Espnet is a open-source end-to-end speech recognition toolkit. please follow the instructions here to install its environment.
    2.1. Pytorch, cudatoolkit, along with many other dependencies will be install automatically during this process. 2.2. If you are going to use NT models, you are recommend to install a RNN-T warpper. Please run ${ESPNET_ROOT}/tools/installer/install_warp-transducer.sh
    2.3. Once you have installed the espnet envrionment successfully, please run pip uninstall espnet to remove the espnet library. So our code will be used.
    2.4. Also link the kaldi in ${ESPNET_ROOT}: ln -s ${KALDI-ROOT} ${ESPNET_ROOT}
  3. k2 is a python-based FST library. Please follow the instructions here to install it. GPU version is required.
    3.1. To use word N-gram LM, please also install kaldilm
  4. There might be some dependency conflicts during building the environment. We report ours below as a reference:
    4.1 OS: CentOS 7; GCC 7.3.1; Python 3.8.10; CUDA 10.1; Pytorch 1.7.1; k2-fsa 1.2 (very old for now)
    4.2 Other python libraries are in requirement.txt (It is not recommend to use this file to build the environment directly).

Results

Currently we have released examples on Aishell-1 and Aishell-2 datasets.

With MMI training & decoding methods and the word-level N-gram LM. We achieve results on Aishell-1 and Aishell-2 as below. All results are in CER%

Test set Aishell-1-dev Aishell-1-test Aishell-2-ios Aishell-2-android Aishell-2-mic
AED 4.73 5.32 5.73 6.56 6.53
AED + MMI + Word Ngram 4.08 4.45 5.26 6.22 5.92
NT 4.41 4.81 5.70 6.75 6.58
NT + MMI + Word Ngram 3.86 4.18 5.06 6.08 5.98

(example on Librispeech is not fully prepared)

Get Start

Take Aishell-1 as an example. Working process for other examples are very similar.
Prepare data and LMs

cd ${ESPNET_ROOT}/egs/aishell1
source path.sh
bash prepare.sh # prepare the data

split the json file of training data for each GPU. (we use 8GPUs)

python3 espnet_utils/splitjson.py -p 
   
     dump/train_sp/deltafalse/data.json

   

Training and decoding for NT model:

bash nt.sh      # to train the nueal transducer model

Training and decoding for AED model:

bash aed.sh     # or to train the attention-based encoder-decoder model

Several Hint:

  1. Please change the paths in path.sh accordingly before you start
  2. Please change the data to config your data path in prepare.sh
  3. Our code runs in DDP style. Before you start, you need to set them manually. We assume Pytorch distributed API works well on your machine.
export HOST_GPU_NUM=x       # number of GPUs on each host
export HOST_NUM=x           # number of hosts
export NODE_NUM=x           # number of GPUs in total (on all hosts)
export INDEX=x              # index of this host
export CHIEF_IP=xx.xx.xx.xx # IP of the master host
  1. Multiple choices are available during decoding (we take aed.sh as an example, but the usage of nt.sh is the same).
    To use the MMI-related scorers, you need train the model with MMI auxiliary criterion;

To use MMI Prefix Score (in AED) or MMI Alignment score (in NT):

bash aed.sh --stage 2 --mmi-weight 0.2

To use any external LM, you need to train them in advance (as implemented in prepare.sh)

To use word-level N-gram LM:

bash aed.sh --stage 2 --word-ngram-weight 0.4

To use character-level N-gram LM:

bash aed.sh --stage 2 --ngram-weight 1.0

To use neural network LM:

bash aed.sh --stage 2 --lm-weight 1.0

Reference

kaldi: https://github.com/kaldi-asr/kaldi
Espent: https://github.com/espnet/espnet
k2-fsa: https://github.com/k2-fsa/k2

Citations

@article{tian2021consistent,  
  title={Consistent Training and Decoding For End-to-end Speech Recognition Using Lattice-free MMI},  
  author={Tian, Jinchuan and Yu, Jianwei and Weng, Chao and Zhang, Shi-Xiong and Su, Dan and Yu, Dong and Zou, Yuexian},  
  journal={arXiv preprint arXiv:2112.02498},  
  year={2021}  
}  

@misc{tian2022improving,
      title={Improving Mandarin End-to-End Speech Recognition with Word N-gram Language Model}, 
      author={Jinchuan Tian and Jianwei Yu and Chao Weng and Yuexian Zou and Dong Yu},
      year={2022},
      eprint={2201.01995},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Authorship

Jinchuan Tian; [email protected] or [email protected]
Jianwei Yu; [email protected] (supervisor)
Chao Weng; [email protected]
Yuexian Zou; [email protected]

Owner
Jinchuan Tian
Graduate student @ Peking University, Shenzhen; Research intern @ Tencent AI LAB;
This repo is about implementing different approaches of pose estimation and also is a sub-task of the smart hospital bed project :smile:

Pose-Estimation This repo is a sub-task of the smart hospital bed project which is about implementing the task of pose estimation 😄 Many thanks to th

Max 11 Oct 17, 2022
Paaster is a secure by default end-to-end encrypted pastebin built with the objective of simplicity.

Follow the development of our desktop client here Paaster Paaster is a secure by default end-to-end encrypted pastebin built with the objective of sim

Ward 211 Dec 25, 2022
A PyTorch implementation for PyramidNets (Deep Pyramidal Residual Networks)

A PyTorch implementation for PyramidNets (Deep Pyramidal Residual Networks) This repository contains a PyTorch implementation for the paper: Deep Pyra

Greg Dongyoon Han 262 Jan 03, 2023
Replication Code for "Self-Supervised Bug Detection and Repair" NeurIPS 2021

Self-Supervised Bug Detection and Repair This is the reference code to replicate the research in Self-Supervised Bug Detection and Repair in NeurIPS 2

Microsoft 85 Dec 24, 2022
Replication Package for AequeVox:Automated Fariness Testing for Speech Recognition Systems

AequeVox Replication Package for AequeVox:Automated Fariness Testing for Speech Recognition Systems README under development. Python Packages Required

Sai Sathiesh 2 Aug 28, 2022
HiPAL: A Deep Framework for Physician Burnout Prediction Using Activity Logs in Electronic Health Records

HiPAL Code for KDD'22 Applied Data Science Track submission -- HiPAL: A Deep Framework for Physician Burnout Prediction Using Activity Logs in Electro

Hanyang Liu 4 Aug 08, 2022
Leaf: Multiple-Choice Question Generation

Leaf: Multiple-Choice Question Generation Easy to use and understand multiple-choice question generation algorithm using T5 Transformers. The applicat

Kristiyan Vachev 62 Dec 20, 2022
Learning to Initialize Neural Networks for Stable and Efficient Training

GradInit This repository hosts the code for experiments in the paper, GradInit: Learning to Initialize Neural Networks for Stable and Efficient Traini

Chen Zhu 124 Dec 30, 2022
Auxiliary data to the CHIIR paper Searching to Learn with Instructional Scaffolding

Searching to Learn with Instructional Scaffolding This is the data and analysis code for the paper "Searching to Learn with Instructional Scaffolding"

Arthur Câmara 2 Mar 02, 2022
TransMVSNet: Global Context-aware Multi-view Stereo Network with Transformers.

TransMVSNet This repository contains the official implementation of the paper: "TransMVSNet: Global Context-aware Multi-view Stereo Network with Trans

旷视研究院 3D 组 155 Dec 29, 2022
D-NeRF: Neural Radiance Fields for Dynamic Scenes

D-NeRF: Neural Radiance Fields for Dynamic Scenes [Project] [Paper] D-NeRF is a method for synthesizing novel views, at an arbitrary point in time, of

Albert Pumarola 291 Jan 02, 2023
Unofficial TensorFlow implementation of the Keyword Spotting Transformer model

Keyword Spotting Transformer This is the unofficial TensorFlow implementation of the Keyword Spotting Transformer model. This model is used to train o

Intelligent Machines Limited 8 May 11, 2022
Metrics to evaluate quality and efficacy of synthetic datasets.

An Open Source Project from the Data to AI Lab, at MIT Metrics for Synthetic Data Generation Projects Website: https://sdv.dev Documentation: https://

The Synthetic Data Vault Project 129 Jan 03, 2023
CaFM-pytorch ICCV ACCEPT Introduction of dataset VSD4K

CaFM-pytorch ICCV ACCEPT Introduction of dataset VSD4K Our dataset VSD4K includes 6 popular categories: game, sport, dance, vlog, interview and city.

96 Jul 05, 2022
Self-supervised Deep LiDAR Odometry for Robotic Applications

DeLORA: Self-supervised Deep LiDAR Odometry for Robotic Applications Overview Paper: link Video: link ICRA Presentation: link This is the correspondin

Robotic Systems Lab - Legged Robotics at ETH Zürich 181 Dec 29, 2022
ByteTrack(Multi-Object Tracking by Associating Every Detection Box)のPythonでのONNX推論サンプル

ByteTrack-ONNX-Sample ByteTrack(Multi-Object Tracking by Associating Every Detection Box)のPythonでのONNX推論サンプルです。 ONNXに変換したモデルも同梱しています。 変換自体を試したい方はByteT

KazuhitoTakahashi 16 Oct 26, 2022
(ICCV'21) Official PyTorch implementation of Relational Embedding for Few-Shot Classification

Relational Embedding for Few-Shot Classification (ICCV 2021) Dahyun Kang, Heeseung Kwon, Juhong Min, Minsu Cho [paper], [project hompage] We propose t

Dahyun Kang 82 Dec 24, 2022
Official codebase for ICLR oral paper Unsupervised Vision-Language Grammar Induction with Shared Structure Modeling

CLIORA This is the official codebase for ICLR oral paper: Unsupervised Vision-Language Grammar Induction with Shared Structure Modeling. We introduce

Bo Wan 32 Dec 23, 2022
A bare-bones Python library for quality diversity optimization.

pyribs Website Source PyPI Conda CI/CD Docs Docs Status Twitter pyribs.org GitHub docs.pyribs.org A bare-bones Python library for quality diversity op

ICAROS 127 Jan 06, 2023
This repository contains the code for the paper "PIFu: Pixel-Aligned Implicit Function for High-Resolution Clothed Human Digitization"

PIFu: Pixel-Aligned Implicit Function for High-Resolution Clothed Human Digitization News: [2020/05/04] Added EGL rendering option for training data g

Shunsuke Saito 1.5k Jan 03, 2023