SpeechBrain is an open-source and all-in-one speech toolkit based on PyTorch.

Overview

The SpeechBrain Toolkit

drawing

SpeechBrain is an open-source and all-in-one speech toolkit based on PyTorch.

The goal is to create a single, flexible, and user-friendly toolkit that can be used to easily develop state-of-the-art speech technologies, including systems for speech recognition, speaker recognition, speech enhancement, multi-microphone signal processing and many others.

SpeechBrain is currently in beta.

News: the call for new sponsors (2022) is open. Take a look here if you are interested!

| Discourse | Tutorials | Website | Documentation | Contributing | HuggingFace |

Key features

SpeechBrain provides various useful tools to speed up and facilitate research on speech technologies:

  • Various pretrained models nicely integrated with drawing (HuggingFace) in our official organization account. These models are given with an interface to easily run inference, facilitating integration. If a HuggingFace model isn't available, we usually provide a least a Google Drive folder containing all the experimental results corresponding.
  • The Brain class, a fully-customizable tool for managing training and evaluation loops over data. The annoying details of training loops are handled for you while retaining complete flexibility to override any part of the process when needed.
  • A YAML-based hyperparameter specification language that describes all types of hyperparameters, from individual numbers (e.g. learning rate) to complete objects (e.g. custom models). This dramatically simplifies recipe code by distilling basic algorithmic components.
  • Multi-GPU training and inference with PyTorch Data-Parallel or Distributed Data-Parallel.
  • Mixed-precision for faster training.
  • A transparent and entirely customizable data input and output pipeline. SpeechBrain follows the PyTorch data loader and dataset style and enables users to customize the i/o pipelines (e.g adding on-the-fly downsampling, BPE tokenization, sorting, threshold ...).
  • A nice integration of sharded data with WebDataset optimized for very large datasets on Nested File Systems (NFS).

Speech recognition

SpeechBrain supports state-of-the-art methods for end-to-end speech recognition:

  • Support of wav2vec 2.0 pretrained model with finetuning.
  • State-of-the-art performance or comparable with other existing toolkits in several ASR benchmarks.
  • Easily customizable neural language models including RNNLM and TransformerLM. We also propose few pre-trained models to save you computations (more to come!). We support the Hugging Face dataset to facilitate the training over a large text dataset.
  • Hybrid CTC/Attention end-to-end ASR:
    • Many available encoders: CRDNN (VGG + {LSTM,GRU,LiGRU} + DNN), ResNet, SincNet, vanilla transformers, contextnet-based transformers or conformers. Thanks to the flexibility of SpeechBrain, any fully customized encoder could be connected to the CTC/attention decoder and trained in few hours of work. The decoder is fully customizable as well: LSTM, GRU, LiGRU, transformer, or your neural network!
    • Optimised and fast beam search on both CPUs or GPUs.
  • Transducer end-to-end ASR with a custom Numba loss to accelerate the training. Any encoder or decoder can be plugged into the transducer ranging from VGG+RNN+DNN to conformers.
  • Pre-trained ASR models for transcribing an audio file or extracting features for a downstream task.

Feature extraction and augmentation

SpeechBrain provides efficient and GPU-friendly speech augmentation pipelines and acoustic feature extraction:

  • On-the-fly and fully-differentiable acoustic feature extraction: filter banks can be learned. This simplifies the training pipeline (you don't have to dump features on disk).
  • On-the-fly feature normalization (global, sentence, batch, or speaker level).
  • On-the-fly environmental corruptions based on noise, reverberation, and babble for robust model training.
  • On-the-fly frequency and time domain SpecAugment.

Speaker recognition, identification and diarization

SpeechBrain provides different models for speaker recognition, identification, and diarization on different datasets:

  • State-of-the-art performance on speaker recognition and diarization based on ECAPA-TDNN models.
  • Original Xvectors implementation (inspired by Kaldi) with PLDA.
  • Spectral clustering for speaker diarization (combined with speakers embeddings).
  • Libraries to extract speaker embeddings with a pre-trained model on your data.

Speech Translation

  • Recipes for transformer and conformer-based end-to-end speech translation.
  • Possibility to choose between normal training (Attention), multi-objectives (CTC+Attention) and multitasks (ST + ASR).

Speech enhancement and separation

  • Recipes for spectral masking, spectral mapping, and time-domain speech enhancement.
  • Multiple sophisticated enhancement losses, including differentiable STOI loss, MetricGAN, and mimic loss.
  • State-of-the-art performance on speech separation with Conv-TasNet, DualPath RNN, and SepFormer.

Multi-microphone processing

Combining multiple microphones is a powerful approach to achieve robustness in adverse acoustic environments:

  • Delay-and-sum, MVDR, and GeV beamforming.
  • Speaker localization.

Performance

The recipes released with speechbrain implement speech processing systems with competitive or state-of-the-art performance. In the following, we report the best performance achieved on some popular benchmarks:

Dataset Task System Performance
LibriSpeech Speech Recognition CNN + Transformer WER=2.46% (test-clean)
TIMIT Speech Recognition CRDNN + distillation PER=13.1% (test)
TIMIT Speech Recognition wav2vec2 + CTC/Att. PER=8.04% (test)
CommonVoice (English) Speech Recognition wav2vec2 + CTC WER=15.69% (test)
CommonVoice (French) Speech Recognition wav2vec2 + CTC WER=9.96% (test)
CommonVoice (Italian) Speech Recognition wav2vec2 + seq2seq WER=9.86% (test)
CommonVoice (Kinyarwanda) Speech Recognition wav2vec2 + seq2seq WER=18.91% (test)
AISHELL (Mandarin) Speech Recognition wav2vec2 + seq2seq CER=5.58% (test)
Fisher-callhome (spanish) Speech translation conformer (ST + ASR) BLEU=48.04 (test)
VoxCeleb2 Speaker Verification ECAPA-TDNN EER=0.69% (vox1-test)
AMI Speaker Diarization ECAPA-TDNN DER=3.01% (eval)
VoiceBank Speech Enhancement MetricGAN+ PESQ=3.08 (test)
WSJ2MIX Speech Separation SepFormer SDRi=22.6 dB (test)
WSJ3MIX Speech Separation SepFormer SDRi=20.0 dB (test)
WHAM! Speech Separation SepFormer SDRi= 16.4 dB (test)
WHAMR! Speech Separation SepFormer SDRi= 14.0 dB (test)
Libri2Mix Speech Separation SepFormer SDRi= 20.6 dB (test-clean)
Libri3Mix Speech Separation SepFormer SDRi= 18.7 dB (test-clean)
LibryParty Voice Activity Detection CRDNN F-score=0.9477 (test)
IEMOCAP Emotion Recognition wav2vec Accuracy=79.8% (test)
CommonLanguage Language Recognition ECAPA-TDNN Accuracy=84.9% (test)
Timers and Such Spoken Language Understanding CRDNN Sentence Accuracy=89.2% (test)

For more details, take a look into the corresponding implementation in recipes/dataset/.

Pretrained Models

Beyond providing recipes for training the models from scratch, SpeechBrain shares several pre-trained models (coupled with easy-inference functions) on HuggingFace. In the following, we report some of them:

Task Dataset Model
Speech Recognition LibriSpeech CNN + Transformer
Speech Recognition LibriSpeech CRDNN
Speech Recognition CommonVoice(English) wav2vec + CTC
Speech Recognition CommonVoice(French) wav2vec + CTC
Speech Recognition CommonVoice(Italian) wav2vec + CTC
Speech Recognition CommonVoice(Kinyarwanda) wav2vec + CTC
Speech Recognition AISHELL(Mandarin) wav2vec + CTC
Speaker Recognition Voxceleb ECAPA-TDNN
Speech Separation WHAMR! SepFormer
Speech Enhancement Voicebank MetricGAN+
Spoken Language Understanding Timers and Such CRDNN
Language Identification CommonLanguage ECAPA-TDNN

Documentation & Tutorials

SpeechBrain is designed to speed-up research and development of speech technologies. Hence, our code is backed-up with three different levels of documentation:

  • Low-level: during the review process of the different pull requests, we are focusing on the level of comments that are given. Hence, any complex functionality or long pipeline is supported with helpful comments enabling users to handily customize the code.
  • Functional-level: all classes in SpeechBrain contains a detailed docstring that details the input and output formats, the different arguments, the usage of the function, the potentially associated bibliography, and a function example that is used for test integration during pull requests. Such examples can also be used to manipulate a class or a function to properly understand what is exactly happening.
  • Educational-level: we provide various Google Colab (i.e. interactive) tutorials describing all the building-blocks of SpeechBrain ranging from the core of the toolkit to a specific model designed for a particular task. The number of available tutorials is expected to increase over time.

Under development

We are currently working towards integrating DNN-HMM for speech recognition and machine translation.

Quick installation

SpeechBrain is constantly evolving. New features, tutorials, and documentation will appear over time. SpeechBrain can be installed via PyPI to rapidly use the standard library. Moreover, a local installation can be used by those users that what to run experiments and modify/customize the toolkit. SpeechBrain supports both CPU and GPU computations. For most all the recipes, however, a GPU is necessary during training. Please note that CUDA must be properly installed to use GPUs.

Install via PyPI

Once you have created your Python environment (Python 3.8+) you can simply type:

pip install speechbrain

Then you can access SpeechBrain with:

import speechbrain as sb

Install with GitHub

Once you have created your Python environment (Python 3.8+) you can simply type:

git clone https://github.com/speechbrain/speechbrain.git
cd speechbrain
pip install -r requirements.txt
pip install --editable .

Then you can access SpeechBrain with:

import speechbrain as sb

Any modification made to the speechbrain package will be automatically interpreted as we installed it with the --editable flag.

Test Installation

Please, run the following script to make sure your installation is working:

pytest tests
pytest --doctest-modules speechbrain

Running an experiment

In SpeechBrain, you can run experiments in this way:

> cd recipes/
   
    /
    
     /
> python experiment.py params.yaml

    
   

The results will be saved in the output_folder specified in the yaml file. The folder is created by calling sb.core.create_experiment_directory() in experiment.py. Both detailed logs and experiment outputs are saved there. Furthermore, less verbose logs are output to stdout.

SpeechBrain Roadmap

As a community-based and open source project, SpeechBrain needs the help of its community to grow in the right direction. Opening the roadmap to our users enable the toolkit to benefit from new ideas, new research axes or even new technologies. The roadmap, available on our Discourse lists all the changes and updates that need to be done in the current version of SpeechBrain. Users are more than welcome to propose new items via new Discourse topics!

Learning SpeechBrain

Instead of a long and boring README, we prefer to provide you with different resources that can be used to learn how to customize SpeechBrain to adapt it to your needs:

  • General information can be found on the website.
  • We offer many tutorials, you can start out from the basic ones about SpeechBrain basic functionalities and building blocks. We provide also more advanced tutorials (e.g SpeechBrain advanced, signal processing ...). You can browse them via the Tutorials drop down menu on SpeechBrain website in the upper right.
  • Details on the SpeechBrain API, how to contribute, and the code are given in the documentation.

License

SpeechBrain is released under the Apache License, version 2.0. The Apache license is a popular BSD-like license. SpeechBrain can be redistributed for free, even for commercial purposes, although you can not take off the license headers (and under some circumstances, you may have to distribute a license document). Apache is not a viral license like the GPL, which forces you to release your modifications to the source code. Also note that this project has no connection to the Apache Foundation, other than that we use the same license terms.

Citing SpeechBrain

Please, cite SpeechBrain if you use it for your research or business.

@misc{speechbrain,
  title={{SpeechBrain}: A General-Purpose Speech Toolkit},
  author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
  year={2021},
  eprint={2106.04624},
  archivePrefix={arXiv},
  primaryClass={eess.AS},
  note={arXiv:2106.04624}
}
Comments
  • Add Transducer recipe

    Add Transducer recipe

    Hello @mravanelli , @TParcollet , @jjery2243542 ,

    This is a work in progress transducer recipe, the following tasks are addressed:

    • [x] add transducer joint module
    • [x] REMOVED:add seq2seq bool in Brain class to handle the [x,y] input for the compute_forward function
    • [x] add embedding for the Prediction Network
    • [x] add greedy decoding
    • [x] Transducer minimal recipe
    • [x] add Transducer seq2seq recipe for TIMIT
    • [x] add comments to explain the greedy search over the transducer
    • [x] Add transducer recipe for Librispeech
    • [x] Find the good architecture with 14 % wer
    enhancement refactor ready to review 
    opened by aheba 73
  • use sentencepiece lib from google

    use sentencepiece lib from google

    Add BPE tokenizer:

    • [x] add the BPE training
    • [x] use the BPE trained model for the token generation for Librispeech recipe
    • [x] Design the way of adding the BPE on the params (yaml file)
    enhancement ready to review 
    opened by aheba 52
  • Switchboard Recipe

    Switchboard Recipe

    Hey everybody,

    I made a recipe for the Switchboard corpus. The data preparation steps mostly follow Kaldi's s5c recipe.

    The recipe includes the following models:

    ASR

    • CTC: Wav2Vec2 Encoder + CTC Decoder (adapted from the Commonvoice recipes)
    • seq2seq: CRDNN encoder + GRU Decoder + Attention (adapted from the LibriSpeech recipe)
      • Note: Unlike the Librispeech recipe, this system does not include any LM. In fact, every LM I tried (pretrained, finetuned or trained from scratch) seemed to make the performance much worse
    • transformer: Transformer model + LM (adapted from the LibriSpeech recipe)

    LM

    • There are two hparams files for finetuning existing LibriSpeech LMs on Switchboard and Fisher data, one for an RNNLM and the other for a Transformer LM

    Tokenizer

    • Basic Sentencepiece Tokenizer training on Switchboard and Fisher data

    Performance The model performance is as follows: | Model | Swbd WER | Callhome WER | Eval2000 WER | |:---------------------------------:|:-----------:|:---------------:| :---------------:| | CTC | 21.35 | 28.32 | 24.91 | | seq2seq | 25.37 | 36.87 | 29.33 | | Transformer (LibriSpeech LM) | 22.00 | 30.12 | 26.14 | | Transformer (Finetuned LM) | 21.11 | 29.43 | 25.36 |

    As you can see, the performance is currently comparable to Kaldi's chain systems without i-vectors. However, they need some refinement to be on par with the best Kaldi systems available (WER should be around 18 on the full eval2000 testset).

    If you have any suggestions for improvements, I'd be happy to implement them.

    I can also provide the trained models in case you are interested (I might need some help with this whole Huggingface thing though).

    Best, Dominik

    ps Thanks for all the great work you've done here! :)

    enhancement 
    opened by dwgnr 50
  • handle the use of multigpu_{count,backend}

    handle the use of multigpu_{count,backend}

    Hey @pplantinga , @mravanelli , Here is a PR fixing the issue #395 . As discussed, the multigpu_{count, backend} are not used in our ddp.py, currently, the multigpu_{count, backend} is used in the hyperparamsfile only with data_parallel. This PR handle the use of multigpu_{count, backend} by DDP.py. If the use set this params in the command line, the params in the yaml file is omitted.

    help wanted work in progress ready to review 
    opened by aheba 50
  • add noise and reverberance version for BinauralWSJ0Mix

    add noise and reverberance version for BinauralWSJ0Mix

    Hi there, I have created a noise and reverberance version of BinauralWSJ0Mix datasets and trained with convtasnet-parallel structure. Here are the recipes and not conflicted with the clean version of datasets. Also, I have trained convtasnet-parallel.yaml again and got a better results which I could share you with the Google Driver. Thanks.

    opened by huangzj421 43
  • Aishell1Mix

    Aishell1Mix

    This branch adds a new task named Aishell1Mix to the recipes which is similar to the LibriMix but applied to the mandarin AISHELL-1 dataset. Hope to receive your reply. Much thanks.

    enhancement 
    opened by huangzj421 42
  • training on voxceleb1+2 is very slow?

    training on voxceleb1+2 is very slow?

    Dear all: I noticed that when training on voxceleb1+2, it will take me up to 25 hours for single epoch. and even with ddp on 4 gpu cards, the training speed does not reduce at all. I guess the cpu is the bottleneck? anyone has the same phenomena? thank you.

    7%|████████▎                                        | 16569/241547 [1:45:07<25:09:56,  2.48it/s, train_loss=13
    
    question 
    opened by dragen1860 35
  • Insertion problem when decoding with pre-trained ASR model.

    Insertion problem when decoding with pre-trained ASR model.

    Thanks for the clear example In foldertemplates/speech_recognition/ASR/ to train an ASR model on mini-librispeech dataset. However, when I used the librispeech-pretrained model (ASR model, language model and tokenizer) to decode some waveforms in librispeech test dataset, the decoding result will repeat some of the words many times and cause severe insertion errors. Below is several examples:

    1221-135766-0014, %WER 2436.36 [ 268 / 11, 268 ins, 0 del, 0 sub ]
    PEARL ; SAW ; AND ; GAZED ; INTENTLY ; BUT ; NEVER ; SOUGHT ; TO ; MAKE ; ACQUAINTANCE ; <eps> ; <eps> ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ; <eps>  ; <eps> ; <eps> ;    <eps>     ; <eps> ; <eps> ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ; <eps>  ; <eps> ; <eps> ;    <eps>     ; <eps> ; <eps> ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ; <eps>  ; <eps> ; <eps> ;    <eps>     ; <eps> ; <eps> ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ; <eps>  ; <eps> ; <eps> ;    <eps>     ; <eps> ; <eps> ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ; <eps>  ; <eps> ; <eps> ;    <eps>     ; <eps> ; <eps> ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ; <eps>  ; <eps> ; <eps> ;    <eps>     ; <eps> ; <eps> ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ; <eps>  ; <eps> ; <eps> ;    <eps>     ; <eps> ; <eps> ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ; <eps>  ; <eps> ; <eps> ;    <eps>     ; <eps> ; <eps> ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ; <eps>  ; <eps> ; <eps> ;    <eps>     ; <eps> ; <eps> ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ; <eps>  ; <eps> ; <eps> ;    <eps>     ; <eps> ; <eps> ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ; <eps>  ; <eps> ; <eps> ;    <eps>     ; <eps> ; <eps> ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ; <eps>  ; <eps> ; <eps> ;    <eps>     ; <eps> ; <eps> ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ; <eps>  ; <eps> ; <eps> ;    <eps>     ; <eps> ; <eps> ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ; <eps>  ; <eps> ; <eps> ;    <eps>     ; <eps> ; <eps> ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ; <eps>  ; <eps> ; <eps> ;    <eps>     ; <eps> ; <eps> ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ; <eps>  ; <eps> ; <eps> ;    <eps>     ; <eps> ; <eps> ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ; <eps>  ; <eps> ; <eps> ;    <eps>     ; <eps> ; <eps> ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ; <eps>  ; <eps> ; <eps> ;    <eps>     ; <eps> ; <eps> ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ; <eps>  ; <eps> ; <eps> ;    <eps>     ; <eps> ; <eps> ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ; <eps>  ; <eps> ; <eps> ;    <eps>     ; <eps> ; <eps> ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ; <eps>  ; <eps> ; <eps> ;    <eps>     ; <eps> ; <eps> ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ; <eps>  ; <eps> ; <eps> ;    <eps>     ; <eps> ; <eps> ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ; <eps>  ; <eps> ; <eps> ;    <eps>     ; <eps> ; <eps> ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ; <eps>  ; <eps> ; <eps> ;    <eps>     ; <eps> ; <eps> ; <eps> ; <eps>

    PEARL ; SAW ; AND ; GAZED ; INTENTLY ; BUT ; NEVER ; SOUGHT ; TO ; MAKE ; ACQUAINTANCE ; PEARL ;  SAW  ;  AND  ; GAZED ; INTENTLY ;  BUT  ; NEVER ; SOUGHT ;   TO  ;  MAKE ; ACQUAINTANCE ; PEARL ;  SAW  ;  AND  ; GAZED ; INTENTLY ;  BUT  ; NEVER ; SOUGHT ;   TO  ;  MAKE ; ACQUAINTANCE ; PEARL ;  SAW  ;  AND  ; GAZED ; INTENTLY ;  BUT  ; NEVER ; SOUGHT ;   TO  ;  MAKE ; ACQUAINTANCE ; PEARL ;  SAW  ;  AND  ; GAZED ; INTENTLY ;  BUT  ; NEVER ; SOUGHT ;   TO  ;  MAKE ; ACQUAINTANCE ; PEARL ;  SAW  ;  AND  ; GAZED ; INTENTLY ;  BUT  ; NEVER ; SOUGHT ;   TO  ;  MAKE ; ACQUAINTANCE ; PEARL ;  SAW  ;  AND  ; GAZED ; INTENTLY ;  BUT  ; NEVER ; SOUGHT ;   TO  ;  MAKE ; ACQUAINTANCE ; PEARL ;  SAW  ;  AND  ; GAZED ; INTENTLY ;  BUT  ; NEVER ; SOUGHT ;   TO  ;  MAKE ; ACQUAINTANCE ; PEARL ;  SAW  ;  AND  ; GAZED ; INTENTLY ;  BUT  ; NEVER ; SOUGHT ;   TO  ;  MAKE ; ACQUAINTANCE ; PEARL ;  SAW  ;  AND  ; GAZED ; INTENTLY ;  BUT  ; NEVER ; SOUGHT ;   TO  ;  MAKE ; ACQUAINTANCE ; PEARL ;  SAW  ;  AND  ; GAZED ; INTENTLY ;  BUT  ; NEVER ; SOUGHT ;   TO  ;  MAKE ; ACQUAINTANCE ; PEARL ;  SAW  ;  AND  ; GAZED ; INTENTLY ;  BUT  ; NEVER ; SOUGHT ;   TO  ;  MAKE ; ACQUAINTANCE ; PEARL ;  SAW  ;  AND  ; GAZED ; INTENTLY ;  BUT  ; NEVER ; SOUGHT ;   TO  ;  MAKE ; ACQUAINTANCE ; PEARL ;  SAW  ;  AND  ; GAZED ; INTENTLY ;  BUT  ; NEVER ; SOUGHT ;   TO  ;  MAKE ; ACQUAINTANCE ; PEARL ;  SAW  ;  AND  ; GAZED ; INTENTLY ;  BUT  ; NEVER ; SOUGHT ;   TO  ;  MAKE ; ACQUAINTANCE ; PEARL ;  SAW  ;  AND  ; GAZED ; INTENTLY ;  BUT  ; NEVER ; SOUGHT ;   TO  ;  MAKE ; ACQUAINTANCE ; PEARL ;  SAW  ;  AND  ; GAZED ; INTENTLY ;  BUT  ; NEVER ; SOUGHT ;   TO  ;  MAKE ; ACQUAINTANCE ; PEARL ;  SAW  ;  AND  ; GAZED ; INTENTLY ;  BUT  ; NEVER ; SOUGHT ;   TO  ;  MAKE ; ACQUAINTANCE ; PEARL ;  SAW  ;  AND  ; GAZED ; INTENTLY ;  BUT  ; NEVER ; SOUGHT ;   TO  ;  MAKE ; ACQUAINTANCE ; PEARL ;  SAW  ;  AND  ; GAZED ; INTENTLY ;  BUT  ; NEVER ; SOUGHT ;   TO  ;  MAKE ; ACQUAINTANCE ; PEARL ;  SAW  ;  AND  ; GAZED ; INTENTLY ;  BUT  ; NEVER ; SOUGHT ;   TO  ;  MAKE ; ACQUAINTANCE ; PEARL ;  SAW  ;  AND  ; GAZED ; INTENTLY ;  BUT  ; NEVER ; SOUGHT ;   TO  ;  MAKE ; ACQUAINTANCE ; PEARL ;  SAW  ;  AND  ; GAZED ; INTENTLY ;  BUT  ; NEVER ; SOUGHT ;   TO  ;  MAKE ; ACQUAINTANCE ; PEARL ;  SAW  ;  AND  ; GAZED ; INTENTLY ;  BUT  ; NEVER ; SOUGHT ;   TO  ;  MAKE ; ACQUAINTANCE ; PEARL ;  SAW  ;  AND  ; GAZED ; INTENTLY ;  BUT  ; NEVER ; SOUGHT ;   TO  ;  MAKE ; ACQUAINTANCE ; PEARL ;  SAW  ;  AND  ; GAZED
    
    121-123859-0001, %WER 869.81 [ 461 / 53, 454 ins, 0 del, 7 sub ]
    O  ; TIS ; THE ; FIRST  ; TIS ; FLATTERY ; IN ; MY ; SEEING ; AND ; MY ; GREAT ; MIND ; MOST ; KINGLY ; DRINKS ; IT ; UP ; MINE ; EYE ; WELL ; KNOWS ; WHAT ; WITH ; HIS ; GUST ; IS ; GREEING ; AND ; TO ; HIS ; PALATE ; DOTH ; PREPARE ; THE ; CUP ; IF ; IT ; BE ; POISON'D ; TIS ; THE ; LESSER ; SIN ; THAT ; MINE ; EYE ; LOVES ; IT ; AND ; DOTH ; <eps>  ; <eps>  ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; FIRST ; BEGIN ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps>  ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps>  ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps>

    OH ; TIS ; THE ; THIRST ; TIS ; FLATTERY ; IN ; MY ; SEEING ; AND ; MY ; GREAT ; MIND ; MOST ; KEENLY ; DRINKS ; IT ; UP ; MINE ; EYE ; WELL ; KNOWS ; WHAT ; WITH ; HIS ; GUST ; IS ;  GREEN  ; AND ; TO ; HIS ; PALATE ; DOTH ; PREPARE ; THE ; CUP ; IF ; IT ; BE ; POISONED ; TIS ; THE ; LESSER ; SIN ; THAT ; MINE ;  I  ;  LOVE ; IT ; AND ; DOTH ; THIRST ; BEGINS ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGINS ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGINS ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE
    
    1284-134647-0001, %WER 707.41 [ 191 / 27, 191 ins, 0 del, 0 sub ]
    THE ; EDICT ; OF ; MILAN ; THE ; GREAT ; CHARTER ; OF ; TOLERATION ; HAD ; CONFIRMED ; TO ; EACH ; INDIVIDUAL ; OF ; THE ; ROMAN ; WORLD ; THE ; PRIVILEGE ; OF ; CHOOSING ; AND ; PROFESSING ; HIS ; OWN ; RELIGION ; <eps> ; <eps> ;   <eps>   ; <eps> ; <eps> ;   <eps>    ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ;   <eps>   ; <eps> ;  <eps>   ; <eps> ;   <eps>    ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ;   <eps>   ; <eps> ; <eps> ;   <eps>    ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ;   <eps>   ; <eps> ;  <eps>   ; <eps> ;   <eps>    ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ;   <eps>   ; <eps> ; <eps> ;   <eps>    ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ;   <eps>   ; <eps> ;  <eps>   ; <eps> ;   <eps>    ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ;   <eps>   ; <eps> ; <eps> ;   <eps>    ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ;   <eps>   ; <eps> ;  <eps>   ; <eps> ;   <eps>    ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ;   <eps>   ; <eps> ; <eps> ;   <eps>    ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ;   <eps>   ; <eps> ;  <eps>   ; <eps> ;   <eps>    ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ;   <eps>   ; <eps> ; <eps> ;   <eps>    ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ;   <eps>   ; <eps> ;  <eps>   ; <eps> ;   <eps>    ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ;   <eps>   ; <eps> ; <eps> ;   <eps>    ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ;   <eps>   ; <eps> ;  <eps>   ; <eps> ;   <eps>    ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ;   <eps>   ; <eps> ; <eps> ;   <eps>    ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ;   <eps>   ; <eps> ;  <eps>   ; <eps> ;   <eps>    ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ;   <eps>   ; <eps> ; <eps> ;   <eps>    ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ;   <eps>   ; <eps> ;  <eps>   ; <eps> ;   <eps>    ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ;   <eps>   ; <eps> ; <eps> ;   <eps>    ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ;   <eps>   ; <eps> ;  <eps>   ; <eps> ;   <eps>    ; <eps> ; <eps> ;  <eps>   ; <eps>


    

    The dataset I tested on is part of the librispeech test-clean dataset (reader id beginning with 1, 2 and 3, 1074 files in total.), and the average WER on this dataset is 20.3%. Below is the hparams I used for searching:

     test_search: !new:speechbrain.decoders.S2SRNNBeamSearchLM
        embedding: !ref <embedding>
        decoder: !ref <decoder>
        linear: !ref <seq_lin>
        ctc_linear: !ref <ctc_lin>
        language_model: !ref <lm_model>
        bos_index: 0
        eos_index: 0
        blank_index: 0
        min_decode_ratio: 0.0
        max_decode_ratio: 1.0
        beam_size: 80
        eos_threshold: 1.5
        using_max_attn_shift: true
        max_attn_shift: 240
        coverage_penalty: 1.5
        lm_weight: 0.5
        ctc_weight: 0.0
        temperature: 1.25
        temperature_lm: 1.25
    

    I also found that if I change the testing batch_size from 8 to 1, the WER can be reduced from 20.3% to 2.8%, which I believe should be the normal result. I am thus wondering whether the padding might be the main reason for this problem.

    opened by Kuray107 31
  • LM decoder and training for TIMIT

    LM decoder and training for TIMIT

    Modifications:

    1. Add length normalization for beam search.
    2. Rename length penalty to length rewarding (beam search).
    3. Integrate LM in the decoder.
    4. Add recipe for LM and ASR with LM decoding.
    work in progress ready to review 
    opened by jjery2243542 31
  • Can't train a model with multi NVIDIA RTX 3090 GPUs.

    Can't train a model with multi NVIDIA RTX 3090 GPUs.

    OS: Ubuntu 20.04 Python: I tested both 3.7 and 3.8 SpeechBrain: I tested 0.5.8 and 0.5.9 PyTorch: 1.7.0 for SpeechBrain 0.5.8 and 1.9.0 for SpeechBrain 0.5.9, both complied on CUDA 11.1 Recipe: speechbrain/recipes/LibriSpeech/ASR/transformer

    command: python train.py hparams/transformer.yaml --data_folder xxx --data_parallel_backend

    I have 8 3090 GPUs on my server. But when I watched nvidia-smi, there was only one GPU process running on one GPU, the rest of the 7 GPUs were idle. So how can I fix this problem? Thank you.

    opened by Xinghui-Wu 28
  • MultiGPU + Librispeech

    MultiGPU + Librispeech

    Adding Multi-GPU training to the Librispeech recipe.

    1. Change the logging to info on the libri preparation. Without that, the user has NO feedback on what is happening, and it's actually weird.
    2. Add multi GPU with data parallel to experiment.py
    3. Add a multigpu param to the yaml file

    To do: [x] Test the recipe on 1-2 GPU [x] Test that the checkpointing doesn't break due to DataParallel when going from one to two and two to one

    enhancement ready to review 
    opened by TParcollet 27
  • [WIP] Streamable Voice Activity Detection

    [WIP] Streamable Voice Activity Detection

    Integrate streamable Voice Activity Detection with script to run on laptop via ffmpeg.

    Missing:

    • [ ] choose model, train and deploy on HF Hub;
    • [ ] test VAD_stream and perform last consistency checks;
    • [ ] update README.md
    opened by fpaissan 0
  • [Bug]: Training hifigan on ljspeech results in FileNotFoundError for train.json

    [Bug]: Training hifigan on ljspeech results in FileNotFoundError for train.json

    Describe the bug

    When I start the hifigan training on ljspeech I get the error FileNotFoundError: [Errno 2] No such file or directory: './results/hifi_gan/1234/save/train.json'

    I looked for the train.json and could not find it. I guess it should be created by the ljspeech_prepare.py script but it is not.

    Expected behaviour

    I expect the train.json to be created automatically when I start the training.

    To Reproduce

    No response

    Versions

    No response

    Relevant log output

    No response

    Additional context

    No response

    bug 
    opened by padmalcom 4
  • [Bug]: Exporting Tacotron2 into onnx file

    [Bug]: Exporting Tacotron2 into onnx file

    Describe the bug

    Hello,

    I am trying to to export tacotron2 into onnx file. Following the documentation of PyTorch, I have chosen to use script() function. Unfortunately, this does not work and shows me an error.

    I am working with Python 3.9.15 using conda environment.

    Please, can you tell if I am doing something wrong or if some calculations are not compatible with onnx exporting?

    Best regards, Mathias.

    Expected behaviour

    I am expecting the generation of an onnx file when I am using torch.onnx.export.

    To Reproduce

    import torch
    from speechbrain.pretrained import Tacotron2
    
    tacotron2 = Tacotron2.from_hparams(source="speechbrain/tts-tacotron2-ljspeech", savedir="tmpdir_tts")
    scriptModule = torch.jit.script(tacotron2)
    torch.onnx.export(scriptModule, ["hello"], "tacotron2.onnx", verbose=True)
    

    Versions

    huggingface-hub==0.11.1 numpy==1.24.0 nvidia-cublas-cu11==11.10.3.66 nvidia-cuda-nvrtc-cu11==11.7.99 nvidia-cuda-runtime-cu11==11.7.99 nvidia-cudnn-cu11==8.5.0.96 PyYAML==6.0 scipy==1.9.3 speechbrain==0.5.13 torch==1.13.1 torchaudio==0.13.1

    Relevant log output

    Traceback (most recent call last):
      File "/home/mquillot/TTS_experiment/sb_experiment.py", line 26, in <module>
        scriptModule = torch.jit.script(tacotron2)
      File "/home/mquillot/speechbrain/lib/python3.9/site-packages/torch/jit/_script.py", line 1286, in script
        return torch.jit._recursive.create_script_module(
      File "/home/mquillot/speechbrain/lib/python3.9/site-packages/torch/jit/_recursive.py", line 476, in create_script_module
        return create_script_module_impl(nn_module, concrete_type, stubs_fn)
      File "/home/mquillot/speechbrain/lib/python3.9/site-packages/torch/jit/_recursive.py", line 538, in create_script_module_impl
        script_module = torch.jit.RecursiveScriptModule._construct(cpp_module, init_fn)
      File "/home/mquillot/speechbrain/lib/python3.9/site-packages/torch/jit/_script.py", line 615, in _construct
        init_fn(script_module)
      File "/home/mquillot/speechbrain/lib/python3.9/site-packages/torch/jit/_recursive.py", line 516, in init_fn
        scripted = create_script_module_impl(orig_value, sub_concrete_type, stubs_fn)
      File "/home/mquillot/speechbrain/lib/python3.9/site-packages/torch/jit/_recursive.py", line 538, in create_script_module_impl
        script_module = torch.jit.RecursiveScriptModule._construct(cpp_module, init_fn)
      File "/home/mquillot/speechbrain/lib/python3.9/site-packages/torch/jit/_script.py", line 615, in _construct
        init_fn(script_module)
      File "/home/mquillot/speechbrain/lib/python3.9/site-packages/torch/jit/_recursive.py", line 516, in init_fn
        scripted = create_script_module_impl(orig_value, sub_concrete_type, stubs_fn)
      File "/home/mquillot/speechbrain/lib/python3.9/site-packages/torch/jit/_recursive.py", line 538, in create_script_module_impl
        script_module = torch.jit.RecursiveScriptModule._construct(cpp_module, init_fn)
      File "/home/mquillot/speechbrain/lib/python3.9/site-packages/torch/jit/_script.py", line 615, in _construct
        init_fn(script_module)
      File "/home/mquillot/speechbrain/lib/python3.9/site-packages/torch/jit/_recursive.py", line 516, in init_fn
        scripted = create_script_module_impl(orig_value, sub_concrete_type, stubs_fn)
      File "/home/mquillot/speechbrain/lib/python3.9/site-packages/torch/jit/_recursive.py", line 542, in create_script_module_impl
        create_methods_and_properties_from_stubs(concrete_type, method_stubs, property_stubs)
      File "/home/mquillot/speechbrain/lib/python3.9/site-packages/torch/jit/_recursive.py", line 393, in create_methods_and_properties_from_stubs
        concrete_type._create_methods_and_properties(property_defs, property_rcbs, method_defs, method_rcbs, method_defaults)
    RuntimeError: Unsupported value kind: Tensor
    

    Additional context

    No response

    bug 
    opened by mquillot 1
  • [Bug]: Implementation of CumulativeLayerNorm

    [Bug]: Implementation of CumulativeLayerNorm

    Describe the bug

    The implementation of CumulativeLayerNorm seems to be channel (or time-frame)-wise normalization instead of accumulating the information on past frames.

    Expected behaviour

    "ChannelwiseLayerNorm (cLN)" as in ESPnet might be more accurate name.

    To Reproduce

    No response

    Versions

    No response

    Relevant log output

    No response

    Additional context

    No response

    bug 
    opened by YoshikiMas 0
  • add whisper normalization on training

    add whisper normalization on training

    Hi :D

    In the present whisper finetuning implementation, we are training with raw text (no normalisation) and then we do testing and validating using whisper normalisation. This is a little adjustment to fine-tune the model on whisper normalisation, as encode only accomplishes tokenisation and not normalisation.

    
    from transformers.models.whisper.tokenization_whisper import WhisperTokenizer
    
    test = "hello i have fifty two dollars"
    
    tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-base")
    print(tokenizer._normalize(test))
    print(tokenizer.decode(tokenizer.encode(test)))
    print(tokenizer.decode(tokenizer.encode(tokenizer._normalize(test))))
    
    #print outputs
    hello i have $52
    <|startoftranscript|><|notimestamps|>hello i have fifty two dollars<|endoftext|>
    <|startoftranscript|><|notimestamps|>hello i have $52<|endoftext|>
    
    opened by Moumeneb1 3
Releases(v0.5.13)
  • v0.5.13(Aug 29, 2022)

    This is a minor release with better dependency version specification. We note that SpeechBrain is compatible with PyTorch 1.12, and the updated package reflects this. See the issue linked next to each commit for more details about the corresponding changes.

    Commit summary

    • [edb7714]: Adding no_sync and on_fit_batch_end method to core (Rudolf Arseni Braun) #1449
    • [07155e9]: G2P fixes (flexthink) #1473
    • [6602dab]: fix for #1469, minimal testing for profiling (anautsch) #1476
    • [abbfab9]: test clean-ups: passes linters; doctests; unit & integration tests; load-yaml on cpu (anautsch) #1487
    • [1a16b41]: fix ddp incorrect command (=) #1498
    • [0b0ec9d]: using no_sync() in fit_batch() of core.py (Rudolf Arseni Braun) #1449
    • [5c9b833]: Remove torch maximum compatible version (Peter Plantinga) #1504
    • [d0f4352]: remove limit for HF hub as it does not work with colab (Titouan) #1508
    • [b78f6f8]: Add revision to hub (Titouan) #1510
    • [2c491a4]: fix transducer loss inputs devices (Adel Moumen) #1511
    • [4972f76]: missing space in install command (pehonnet) #1512
    • [6bc72af]: Fixing shuffle argument for distributed sampler in core.py (Rudolf Arseni Braun) #1518
    • [df7acd9]: Added the link for example results (cem) #1523
    • [5bae6df]: add LinearWarmupScheduler (Ge Li) #1537
    • [2edd7ee]: updating scipy version in requirements.txt. (Nauman Dawalatabad) #1546
    Source code(tar.gz)
    Source code(zip)
  • v0.5.12(Jun 26, 2022)

    Release Notes - SpeechBrain v0.5.12

    We worked very hard and we are very happy to announce the new version of SpeechBrain!

    SpeechBrain 0.5.12 significantly expands the toolkit without introducing any major interface changes. I would like to warmly thank the many contributors that made this possible.

    The main changes are the following:

    A) Text-to-Speech: We developed the first TTS system of SpeechBrain. You can find it here. The system relies on Tacotron2 + HiFiGAN (as vocoder). The models coupled with an easy-inference interface are available on HuggingFace.

    B) Grapheme-to-Phoneme (G2P): We developed an advanced Grapheme-to-Phoneme. You can find the code here. The current version significantly outperforms our previous model.

    C) Speech Separation:

    1. We developed a novel version of the SepFormer called Resource-Efficient SepFormer (RE-Sepformer). The code is available here and the pre-trained model (with an easy inference interface) here.
    2. We released a recipe for Binaural speech separation with WSJMix. See the code here.
    3. We released a new recipe with the AIShell mix dataset. You can see the code here.

    D) Speech Enhancement:

    1. We released the SepFormer model for speech enhancement. the code is here, while the pre-trained model (with easy-inference interface) is here.
    2. We implemented the WideResNet for speech enhancement and use it to mimic loss-based speech enhancement. The code is here and the pretrained model (with easy-inference interface) is here.

    E) Feature Front-ends:

    1. We now support LEAF filter banks. The code is here. You can find an example of a recipe using it here.
    2. We now support SincConv multichannel (see code here).

    F) Recipe Refactors:

    1. We refactored the Voxceleb recipe and fix the normalization issues. See the new code here. We also made the EER computation method less memory demanding (see here).
    2. We refactored the IEMOCAP recipe for emotion recognition. See the new code here.

    G) Models for African Languages: We now have recipes for the DVoice dataset. We currently support Darija, Swahili, Wolof, Fongbe, and Amharic. The code is available here. The pretrained model (coupled with an easy-inference interface) can be found on SpeechBrain-HuggingFace.

    H) Profiler: We implemented a model profiler that helps users while developing new models with SpeechBrain. The profiler outputs a bunch of potentially useful information, such as the real-time factors and many other details. A tutorial is available here.

    I) Tests: We significantly improved the tests. In particular, we introduced the following tests: HF_repo tests, docstring checks, yaml-script consistency, recipe tests, and check URLs. This will helps us scale up the project.

    L) Other improvements:

    1. We now support the torchaudio RNNT loss*.
    2. We improved the relative attention mechanism of the Conformer.
    3. We updated the transformer for LibriSpeech. This improves the performance from WER= 2.46% to 2.26% on the test-clean. See the code here.
    4. The Environmental corruption module can now support different sampling rates.
    5. Minor fixes.
    Source code(tar.gz)
    Source code(zip)
  • v0.5.11(Dec 20, 2021)

    Dear users, We worked very hard, and we are very happy to announce the new version of SpeechBrain. SpeechBrain 0.5.11 further expands the toolkit without introducing any major interface change.

    The main changes are the following:

    1. We implemented new recipes, such as:
    1. Support for Dynamic batching with a Tutorial to help users familiarize themselves with it.

    2. Support for wav2vec training within SpeechBrain.

    3. Developed an interface with Orion for hyperparameter tuning with a Tutorial to help users familiarize themselves with it.

    4. the torchaudio transducer loss is now supported. We also kept our numba implementation to help users customize the transducer loss part if needed.

    5. Improved CTC-Segmentation

    6. Fixed minor bugs and issues (e.g., fixed MVDR beamformer ).

    Let me thank all the amazing contributors for this achievement. Please, keep add a star to our project if you appreciate our effort for the community. Together, we are growing very fast, and we have big plans for the future.

    Stay Tuned!

    Source code(tar.gz)
    Source code(zip)
  • 0.5.10(Sep 11, 2021)

    This version mainly expands the functionalities of SpeechBrain without adding any backward incompatibilities.

    New Recipes:

    • Language Identification with CommonLanguage
    • EEG signal processing with ERPCore
    • Speech translation with Fisher-Call Home
    • Emotion Recognition with IEMOCAP
    • Voice Activity Detection with LibriParty
    • ASR with LibriSpeech wav2vec (WER=1.9 on test-clean)
    • SpeechEnhancement with CoopNet
    • SpeechEnhancement with SEGAN
    • Speech Separation with LibriMix, WHAM, and WHAMR
    • Support for guided attention
    • Spoken Language Understanding with SLURP

    Beyond that, we fixed some minor bugs and issues.

    Source code(tar.gz)
    Source code(zip)
  • v0.5.9(Jun 17, 2021)

    This main differences with the previous version are the following:

    • Added Wham/whamr/librimix for speech separation
    • Compatibility with PyTorch 1.9
    • Fixed minor bugs
    • Added SpeechBrain paper
    Source code(tar.gz)
    Source code(zip)
  • v0.5.8(Jun 6, 2021)

    SpeechBrain 0.5.8 improves the previous version in the following way:

    • Added wav2vec support in TIMIT, CommonVoice, AISHELL-1
    • Improved Fluent Speech Command Recipe
    • Improved SLU recipes
    • Recipe for UrbanSound8k
    • Fix small bugs
    • Fix typos
    Source code(tar.gz)
    Source code(zip)
  • 0.5.7(Apr 29, 2021)

    SpeechBrain is an open-source and all-in-one speech toolkit. It is designed to be simple, extremely flexible, and user-friendly. Competitive or state-of-the-art performance is obtained in various domains. The current version (v0.5.7) supports:

    • E2E Speech Recognition
    • Speaker Recognition (Identification and Verification)
    • Spoken Language Understanding (e.g., Intent recognition)
    • Speaker Diarization
    • Speech Enhancement
    • Speech Separation
    • Multi-microphone signal processing (beamforming, localization)

    Many other tasks will be supported soon. Take a look into our roadmap on Discourse. Your contribution is welcome! Please, star our project to help us growing.

    For more info and tutorials: https://speechbrain.github.io/

    Source code(tar.gz)
    Source code(zip)
Official code for 'Pixel-wise Energy-biased Abstention Learning for Anomaly Segmentationon Complex Urban Driving Scenes'

PEBAL This repo contains the Pytorch implementation of our paper: Pixel-wise Energy-biased Abstention Learning for Anomaly Segmentation on Complex Urb

Yu Tian 117 Jan 03, 2023
TensorFlow tutorials and best practices.

Effective TensorFlow 2 Table of Contents Part I: TensorFlow 2 Fundamentals TensorFlow 2 Basics Broadcasting the good and the ugly Take advantage of th

Vahid Kazemi 8.7k Dec 31, 2022
Official Implementation of "LUNAR: Unifying Local Outlier Detection Methods via Graph Neural Networks"

LUNAR Official Implementation of "LUNAR: Unifying Local Outlier Detection Methods via Graph Neural Networks" Adam Goodge, Bryan Hooi, Ng See Kiong and

Adam Goodge 25 Dec 28, 2022
Code for weakly supervised segmentation of a single class

SingleClassRL Implementation of weak single object segmentation from paper "Regularized Loss for Weakly Supervised Single Class Semantic Segmentation"

16 Nov 14, 2022
Project of 'TBEFN: A Two-branch Exposure-fusion Network for Low-light Image Enhancement '

TBEFN: A Two-branch Exposure-fusion Network for Low-light Image Enhancement Codes for TMM20 paper "TBEFN: A Two-branch Exposure-fusion Network for Low

KUN LU 31 Nov 06, 2022
Hierarchical Memory Matching Network for Video Object Segmentation (ICCV 2021)

Hierarchical Memory Matching Network for Video Object Segmentation Hongje Seong, Seoung Wug Oh, Joon-Young Lee, Seongwon Lee, Suhyeon Lee, Euntai Kim

Hongje Seong 72 Dec 14, 2022
FastFCN: Rethinking Dilated Convolution in the Backbone for Semantic Segmentation.

FastFCN: Rethinking Dilated Convolution in the Backbone for Semantic Segmentation [Project] [Paper] [arXiv] [Home] Official implementation of FastFCN:

Wu Huikai 815 Dec 29, 2022
Conditional Gradients For The Approximately Vanishing Ideal

Conditional Gradients For The Approximately Vanishing Ideal Code for the paper: Wirth, E., and Pokutta, S. (2022). Conditional Gradients for the Appro

IOL Lab @ ZIB 0 May 25, 2022
Face Identity Disentanglement via Latent Space Mapping [SIGGRAPH ASIA 2020]

Face Identity Disentanglement via Latent Space Mapping Description Official Implementation of the paper Face Identity Disentanglement via Latent Space

150 Dec 07, 2022
Official PyTorch implemention of our paper "Learning to Rectify for Robust Learning with Noisy Labels".

WarPI The official PyTorch implemention of our paper "Learning to Rectify for Robust Learning with Noisy Labels". Run python main.py --corruption_type

Haoliang Sun 3 Sep 03, 2022
Simple ONNX operation generator. Simple Operation Generator for ONNX.

sog4onnx Simple ONNX operation generator. Simple Operation Generator for ONNX. https://github.com/PINTO0309/simple-onnx-processing-tools Key concept V

Katsuya Hyodo 6 May 15, 2022
Generate saved_model, tfjs, tf-trt, EdgeTPU, CoreML, quantized tflite and .pb from .tflite.

tflite2tensorflow Generate saved_model, tfjs, tf-trt, EdgeTPU, CoreML, quantized tflite and .pb from .tflite. 1. Supported Layers No. TFLite Layer TF

Katsuya Hyodo 214 Dec 29, 2022
A PyTorch implementation of EfficientDet.

A PyTorch impl of EfficientDet faithful to the original Google impl w/ ported weights

Ross Wightman 1.4k Jan 07, 2023
Ansible Automation Example: JSNAPY PRE/POST Upgrade Validation

Ansible Automation Example: JSNAPY PRE/POST Upgrade Validation Overview This example will show how to validate the status of our firewall before and a

Calvin Remsburg 1 Jan 07, 2022
Metric learning algorithms in Python

metric-learn: Metric Learning in Python metric-learn contains efficient Python implementations of several popular supervised and weakly-supervised met

1.3k Jan 02, 2023
9th place solution in "Santa 2020 - The Candy Cane Contest"

Santa 2020 - The Candy Cane Contest My solution in this Kaggle competition "Santa 2020 - The Candy Cane Contest", 9th place. Basic Strategy In this co

toshi_k 22 Nov 26, 2021
An algorithmic trading bot that learns and adapts to new data and evolving markets using Financial Python Programming and Machine Learning.

ALgorithmic_Trading_with_ML An algorithmic trading bot that learns and adapts to new data and evolving markets using Financial Python Programming and

1 Mar 14, 2022
Companion code for the paper "Meta-Learning the Search Distribution of Black-Box Random Search Based Adversarial Attacks" by Yatsura et al.

META-RS This is the companion code for the paper "Meta-Learning the Search Distribution of Black-Box Random Search Based Adversarial Attacks" by Yatsu

Bosch Research 7 Dec 09, 2022
Code for ACM MM2021 paper "Complementary Trilateral Decoder for Fast and Accurate Salient Object Detection"

CTDNet The PyTorch code for ACM MM2021 paper "Complementary Trilateral Decoder for Fast and Accurate Salient Object Detection" Requirements Python 3.6

CVTEAM 28 Oct 20, 2022
Tensorflow implementation and notebooks for Implicit Maximum Likelihood Estimation

tf-imle Tensorflow 2 and PyTorch implementation and Jupyter notebooks for Implicit Maximum Likelihood Estimation (I-MLE) proposed in the NeurIPS 2021

NEC Laboratories Europe 69 Dec 13, 2022