SpeechBrain is an open-source and all-in-one speech toolkit based on PyTorch.

Overview

The SpeechBrain Toolkit

drawing

SpeechBrain is an open-source and all-in-one speech toolkit based on PyTorch.

The goal is to create a single, flexible, and user-friendly toolkit that can be used to easily develop state-of-the-art speech technologies, including systems for speech recognition, speaker recognition, speech enhancement, multi-microphone signal processing and many others.

SpeechBrain is currently in beta.

| Discourse | Tutorials | Website | Documentation | Contributing | HuggingFace |

Key features

SpeechBrain provides various useful tools to speed up and facilitate research on speech technologies:

  • Various pretrained models nicely integrated with drawing (HuggingFace) in our official organization account. These models are given with an interface to easily run inference, facilitating integration. If a HuggingFace model isn't available, we usually provide a least a Google Drive folder containing all the experimental results corresponding.
  • The Brain class, a fully-customizable tool for managing training and evaluation loops over data. The annoying details of training loops are handled for you while retaining complete flexibility to override any part of the process when needed.
  • A YAML-based hyperparameter specification language that describes all types of hyperparameters, from individual numbers (e.g. learning rate) to complete objects (e.g. custom models). This dramatically simplifies recipe code by distilling basic algorithmic components.
  • Multi-GPU training and inference with PyTorch Data-Parallel or Distributed Data-Parallel.
  • Mixed-precision for faster training.
  • A transparent and entirely customizable data input and output pipeline. SpeechBrain follows the PyTorch data loader and dataset style and enables users to customize the i/o pipelines (e.g adding on-the-fly downsampling, BPE tokenization, sorting, threshold ...).

Speech recognition

SpeechBrain supports state-of-the-art methods for end-to-end speech recognition:

  • State-of-the-art performance or comparable with other existing toolkits in several ASR benchmarks.
  • Easily customizable neural language models including RNNLM and TransformerLM. We also propose few pre-trained models to save you computations (more to come!). We support the Hugging Face dataset to facilitate the training over a large text dataset.
  • Hybrid CTC/Attention end-to-end ASR:
    • Many available encoders: CRDNN (VGG + {LSTM,GRU,LiGRU} + DNN), ResNet, SincNet, vanilla transformers, contextnet-based transformers or conformers. Thanks to the flexibility of SpeechBrain, any fully customized encoder could be connected to the CTC/attention decoder and trained in few hours of work. The decoder is fully customizable as well: LSTM, GRU, LiGRU, transformer, or your neural network!
    • Optimised and fast beam search on both CPUs or GPUs.
  • Transducer end-to-end ASR with a custom Numba loss to accelerate the training. Any encoder or decoder can be plugged into the transducer ranging from VGG+RNN+DNN to conformers.
  • Pre-trained ASR models for transcribing an audio file or extracting features for a downstream task.

Feature extraction and augmentation

SpeechBrain provides efficient and GPU-friendly speech augmentation pipelines and acoustic feature extraction:

  • On-the-fly and fully-differentiable acoustic feature extraction: filter banks can be learned. This simplifies the training pipeline (you don't have to dump features on disk).
  • On-the-fly feature normalization (global, sentence, batch, or speaker level).
  • On-the-fly environmental corruptions based on noise, reverberation, and babble for robust model training.
  • On-the-fly frequency and time domain SpecAugment.

Speaker recognition, identification and diarization

SpeechBrain provides different models for speaker recognition, identification, and diarization on different datasets:

  • State-of-the-art performance on speaker recognition and diarization based on ECAPA-TDNN models.
  • Original Xvectors implementation (inspired by Kaldi) with PLDA.
  • Spectral clustering for speaker diarization (combined with speakers embeddings).
  • Libraries to extract speaker embeddings with a pre-trained model on your data.

Speech enhancement and separation

  • Recipes for spectral masking, spectral mapping, and time-domain speech enhancement.
  • Multiple sophisticated enhancement losses, including differentiable STOI loss, MetricGAN, and mimic loss.
  • State-of-the-art performance on speech separation with Conv-TasNet, DualPath RNN, and SepFormer.

Multi-microphone processing

Combining multiple microphones is a powerful approach to achieve robustness in adverse acoustic environments:

  • Delay-and-sum, MVDR, and GeV beamforming.
  • Speaker localization.

Performance

The recipes released with speechbrain implement speech processing systems with competitive or state-of-the-art performance. In the following, we report the best performance achieved on some popular benchmarks:

Dataset Task System Performance
LibriSpeech Speech Recognition CNN + Transformer WER=2.50% (test-clean)
TIMIT Speech Recognition CRDNN + distillation PER=13.1% (test)
CommonVoice (French) Speech Recognition CRDNN WER=17.7% (test)
VoxCeleb2 Speaker Verification ECAPA-TDNN EER=0.69% (vox1-test)
AMI Speaker Diarization ECAPA-TDNN DER=2.13% (lapel-mix)
VoiceBank Speech Enhancement MetricGAN+ PESQ=3.08 (test)
WSJ2MIX Speech Separation SepFormer SDRi=22.6 dB (test)
WSJ3MIX Speech Separation SepFormer SDRi=20.0 dB (test)

For more details, take a look into the corresponding implementation in recipes/dataset/.

Documentation & Tutorials

SpeechBrain is designed to speed-up research and development of speech technologies. Hence, our code is backed-up with three different levels of documentation:

  • Low-level: during the review process of the different pull requests, we are focusing on the level of comments that are given. Hence, any complex functionality or long pipeline is supported with helpful comments enabling users to handily customize the code.
  • Functional-level: all classes in SpeechBrain contains a detailed docstring that details the input and output formats, the different arguments, the usage of the function, the potentially associated bibliography, and a function example that is used for test integration during pull requests. Such examples can also be used to manipulate a class or a function to properly understand what is exactly happening.
  • Educational-level: we provide various Google Colab (i.e. interactive) tutorials describing all the building-blocks of SpeechBrain ranging from the core of the toolkit to a specific model designed for a particular task. The number of available tutorials is expected to increase over time.

Under development

We are currently working towards integrating DNN-HMM for speech recognition and machine translation.

Quick installation

SpeechBrain is constantly evolving. New features, tutorials, and documentation will appear over time. SpeechBrain can be installed via PyPI to rapidly use the standard library. Moreover, a local installation can be used by those users that what to run experiments and modify/customize the toolkit. SpeechBrain supports both CPU and GPU computations. For most all the recipes, however, a GPU is necessary during training. Please note that CUDA must be properly installed to use GPUs.

Install via PyPI

Once you have created your Python environment (Python 3.8+) you can simply type:

pip install speechbrain

Then you can access SpeechBrain with:

import speechbrain as sb

Install with GitHub

Once you have created your Python environment (Python 3.8+) you can simply type:

git clone https://github.com/speechbrain/speechbrain.git
cd speechbrain
pip install -r requirements.txt
pip install --editable .

Then you can access SpeechBrain with:

import speechbrain as sb

Any modification made to the speechbrain package will be automatically interpreted as we installed it with the --editable flag.

Test Installation

Please, run the following script to make sure your installation is working:

pytest tests
pytest --doctest-modules speechbrain

Running an experiment

In SpeechBrain, you can run experiments in this way:

> cd recipes///
> python experiment.py params.yaml

The results will be saved in the output_folder specified in the yaml file. The folder is created by calling sb.core.create_experiment_directory() in experiment.py. Both detailed logs and experiment outputs are saved there. Furthermore, less verbose logs are output to stdout.

Learning SpeechBrain

Instead of a long and boring README, we prefer to provide you with different resources that can be used to learn how to customize SpeechBrain to adapt it to your needs:

  • General information can be found on the website.
  • We offer many tutorials, you can start out from the basic ones about SpeechBrain basic functionalities and building blocks. We provide also more advanced tutorials (e.g SpeechBrain advanced, signal processing ...). You can browse them via the Tutorials drop down menu on SpeechBrain website in the upper right.
  • Details on the SpeechBrain API, how to contribute, and the code are given in the documentation.

License

SpeechBrain is released under the Apache License, version 2.0. The Apache license is a popular BSD-like license. SpeechBrain can be redistributed for free, even for commercial purposes, although you can not take off the license headers (and under some circumstances, you may have to distribute a license document). Apache is not a viral license like the GPL, which forces you to release your modifications to the source code. Also note that this project has no connection to the Apache Foundation, other than that we use the same license terms.

Comments
  • Add Transducer recipe

    Add Transducer recipe

    Hello @mravanelli , @TParcollet , @jjery2243542 ,

    This is a work in progress transducer recipe, the following tasks are addressed:

    • [x] add transducer joint module
    • [x] REMOVED:add seq2seq bool in Brain class to handle the [x,y] input for the compute_forward function
    • [x] add embedding for the Prediction Network
    • [x] add greedy decoding
    • [x] Transducer minimal recipe
    • [x] add Transducer seq2seq recipe for TIMIT
    • [x] add comments to explain the greedy search over the transducer
    • [x] Add transducer recipe for Librispeech
    • [x] Find the good architecture with 14 % wer
    enhancement refactor ready to review 
    opened by aheba 73
  • use sentencepiece lib from google

    use sentencepiece lib from google

    Add BPE tokenizer:

    • [x] add the BPE training
    • [x] use the BPE trained model for the token generation for Librispeech recipe
    • [x] Design the way of adding the BPE on the params (yaml file)
    enhancement ready to review 
    opened by aheba 52
  • Switchboard Recipe

    Switchboard Recipe

    Hey everybody,

    I made a recipe for the Switchboard corpus. The data preparation steps mostly follow Kaldi's s5c recipe.

    The recipe includes the following models:

    ASR

    • CTC: Wav2Vec2 Encoder + CTC Decoder (adapted from the Commonvoice recipes)
    • seq2seq: CRDNN encoder + GRU Decoder + Attention (adapted from the LibriSpeech recipe)
      • Note: Unlike the Librispeech recipe, this system does not include any LM. In fact, every LM I tried (pretrained, finetuned or trained from scratch) seemed to make the performance much worse
    • transformer: Transformer model + LM (adapted from the LibriSpeech recipe)

    LM

    • There are two hparams files for finetuning existing LibriSpeech LMs on Switchboard and Fisher data, one for an RNNLM and the other for a Transformer LM

    Tokenizer

    • Basic Sentencepiece Tokenizer training on Switchboard and Fisher data

    Performance The model performance is as follows: | Model | Swbd WER | Callhome WER | Eval2000 WER | |:---------------------------------:|:-----------:|:---------------:| :---------------:| | CTC | 21.35 | 28.32 | 24.91 | | seq2seq | 25.37 | 36.87 | 29.33 | | Transformer (LibriSpeech LM) | 22.00 | 30.12 | 26.14 | | Transformer (Finetuned LM) | 21.11 | 29.43 | 25.36 |

    As you can see, the performance is currently comparable to Kaldi's chain systems without i-vectors. However, they need some refinement to be on par with the best Kaldi systems available (WER should be around 18 on the full eval2000 testset).

    If you have any suggestions for improvements, I'd be happy to implement them.

    I can also provide the trained models in case you are interested (I might need some help with this whole Huggingface thing though).

    Best, Dominik

    ps Thanks for all the great work you've done here! :)

    enhancement 
    opened by dwgnr 50
  • handle the use of multigpu_{count,backend}

    handle the use of multigpu_{count,backend}

    Hey @pplantinga , @mravanelli , Here is a PR fixing the issue #395 . As discussed, the multigpu_{count, backend} are not used in our ddp.py, currently, the multigpu_{count, backend} is used in the hyperparamsfile only with data_parallel. This PR handle the use of multigpu_{count, backend} by DDP.py. If the use set this params in the command line, the params in the yaml file is omitted.

    help wanted work in progress ready to review 
    opened by aheba 50
  • add noise and reverberance version for BinauralWSJ0Mix

    add noise and reverberance version for BinauralWSJ0Mix

    Hi there, I have created a noise and reverberance version of BinauralWSJ0Mix datasets and trained with convtasnet-parallel structure. Here are the recipes and not conflicted with the clean version of datasets. Also, I have trained convtasnet-parallel.yaml again and got a better results which I could share you with the Google Driver. Thanks.

    opened by huangzj421 43
  • Aishell1Mix

    Aishell1Mix

    This branch adds a new task named Aishell1Mix to the recipes which is similar to the LibriMix but applied to the mandarin AISHELL-1 dataset. Hope to receive your reply. Much thanks.

    enhancement 
    opened by huangzj421 42
  • training on voxceleb1+2 is very slow?

    training on voxceleb1+2 is very slow?

    Dear all: I noticed that when training on voxceleb1+2, it will take me up to 25 hours for single epoch. and even with ddp on 4 gpu cards, the training speed does not reduce at all. I guess the cpu is the bottleneck? anyone has the same phenomena? thank you.

    7%|████████▎                                        | 16569/241547 [1:45:07<25:09:56,  2.48it/s, train_loss=13
    
    question 
    opened by dragen1860 35
  • Insertion problem when decoding with pre-trained ASR model.

    Insertion problem when decoding with pre-trained ASR model.

    Thanks for the clear example In foldertemplates/speech_recognition/ASR/ to train an ASR model on mini-librispeech dataset. However, when I used the librispeech-pretrained model (ASR model, language model and tokenizer) to decode some waveforms in librispeech test dataset, the decoding result will repeat some of the words many times and cause severe insertion errors. Below is several examples:

    1221-135766-0014, %WER 2436.36 [ 268 / 11, 268 ins, 0 del, 0 sub ]
    PEARL ; SAW ; AND ; GAZED ; INTENTLY ; BUT ; NEVER ; SOUGHT ; TO ; MAKE ; ACQUAINTANCE ; <eps> ; <eps> ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ; <eps>  ; <eps> ; <eps> ;    <eps>     ; <eps> ; <eps> ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ; <eps>  ; <eps> ; <eps> ;    <eps>     ; <eps> ; <eps> ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ; <eps>  ; <eps> ; <eps> ;    <eps>     ; <eps> ; <eps> ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ; <eps>  ; <eps> ; <eps> ;    <eps>     ; <eps> ; <eps> ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ; <eps>  ; <eps> ; <eps> ;    <eps>     ; <eps> ; <eps> ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ; <eps>  ; <eps> ; <eps> ;    <eps>     ; <eps> ; <eps> ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ; <eps>  ; <eps> ; <eps> ;    <eps>     ; <eps> ; <eps> ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ; <eps>  ; <eps> ; <eps> ;    <eps>     ; <eps> ; <eps> ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ; <eps>  ; <eps> ; <eps> ;    <eps>     ; <eps> ; <eps> ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ; <eps>  ; <eps> ; <eps> ;    <eps>     ; <eps> ; <eps> ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ; <eps>  ; <eps> ; <eps> ;    <eps>     ; <eps> ; <eps> ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ; <eps>  ; <eps> ; <eps> ;    <eps>     ; <eps> ; <eps> ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ; <eps>  ; <eps> ; <eps> ;    <eps>     ; <eps> ; <eps> ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ; <eps>  ; <eps> ; <eps> ;    <eps>     ; <eps> ; <eps> ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ; <eps>  ; <eps> ; <eps> ;    <eps>     ; <eps> ; <eps> ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ; <eps>  ; <eps> ; <eps> ;    <eps>     ; <eps> ; <eps> ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ; <eps>  ; <eps> ; <eps> ;    <eps>     ; <eps> ; <eps> ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ; <eps>  ; <eps> ; <eps> ;    <eps>     ; <eps> ; <eps> ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ; <eps>  ; <eps> ; <eps> ;    <eps>     ; <eps> ; <eps> ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ; <eps>  ; <eps> ; <eps> ;    <eps>     ; <eps> ; <eps> ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ; <eps>  ; <eps> ; <eps> ;    <eps>     ; <eps> ; <eps> ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ; <eps>  ; <eps> ; <eps> ;    <eps>     ; <eps> ; <eps> ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ; <eps>  ; <eps> ; <eps> ;    <eps>     ; <eps> ; <eps> ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ; <eps>  ; <eps> ; <eps> ;    <eps>     ; <eps> ; <eps> ; <eps> ; <eps>

    PEARL ; SAW ; AND ; GAZED ; INTENTLY ; BUT ; NEVER ; SOUGHT ; TO ; MAKE ; ACQUAINTANCE ; PEARL ;  SAW  ;  AND  ; GAZED ; INTENTLY ;  BUT  ; NEVER ; SOUGHT ;   TO  ;  MAKE ; ACQUAINTANCE ; PEARL ;  SAW  ;  AND  ; GAZED ; INTENTLY ;  BUT  ; NEVER ; SOUGHT ;   TO  ;  MAKE ; ACQUAINTANCE ; PEARL ;  SAW  ;  AND  ; GAZED ; INTENTLY ;  BUT  ; NEVER ; SOUGHT ;   TO  ;  MAKE ; ACQUAINTANCE ; PEARL ;  SAW  ;  AND  ; GAZED ; INTENTLY ;  BUT  ; NEVER ; SOUGHT ;   TO  ;  MAKE ; ACQUAINTANCE ; PEARL ;  SAW  ;  AND  ; GAZED ; INTENTLY ;  BUT  ; NEVER ; SOUGHT ;   TO  ;  MAKE ; ACQUAINTANCE ; PEARL ;  SAW  ;  AND  ; GAZED ; INTENTLY ;  BUT  ; NEVER ; SOUGHT ;   TO  ;  MAKE ; ACQUAINTANCE ; PEARL ;  SAW  ;  AND  ; GAZED ; INTENTLY ;  BUT  ; NEVER ; SOUGHT ;   TO  ;  MAKE ; ACQUAINTANCE ; PEARL ;  SAW  ;  AND  ; GAZED ; INTENTLY ;  BUT  ; NEVER ; SOUGHT ;   TO  ;  MAKE ; ACQUAINTANCE ; PEARL ;  SAW  ;  AND  ; GAZED ; INTENTLY ;  BUT  ; NEVER ; SOUGHT ;   TO  ;  MAKE ; ACQUAINTANCE ; PEARL ;  SAW  ;  AND  ; GAZED ; INTENTLY ;  BUT  ; NEVER ; SOUGHT ;   TO  ;  MAKE ; ACQUAINTANCE ; PEARL ;  SAW  ;  AND  ; GAZED ; INTENTLY ;  BUT  ; NEVER ; SOUGHT ;   TO  ;  MAKE ; ACQUAINTANCE ; PEARL ;  SAW  ;  AND  ; GAZED ; INTENTLY ;  BUT  ; NEVER ; SOUGHT ;   TO  ;  MAKE ; ACQUAINTANCE ; PEARL ;  SAW  ;  AND  ; GAZED ; INTENTLY ;  BUT  ; NEVER ; SOUGHT ;   TO  ;  MAKE ; ACQUAINTANCE ; PEARL ;  SAW  ;  AND  ; GAZED ; INTENTLY ;  BUT  ; NEVER ; SOUGHT ;   TO  ;  MAKE ; ACQUAINTANCE ; PEARL ;  SAW  ;  AND  ; GAZED ; INTENTLY ;  BUT  ; NEVER ; SOUGHT ;   TO  ;  MAKE ; ACQUAINTANCE ; PEARL ;  SAW  ;  AND  ; GAZED ; INTENTLY ;  BUT  ; NEVER ; SOUGHT ;   TO  ;  MAKE ; ACQUAINTANCE ; PEARL ;  SAW  ;  AND  ; GAZED ; INTENTLY ;  BUT  ; NEVER ; SOUGHT ;   TO  ;  MAKE ; ACQUAINTANCE ; PEARL ;  SAW  ;  AND  ; GAZED ; INTENTLY ;  BUT  ; NEVER ; SOUGHT ;   TO  ;  MAKE ; ACQUAINTANCE ; PEARL ;  SAW  ;  AND  ; GAZED ; INTENTLY ;  BUT  ; NEVER ; SOUGHT ;   TO  ;  MAKE ; ACQUAINTANCE ; PEARL ;  SAW  ;  AND  ; GAZED ; INTENTLY ;  BUT  ; NEVER ; SOUGHT ;   TO  ;  MAKE ; ACQUAINTANCE ; PEARL ;  SAW  ;  AND  ; GAZED ; INTENTLY ;  BUT  ; NEVER ; SOUGHT ;   TO  ;  MAKE ; ACQUAINTANCE ; PEARL ;  SAW  ;  AND  ; GAZED ; INTENTLY ;  BUT  ; NEVER ; SOUGHT ;   TO  ;  MAKE ; ACQUAINTANCE ; PEARL ;  SAW  ;  AND  ; GAZED ; INTENTLY ;  BUT  ; NEVER ; SOUGHT ;   TO  ;  MAKE ; ACQUAINTANCE ; PEARL ;  SAW  ;  AND  ; GAZED ; INTENTLY ;  BUT  ; NEVER ; SOUGHT ;   TO  ;  MAKE ; ACQUAINTANCE ; PEARL ;  SAW  ;  AND  ; GAZED
    
    121-123859-0001, %WER 869.81 [ 461 / 53, 454 ins, 0 del, 7 sub ]
    O  ; TIS ; THE ; FIRST  ; TIS ; FLATTERY ; IN ; MY ; SEEING ; AND ; MY ; GREAT ; MIND ; MOST ; KINGLY ; DRINKS ; IT ; UP ; MINE ; EYE ; WELL ; KNOWS ; WHAT ; WITH ; HIS ; GUST ; IS ; GREEING ; AND ; TO ; HIS ; PALATE ; DOTH ; PREPARE ; THE ; CUP ; IF ; IT ; BE ; POISON'D ; TIS ; THE ; LESSER ; SIN ; THAT ; MINE ; EYE ; LOVES ; IT ; AND ; DOTH ; <eps>  ; <eps>  ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; FIRST ; BEGIN ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps>  ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps>  ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; <eps>

    OH ; TIS ; THE ; THIRST ; TIS ; FLATTERY ; IN ; MY ; SEEING ; AND ; MY ; GREAT ; MIND ; MOST ; KEENLY ; DRINKS ; IT ; UP ; MINE ; EYE ; WELL ; KNOWS ; WHAT ; WITH ; HIS ; GUST ; IS ;  GREEN  ; AND ; TO ; HIS ; PALATE ; DOTH ; PREPARE ; THE ; CUP ; IF ; IT ; BE ; POISONED ; TIS ; THE ; LESSER ; SIN ; THAT ; MINE ;  I  ;  LOVE ; IT ; AND ; DOTH ; THIRST ; BEGINS ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGINS ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGINS ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE ;   IT  ;  AND  ;  DOTH ; FIRST ; BEGIN ;  THAT ;  MINE ;   I   ;  LOVE
    
    1284-134647-0001, %WER 707.41 [ 191 / 27, 191 ins, 0 del, 0 sub ]
    THE ; EDICT ; OF ; MILAN ; THE ; GREAT ; CHARTER ; OF ; TOLERATION ; HAD ; CONFIRMED ; TO ; EACH ; INDIVIDUAL ; OF ; THE ; ROMAN ; WORLD ; THE ; PRIVILEGE ; OF ; CHOOSING ; AND ; PROFESSING ; HIS ; OWN ; RELIGION ; <eps> ; <eps> ;   <eps>   ; <eps> ; <eps> ;   <eps>    ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ;   <eps>   ; <eps> ;  <eps>   ; <eps> ;   <eps>    ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ;   <eps>   ; <eps> ; <eps> ;   <eps>    ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ;   <eps>   ; <eps> ;  <eps>   ; <eps> ;   <eps>    ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ;   <eps>   ; <eps> ; <eps> ;   <eps>    ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ;   <eps>   ; <eps> ;  <eps>   ; <eps> ;   <eps>    ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ;   <eps>   ; <eps> ; <eps> ;   <eps>    ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ;   <eps>   ; <eps> ;  <eps>   ; <eps> ;   <eps>    ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ;   <eps>   ; <eps> ; <eps> ;   <eps>    ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ;   <eps>   ; <eps> ;  <eps>   ; <eps> ;   <eps>    ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ;   <eps>   ; <eps> ; <eps> ;   <eps>    ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ;   <eps>   ; <eps> ;  <eps>   ; <eps> ;   <eps>    ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ;   <eps>   ; <eps> ; <eps> ;   <eps>    ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ;   <eps>   ; <eps> ;  <eps>   ; <eps> ;   <eps>    ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ;   <eps>   ; <eps> ; <eps> ;   <eps>    ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ;   <eps>   ; <eps> ;  <eps>   ; <eps> ;   <eps>    ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ;   <eps>   ; <eps> ; <eps> ;   <eps>    ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ;   <eps>   ; <eps> ;  <eps>   ; <eps> ;   <eps>    ; <eps> ; <eps> ;  <eps>   ; <eps> ; <eps> ;   <eps>   ; <eps> ; <eps> ;   <eps>    ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ;   <eps>   ; <eps> ;  <eps>   ; <eps> ;   <eps>    ; <eps> ; <eps> ;  <eps>   ; <eps>


    

    The dataset I tested on is part of the librispeech test-clean dataset (reader id beginning with 1, 2 and 3, 1074 files in total.), and the average WER on this dataset is 20.3%. Below is the hparams I used for searching:

     test_search: !new:speechbrain.decoders.S2SRNNBeamSearchLM
        embedding: !ref <embedding>
        decoder: !ref <decoder>
        linear: !ref <seq_lin>
        ctc_linear: !ref <ctc_lin>
        language_model: !ref <lm_model>
        bos_index: 0
        eos_index: 0
        blank_index: 0
        min_decode_ratio: 0.0
        max_decode_ratio: 1.0
        beam_size: 80
        eos_threshold: 1.5
        using_max_attn_shift: true
        max_attn_shift: 240
        coverage_penalty: 1.5
        lm_weight: 0.5
        ctc_weight: 0.0
        temperature: 1.25
        temperature_lm: 1.25
    

    I also found that if I change the testing batch_size from 8 to 1, the WER can be reduced from 20.3% to 2.8%, which I believe should be the normal result. I am thus wondering whether the padding might be the main reason for this problem.

    opened by Kuray107 31
  • LM decoder and training for TIMIT

    LM decoder and training for TIMIT

    Modifications:

    1. Add length normalization for beam search.
    2. Rename length penalty to length rewarding (beam search).
    3. Integrate LM in the decoder.
    4. Add recipe for LM and ASR with LM decoding.
    work in progress ready to review 
    opened by jjery2243542 31
  • Can't train a model with multi NVIDIA RTX 3090 GPUs.

    Can't train a model with multi NVIDIA RTX 3090 GPUs.

    OS: Ubuntu 20.04 Python: I tested both 3.7 and 3.8 SpeechBrain: I tested 0.5.8 and 0.5.9 PyTorch: 1.7.0 for SpeechBrain 0.5.8 and 1.9.0 for SpeechBrain 0.5.9, both complied on CUDA 11.1 Recipe: speechbrain/recipes/LibriSpeech/ASR/transformer

    command: python train.py hparams/transformer.yaml --data_folder xxx --data_parallel_backend

    I have 8 3090 GPUs on my server. But when I watched nvidia-smi, there was only one GPU process running on one GPU, the rest of the 7 GPUs were idle. So how can I fix this problem? Thank you.

    opened by Xinghui-Wu 28
  • MultiGPU + Librispeech

    MultiGPU + Librispeech

    Adding Multi-GPU training to the Librispeech recipe.

    1. Change the logging to info on the libri preparation. Without that, the user has NO feedback on what is happening, and it's actually weird.
    2. Add multi GPU with data parallel to experiment.py
    3. Add a multigpu param to the yaml file

    To do: [x] Test the recipe on 1-2 GPU [x] Test that the checkpointing doesn't break due to DataParallel when going from one to two and two to one

    enhancement ready to review 
    opened by TParcollet 27
  • [Bug]: Speaker Classification Inference KeyError

    [Bug]: Speaker Classification Inference KeyError

    Issue

    This is similar to the #1049 issue, the only difference being that it was for language identification.

    I trained a model using some audio from commonvoice. The model completed training. I am doing inference now.

    This is my Inference code:

    `from speechbrain.pretrained import EncoderClassifier import os import sys import torch import torchaudio

    classifier = EncoderClassifier.from_hparams(source="./content/best_model/", hparams_file='hparams_inference.yaml', savedir="./content/best_model/")

    Classification

    audio_file = 'data/common_voice_de_27022043.wav' signal, fs = torchaudio.load(audio_file) # test_speaker: 5789 output_probs, score, index, text_lab = classifier.classify_batch(signal) print('Target: 000fc181c938978e23ec7c066dddc246ca2b3160b50e3bfee829c02f5db753b3b8b955ad0f1b3effc954e5b10b474e6e93386f2b7925e7195abd84a164477851, Predicted: ' + text_lab[0])

    Speaker 2

    audio_file = 'data/common_voice_de_18351596.wav' signal, fs =torchaudio.load(audio_file) # test_speaker: 460 output_probs, score, index, text_lab = classifier.classify_batch(signal) print('Target: 0e156ff8b3bdd99355fd7f99e2259c47bb78e7dcac346a9966181b9e5e265960ddccc5f73b036948d3586a03e8b482b01d908da93026737ac60a9996d88d6881, Predicted: ' + text_lab[0]) `

    I get the error KeyError: 0

    I have added the pre-trainer link in the YAML. This is what that part looks like: `modules: compute_features: !ref <compute_features> embedding_model: !ref <embedding_model> classifier: !ref mean_var_norm: !ref <mean_var_norm>

    pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer loadables: embedding_model: !ref <embedding_model> classifier: !ref label_encoder: !ref <label_encoder> paths: embedding_model: !ref <pretrained_path>/embedding_model.ckpt classifier: !ref <pretrained_path>/classifier.ckpt label_encoder: !ref <pretrained_path>/label_encoder.txt`

    ##I have attached my label encoder in the comments

    Could you tell me what I am missing here?

    Expected behaviour

    Successful Inference

    To Reproduce

    '''Compete YAML file below:

    pretrain folders:

    pretrained_path: /content/best_model/

    Model parameters

    n_mels: 23 sample_rate: 16000 n_classes: 29 # In this case, we have 28 speakers emb_dim: 512 # dimensionality of the embeddings

    Feature extraction

    compute_features: !new:speechbrain.lobes.features.Fbank n_mels: !ref <n_mels>

    Mean and std normalization of the input features

    mean_var_norm: !new:speechbrain.processing.features.InputNormalization norm_type: sentence std_norm: False

    embedding_model: !new:custom_model.Xvector in_channels: !ref <n_mels> activation: !name:torch.nn.LeakyReLU tdnn_blocks: 5 tdnn_channels: [512, 512, 512, 512, 1500] tdnn_kernel_sizes: [5, 3, 3, 1, 1] tdnn_dilations: [1, 2, 3, 1, 1] lin_neurons: !ref <emb_dim>

    classifier: !new:custom_model.Classifier input_shape: [null, null, !ref <emb_dim>] activation: !name:torch.nn.LeakyReLU lin_blocks: 1 lin_neurons: !ref <emb_dim> out_neurons: !ref <n_classes>

    label_encoder: !new:speechbrain.dataio.encoder.CategoricalEncoder

    modules: compute_features: !ref <compute_features> embedding_model: !ref <embedding_model> classifier: !ref mean_var_norm: !ref <mean_var_norm>

    pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer loadables: embedding_model: !ref <embedding_model> classifier: !ref label_encoder: !ref <label_encoder> paths: embedding_model: !ref <pretrained_path>/embedding_model.ckpt classifier: !ref <pretrained_path>/classifier.ckpt label_encoder: !ref <pretrained_path>/label_encoder.txt'''

    Versions

    No response

    Relevant log output

    No response

    Additional context

    No response

    bug 
    opened by praveenmathew93 1
  • [Bug]: M1 GPU (mps) support

    [Bug]: M1 GPU (mps) support

    Describe the bug

    It looks like the Speechbrain library does not support the M1 GPU (mps backend). The error is raised when trying to use the MPS backend on a pre-trained model (at least, this is the case I found, I don't know if it happens also in other situations, but I guess it does) and in particular the error is:

    {ValueError}invalid type: 'torch.mps.FloatTensor'

    The error is caused by this line in dual_path.py (file in the Speechbrain library, line 1066):

            if gap > 0:
                pad = torch.Tensor(torch.zeros(B, N, gap)).type(input.type())
    

    And it is caused by the fact that input.type() returns torch.mps.FloatTensor but such value is not a valid Tensor type.

    Such problem has been already reported in PyTorch (here: https://github.com/pytorch/pytorch/issues/82296) and looks like it is on its way to be fixed.

    However, it looks like Speechbrain will need to upgrade its PyTorch dependency (from the PyTorch discussion it looks like they're gonna include the fix in Torch 2.0) or find a workaround with the datatype in the meanwhile 🤔

    Expected behaviour

    Being able to use the MPS backend on a M1 Mac to run Speechbrain models

    To Reproduce

    from speechbrain.pretrained.interfaces import SepformerSeparation
    import torchaudio
    import torch
    
    separator = SepformerSeparation.from_hparams(source="speechbrain/sepformer-wsj02mix", savedir="./pretrained-sepformer-wsj02mix",  run_opts={"device": "mps"})
    
    s1, fs = torchaudio.load('./my_file.wav') # Just insert here any wav file you want
    resampler = torchaudio.transforms.Resample(fs, 8000)
    
    s1 = resampler(s1)
    
    est_sources = separator.separate_batch(s1)
    

    Versions

    0.5.13

    Relevant log output

    ---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    Cell In[49], line 1
    ----> 1 est_sources = separator.separate_batch(s1)
    
    File ~/Desktop/personal_git/voice-assistant/.venv/lib/python3.10/site-packages/speechbrain/pretrained/interfaces.py:1976, in SepformerSeparation.separate_batch(self, mix)
       1974 mix = mix.to(self.device)
       1975 mix_w = self.mods.encoder(mix)
    -> 1976 est_mask = self.mods.masknet(mix_w)
       1977 mix_w = torch.stack([mix_w] * self.hparams.num_spks)
       1978 sep_h = mix_w * est_mask
    
    File ~/Desktop/personal_git/voice-assistant/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
       1190 # If we don't have any hooks, we want to skip the rest of the logic in
       1191 # this function, and just call forward.
       1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
       1193         or _global_forward_hooks or _global_forward_pre_hooks):
    -> 1194     return forward_call(*input, **kwargs)
       1195 # Do not call functions when jit is used
       1196 full_backward_hooks, non_full_backward_hooks = [], []
    
    File ~/Desktop/personal_git/voice-assistant/.venv/lib/python3.10/site-packages/speechbrain/lobes/models/dual_path.py:1017, in Dual_Path_Model.forward(self, x)
       1012     x = self.pos_enc(x.transpose(1, -1)).transpose(1, -1) + x * (
       1013         x.size(1) ** 0.5
       1014     )
       1016 # [B, N, K, S]
    -> 1017 x, gap = self._Segmentation(x, self.K)
       1019 # [B, N, K, S]
       1020 for i in range(self.num_layers):
    
    File ~/Desktop/personal_git/voice-assistant/.venv/lib/python3.10/site-packages/speechbrain/lobes/models/dual_path.py:1097, in Dual_Path_Model._Segmentation(self, input, K)
       1095 B, N, L = input.shape
       1096 P = K // 2
    -> 1097 input, gap = self._padding(input, K)
       1098 # [B, N, K, S]
       1099 input1 = input[:, :, :-P].contiguous().view(B, N, -1, K)
    
    File ~/Desktop/personal_git/voice-assistant/.venv/lib/python3.10/site-packages/speechbrain/lobes/models/dual_path.py:1067, in Dual_Path_Model._padding(self, input, K)
       1065 gap = K - (P + L % K) % K
       1066 if gap > 0:
    -> 1067     pad = torch.Tensor(torch.zeros(B, N, gap)).type(input.type())
       1068     input = torch.cat([input, pad], dim=2)
       1070 _pad = torch.Tensor(torch.zeros(B, N, P)).type(input.type())
    
    ValueError: invalid type: 'torch.mps.FloatTensor'
    

    Additional context

    No response

    bug 
    opened by mattiasu96 0
  • [Bug]: SLURP/direct malformed node or string

    [Bug]: SLURP/direct malformed node or string

    Describe the bug

    While testing recipes, one of the SLURP recipes ran into an error.

    ValueError: malformed node or string: <ast.Name object at 0x7ff2cbf7c4f0>

    Expected behaviour

    No error

    To Reproduce

    No response

    Versions

    No response

    Relevant log output

    File "speechbrain/recipes/SLURP/direct/train.py", line 365, in <module>
        slu_brain.evaluate(test_set, test_loader_kwargs=hparams["dataloader_opts"])
    ...
      File "speechbrain/recipes/SLURP/direct/train.py", line 128, in compute_objectives
        _dict = ast.literal_eval(
    ...
      File "python3.9/ast.py", line 66, in _raise_malformed_node
        raise ValueError(f'malformed node or string: {node!r}')
    ValueError: malformed node or string: <ast.Name object at 0x7ff2cbf7c4f0>
    

    Additional context

    No response

    bug 
    opened by anautsch 0
  • [Bug]: LJSpeech & LibriTTS - audio_pipeline error

    [Bug]: LJSpeech & LibriTTS - audio_pipeline error

    Describe the bug

    While testing recipes, _get_spec_norms threw an error. I remember since I ran the tests last time, some of it might have been fixed already in #1740 for LJSpeech but the issue might still be up for LibriTTS.

    Expected behaviour

    No error

    To Reproduce

    No response

    Versions

    No response

    Relevant log output

    File "speechbrain/recipes/LibriTTS/vocoder/hifigan/train.py", line 330, in audio_pipeline
        mel = hparams["mel_spectogram"](audio=audio.squeeze(0))
    ...
      File "torchaudio/transforms/_transforms.py", line 108, in forward
        return F.spectrogram(
      File "torchaudio/functional/functional.py", line 114, in spectrogram
        frame_length_norm, window_norm = _get_spec_norms(normalized)
      File "torchaudio/functional/functional.py", line 239, in _get_spec_norms
        raise TypeError("Input type not supported")
    TypeError: Input type not supported
    

    Additional context

    No response

    bug 
    opened by anautsch 0
  • [Bug]: CommonVoice/self-supervised-learning/wav2vec2 - not implemented for NumPy arrays

    [Bug]: CommonVoice/self-supervised-learning/wav2vec2 - not implemented for NumPy arrays

    Describe the bug

    While testing recipes, an error occured.

    TypeError: Concatenation operation is not implemented for NumPy arrays, use np.concatenate() instead. Please do not rely on this error; it may not be given on all Python implementations.

    Expected behaviour

    Either a clearer restriction of when this recipe can be used, or an adjustment of dependencies (so there is no error).

    To Reproduce

    No response

    Versions

    No response

    Relevant log output

    File "speechbrain/recipes/CommonVoice/self-supervised-learning/wav2vec2/train_hf_wav2vec2.py", line 111, in fit_batch
        predictions = self.compute_forward(batch, sb.Stage.TRAIN)
    ...
      File "transformers/models/wav2vec2/modeling_wav2vec2.py", line 285, in _sample_negative_indices
        sampled_negative_indices[batch_idx] += batch_idx * sequence_length
    TypeError: Concatenation operation is not implemented for NumPy arrays, use np.concatenate() instead. Please do not rely on this error; it may not be given on all Python implementations.
    

    Additional context

    No response

    bug 
    opened by anautsch 0
  • [Bug]: Voicebank/*/*MetricGAN* torch.multinomial error

    [Bug]: Voicebank/*/*MetricGAN* torch.multinomial error

    Describe the bug

    While testing recipes, an error was thrown for all MetricGAN recipes. This report points to one log only, but it is alike for all three MetricGAN recipes.

    RuntimeError: cannot sample n_sample > prob_dist.size(-1) samples without replacement

    Expected behaviour

    No error

    To Reproduce

    No response

    Versions

    No response

    Relevant log output

    File "speechbrain/recipes/Voicebank/enhance/MetricGAN/train.py", line 371, in train_discriminator
        self.fit(
    ...
      File "torch/utils/data/sampler.py", line 203, in __iter__
        rand_tensor = torch.multinomial(self.weights, self.num_samples, self.replacement, generator=self.generator)
    RuntimeError: cannot sample n_sample > prob_dist.size(-1) samples without replacement
    

    Additional context

    No response

    bug 
    opened by anautsch 0
Releases(v0.5.13)
  • v0.5.13(Aug 29, 2022)

    This is a minor release with better dependency version specification. We note that SpeechBrain is compatible with PyTorch 1.12, and the updated package reflects this. See the issue linked next to each commit for more details about the corresponding changes.

    Commit summary

    • [edb7714]: Adding no_sync and on_fit_batch_end method to core (Rudolf Arseni Braun) #1449
    • [07155e9]: G2P fixes (flexthink) #1473
    • [6602dab]: fix for #1469, minimal testing for profiling (anautsch) #1476
    • [abbfab9]: test clean-ups: passes linters; doctests; unit & integration tests; load-yaml on cpu (anautsch) #1487
    • [1a16b41]: fix ddp incorrect command (=) #1498
    • [0b0ec9d]: using no_sync() in fit_batch() of core.py (Rudolf Arseni Braun) #1449
    • [5c9b833]: Remove torch maximum compatible version (Peter Plantinga) #1504
    • [d0f4352]: remove limit for HF hub as it does not work with colab (Titouan) #1508
    • [b78f6f8]: Add revision to hub (Titouan) #1510
    • [2c491a4]: fix transducer loss inputs devices (Adel Moumen) #1511
    • [4972f76]: missing space in install command (pehonnet) #1512
    • [6bc72af]: Fixing shuffle argument for distributed sampler in core.py (Rudolf Arseni Braun) #1518
    • [df7acd9]: Added the link for example results (cem) #1523
    • [5bae6df]: add LinearWarmupScheduler (Ge Li) #1537
    • [2edd7ee]: updating scipy version in requirements.txt. (Nauman Dawalatabad) #1546
    Source code(tar.gz)
    Source code(zip)
  • v0.5.12(Jun 26, 2022)

    Release Notes - SpeechBrain v0.5.12

    We worked very hard and we are very happy to announce the new version of SpeechBrain!

    SpeechBrain 0.5.12 significantly expands the toolkit without introducing any major interface changes. I would like to warmly thank the many contributors that made this possible.

    The main changes are the following:

    A) Text-to-Speech: We developed the first TTS system of SpeechBrain. You can find it here. The system relies on Tacotron2 + HiFiGAN (as vocoder). The models coupled with an easy-inference interface are available on HuggingFace.

    B) Grapheme-to-Phoneme (G2P): We developed an advanced Grapheme-to-Phoneme. You can find the code here. The current version significantly outperforms our previous model.

    C) Speech Separation:

    1. We developed a novel version of the SepFormer called Resource-Efficient SepFormer (RE-Sepformer). The code is available here and the pre-trained model (with an easy inference interface) here.
    2. We released a recipe for Binaural speech separation with WSJMix. See the code here.
    3. We released a new recipe with the AIShell mix dataset. You can see the code here.

    D) Speech Enhancement:

    1. We released the SepFormer model for speech enhancement. the code is here, while the pre-trained model (with easy-inference interface) is here.
    2. We implemented the WideResNet for speech enhancement and use it to mimic loss-based speech enhancement. The code is here and the pretrained model (with easy-inference interface) is here.

    E) Feature Front-ends:

    1. We now support LEAF filter banks. The code is here. You can find an example of a recipe using it here.
    2. We now support SincConv multichannel (see code here).

    F) Recipe Refactors:

    1. We refactored the Voxceleb recipe and fix the normalization issues. See the new code here. We also made the EER computation method less memory demanding (see here).
    2. We refactored the IEMOCAP recipe for emotion recognition. See the new code here.

    G) Models for African Languages: We now have recipes for the DVoice dataset. We currently support Darija, Swahili, Wolof, Fongbe, and Amharic. The code is available here. The pretrained model (coupled with an easy-inference interface) can be found on SpeechBrain-HuggingFace.

    H) Profiler: We implemented a model profiler that helps users while developing new models with SpeechBrain. The profiler outputs a bunch of potentially useful information, such as the real-time factors and many other details. A tutorial is available here.

    I) Tests: We significantly improved the tests. In particular, we introduced the following tests: HF_repo tests, docstring checks, yaml-script consistency, recipe tests, and check URLs. This will helps us scale up the project.

    L) Other improvements:

    1. We now support the torchaudio RNNT loss*.
    2. We improved the relative attention mechanism of the Conformer.
    3. We updated the transformer for LibriSpeech. This improves the performance from WER= 2.46% to 2.26% on the test-clean. See the code here.
    4. The Environmental corruption module can now support different sampling rates.
    5. Minor fixes.
    Source code(tar.gz)
    Source code(zip)
  • v0.5.11(Dec 20, 2021)

    Dear users, We worked very hard, and we are very happy to announce the new version of SpeechBrain. SpeechBrain 0.5.11 further expands the toolkit without introducing any major interface change.

    The main changes are the following:

    1. We implemented new recipes, such as:
    1. Support for Dynamic batching with a Tutorial to help users familiarize themselves with it.

    2. Support for wav2vec training within SpeechBrain.

    3. Developed an interface with Orion for hyperparameter tuning with a Tutorial to help users familiarize themselves with it.

    4. the torchaudio transducer loss is now supported. We also kept our numba implementation to help users customize the transducer loss part if needed.

    5. Improved CTC-Segmentation

    6. Fixed minor bugs and issues (e.g., fixed MVDR beamformer ).

    Let me thank all the amazing contributors for this achievement. Please, keep add a star to our project if you appreciate our effort for the community. Together, we are growing very fast, and we have big plans for the future.

    Stay Tuned!

    Source code(tar.gz)
    Source code(zip)
  • 0.5.10(Sep 11, 2021)

    This version mainly expands the functionalities of SpeechBrain without adding any backward incompatibilities.

    New Recipes:

    • Language Identification with CommonLanguage
    • EEG signal processing with ERPCore
    • Speech translation with Fisher-Call Home
    • Emotion Recognition with IEMOCAP
    • Voice Activity Detection with LibriParty
    • ASR with LibriSpeech wav2vec (WER=1.9 on test-clean)
    • SpeechEnhancement with CoopNet
    • SpeechEnhancement with SEGAN
    • Speech Separation with LibriMix, WHAM, and WHAMR
    • Support for guided attention
    • Spoken Language Understanding with SLURP

    Beyond that, we fixed some minor bugs and issues.

    Source code(tar.gz)
    Source code(zip)
  • v0.5.9(Jun 17, 2021)

    This main differences with the previous version are the following:

    • Added Wham/whamr/librimix for speech separation
    • Compatibility with PyTorch 1.9
    • Fixed minor bugs
    • Added SpeechBrain paper
    Source code(tar.gz)
    Source code(zip)
  • v0.5.8(Jun 6, 2021)

    SpeechBrain 0.5.8 improves the previous version in the following way:

    • Added wav2vec support in TIMIT, CommonVoice, AISHELL-1
    • Improved Fluent Speech Command Recipe
    • Improved SLU recipes
    • Recipe for UrbanSound8k
    • Fix small bugs
    • Fix typos
    Source code(tar.gz)
    Source code(zip)
  • 0.5.7(Apr 29, 2021)

    SpeechBrain is an open-source and all-in-one speech toolkit. It is designed to be simple, extremely flexible, and user-friendly. Competitive or state-of-the-art performance is obtained in various domains. The current version (v0.5.7) supports:

    • E2E Speech Recognition
    • Speaker Recognition (Identification and Verification)
    • Spoken Language Understanding (e.g., Intent recognition)
    • Speaker Diarization
    • Speech Enhancement
    • Speech Separation
    • Multi-microphone signal processing (beamforming, localization)

    Many other tasks will be supported soon. Take a look into our roadmap on Discourse. Your contribution is welcome! Please, star our project to help us growing.

    For more info and tutorials: https://speechbrain.github.io/

    Source code(tar.gz)
    Source code(zip)
Suite of 500 procedurally-generated NLP tasks to study language model adaptability

TaskBench500 The TaskBench500 dataset and code for generating tasks. Data The TaskBench dataset is available under wget http://web.mit.edu/bzl/www/Tas

Belinda Li 20 May 17, 2022
The projects lets you extract glossary words and their definitions from a given piece of text automatically using NLP techniques

Unsupervised technique to Glossary and Definition Extraction Code Files GPT2-DefinitionModel.ipynb - GPT-2 model for definition generation. Data_Gener

Prakhar Mishra 28 May 25, 2021
COVID-19 Related NLP Papers

COVID-19 outbreak has become a global pandemic. NLP researchers are fighting the epidemic in their own way.

xcfeng 28 Oct 30, 2022
Geometry-Consistent Neural Shape Representation with Implicit Displacement Fields

Geometry-Consistent Neural Shape Representation with Implicit Displacement Fields [project page][paper][cite] Geometry-Consistent Neural Shape Represe

Yifan Wang 100 Dec 19, 2022
topic modeling on unstructured data in Space news articles retrieved from the Guardian (UK) newspaper using API

NLP Space News Topic Modeling Photos by nasa.gov (1, 2, 3, 4, 5) and extremetech.com Table of Contents Project Idea Data acquisition Primary data sour

edesz 1 Jan 03, 2022
Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

OpenSpeech provides reference implementations of various ASR modeling papers and three languages recipe to perform tasks on automatic speech recogniti

Soohwan Kim 26 Dec 14, 2022
Open-World Entity Segmentation

Open-World Entity Segmentation Project Website Lu Qi*, Jason Kuen*, Yi Wang, Jiuxiang Gu, Hengshuang Zhao, Zhe Lin, Philip Torr, Jiaya Jia This projec

DV Lab 408 Dec 29, 2022
pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation

pkuseg:一个多领域中文分词工具包 (English Version) pkuseg 是基于论文[Luo et. al, 2019]的工具包。其简单易用,支持细分领域分词,有效提升了分词准确度。 目录 主要亮点 编译和安装 各类分词工具包的性能对比 使用方式 论文引用 作者 常见问题及解答 主要

LancoPKU 6k Dec 29, 2022
Predict an emoji that is associated with a text

Sentiment Analysis Sentiment analysis in computational linguistics is a general term for techniques that quantify sentiment or mood in a text. Can you

Tetsumichi(Telly) Umada 30 Sep 07, 2022
Hostapd-mac-tod-acl - Setup a hostapd AP with MAC ToD ACL

A brief explanation This script provides a quick way to setup a Time-of-day (Tod

2 Feb 03, 2022
Word2Wave: a framework for generating short audio samples from a text prompt using WaveGAN and COALA.

Word2Wave is a simple method for text-controlled GAN audio generation. You can either follow the setup instructions below and use the source code and CLI provided in this repo or you can have a play

Ilaria Manco 91 Dec 23, 2022
Official implementation of Meta-StyleSpeech and StyleSpeech

Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation Dongchan Min, Dong Bok Lee, Eunho Yang, and Sung Ju Hwang This is an official code

min95 169 Jan 05, 2023
L3Cube-MahaCorpus a Marathi monolingual data set scraped from different internet sources.

L3Cube-MahaCorpus L3Cube-MahaCorpus a Marathi monolingual data set scraped from different internet sources. We expand the existing Marathi monolingual

21 Dec 17, 2022
Automated question generation and question answering from Turkish texts using text-to-text transformers

Turkish Question Generation Offical source code for "Automated question generation & question answering from Turkish texts using text-to-text transfor

Open Business Software Solutions 29 Dec 14, 2022
Transformers implementation for Fall 2021 Clinic

Installation Download miniconda3 if not already installed You can check by running typing conda in command prompt. Use conda to create an environment

Aakash Tripathi 1 Oct 28, 2021
Indonesia spellchecker with python

indonesia-spellchecker Ganti kata yang terdapat pada file teks.txt untuk diperiksa kebenaran kata. Run on local machine python3 main.py

Rahmat Agung Julians 1 Sep 14, 2022
A repository to run gpt-j-6b on low vram machines (4.2 gb minimum vram for 2000 token context, 3.5 gb for 1000 token context). Model loading takes 12gb free ram.

Basic-UI-for-GPT-J-6B-with-low-vram A repository to run GPT-J-6B on low vram systems by using both ram, vram and pinned memory. There seem to be some

90 Dec 25, 2022
Code for Emergent Translation in Multi-Agent Communication

Emergent Translation in Multi-Agent Communication PyTorch implementation of the models described in the paper Emergent Translation in Multi-Agent Comm

Facebook Research 75 Jul 15, 2022
Korea Spell Checker

한국어 문서 koSpellPy Korean Spell checker How to use Install pip install kospellpy Use from kospellpy import spell_init spell_checker = spell_init() # d

kangsukmin 2 Oct 20, 2021