kapre: Keras Audio Preprocessors

Overview

Kapre

Keras Audio Preprocessors - compute STFT, ISTFT, Melspectrogram, and others on GPU real-time.

Tested on Python 3.6 and 3.7

Why Kapre?

vs. Pre-computation

  • You can optimize DSP parameters
  • Your model deployment becomes much simpler and consistent.
  • Your code and model has less dependencies

vs. Your own implementation

  • Quick and easy!
  • Consistent with 1D/2D tensorflow batch shapes
  • Data format agnostic (channels_first and channels_last)
  • Less error prone - Kapre layers are tested against Librosa (stft, decibel, etc) - which is (trust me) trickier than you think.
  • Kapre layers have some extended APIs from the default tf.signals implementation such as..
    • A perfectly invertible STFT and InverseSTFT pair
    • Mel-spectrogram with more options
  • Reproducibility - Kapre is available on pip with versioning

Workflow with Kapre

  1. Preprocess your audio dataset. Resample the audio to the right sampling rate and store the audio signals (waveforms).
  2. In your ML model, add Kapre layer e.g. kapre.time_frequency.STFT() as the first layer of the model.
  3. The data loader simply loads audio signals and feed them into the model
  4. In your hyperparameter search, include DSP parameters like n_fft to boost the performance.
  5. When deploying the final model, all you need to remember is the sampling rate of the signal. No dependency or preprocessing!

Installation

pip install kapre

API Documentation

Please refer to Kapre API Documentation at https://kapre.readthedocs.io

One-shot example

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, BatchNormalization, ReLU, GlobalAveragePooling2D, Dense, Softmax
from kapre import STFT, Magnitude, MagnitudeToDecibel
from kapre.composed import get_melspectrogram_layer, get_log_frequency_spectrogram_layer

# 6 channels (!), maybe 1-sec audio signal, for an example.
input_shape = (44100, 6)
sr = 44100
model = Sequential()
# A STFT layer
model.add(STFT(n_fft=2048, win_length=2018, hop_length=1024,
               window_name=None, pad_end=False,
               input_data_format='channels_last', output_data_format='channels_last',
               input_shape=input_shape))
model.add(Magnitude())
model.add(MagnitudeToDecibel())  # these three layers can be replaced with get_stft_magnitude_layer()
# Alternatively, you may want to use a melspectrogram layer
# melgram_layer = get_melspectrogram_layer()
# or log-frequency layer
# log_stft_layer = get_log_frequency_spectrogram_layer() 

# add more layers as you want
model.add(Conv2D(32, (3, 3), strides=(2, 2)))
model.add(BatchNormalization())
model.add(ReLU())
model.add(GlobalAveragePooling2D())
model.add(Dense(10))
model.add(Softmax())

# Compile the model
model.compile('adam', 'categorical_crossentropy') # if single-label classification

# train it with raw audio sample inputs
# for example, you may have functions that load your data as below.
x = load_x() # e.g., x.shape = (10000, 6, 44100)
y = load_y() # e.g., y.shape = (10000, 10) if it's 10-class classification
# then..
model.fit(x, y)
# Done!

Citation

Please cite this paper if you use Kapre for your work.

@inproceedings{choi2017kapre,
  title={Kapre: On-GPU Audio Preprocessing Layers for a Quick Implementation of Deep Neural Network Models with Keras},
  author={Choi, Keunwoo and Joo, Deokjin and Kim, Juho},
  booktitle={Machine Learning for Music Discovery Workshop at 34th International Conference on Machine Learning},
  year={2017},
  organization={ICML}
}
Comments
  • Migrated functions to tf.keras

    Migrated functions to tf.keras

    This PR addresses #52 by removing the dependency on keras and switching to tensorflow.keras

    Proposed version is 0.1.6 because of pull request #56

    In particular, #56 keeps the dependency on keras with from keras.utils.conv_utils import conv_output_length

    opened by douglas125 27
  • Spectrogram?

    Spectrogram?

    I have an older version of Kapre that has time_frequency.Spectrogram, which is trainable.

    However, the new version of Kapre doesn't have Spectrogram anymore. Why?

    opened by turian 8
  • Melspectrogram cant be set 'trainable_fb=False'

    Melspectrogram cant be set 'trainable_fb=False'

    Melspectrogram cant be set 'trainable_fb=False',after I set trainable_fb=False,trainable_kernel=False,but seems like it doesnot work.It is still trainable.

    opened by zhh6 8
  • htk=true for mel frequencies

    htk=true for mel frequencies

    We noticed the current implenetation of the mel_frequencies function (based on Librosa) doesn't include the htk=True option, which is handy when training CNNs because then the frequency scale is fully logarithmic which, in principle, makes more sense for frequency invariant convolutional filters.

    What was the motivation for removing this? Any chance it can be added?

    opened by justinsalamon 8
  • Added parallel STFT implementation

    Added parallel STFT implementation

    Hi!

    As I comented in #98, I added a parallel STFT implementation based on the map_fn function following the indications of @zaccharieramzi here.

    I've added a use_parallel_stft param (disabled by default) that allows to use this function. I've put this param in as many functions as I can. I also added test cases for every function I can, including an specific test that checks that the result of the tf.signal.stft is equals to the result of the parallel_stft function.

    Hope this could serve us well meanwhile tensorflow resolves its issues with the fft implementation.

    opened by JPery 7
  • Amplitude-to-decibel conversion produces different results on different batches

    Amplitude-to-decibel conversion produces different results on different batches

    Related to #16, I found another issue that contributes to different prediction results depending on batch size (and the batches themselves). In particular, it occurs when using converting spectrograms to decibels.

    https://github.com/keunwoochoi/kapre/blob/master/kapre/backend_keras.py#L17

    The maximum is taken over the entire tensor, instead of per example in the batch. This results in different normalization when the examples in a batch are different.

    bug 
    opened by auroracramer 7
  • Inverse Spectrogram and Mel-Spectrogram Layer?

    Inverse Spectrogram and Mel-Spectrogram Layer?

    Namaste!

    kapre has become an integral part of all my audio Deep Learning experiments. Powerful! Thanks for providing such a great software!

    I was thinking... I guess it would make sense to have layers for inverse spectrogram and inverse mel-spectrogram. Thinking about Autoencoders, this would be even more powerful. I know that reconstructing samples from spectrograms is not the best, but it is possible to a certain degree.

    What do you think about that feature request?

    Best, Tristan

    opened by AI-Guru 7
  • Hey! The input is too short!

    Hey! The input is too short!

    Hi,

    I'm encountering an assertion problem when calling your code with a Tensorflow backend.

    input_shape = (44100,1)

    Could this be a be a problem with "channels_first" / "channels_last"?

    Best, Alex

    opened by slychief 6
  • Pip?

    Pip?

    It seems you were on pip, but are no longer. Is there anything I could do to help get kapre back on there? We want to use this library in a commercial application, and for our process pip packages are easier to support than a git repository.

    opened by ff-rfeather 6
  • trainable_stft error

    trainable_stft error

    Following your example but missing layer definition trainable_stft or something, can you provide example with error resolution?

    `# 6 channels (!), maybe 1-sec audio signal
    input_shape = (6, 44100) 
    sr = 44100
    model = Sequential()
    model.add(Melspectrogram(n_dft=512, n_hop=256, input_shape=src_shape,
                             border_mode='same', sr=sr, n_mels=128,
                             fmin=0.0, fmax=sr/2, power=1.0,
                             return_decibel=False, trainable_fb=False,
                             trainable_kernel=False
                             name='trainable_stft'))`
    
      File "<ipython-input-24-cea5588ddf1e>", line 13
        name='trainable_stft'))
           ^
    SyntaxError: invalid syntax
    
    opened by sildeag 6
  • `STFT` layer output shape deviates from `STFTTflite` layer in batch dimension

    `STFT` layer output shape deviates from `STFTTflite` layer in batch dimension

    Use Case

    I want to convert a STFT layer in my model to a STFTTflite to deploy it to my mobile device. In the documentation I found that another dimension is added to account for complex numbers. But I also encountered a behaviour that is not documented.

    Expected Behaviour

    input_shape = (2048, 1)  # mono signal
    
    model = keras.models.Sequential()  # TFLite incompatible model
    model.add(kapre.STFT(n_fft=1024, hop_length=512, input_shape=input_shape))
    
    tflite_model = keras.models.Sequential()  # TFLite compatible model
    tflite_model.add(kapre.STFTTflite(n_fft=1024, hop_length=512, input_shape=input_shape))
    

    model has the output shape (None, 3, 513, 1). Therefore, tflite_model should have the output shape (None, 3, 513, 1, 2).

    Observed Behaviour

    The output shape of tflite_model is (1, 3, 513, 1, 2) instead of (None, 3, 513, 1, 2).

    Problem Solution

    • If this behaviour is unwanted:
      • Change the model output format so that the batch dimension is correctly shaped.
    • Otherwise:
      • Explain in the documentation why the batch dimension is shaped to 1.
      • Explain in the documentation how to include this layer into models which expect the batch dimension to be shaped None.
    opened by PhilippMatthes 5
  • Problem incorporating SpecAugument in the training process

    Problem incorporating SpecAugument in the training process

    Hi,

    I'm trying to add a SpecAug layer in the training process of a CNN using the code below:

    
    CLIP_DURATION = 5 
    SAMPLING_RATE = 41000
    NUM_CHANNELS = 1
    
    INPUT_SHAPE = ((CLIP_DURATION * SAMPLING_RATE), NUM_CHANNELS)
    
    melgram = get_melspectrogram_layer(input_shape = INPUT_SHAPE,
                              n_fft = 2048,
                              hop_length = 512,
                              return_decibel=True,
                              n_mels = 40,
                              mel_f_min = 500,
                              mel_f_max = 15000,
                              input_data_format='channels_last',
                              output_data_format='channels_last')
    
    spec_augment = SpecAugment(freq_mask_param=5,
                              time_mask_param=10,
                              n_freq_masks=2,
                              n_time_masks=3,
                              mask_value=-100)   
    
    model = Sequential()
    model.add(melgram)
    model.add(spec_augment)
    

    The CNN summary looks like this:

    Model: "sequential_2"
    _________________________________________________________________
     Layer (type)                Output Shape              Param #   
    =================================================================
     melspectrogram (Sequential)  (None, 397, 40, 1)       0         
                                                                     
     spec_augment_1 (SpecAugment  (None, 397, 40, 1)       0         
     )                                                               
                                                                     
    =================================================================
    Total params: 0
    Trainable params: 0
    Non-trainable params: 0
    _________________________________________________________________
    

    Compiling and fitting the model

    model.compile(loss = 'sparse_categorical_crossentropy', optimizer='adam', metrics = 'accuracy')
    
    early_stop = EarlyStopping(monitor='loss', patience=5)
    
    reduce_LR = ReduceLROnPlateau(monitor="val_loss",factor=0.1,patience=4)
    
    checkpointer = ModelCheckpoint(filepath = 'saved_models/bird_song_classification.hdf5')
    
    model.fit(X_train, y_train, validation_data = (X_val, y_val), epochs = 50, batch_size = 32, callbacks = [early_stop, checkpointer, reduce_LR])
    

    Then I get the following error:

    Epoch 1/50
    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    [<ipython-input-35-e58a056ab523>](https://localhost:8080/#) in <module>
          7 checkpointer = ModelCheckpoint(filepath = 'saved_models/bird_song_classification.hdf5')
          8 
    ----> 9 model.fit(X_train, y_train, validation_data = (X_val, y_val), epochs = 50, batch_size = 32, callbacks = [early_stop, checkpointer, reduce_LR])
    
    6 frames
    [/usr/local/lib/python3.7/dist-packages/kapre/augmentation.py](https://localhost:8080/#) in tf___apply_masks_to_axis(self, x, axis, mask_param, n_masks)
         78                 try:
         79                     do_return = True
    ---> 80                     retval_ = ag__.converted_call(ag__.ld(tf).where, (ag__.ld(mask), ag__.ld(self).mask_value, ag__.ld(x)), None, fscope)
         81                 except:
         82                     do_return = False
    
    TypeError: in user code:
    
        File "/usr/local/lib/python3.7/dist-packages/keras/engine/training.py", line 1051, in train_function  *
            return step_function(self, iterator)
        File "/usr/local/lib/python3.7/dist-packages/keras/engine/training.py", line 1040, in step_function  **
            outputs = model.distribute_strategy.run(run_step, args=(data,))
        File "/usr/local/lib/python3.7/dist-packages/keras/engine/training.py", line 1030, in run_step  **
            outputs = model.train_step(data)
        File "/usr/local/lib/python3.7/dist-packages/keras/engine/training.py", line 889, in train_step
            y_pred = self(x, training=True)
        File "/usr/local/lib/python3.7/dist-packages/keras/utils/traceback_utils.py", line 67, in error_handler
            raise e.with_traceback(filtered_tb) from None
        File "/tmp/__autograph_generated_filepzvfxhgz.py", line 63, in tf__call
            ag__.if_stmt((ag__.ld(training) in (None, False)), if_body_2, else_body_2, get_state_2, set_state_2, ('do_return', 'retval_'), 2)
        File "/tmp/__autograph_generated_filepzvfxhgz.py", line 58, in else_body_2
            retval_ = ag__.converted_call(ag__.ld(tf).map_fn, (), dict(elems=ag__.ld(x), fn=ag__.ld(self)._apply_spec_augment, dtype=ag__.ld(tf).float32, fn_output_signature=ag__.ld(tf).float32), fscope)
        File "/tmp/__autograph_generated_filef27o6c1f.py", line 44, in tf___apply_spec_augment
            ag__.if_stmt((ag__.ld(self).n_time_masks >= 1), if_body_1, else_body_1, get_state_1, set_state_1, ('x',), 1)
        File "/tmp/__autograph_generated_filef27o6c1f.py", line 39, in if_body_1
            x = ag__.converted_call(ag__.ld(self)._apply_masks_to_axis, (ag__.ld(x),), dict(axis=ag__.ld(time_axis), mask_param=ag__.ld(self).time_mask_param, n_masks=ag__.ld(self).n_time_masks), fscope)
        File "/tmp/__autograph_generated_file3vip8w4x.py", line 80, in tf___apply_masks_to_axis
            retval_ = ag__.converted_call(ag__.ld(tf).where, (ag__.ld(mask), ag__.ld(self).mask_value, ag__.ld(x)), None, fscope)
    
        TypeError: Exception encountered when calling layer "spec_augment_1" (type SpecAugment).
        
        in user code:
        
            File "/usr/local/lib/python3.7/dist-packages/kapre/augmentation.py", line 299, in call  *
                elems=x, fn=self._apply_spec_augment, dtype=tf.float32, fn_output_signature=tf.float32
            File "/usr/local/lib/python3.7/dist-packages/kapre/augmentation.py", line 273, in _apply_spec_augment  *
                x = self._apply_masks_to_axis(
            File "/usr/local/lib/python3.7/dist-packages/kapre/augmentation.py", line 254, in _apply_masks_to_axis  *
                return tf.where(mask, self.mask_value, x)
        
            TypeError: Input 'e' of 'SelectV2' Op has type float32 that does not match type int32 of argument 't'.
        
        
        Call arguments received by layer "spec_augment_1" (type SpecAugment):
          • x=tf.Tensor(shape=(None, 397, 40, 1), dtype=float32)
          • training=True
          • kwargs=<class 'inspect._empty'>
    

    The shape of X_train is

    (2182, 205000, 1)
    

    I'm using Tensorflow 2.9.2, and Python 3.7.15

    When I remove the SpecAug layer everything runs fine. I've tested using only the melspec + a mobile net at the end and it runs smooth. The problem is apparently related to SpecAug layer.

    Do you have any idea what could be going wrong here? I appreciate any guidance related to the problem. Best regards.

    opened by nnbuainain 2
  • Full-integer quantization and kapre layers

    Full-integer quantization and kapre layers

    I am training a model which includes the mel-spectrogram block from get_melspectrogram_layer() right after the input layer. Training goes well, and I am able to change the specific mel-spec-layers to their TFLite-counterparts (STFTTflite, MagnitudeTflite) afterwards. I have checked also that the model performs as well as before.

    The model also perfoms as expected when converting the model to .tflite using dynamic range quantization. However, when using full-integer quantization, the model loses its accuracy (see (https://www.tensorflow.org/lite/performance/post_training_quantization#integer_only).

    I suppose the mel-spec starts to significantly differ as in full-integer quantization, the input values are projected to new range (int8). Is there any way to make it work with full-integer quantization?

    I guess I need to separate the mel-spec-layer from the model as a preprocessing step in order to succeed with full-integer quantization, i.e., apply the input quantization to the output values of mel-spec layer. But then I would have to deploy two models to the edge device, where the input goes first to the mel-spec-block and then to the rest of the model (?).

    I am using TensorFlow 2.7.0 and kapre 0.3.7.

    Here is my code for testing the tflite-model:

    preds = []
    # Test and evaluate the TFLite-converted model on unseen test data
    for i, sample in enumerate(X_test_full_scaled):
        X = sample
        
        if input_details['dtype'] == np.int8:
            input_scale, input_zero_point = input_details["quantization"]
            X = sample / input_scale + input_zero_point
        
        X = X.reshape((1, 8000, 1)).astype(input_details["dtype"])
        
        interpreter.set_tensor(input_index, X)
        interpreter.invoke()
        pred = interpreter.get_tensor(output_index)
        
        output_scale, output_zero_point = output_details['quantization']
        if output_details['dtype'] == np.int8:
            pred = pred.astype(np.float32)
            pred = (pred - output_zero_point) * output_scale
        
        pred = np.argmax(pred, axis=1)[0]
        preds.append(pred)
    
    preds = np.array(preds)
    
    opened by eppane 3
  • Calling Magnitude() and Phase() simultaneously

    Calling Magnitude() and Phase() simultaneously

    Hi,

    I am looking to call Magnitude() and Phase() simultaneously for the same STFT input and concatenate the magnitude and phase before feeding into the convolution layers in my CNN sequential Keras model.

    Is this possible?

    Best,

    Yang

    opened by HsuanYang-Wang 1
  • about kapre.utiils

    about kapre.utiils

    Hi, when i used "from kapre.utils import Normalization2D", I met this error which said No module named 'kapre.utils'. I see your package, and found that there is surely no utils.py. I am wondering how to slove it.

    Best wishes, Daisy

    opened by YiningWang2 1
  • Function missing in updated version

    Function missing in updated version

    I noticed there is a functon "kapre.utils.Normalization2D" in the old version, while I cannot find it in the updated version. Why? Is there have any alternative functions?

    opened by v3551G 1
  • trainable DSP parameters

    trainable DSP parameters

    hello contributers and community.

    I love your repo! It's eases so much for me! Although having the precomputation in the model is already great I'd like to know how you can optimize DSP parameters. It looks like that this is a feature from old versions (e.g. 0.2) and by default I dont see any trainable params in this layer.

    Could you please state if this is still available and how to use it?

    happy hacking Paul

    opened by bytosaur 2
Releases(Kapre-0.3.7)
Owner
Keunwoo Choi
MIR, machine learning, music recommendation.
Keunwoo Choi
Generating a structured library of .wav samples with Python.

sample-library Scripts for generating a structured sample library with Python Requires Docker about Samples are written to wave files in lib/. Differe

Ben Mangold 1 Nov 11, 2021
A useful tool to generate chord progressions according to melody MIDIs

Auto chord generator, pure python package that generate chord progressions according to given melodies

Billy Yi 53 Dec 30, 2022
F.R.I.D.A.Y. ----- Female Replacement Intelligent Digital Assistant Youth

F.R.I.D.A.Y. Female Replacement Intelligent Digital Assistant Youth--Jarvis-- the virtual assistant made by python Overview This is a virtual assistan

JIB - Just Innovative Bro 4 Feb 26, 2022
C++ library for audio and music analysis, description and synthesis, including Python bindings

Essentia Essentia is an open-source C++ library for audio analysis and audio-based music information retrieval released under the Affero GPL license.

Music Technology Group - Universitat Pompeu Fabra 2.3k Jan 03, 2023
Muzic: Music Understanding and Generation with Artificial Intelligence

Muzic is a research project on AI music that empowers music understanding and generation with deep learning and artificial intelligence.

Microsoft 2.6k Dec 30, 2022
A simple music player, powered by Python, utilising various libraries such as Tkinter and Pygame

A simple music player, powered by Python, utilising various libraries such as Tkinter and Pygame

PotentialCoding 2 May 12, 2022
spafe: Simplified Python Audio-Features Extraction

spafe aims to simplify features extractions from mono audio files. The library can extract of the following features: BFCC, LFCC, LPC, LPCC, MFCC, IMFCC, MSRCC, NGCC, PNCC, PSRCC, PLP, RPLP, Frequenc

Ayoub Malek 310 Jan 01, 2023
Gammatone-based spectrograms, using gammatone filterbanks or Fourier transform weightings.

Gammatone Filterbank Toolkit Utilities for analysing sound using perceptual models of human hearing. Jason Heeris, 2013 Summary This is a port of Malc

Jason Heeris 188 Dec 14, 2022
ianZiPu is a way to write notation for Guqin (古琴) music.

PyBetween Wrapper for Between - 비트윈을 위한 파이썬 라이브러리 Legal Disclaimer 오직 교육적 목적으로만 사용할수 있으며, 비트윈은 VCNC의 자산입니다. 악의적 공격에 이용할시 처벌 받을수 있습니다. 사용에 따른 책임은 사용자가

Nancy Yi Liang 8 Nov 25, 2022
A Python library for audio data augmentation. Inspired by albumentations. Useful for machine learning.

Audiomentations A Python library for audio data augmentation. Inspired by albumentations. Useful for deep learning. Runs on CPU. Supports mono audio a

Iver Jordal 1.2k Jan 07, 2023
:speech_balloon: SpeechPy - A Library for Speech Processing and Recognition: http://speechpy.readthedocs.io/en/latest/

SpeechPy Official Project Documentation Table of Contents Documentation Which Python versions are supported Citation How to Install? Local Installatio

Amirsina Torfi 870 Dec 27, 2022
Delta TTA(Text To Audio) SoftWare

Text-To-Audio-Windows Delta TTA(Text To Audio) SoftWare Info You Can Use It For Convert Your Text To Audio File You Just Write Your Text And Your End

Delta Inc. 2 Dec 14, 2021
Python module for handling audio metadata

Mutagen is a Python module to handle audio metadata. It supports ASF, FLAC, MP4, Monkey's Audio, MP3, Musepack, Ogg Opus, Ogg FLAC, Ogg Speex, Ogg The

Quod Libet 1.1k Dec 31, 2022
Mousai is a simple application that can identify song like Shazam

Mousai is a simple application that can identify song like Shazam. It saves the artist, album, and title of the identified song in a JSON file.

Dave Patrick 662 Jan 07, 2023
XA Music Player - Telegram Music Bot

XA Music Player Requirements 📝 FFmpeg (Latest) NodeJS nodesource.com (NodeJS 17+) Python (3.10+) PyTgCalls (Lastest) MongoDB (3.12.1) 2nd Telegram Ac

RexAshh 3 Jun 30, 2022
Code for paper 'Audio-Driven Emotional Video Portraits'.

Audio-Driven Emotional Video Portraits [CVPR2021] Xinya Ji, Zhou Hang, Kaisiyuan Wang, Wayne Wu, Chen Change Loy, Xun Cao, Feng Xu [Project] [Paper] G

197 Dec 31, 2022
IDing the songs played on the do you radio show

IDing the songs played on the do you radio show

Rasmus Jones 36 Nov 15, 2022
📺Headless全自动B站直播录播、切片、上传一体工具

DDRecorder Headless全自动B站直播录播、切片、上传一体工具 感谢 FortuneDayssss/BilibiliUploader 安装指南(Windows) 在Release下载zip包解压。 修改配置文件config.json 双击运行DDRecorder.exe (这将使用co

322 Dec 27, 2022
A rofi-blocks script that searches youtube and plays the selected audio on mpv.

rofi-ytm A rofi-blocks script that searches youtube and plays the selected audio on mpv. To use the script, run the following command rofi -modi block

Cliford 26 Dec 21, 2022
Audio features extraction

Yaafe Yet Another Audio Feature Extractor Build status Branch master : Branch dev : Anaconda : Install Conda Yaafe can be easily install with conda. T

Yaafe 231 Dec 26, 2022