Tensorflow Implementation for "Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition"

Overview

Tensorflow Implementation for "Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition"


Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition

Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition
Hua Zhang, Ruoyun Gou, Jili Shang, Fangyao Shen, Yifan Wu and Guojun Dai

Abstract: Speech emotion recognition (SER) is a difficult and challenging task because of the affective variances between different speakers. The performances of SER are extremely reliant on the extracted features from speech signals. To establish an effective features extracting and classification model is still a challenging task. In this paper, we propose a new method for SER based on Deep Convolution Neural Network (DCNN) and Bidirectional Long Short-Term Memory with Attention (BLSTMwA) model (DCNN-BLSTMwA). We first preprocess the speech samples by data enhancement and datasets balancing. Secondly, we extract three-channel of log Mel-spectrograms (static, delta, and delta-delta) as DCNN input. Then the DCNN model pre-trained on ImageNet dataset is applied to generate the segment-level features. We stack these features of a sentence into utterance-level features. Next, we adopt BLSTM to learn the high-level emotional features for temporal summarization, followed by an attention layer which can focus on emotionally relevant features. Finally, the learned high-level emotional features are fed into the Deep Neural Network (DNN) to predict the final emotion. Experiments on EMO-DB and IEMOCAP database obtain the unweighted average recall (UAR) of 87.86 and 68.50%, respectively, which are better than most popular SER methods and demonstrate the effectiveness of our propose method.

link to paper

Requirements

The project has been tested on a python=3.7 on Ubuntu 20.04 with the following packages:

tensorflow=2.7.0
librosa=0.8.1
scikit-learn=1.0.1

Uses librosa to read files, which needs sndfile.
Use sudo apt-get install libsndfile1 to install sndfile library

Usage

This repository can be used in the following ways:

  1. Using train.py.
    i. Download the RAVDESS dataset (only this dataset is supported as of now) and extract it within the dataset directory. Then run the commands below to move all files into the .dataset/ directory from indivisual sub folders like .dataset/Actor-xx. Run these from within the dataset directory. Make sure to be in the .dataset/ directory before running these comannds from a linux terminal.
    find . -mindepth 2 -type f -print -exec mv {} . \;  
    rm -r Actor_*
    Then 
    
    ii. Run train.py with required options. Use python train.py -h to check all options available. A saved_model will be put in the saved_model directory.
    iii. Use infer.py to run inference on a set of files.
  2. Using SpeechModel.py to get a Keras Model into your code. This model follows specifications mentioned in the paper. You may write your own dataset code.
    Example:
    # Your own dataset architecture
    from SpeechModel import SpeechModel
    SP = SpeechModel
    model = SP.create_model()
    # Rest of model training code
    
  3. Using just the load_wav and get_framed_log_melspectrogram functions from utils.py, you can write your own dataset funcion, as well as your own model. This function returns a (num_frames, 64, 64, 3) shaped array that can be fed to a TimeDistributed network of your choice.

Model Description

The model uses a TimeDistributed layer to feed all segments of a audio file that have been converted into 3 channel images to a pretrained CNN network (in this case, resnet50_v2, trained on imagenet). Following this, we have bi-lstm layers and attention layers. Then, there are Fully Connected Layers with dropout and finally, classification with 8 nodes.


(Image credits: Paper cited below)

Example usage

Training

  • Use python train.py -h to see a list of arguments.
  • python train.py 30 to train the model for 30 epochs

Inference

(Still to be implemented)

Limitations

  1. Currently this repo only supports dataset preparation for the RAVDESS model since different datasets describe their labels differently. You can still use this repo with other datasets by defining your own function to load the datasets and using the get_framed_log_melspectrograms function from utils.py.
    Then you may use SpeechModel.py to create a model based on specifications form the paper and train.
  2. Since I couldn't find a pretrained AlexNet model and didn't have the resources myself to train one from scratch, I used a pretrained ResNet 50 model. This may affect the performance and not match the results given by the authors.

Credits and acknowledgements:

I did this work for a hackathon. This method did not produce the best results for my use case. I suspect this was due to the dataset being very noisy.

Citation

AUTHOR=Zhang Hua, Gou Ruoyun, Shang Jili, Shen Fangyao, Wu Yifan, Dai Guojun
    
TITLE=Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition  
    
JOURNAL=Frontiers in Physiology     
    
VOLUME=12      
    
YEAR=2021
    
PAGES=177   
        
URL=https://www.frontiersin.org/article/10.3389/fphys.2021.643202     
    
DOI=10.3389/fphys.2021.643202    
    
ISSN=1664-042X   

ABSTRACT=Speech emotion recognition (SER) is a difficult and challenging task because of the affective variances between different speakers. The performances of SER are extremely reliant on the extracted features from speech signals. To establish an effective features extracting and classification model is still a challenging task. In this paper, we propose a new method for SER based on Deep Convolution Neural Network (DCNN) and Bidirectional Long Short-Term Memory with Attention (BLSTMwA) model (DCNN-BLSTMwA). We first preprocess the speech samples by data enhancement and datasets balancing. Secondly, we extract three-channel of log Mel-spectrograms (static, delta, and delta-delta) as DCNN input. Then the DCNN model pre-trained on ImageNet dataset is applied to generate the segment-level features. We stack these features of a sentence into utterance-level features. Next, we adopt BLSTM to learn the high-level emotional features for temporal summarization, followed by an attention layer which can focus on emotionally relevant features. Finally, the learned high-level emotional features are fed into the Deep Neural Network (DNN) to predict the final emotion. Experiments on EMO-DB and IEMOCAP database obtain the unweighted average recall (UAR) of 87.86 and 68.50%, respectively, which are better than most popular SER methods and demonstrate the effectiveness of our propose method.
Owner
Ankush Malaker
Result driven, deep learning engineer with a passion to solve problems using computers and deep learning.
Ankush Malaker
DA2Lite is an automated model compression toolkit for PyTorch.

DA2Lite (Deep Architecture to Lite) is a toolkit to compress and accelerate deep network models. ⭐ Star us on GitHub — it helps!! Frameworks & Librari

Sinhan Kang 7 Mar 22, 2022
Pytorch implementation of Each Part Matters: Local Patterns Facilitate Cross-view Geo-localization https://arxiv.org/abs/2008.11646

[TCSVT] Each Part Matters: Local Patterns Facilitate Cross-view Geo-localization LPN [Paper] NEWs Prerequisites Python 3.6 GPU Memory = 8G Numpy 1.

46 Dec 14, 2022
This is a project based on retinaface face detection, including ghostnet and mobilenetv3

English | 简体中文 RetinaFace in PyTorch Chinese detailed blog:https://zhuanlan.zhihu.com/p/379730820 Face recognition with masks is still robust---------

pogg 59 Dec 21, 2022
Implementation for paper LadderNet: Multi-path networks based on U-Net for medical image segmentation

Implementation for paper LadderNet: Multi-path networks based on U-Net for medical image segmentation This implementation is based on orobix implement

Juntang Zhuang 116 Sep 06, 2022
A list of multi-task learning papers and projects.

This page contains a list of papers on multi-task learning for computer vision. Please create a pull request if you wish to add anything. If you are interested, consider reading our recent survey pap

svandenh 297 Dec 17, 2022
Drone-based Joint Density Map Estimation, Localization and Tracking with Space-Time Multi-Scale Attention Network

DroneCrowd Paper Detection, Tracking, and Counting Meets Drones in Crowds: A Benchmark. Introduction This paper proposes a space-time multi-scale atte

VisDrone 98 Nov 16, 2022
deep learning model that learns to code with drawing in the Processing language

sketchnet sketchnet - processing code generator can we teach a computer to draw pictures with code. We use Processing and java/jruby code paired with

41 Dec 12, 2022
This is the implementation of "SELF SUPERVISED REPRESENTATION LEARNING WITH DEEP CLUSTERING FOR ACOUSTIC UNIT DISCOVERY FROM RAW SPEECH" submitted to ICASSP 2022

CPC_DeepCluster This is the implementation of "SELF SUPERVISED REPRESENTATION LEARNING WITH DEEP CLUSTERING FOR ACOUSTIC UNIT DISCOVERY FROM RAW SPEEC

LEAP Lab 2 Sep 15, 2022
✨风纪委员会自动投票脚本,利用Github Action帮你进行裁决操作(为了让其他风纪委员有案件可判,本程序从中午12点才开始运行,有需要请自己修改运行时间)

风纪委员会自动投票 本脚本通过使用Github Action来实现B站风纪委员的自动投票功能,喜欢请给我点个STAR吧! 如果你不是风纪委员,在符合风纪委员申请条件的情况下,本脚本会自动帮你申请 投票时间是早上八点,如果有需要请自行修改.github/workflows/Judge.yml中的时间,

Pesy Wu 25 Feb 17, 2021
Linear Variational State Space Filters

Linear Variational State Space Filters To set up the environment, use the provided scripts in the docker/ folder to build and run the codebase inside

0 Dec 13, 2021
PyTorch implementation of DARDet: A Dense Anchor-free Rotated Object Detector in Aerial Images

DARDet PyTorch implementation of "DARDet: A Dense Anchor-free Rotated Object Detector in Aerial Images", [pdf]. Highlights: 1. We develop a new dense

41 Oct 23, 2022
Industrial knn-based anomaly detection for images. Visit streamlit link to check out the demo.

Industrial KNN-based Anomaly Detection ⭐ Now has streamlit support! ⭐ Run $ streamlit run streamlit_app.py This repo aims to reproduce the results of

aventau 102 Dec 26, 2022
exponential adaptive pooling for PyTorch

AdaPool: Exponential Adaptive Pooling for Information-Retaining Downsampling Abstract Pooling layers are essential building blocks of Convolutional Ne

Alexandros Stergiou 55 Jan 04, 2023
Tensorflow2.0 🍎🍊 is delicious, just eat it! 😋😋

How to eat TensorFlow2 in 30 days ? 🔥 🔥 Click here for Chinese Version(中文版) 《10天吃掉那只pyspark》 🚀 github项目地址: https://github.com/lyhue1991/eat_pyspark

lyhue1991 9.7k Jan 01, 2023
Plugin adapted from Ultralytics to bring YOLOv5 into Napari

napari-yolov5 Plugin adapted from Ultralytics to bring YOLOv5 into Napari. Training and detection can be done using the GUI. Training dataset must be

2 May 05, 2022
Graph Self-Attention Network for Learning Spatial-Temporal Interaction Representation in Autonomous Driving

GSAN Introduction Code for paper GSAN: Graph Self-Attention Network for Learning Spatial-Temporal Interaction Representation in Autonomous Driving, wh

YE Luyao 6 Oct 27, 2022
"SOLQ: Segmenting Objects by Learning Queries", SOLQ is an end-to-end instance segmentation framework with Transformer.

SOLQ: Segmenting Objects by Learning Queries This repository is an official implementation of the paper SOLQ: Segmenting Objects by Learning Queries.

MEGVII Research 179 Jan 02, 2023
Physical Anomalous Trajectory or Motion (PHANTOM) Dataset

Physical Anomalous Trajectory or Motion (PHANTOM) Dataset Description This dataset contains the six different classes as described in our paper[]. The

0 Dec 16, 2021
Using fully convolutional networks for semantic segmentation with caffe for the cityscapes dataset

Using fully convolutional networks for semantic segmentation (Shelhamer et al.) with caffe for the cityscapes dataset How to get started Download the

Simon Guist 27 Jun 06, 2022