A CROSS-MODAL FUSION NETWORK BASED ON SELF-ATTENTION AND RESIDUAL STRUCTURE FOR MULTIMODAL EMOTION RECOGNITION

Last update: Sep 26, 2022

Related tags

Overview

CFN-SR

A CROSS-MODAL FUSION NETWORK BASED ON SELF-ATTENTION AND RESIDUAL STRUCTURE FOR MULTIMODAL EMOTION RECOGNITION

The audio-video based multimodal emotion recognition has attracted a lot of attention due to its robust performance. Most of the existing methods focus on proposing different cross-modal fusion strategies. However, these strategies introduce redundancy in the features of different modalities without fully considering the complementary properties between modal information, and these approaches do not guarantee the non-loss of original semantic information during intra- and inter-modal interactions. In this paper, we propose a novel cross-modal fusion network based on self-attention and residual structure (CFN-SR) for multimodal emotion recognition. Firstly, we perform representation learning for audio and video modalities to obtain the semantic features of the two modalities by efficient ResNeXt and 1D CNN, respectively. Secondly, we feed the features of the two modalities into the cross-modal blocks separately to ensure efficient complementarity and completeness of information through the self-attention mechanism and residual structure. Finally, we obtain the output of emotions by splicing the obtained fused representation with the original representation. To verify the effectiveness of the proposed method, we conduct experiments on the RAVDESS dataset. The experimental results show that the proposed CFN-SR achieves the state-of-the-art and obtains 75.76% accuracy with 26.30M parameters.

Setup

Install dependencies

pip install opencv-python moviepy librosa sklearn

Download the RAVDESS dataset using the bash script

bash scripts/download_ravdess.sh <path/to/RAVDESS>

Or download the files manually

and follow the folder structure below and have .csv files in landmarks/ (do not modify file names)

RAVDESS/
    landmarks/
        .csv landmark files
    Actor_01/
    ...
    Actor_24/

Preprocess the dataset using the following

python dataset_prep.py --datadir <path/to/RAVDESS>

Generated folder structure (do not modify file names)

RAVDESS/
    landmarks/
        .csv landmark files
    Actor_01/
    ...
    Actor_24/
    preprocessed/
        Actor_01/
        ...
        Actor_24/
            01-01-01-01-01-01-24.mp4/
                frames/
                    .jpg frames
                audios/
                    .wav raw audio
                    .npy MFCC features
            ...

Download checkpoints folder from Google Drive. The following script downloads all pretrained models (unimodal and MSAF) for all 6 folds.

bash scripts/download_checkpoints.sh

Train

python main_msaf.py --datadir <path/to/RAVDESS/preprocessed> --checkpointdir checkpoints --train

All parameters

usage: main_msaf.py [-h] [--datadir DATADIR] [--k_fold K_FOLD] [--lr LR]
                    [--batch_size BATCH_SIZE] [--num_workers NUM_WORKERS]
                    [--epochs EPOCHS] [--checkpointdir CHECKPOINTDIR] [--no_verbose]
                    [--log_interval LOG_INTERVAL] [--no_save] [--train]

Result

Model	Fusion Stage	Accuracy	#Params
Averaging	Late	68.82	25.92M
Multiplicative	Late	70.35	25.92M
Multiplication	Late	70.56	25.92M
Concat + FC	Early	71.04	26.87M
MCBP	Early	71.32	51.03M
MMTM	Model	73.12	31.97M
MSAF	Model	74.86	25.94M
ERANNs	Model	74.80
CFN-SR(Ours)	Model	75.76	26.30M

Reference

Note that some codes references MSAF

A CROSS-MODAL FUSION NETWORK BASED ON SELF-ATTENTION AND RESIDUAL STRUCTURE FOR MULTIMODAL EMOTION RECOGNITION

Related tags

Overview

CFN-SR

Setup

Train

Result

Reference

Owner

skeleton

Official code of paper: MovingFashion: a Benchmark for the Video-to-Shop Challenge

NFNets and Adaptive Gradient Clipping for SGD implemented in PyTorch

Cookiecutter PyTorch Lightning

Code for Dual Contrastive Learning for Unsupervised Image-to-Image Translation, NTIRE, CVPRW 2021.

This is code to fit per-pixel environment map with spherical Gaussian lobes, using LBFGS optimization

AI创造营：Metaverse启动机之重构现世，结合PaddlePaddle 和 Wechaty 创造自己的聊天机器人

PyTorch Implementation of NCSOFT's FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis

Pytorch implementation of PTNet for high-resolution and longitudinal infant MRI synthesis

Official repository of IMPROVING DEEP IMAGE MATTING VIA LOCAL SMOOTHNESS ASSUMPTION.

Framework for abstracting Amiga debuggers and access to AmigaOS libraries and devices.

The code for two papers: Feedback Transformer and Expire-Span.

Yolo object detection - Yolo object detection with python

The hippynn python package - a modular library for atomistic machine learning with pytorch.

General-purpose program synthesiser

Official PyTorch implemention of our paper "Learning to Rectify for Robust Learning with Noisy Labels".

An Inverse Kinematics library aiming performance and modularity

Human pose estimation from video plays a critical role in various applications such as quantifying physical exercises, sign language recognition, and full-body gesture control.

Fine-grained Post-training for Improving Retrieval-based Dialogue Systems - NAACL 2021

🤗 Push your spaCy pipelines to the Hugging Face Hub

A Conditional Point Diffusion-Refinement Paradigm for 3D Point Cloud Completion

A CROSS-MODAL FUSION NETWORK BASED ON SELF-ATTENTION AND RESIDUAL STRUCTURE FOR MULTIMODAL EMOTION RECOGNITION

Related tags

Overview

CFN-SR

Setup

Train

Result

Reference

Owner

skeleton

Official code of paper: MovingFashion: a Benchmark for the Video-to-Shop Challenge

NFNets and Adaptive Gradient Clipping for SGD implemented in PyTorch

Cookiecutter PyTorch Lightning

Code for Dual Contrastive Learning for Unsupervised Image-to-Image Translation, NTIRE, CVPRW 2021.

This is code to fit per-pixel environment map with spherical Gaussian lobes, using LBFGS optimization

AI创造营 ：Metaverse启动机之重构现世，结合PaddlePaddle 和 Wechaty 创造自己的聊天机器人

PyTorch Implementation of NCSOFT's FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis

Pytorch implementation of PTNet for high-resolution and longitudinal infant MRI synthesis

Official repository of IMPROVING DEEP IMAGE MATTING VIA LOCAL SMOOTHNESS ASSUMPTION.

Framework for abstracting Amiga debuggers and access to AmigaOS libraries and devices.

The code for two papers: Feedback Transformer and Expire-Span.

Yolo object detection - Yolo object detection with python

The hippynn python package - a modular library for atomistic machine learning with pytorch.

General-purpose program synthesiser

Official PyTorch implemention of our paper "Learning to Rectify for Robust Learning with Noisy Labels".

An Inverse Kinematics library aiming performance and modularity

Human pose estimation from video plays a critical role in various applications such as quantifying physical exercises, sign language recognition, and full-body gesture control.

Fine-grained Post-training for Improving Retrieval-based Dialogue Systems - NAACL 2021

🤗 Push your spaCy pipelines to the Hugging Face Hub

A Conditional Point Diffusion-Refinement Paradigm for 3D Point Cloud Completion

AI创造营：Metaverse启动机之重构现世，结合PaddlePaddle 和 Wechaty 创造自己的聊天机器人