Lip Reading - Cross Audio-Visual Recognition using 3D Convolutional Neural Networks

Overview

Lip Reading - Cross Audio-Visual Recognition using 3D Convolutional Neural Networks - Official Project Page

https://img.shields.io/badge/contributions-welcome-brightgreen.svg?style=flat https://badges.frapsoft.com/os/v2/open-source.svg?v=102 https://coveralls.io/repos/github/astorfi/3D-convolutional-Audio-Visual/badge.svg?branch=master https://img.shields.io/twitter/follow/amirsinatorfi.svg?label=Follow&style=social

This repository contains the code developed by TensorFlow for the following paper:

im1 im2 im3

The input pipeline must be prepared by the users. This code is aimed to provide the implementation for Coupled 3D Convolutional Neural Networks for audio-visual matching. Lip-reading can be a specific application for this work.

If you used this code, please kindly consider citing the following paper:

@article{torfi20173d,
  title={3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition},
  author={Torfi, Amirsina and Iranmanesh, Seyed Mehdi and Nasrabadi, Nasser and Dawson, Jeremy},
  journal={IEEE Access},
  year={2017},
  publisher={IEEE}
  }

Table of Contents

DEMO

Training/Evaluation DEMO

training

Lip Tracking DEMO

liptrackingdemo

General View

Audio-visual recognition (AVR) has been considered as a solution for speech recognition tasks when the audio is corrupted, as well as a visual recognition method used for speaker verification in multi-speaker scenarios. The approach of AVR systems is to leverage the extracted information from one modality to improve the recognition ability of the other modality by complementing the missing information.

The Problem and the Approach

The essential problem is to find the correspondence between the audio and visual streams, which is the goal of this work. We proposed the utilization of a coupled 3D Convolutional Neural Network (CNN) architecture that can map both modalities into a representation space to evaluate the correspondence of audio-visual streams using the learned multimodal features.

How to leverage 3D Convolutional Neural Networks?

The proposed architecture will incorporate both spatial and temporal information jointly to effectively find the correlation between temporal information for different modalities. By using a relatively small network architecture and much smaller dataset, our proposed method surpasses the performance of the existing similar methods for audio-visual matching which use CNNs for feature representation. We also demonstrate that effective pair selection method can significantly increase the performance.

Code Implementation

The input pipeline must be provided by the user. The rest of the implementation consider the dataset which contains the utterance-based extracted features.

Lip Tracking

For lip tracking, the desired video must be fed as the input. At first, cd to the corresponding directory:

cd code/lip_tracking

The run the dedicated python file as below:

python VisualizeLip.py --input input_video_file_name.ext --output output_video_file_name.ext

Running the aforementioned script extracts the lip motions by saving the mouth area of each frame and create the output video with a rectangular around the mouth area for better visualization.

The required arguments are defined by the following python script which have been defined in the VisualizeLip.py file:

ap = argparse.ArgumentParser()
ap.add_argument("-i", "--input", required=True,
             help="path to input video file")
ap.add_argument("-o", "--output", required=True,
             help="path to output video file")
ap.add_argument("-f", "--fps", type=int, default=30,
             help="FPS of output video")
ap.add_argument("-c", "--codec", type=str, default="MJPG",
             help="codec of output video")
args = vars(ap.parse_args())

Some of the defined arguments have their default values and no further action is required by them.

Processing

In the visual section, the videos are post-processed to have an equal frame rate of 30 f/s. Then, face tracking and mouth area extraction are performed on the videos using the dlib library [dlib]. Finally, all mouth areas are resized to have the same size and concatenated to form the input feature cube. The dataset does not contain any audio files. The audio files are extracted from videos using FFmpeg framework [ffmpeg]. The processing pipeline is the below figure.

readme_images/processing.gif

Input Pipeline for this work

The proposed architecture utilizes two non-identical ConvNets which uses a pair of speech and video streams. The network input is a pair of features that represent lip movement and speech features extracted from 0.3 second of a video clip. The main task is to determine if a stream of audio corresponds with a lip motion clip within the desired stream duration. In the two next sub-sections, we are going to explain the inputs for speech and visual streams.

Speech Net

On the time axis, the temporal features are non-overlapping 20ms windows which are used for the generation of spectrum features that possess a local characteristic. The input speech feature map, which is represented as an image cube, corresponds to the spectrogram as well as the first and second order derivatives of the MFEC features. These three channels correspond to the image depth. Collectively from a 0.3 second clip, 15 temporal feature sets (each forms 40 MFEC features) can be derived which form a speech feature cube. Each input feature map for a single audio stream has the dimensionality of 15 × 40 × 3. This representation is depicted in the following figure:

readme_images/Speech_GIF.gif

The speech features have been extracted using [SpeechPy] package.

Please refer to code/speech_input/input_feature.py for having an idea about how the input pipeline works.

Visual Net

The frame rate of each video clip used in this effort is 30 f/s. Consequently, 9 successive image frames form the 0.3 second visual stream. The input of the visual stream of the network is a cube of size 9x60x100, where 9 is the number of frames that represent the temporal information. Each channel is a 60x100 gray-scale image of mouth region.

readme_images/lip_motion.jpg

Architecture

The architecture is a coupled 3D convolutional neural network in which two different networks with different sets of weights must be trained. For the visual network, the lip motions spatial information alongside the temporal information are incorporated jointly and will be fused for exploiting the temporal correlation. For the audio network, the extracted energy features are considered as a spatial dimension, and the stacked audio frames form the temporal dimension. In the proposed 3D CNN architecture, the convolutional operations are performed on successive temporal frames for both audio-visual streams.

readme_images/DNN-Coupled.png

Training / Evaluation

At first, clone the repository. Then, cd to the dedicated directory:

cd code/training_evaluation

Finally, the train.py file must be executed:

python train.py

For evaluation phase, a similar script must be executed:

python test.py

Results

The below results demonstrate effects of the proposed method on the accuracy and the speed of convergence.

accuracy

The best results, which is the right-most one, belongs to our proposed method.

converge

The effect of proposed Online Pair Selection method has been shown in the figure.

Disclaimer

The current version of the code does not contain the adaptive pair selection method proposed by 3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition paper. Just a simple pair selection with hard thresholding is included at the moment.

Contribution

We are looking forward to your kind feedback. Please help us to improve the code and make our work better. For contribution, please create the pull request and we will investigate it promptly. Once again, we appreciate your feedback and code inspections.

references

[SpeechPy] @misc{amirsina_torfi_2017_810392, author = {Amirsina Torfi}, title = {astorfi/speech_feature_extraction: SpeechPy}, month = jun, year = 2017, doi = {10.5281/zenodo.810392}, url = {https://doi.org/10.5281/zenodo.810391}}
[dlib]
    1. King. Dlib-ml: A machine learning toolkit. Journal of Machine Learning Research, 10:1755–1758, 2009.
[ffmpeg]
  1. Developers. FFmpeg tool (version be1d324) [software], 2016.
Comments
  • Online pair selection?

    Online pair selection?

    Hi , In your paper, Pair selection algorithm to select main contributing impostor pairs which is imp_dis < (max_gen+ margin). Could you clarify why it's a main contributing impostor pairs?

    opened by bg193 13
  • Question about speech features

    Question about speech features

    As you mentioned before "use of non-overlapping hamming windows for generating speech features" , I'm not sure how to do it here. Could you describe in detail the procedures here? Thank a lot!

    opened by bg193 12
  • Unable to open dlib/shape_predictor_68_face_landmarks.dat

    Unable to open dlib/shape_predictor_68_face_landmarks.dat

    Hello,

    How can I resolve the error Unable to open dlib/shape_predictor_68_face_landmarks.dat when I run the command python python VisualizeLip.py --input sample_video.mp4 --output sample_video.mp4?

    Thank you!

    opened by mrpasquali 7
  • constrastive loss

    constrastive loss

    Hi! Thank you for the excellent work and publicly available code. I would like to know how to get a metric that show the out of sync of video and audio on sample? For example, in this paper, the guys display AV offset, Min dist and Confidence. http://www.robots.ox.ac.uk:5000/~vgg/publications/2016/Chung16a/chung16a.pdf

    opened by Maxfashko 5
  • Complete noob questions - 1) model purpose? 2) pre-trained weights? 3) other languages?

    Complete noob questions - 1) model purpose? 2) pre-trained weights? 3) other languages?

    Excuse my complete noob-ness

    1. is the model trying to accurately determine if the video (i.e. shape of lips) and audio are sync'ed?

    2. Any pre-trained weights I can download to run ?

    3. Assuming my Q1 is correct.. has anyone tested to see if this model can accurately detect audio/video synchronization on non-english languages?

    opened by taewookim 4
  • A problem of Multiclass Classification

    A problem of Multiclass Classification

    As far as I understtod, your code supports only a binary classification problem. I could not find any information in the paper regarding the classes (the "Words"/"Subjects" are the classes?). I am trying to use this for a multi-class problem. Since pairing has been done for frame sequences of each video (9 of them) with the corresponding speech spectrogram and MFEC features, I suppose there will be no problem if one changes the number of classes. When I change number of classes from 2 to 6, I get this error, can you help me?

    Epoch 1, Minibatch 1 of 15 , Minibatch Loss= 1056.706787, EER= 0.50000, AUC= 0.33333, AP= 0.69683, contrib = 8 pairs
    Epoch 1, Minibatch 2 of 15 , Minibatch Loss= 1793.572998, EER= 0.50000, AUC= 0.55000, AP= 0.61167, contrib = 9 pairs
    Epoch 1, Minibatch 3 of 15 , Minibatch Loss= 1273.130249, EER= 0.50000, AUC= 0.62500, AP= 0.80417, contrib = 6 pairs
    Epoch 1, Minibatch 4 of 15 , Minibatch Loss= 1280.513916, EER= 0.25000, AUC= 0.60714, AP= 0.81829, contrib = 11 pairs
    Epoch 1, Minibatch 5 of 15 , Minibatch Loss= 1651.882568, EER= 0.40000, AUC= 0.60000, AP= 0.67778, contrib = 9 pairs
    Epoch 1, Minibatch 6 of 15 , Minibatch Loss= 1395.890381, EER= 0.40000, AUC= 0.48000, AP= 0.53429, contrib = 10 pairs
    Epoch 1, Minibatch 7 of 15 , Minibatch Loss= 1423.493164, EER= 0.27273, AUC= 0.63636, AP= 0.58000, contrib = 16 pairs
    Epoch 1, Minibatch 8 of 15 , Minibatch Loss= 1248.631836, EER= 0.50000, AUC= 0.55000, AP= 0.61167, contrib = 9 pairs
    Epoch 1, Minibatch 9 of 15 , Minibatch Loss= 1377.684937, EER= 0.50000, AUC= 0.54167, AP= 0.74385, contrib = 10 pairs
    Epoch 1, Minibatch 10 of 15 , Minibatch Loss= 1460.154419, EER= 0.33333, AUC= 0.83333, AP= 0.88750, contrib = 7 pairs
    Epoch 1, Minibatch 11 of 15 , Minibatch Loss= 1794.762451, EER= 0.40000, AUC= 0.33333, AP= 0.67771, contrib = 17 pairs
    Epoch 1, Minibatch 12 of 15 , Minibatch Loss= 1140.301392, EER= 0.50000, AUC= 0.37500, AP= 0.36667, contrib = 6 pairs
    Epoch 1, Minibatch 13 of 15 , Minibatch Loss= 1273.781738, EER= 0.66667, AUC= 0.47619, AP= 0.51664, contrib = 16 pairs
    Epoch 1, Minibatch 14 of 15 , Minibatch Loss= 989.276489, EER= 0.50000, AUC= 0.58333, AP= 0.36667, contrib = 8 pairs
    Epoch 1, Minibatch 15 of 15 , Minibatch Loss= 1625.663696, EER= 0.33333, AUC= 0.83333, AP= 0.95028, contrib = 11 pairs
    TESTING: Epoch 1, Minibatch 1 of 3 
    TESTING: Epoch 1, Minibatch 2 of 3 
    Traceback (most recent call last):
      File "/media/Data/Scripts/lip-reading-deeplearning/code/training_evaluation/train.py", line 667, in <module>
        tf.app.run()
      File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 126, in run
        _sys.exit(main(argv))
      File "/media/Data/Scripts/lip-reading-deeplearning/code/training_evaluation/train.py", line 659, in main
        score_dissimilarity_vector[i * batch_k_validation:(i + 1) * batch_k_validation])
      File "/media/Data/Scripts/lip-reading-deeplearning/code/training_evaluation/roc_curve/calculate_roc.py", line 16, in calculate_eer_auc_ap
        AUC = metrics.roc_auc_score(label, -distance, average='macro', sample_weight=None)
      File "/usr/local/lib/python2.7/dist-packages/sklearn/metrics/ranking.py", line 277, in roc_auc_score
        sample_weight=sample_weight)
      File "/usr/local/lib/python2.7/dist-packages/sklearn/metrics/base.py", line 72, in _average_binary_score
    TESTING: Epoch 1, Minibatch 3 of 3 
        raise ValueError("{0} format is not supported".format(y_type))
    ValueError: multiclass format is not supported
    

    Is this because of ROC calculation (line 659 of train.py) for multi-class classification? Can you tell me which parts need modification, maybe I missed something.

    opened by ghost 3
  • error when changing the input size

    error when changing the input size

    I changed the input image size into: 'mouth': np.random.random_sample(size=(num_training_samples, 9, 64, 64, 1))

    then the error Negative dimension size caused by subtracting 5 from 3 for 'tower_0/mouth_cnn/fc5/fc5_1/convolution' (op: 'Conv3D') with input shapes: [?,9,3,3,128], [1,2,5,128,256]

    How should I change the net structure to fix that? thank you

    opened by Perseus1993 3
  • local variable 'i' referenced before assignment

    local variable 'i' referenced before assignment

    when i run test.py :

    File "test.py", line 576, in <module> tf.app.run() File "/home/ligen/anaconda2/envs/python3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "test.py", line 401, in main with tf.name_scope('%s_%d' % ('tower', i)) as scope: UnboundLocalError: local variable 'i' referenced before assignment

    opened by Perseus1993 3
  • ./results/mouth/frame_%*.png: could not find codec parameters

    ./results/mouth/frame_%*.png: could not find codec parameters

    Hi: i am glad to find this source but when i run ./run.sh or ./lip_tracking_demo.sh, the error is: [image2 @ 0x1aa4100] Could not open file : ./results/mouth/frame_.png [image2 @ 0x1aa4100] Could not find codec parameters (Video: png) ./results/mouth/frame_%.png: could not find codec parameters

    beacuse the results/mouth/frame_%.png can't be find. Can you release this file or give me an example what form the frame_%.png is? if the frame_%*.png is only the part of mouth just like 'lip-reading-deeplearning-master/readme_images/1.gif '

    opened by xiaoyun4 3
  • This recording has been archived

    This recording has been archived

    Hello!

    The video of Training/Evaluation DEMO is not exist. It shows "All unclaimed recordings (the ones not linked to any user account) are automatically archived 7 days after upload."

    opened by Vegetebird 2
  • The version of tensorflow.

    The version of tensorflow.

    Could you please give some information about the version of tensorflow, I got the different errors when I did on the different versions of tensorflow. Thank you. tensorflow:1.11.0

     File "/Users/apple/anaconda3/envs/venv/lib/python2.7/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1025, in convolution
        (conv_dims + 2, input_rank))
    ValueError: Convolution expects input with rank 4, got 5
    

    tensorflow:1.6.0

      File "/Users/apple/anaconda3/envs/tensorflow-1.0.0/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1751, in restore
        raise ValueError("Can't load save_path when it is None.")
    ValueError: Can't load save_path when it is None.
    
    opened by XiangYangAI 2
  • cannot set WRITEABLE flag to True of this array

    cannot set WRITEABLE flag to True of this array

    When I run the VisualizeLip.py, I encounter the following problems: Traceback (most recent call last): File "VisualizeLip.py", line 129, in <module> frame.setflags(write=True) ValueError: cannot set WRITEABLE flag to True of this array I looked it up on the Internet and it said the problem is numpy library's version, my tensorflow is 2.1.0, it conflicts with numpy library

    opened by LeoJin1234 1
  • Need help about demo

    Need help about demo

    Anyone who is currently working or have done worked on it .I need help .I have run this project lip tracking with dlib is working fine for me . I want to know how i can do lip-reading from video. its only giving output of tracking lips no text ouput of lips reading

    opened by mozillah 0
  • May I know if this is the correct model file?

    May I know if this is the correct model file?

    Hi,

    I found there is a model checkpoint under the results/TRAIN_CNN_3D folder: train_logs-62.*. May I know if this is the correct model file? Thank you very much!

    opened by ilovecv 0
Releases(1.2)
  • 1.2(Aug 8, 2017)

    This project is aimed to provide the implementation for Coupled 3D Convolutional Neural Networks for audio-visual matching. Lip-reading can be a specific application for this work.

    Source code(tar.gz)
    Source code(zip)
  • 1.1(Jul 30, 2017)

    This project is aimed to provide the implementation for Coupled 3D Convolutional Neural Networks for audio-visual matching. Lip-reading can be a specific application for this work.

    Source code(tar.gz)
    Source code(zip)
  • 1.0(Jul 30, 2017)

    This project is aimed to provide the implementation for Coupled 3D Convolutional Neural Networks for audio-visual matching. Lip-reading can be a specific application for this work.

    Source code(tar.gz)
    Source code(zip)
Owner
Amirsina Torfi
PhD & Developer working on Deep Learning, Computer Vision & NLP
Amirsina Torfi
Scribble-Supervised LiDAR Semantic Segmentation, CVPR 2022 (ORAL)

Scribble-Supervised LiDAR Semantic Segmentation Dataset and code release for the paper Scribble-Supervised LiDAR Semantic Segmentation, CVPR 2022 (ORA

102 Dec 25, 2022
Code Impementation for "Mold into a Graph: Efficient Bayesian Optimization over Mixed Spaces"

Code Impementation for "Mold into a Graph: Efficient Bayesian Optimization over Mixed Spaces" This repo contains the implementation of GEBO algorithm.

Jaeyeon Ahn 2 Mar 22, 2022
Hierarchical Cross-modal Talking Face Generation with Dynamic Pixel-wise Loss (ATVGnet)

Hierarchical Cross-modal Talking Face Generation with Dynamic Pixel-wise Loss (ATVGnet) By Lele Chen , Ross K Maddox, Zhiyao Duan, Chenliang Xu. Unive

Lele Chen 218 Dec 27, 2022
DARTS-: Robustly Stepping out of Performance Collapse Without Indicators

[ICLR'21] DARTS-: Robustly Stepping out of Performance Collapse Without Indicators [openreview] Authors: Xiangxiang Chu, Xiaoxing Wang, Bo Zhang, Shun

55 Nov 01, 2022
Implementation of "Large Steps in Inverse Rendering of Geometry"

Large Steps in Inverse Rendering of Geometry ACM Transactions on Graphics (Proceedings of SIGGRAPH Asia), December 2021. Baptiste Nicolet · Alec Jacob

RGL: Realistic Graphics Lab 274 Jan 06, 2023
PyTorch implementation of the paper Dynamic Data Augmentation with Gating Networks

Dynamic Data Augmentation with Gating Networks This is an official PyTorch implementation of the paper Dynamic Data Augmentation with Gating Networks

九州大学 ヒューマンインタフェース研究室 3 Oct 26, 2022
make ASCII Art by Deep Learning

DeepAA This is convolutional neural networks generating ASCII art. This repository is under construction. This work is accepted by NIPS 2017 Workshop,

OsciiArt 1.4k Dec 28, 2022
Session-aware Item-combination Recommendation with Transformer Network

Session-aware Item-combination Recommendation with Transformer Network 2nd place (0.39224) code and report for IEEE BigData Cup 2021 Track1 Report EDA

Tzu-Heng Lin 6 Mar 10, 2022
一套完整的微博舆情分析流程代码,包括微博爬虫、LDA主题分析和情感分析。

已经将项目的关键文件上传,包含微博爬虫、LDA主题分析和情感分析三个部分。 1.微博爬虫 实现微博评论爬取和微博用户信息爬取,一天大概十万条。 2.LDA主题分析 实现文档主题抽取,包括数据清洗及分词、主题数的确定(主题一致性和困惑度)和最优主题模型的选择(暴力搜索)。 3.情感分析 实现评论文本的

182 Jan 02, 2023
Benchmarking Pipeline for Prediction of Protein-Protein Interactions

B4PPI Benchmarking Pipeline for the Prediction of Protein-Protein Interactions How this benchmarking pipeline has been built, and how to use it, is de

Loïc Lannelongue 4 Jun 27, 2022
Official code release for: EditGAN: High-Precision Semantic Image Editing

Official code release for: EditGAN: High-Precision Semantic Image Editing

565 Jan 05, 2023
Pytorch Lightning Distributed Accelerators using Ray

Distributed PyTorch Lightning Training on Ray This library adds new PyTorch Lightning accelerators for distributed training using the Ray distributed

166 Dec 27, 2022
Tutorial for the PERFECTING FACTORY 5.0 WITH EDGE-POWERED AI workshop

Workshop Advantech Jetson Nano This tutorial has been designed for the PERFECTING FACTORY 5.0 WITH EDGE-POWERED AI workshop in collaboration with Adva

Edge Impulse 18 Nov 22, 2022
This is a clean and robust Pytorch implementation of DQN and Double DQN.

DQN/DDQN-Pytorch This is a clean and robust Pytorch implementation of DQN and Double DQN. Here is the training curve: All the experiments are trained

XinJingHao 15 Dec 27, 2022
Instance-wise Feature Importance in Time (FIT)

Instance-wise Feature Importance in Time (FIT) FIT is a framework for explaining time series perdiction models, by assigning feature importance to eve

Sana 46 Dec 25, 2022
Using LSTM to detect spoofing attacks in an Air-Ground network

Using LSTM to detect spoofing attacks in an Air-Ground network Specifications IDE: Spider Packages: Tensorflow 2.1.0 Keras NumPy Scikit-learn Matplotl

Tiep M. H. 1 Nov 20, 2021
Pathdreamer: A World Model for Indoor Navigation

Pathdreamer: A World Model for Indoor Navigation This repository hosts the open source code for Pathdreamer, to be presented at ICCV 2021. Paper | Pro

Google Research 122 Jan 04, 2023
Pytorch implementation of SELF-ATTENTIVE VAD, ICASSP 2021

SELF-ATTENTIVE VAD: CONTEXT-AWARE DETECTION OF VOICE FROM NOISE (ICASSP 2021) Pytorch implementation of SELF-ATTENTIVE VAD | Paper | Dataset Yong Rae

97 Dec 23, 2022
Scalable Attentive Sentence-Pair Modeling via Distilled Sentence Embedding (AAAI 2020) - PyTorch Implementation

Scalable Attentive Sentence-Pair Modeling via Distilled Sentence Embedding PyTorch implementation for the Scalable Attentive Sentence-Pair Modeling vi

Microsoft 25 Dec 02, 2022
PyTorch implementation of NIPS 2017 paper Dynamic Routing Between Capsules

Dynamic Routing Between Capsules - PyTorch implementation PyTorch implementation of NIPS 2017 paper Dynamic Routing Between Capsules from Sara Sabour,

Adam Bielski 475 Dec 24, 2022