PyTorch implementation of Lip to Speech Synthesis with Visual Context Attentional GAN (NeurIPS2021)

Last update: Nov 02, 2022

Overview

Lip to Speech Synthesis with Visual Context Attentional GAN

This repository contains the PyTorch implementation of the following paper:

Lip to Speech Synthesis with Visual Context Attentional GAN
Minsu Kim, Joanna Hong, and Yong Man Ro
[Paper] [Demo Video]

Preparation

Requirements

python 3.7
pytorch 1.6 ~ 1.8
torchvision
torchaudio
ffmpeg
av
tensorboard
scikit-image
pillow
librosa
pystoi
pesq
scipy

Datasets

Download

GRID dataset (video normal) can be downloaded from the below link.

http://spandh.dcs.shef.ac.uk/gridcorpus/

For data preprocessing, download the face landmark of GRID from the below link.

https://drive.google.com/file/d/1MDLmREuqeWin6CituMn4Z_dhIIJAwDGo/view?usp=sharing

Preprocessing

After download the dataset, preprocess the dataset with the following scripts in ./preprocess.
It supposes the data directory is constructed as

Data_dir
├── subject
|   ├── video
|   |   └── xxx.mpg

Extract frames
Extract_frames.py extract images and audio from the video.

python Extract_frames.py --Grid_dir "Data dir of GRID_corpus" --Out_dir "Output dir of images and audio of GRID_corpus"

Align faces and audio processing
Preprocess.py aligns faces and generates videos, which enables cropping the video lip-centered during training.

python Preprocess.py \
--Data_dir "Data dir of extracted images and audio of GRID_corpus" \
--Landmark "Downloaded landmark dir of GRID" \
--Output_dir "Output dir of processed data"

Training the Model

The speaker setting (different subject) can be selected by subject argument. Please refer to below examples.
To train the model, run following command:

# Data Parallel training example using 4 GPUs for multi-speaker setting in GRID
python train.py \
--grid 'enter_the_processed_data_path' \
--checkpoint_dir 'enter_the_path_to_save' \
--batch_size 88 \
--epochs 500 \
--subject 'overlap' \
--eval_step 720 \
--dataparallel \
--gpu 0,1,2,3

# 1 GPU training example for GRID for unseen-speaker setting in GRID
python train.py \
--grid 'enter_the_processed_data_path' \
--checkpoint_dir 'enter_the_path_to_save' \
--batch_size 22 \
--epochs 500 \
--subject 'unseen' \
--eval_step 1000 \
--gpu 0

Descriptions of training parameters are as follows:

--grid: Dataset location (grid)
--checkpoint_dir: directory for saving checkpoints
--checkpoint : saved checkpoint where the training is resumed from
--batch_size: batch size
--epochs: number of epochs
--augmentations: whether performing augmentation
--dataparallel: Use DataParallel
--subject: different speaker settings, s# is speaker specific training, overlap for multi-speaker setting, unseen for unseen-speaker setting, four for four speaker training
--gpu: gpu number for training
--lr: learning rate
--eval_step: steps for performing evaluation
--window_size: number of frames to be used for training
Refer to train.py for the other training parameters

The evaluation during training is performed for a subset of the validation dataset due to the heavy time costs of waveform conversion (griffin-lim).
In order to evaluate the entire performance of the trained model run the test code (refer to "Testing the Model" section).

check the training logs

tensorboard --logdir='./runs/logs to watch' --host='ip address of the server'

The tensorboard shows the training and validation loss, evaluation metrics, generated mel-spectrogram, and audio

Testing the Model

To test the model, run following command:

# Dataparallel test example for multi-speaker setting in GRID
python test.py \
--grid 'enter_the_processed_data_path' \
--checkpoint 'enter_the_checkpoint_path' \
--batch_size 100 \
--subject 'overlap' \
--save_mel \
--save_wav \
--dataparallel \
--gpu 0,1

Descriptions of training parameters are as follows:

--grid: Dataset location (grid)
--checkpoint : saved checkpoint where the training is resumed from
--batch_size: batch size
--dataparallel: Use DataParallel
--subject: different speaker settings, s# is speaker specific training, overlap for multi-speaker setting, unseen for unseen-speaker setting, four for four speaker training
--save_mel: whether to save the 'mel_spectrogram' and 'spectrogram' in .npz format
--save_wav: whether to save the 'waveform' in .wav format
--gpu: gpu number for training
Refer to test.py for the other parameters

Test Automatic Speech Recognition (ASR) results of generated results: WER

Transcription (Ground-truth) of GRID dataset can be downloaded from the below link.

https://drive.google.com/file/d/1q_v4acR_xsHb75P09jKAAtNONVo35ueR/view?usp=sharing

move to the ASR_model directory

cd ASR_model/GRID

To evaluate the WER, run following command:

# test example for multi-speaker setting in GRID
python test.py \
--data 'enter_the_generated_data_dir (mel or wav) (ex. ./../../test/spec_mel)' \
--gtpath 'enter_the_downloaded_transcription_path' \
--subject 'overlap' \
--gpu 0

Descriptions of training parameters are as follows:

--data: Data for evaluation (wav or mel(.npz))
--wav : whether the data is waveform or not
--batch_size: batch size
--subject: different speaker settings, s# is speaker specific training, overlap for multi-speaker setting, unseen for unseen-speaker setting, four for four speaker training
--gpu: gpu number for training
Refer to ./ASR_model/GRID/test.py for the other parameters

Pre-trained ASR model checkpoint

Below lists are the pre-trained ASR model to evaluate the generated speech.
WER shows the original performances of the model on ground-truth audio.

Setting	WER
GRID (constrained-speaker)	0.83 %
GRID (multi-speaker)	1.67 %
GRID (unseen-speaker)	0.37 %
LRW	1.54 %

Put the checkpoints in ./ASR_model/GRID/data for GRID, and in ./ASR_model/LRW/data for LRW.

Citation

If you find this work useful in your research, please cite the paper:

@article{kim2021vcagan,
  title={Lip to Speech Synthesis with Visual Context Attentional GAN},
  author={Kim, Minsu and Hong, Joanna and Ro, Yong Man},
  journal={Advances in Neural Information Processing Systems},
  volume={34},
  year={2021}
}

PyTorch implementation of Lip to Speech Synthesis with Visual Context Attentional GAN (NeurIPS2021)

Related tags

Overview

Lip to Speech Synthesis with Visual Context Attentional GAN

Preparation

Requirements

Datasets

Download

Preprocessing

Training the Model

check the training logs

Testing the Model

Test Automatic Speech Recognition (ASR) results of generated results: WER

Pre-trained ASR model checkpoint

Citation

Owner

Tensorflow Implementation of ECCV'18 paper: Multimodal Human Motion Synthesis

Code Release for the paper "TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation"

Performance Analysis of Multi-user NOMA Wireless-Powered mMTC Networks: A Stochastic Geometry Approach

Efficient Deep Learning Systems course

Code for paper: "Spinning Language Models for Propaganda-As-A-Service"

Code for visualizing the loss landscape of neural nets

Official repo for SemanticGAN https://nv-tlabs.github.io/semanticGAN/

CBREN: Convolutional Neural Networks for Constant Bit Rate Video Quality Enhancement

Public repository containing materials used for Feed Forward (FF) Neural Networks article.

Deep Learning to Improve Breast Cancer Detection on Screening Mammography

True Few-Shot Learning with Language Models

SeisComP/SeisBench interface to enable deep-learning (re)picking in SeisComP

A hand tracking demo made with mediapipe where you can control lights with pinching your fingers and moving your hand up/down.

Implementation of "Distribution Alignment: A Unified Framework for Long-tail Visual Recognition"(CVPR 2021)

Neural Tangent Generalization Attacks (NTGA)

Shallow Convolutional Neural Networks for Human Activity Recognition using Wearable Sensors

Open source hardware and software platform to build a small scale self driving car.

alfred-py: A deep learning utility library for human

Point cloud processing tool library.

Training DiffWave using variational method from Variational Diffusion Models.

PyTorch implementation of Lip to Speech Synthesis with Visual Context Attentional GAN (NeurIPS2021)

Related tags

Overview

Lip to Speech Synthesis with Visual Context Attentional GAN

Preparation

Requirements

Datasets

Download

Preprocessing

Training the Model

check the training logs

Testing the Model

Test Automatic Speech Recognition (ASR) results of generated results: WER

Pre-trained ASR model checkpoint

Citation

Owner

Tensorflow Implementation of ECCV'18 paper: Multimodal Human Motion Synthesis

Code Release for the paper "TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation"

Performance Analysis of Multi-user NOMA Wireless-Powered mMTC Networks: A Stochastic Geometry Approach

Efficient Deep Learning Systems course

Code for paper: "Spinning Language Models for Propaganda-As-A-Service"

Code for visualizing the loss landscape of neural nets

Official repo for SemanticGAN https://nv-tlabs.github.io/semanticGAN/

CBREN: Convolutional Neural Networks for Constant Bit Rate Video Quality Enhancement

Public repository containing materials used for Feed Forward (FF) Neural Networks article.

Deep Learning to Improve Breast Cancer Detection on Screening Mammography

True Few-Shot Learning with Language Models

SeisComP/SeisBench interface to enable deep-learning (re)picking in SeisComP

A hand tracking demo made with mediapipe where you can control lights with pinching your fingers and moving your hand up/down.

Implementation of "Distribution Alignment: A Unified Framework for Long-tail Visual Recognition"(CVPR 2021)

Neural Tangent Generalization Attacks (NTGA)

Shallow Convolutional Neural Networks for Human Activity Recognition using Wearable Sensors

Open source hardware and software platform to build a small scale self driving car.

alfred-py: A deep learning utility library for **human**

Point cloud processing tool library.

Training DiffWave using variational method from Variational Diffusion Models.

alfred-py: A deep learning utility library for human