Official pytorch implementation for Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion (CVPR 2022)

Overview

Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion

This repository contains a pytorch implementation of "Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion"

report

This codebase provides:

  • train code
  • test code
  • dataset
  • pretrained motion models

The main sections are:

  • Overview
  • Instalation
  • Download Data and Models
  • Training from Scratch
  • Testing with Pretrained Models

Please note, we will not be providing visualization code for the photorealistic rendering.

Overview:

We provide models and code to train and test our listener motion models.

See below for sections:

  • Installation: environment setup and installation for visualization
  • Download data and models: download annotations and pre-trained models
  • Training from scratch: scripts to get the training pipeline running from scratch
  • Testing with pretrianed models: scripts to test pretrained models and save output motion parameters

Installation:

Tested with cuda/9.0, cudnn/v7.0-cuda.9.0, and python 3.6.11

git clone [email protected]:evonneng/learning2listen.git

cd learning2listen/src/
conda create -n venv_l2l python=3.6
conda activate venv_l2l
pip install -r requirements.txt

export L2L_PATH=`pwd`

IMPORTANT: After installing torch, please make sure to modify the site-packages/torch/nn/modules/conv.py file by commenting out the self.padding_mode != 'zeros' line to allow for replicated padding for ConvTranspose1d as shown here.

Download Data and Models:

Download Data:

Please first download the dataset for the corresponding individual with google drive.

Make sure all downloaded .tar files are moved to the directory $L2L_PATH/data/ (e.g. $L2L_PATH/data/conan_data.tar)

Then run the following script.

./scripts/unpack_data.sh

The downloaded data will unpack into the following directory structure as viewed from $L2L_PATH:

|-- data/
    |-- conan/
        |-- test/
            |-- p0_list_faces_clean_deca.npy
            |-- p0_speak_audio_clean_deca.npy
            |-- p0_speak_faces_clean_deca.npy
            |-- p0_speak_files_clean_deca.npy
            |-- p1_list_faces_clean_deca.npy
            |-- p1_speak_audio_clean_deca.npy
            |-- p1_speak_faces_clean_deca.npy
            |-- p1_speak_files_clean_deca.npy
        |-- train/
    |-- devi2/
    |-- fallon/
    |-- kimmel/
    |-- stephen/
    |-- trevor/

Our dataset consists of 6 different youtube channels named accordingly. Please see comments in $L2L_PATH/scripts/download_models.sh for more details.

Data Format:

The data format is as described below:

We denote p0 as the person on the left side of the video, and p1 as the right side.

  • p0_list_faces_clean_deca.npy - face features (N x 64 x 184) for when p0 is listener
    • N sequences of length 64. Features of size 184, which includes the deca parameter set of expression (50D), pose (6D), and details (128D).
  • p0_speak_audio_clean_deca.npy - audio features (N x 256 x 128) for when p0 is speaking
    • N sequences of length 256. Features of size 128 mel features
  • p0_speak_faces_clean_deca.npy - face features (N x 64 x 184) for when p0 is speaking
  • p0_speak_files_clean_deca.npy - file names of the format (N x 64 x 3) for when p0 is speaking

Using Your Own Data:

To train and test on your own videos, please follow this process to convert your data into a compatible format:

(Optional) In our paper, we ran preprocessing to figure out when a each person is speaking or listening. We used this information to segment/chunk up our data. We then extracted speaker-only audio by removing listener back-channels.

  1. Run SyncNet on the video to determine who is speaking when.
  2. Then run Multi Sensory to obtain speaker's audio with all the listener backchannels removed.

For the main processing, we assuming there are 2 people in the video - one speaker and one listener...

  1. Run DECA to extract the facial expression and pose details of the two faces for each frame in the video. For each person combine the extracted features across the video into a (1 x T x (50+6)) matrix and save to p0_list_faces_clean_deca.npy or p0_speak_faces_clean_deca.npy files respectively. Note, in concatenating the features, expression comes first.

  2. Use librosa.feature.melspectrogram(...) to process the speaker's audio into a (1 x 4T x 128) feature. Save to p0_speak_audio_clean_deca.npy.

Download Model:

Please first download the models for the corresponding individual with google drive.

Make sure all downloaded .tar files are moved to the directory $L2L_PATH/models/ (e.g. $L2L_PATH/models/conan_models.tar)

Once downloaded, you can run the follow script to unpack all of the models.

cd $L2L_PATH
./scripts/unpack_models.sh

We provide person-specific models trained for Conan, Fallon, Stephen, and Trevor. Each person-specific model consists of 2 models: 1) VQ-VAE pre-trained codebook of motion in $L2L_PATH/vqgan/models/ and 2) predictor model for listener motion prediction in $L2L_PATH/models/. It is important that the models are paired correctly during test time.

In addition to the models, we also provide the corresponding config files that were used to define the models/listener training setup.

Please see comments in $L2L_PATH/scripts/unpack_models.sh for more details.

Training from Scratch:

Training a model from scratch follows a 2-step process.

  1. Train the VQ-VAE codebook of listener motion:
# --config: the config file associated with training the codebook
# Includes network setup information and listener information
# See provided config: configs/l2_32_smoothSS.json

cd $L2L_PATH/vqgan/
python train_vq_transformer.py --config <path_to_config_file>

Please note, during training of the codebook, it is normal for the loss to increase before decreasing. Typical training was ~2 days on 4 GPUs.

  1. After training of the VQ-VAE has converged, we can begin training the predictor model that uses this codebook.
# --config: the config file associated with training the predictor
# Includes network setup information and codebook information
# Note, you will have to update this config to point to the correct codebook.
# See provided config: configs/vq/delta_v6.json

cd $L2L_PATH
python -u train_vq_decoder.py --config <path_to_config_file>

Training the predictor model should have a much faster convergance. Typical training was ~half a day on 4 GPUs.

Testing with Pretrained Models:

# --config: the config file associated with training the predictor 
# --checkpoint: the path to the pretrained model
# --speaker: can specify which speaker you want to test on (conan, trevor, stephen, fallon, kimmel)

cd $L2L_PATH
python test_vq_decoder.py --config <path_to_config> --checkpoint <path_to_pretrained_model> --speaker <optional>

For our provided models and configs you can run:

python test_vq_decoder.py --config configs/vq/delta_v6.json --checkpoint models/delta_v6_er2er_best.pth --speaker 'conan'

Visualization

As part of responsible practices, we will not be releasing code for the photorealistic visualization pipeline. However, the raw 3D meshes can be rendered using the DECA renderer.

Potentially Coming Soon

  • Visualization of 3D meshes code from saved output
Theory-inspired Parameter Control Benchmarks for Dynamic Algorithm Configuration

This repo is for the paper: Theory-inspired Parameter Control Benchmarks for Dynamic Algorithm Configuration The DAC environment is based on the Dynam

Carola Doerr 1 Aug 19, 2022
Install alphafold on the local machine, get out of docker.

AlphaFold This package provides an implementation of the inference pipeline of AlphaFold v2.0. This is a completely new model that was entered in CASP

Kui Xu 73 Dec 13, 2022
This repository is an open-source implementation of the ICRA 2021 paper: Locus: LiDAR-based Place Recognition using Spatiotemporal Higher-Order Pooling.

Locus This repository is an open-source implementation of the ICRA 2021 paper: Locus: LiDAR-based Place Recognition using Spatiotemporal Higher-Order

Robotics and Autonomous Systems Group 96 Dec 15, 2022
PyTorch Code for the paper "VSE++: Improving Visual-Semantic Embeddings with Hard Negatives"

Improving Visual-Semantic Embeddings with Hard Negatives Code for the image-caption retrieval methods from VSE++: Improving Visual-Semantic Embeddings

Fartash Faghri 441 Dec 05, 2022
TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.

TensorFlowOnSpark TensorFlowOnSpark brings scalable deep learning to Apache Hadoop and Apache Spark clusters. By combining salient features from the T

Yahoo 3.8k Jan 04, 2023
Expressive Power of Invariant and Equivaraint Graph Neural Networks (ICLR 2021)

Expressive Power of Invariant and Equivaraint Graph Neural Networks In this repository, we show how to use powerful GNN (2-FGNN) to solve a graph alig

Marc Lelarge 36 Dec 12, 2022
This is the official PyTorch implementation for "Mesa: A Memory-saving Training Framework for Transformers".

Mesa: A Memory-saving Training Framework for Transformers This is the official PyTorch implementation for Mesa: A Memory-saving Training Framework for

Zhuang AI Group 105 Dec 06, 2022
The codes I made while I practiced various TensorFlow examples

TensorFlow_Exercises The codes I made while I practiced various TensorFlow examples About the codes I didn't create these codes by myself, but re-crea

Terry Taewoong Um 614 Dec 08, 2022
This repository provides the official code for GeNER (an automated dataset Generation framework for NER).

GeNER This repository provides the official code for GeNER (an automated dataset Generation framework for NER). Overview of GeNER GeNER allows you to

DMIS Laboratory - Korea University 50 Nov 30, 2022
This repository contains the files for running the Patchify GUI.

Repository Name Train-Test-Validation-Dataset-Generation App Name Patchify Description This app is designed for crop images and creating smal

Salar Ghaffarian 9 Feb 15, 2022
Source code for CVPR 2021 paper "Riggable 3D Face Reconstruction via In-Network Optimization"

Riggable 3D Face Reconstruction via In-Network Optimization Source code for CVPR 2021 paper "Riggable 3D Face Reconstruction via In-Network Optimizati

130 Jan 02, 2023
VIL-100: A New Dataset and A Baseline Model for Video Instance Lane Detection (ICCV 2021)

Preparation Please see dataset/README.md to get more details about our datasets-VIL100 Please see INSTALL.md to install environment and evaluation too

82 Dec 15, 2022
This is a Pytorch implementation of paper: DropEdge: Towards Deep Graph Convolutional Networks on Node Classification

DropEdge: Towards Deep Graph Convolutional Networks on Node Classification This is a Pytorch implementation of paper: DropEdge: Towards Deep Graph Con

401 Dec 16, 2022
Training neural models with structured signals.

Neural Structured Learning in TensorFlow Neural Structured Learning (NSL) is a new learning paradigm to train neural networks by leveraging structured

955 Jan 02, 2023
Deep Two-View Structure-from-Motion Revisited

Deep Two-View Structure-from-Motion Revisited This repository provides the code for our CVPR 2021 paper Deep Two-View Structure-from-Motion Revisited.

Jianyuan Wang 145 Jan 06, 2023
Flexible Networks for Learning Physical Dynamics of Deformable Objects (2021)

Flexible Networks for Learning Physical Dynamics of Deformable Objects (2021) By Jinhyung Park, Dohae Lee, In-Kwon Lee from Yonsei University (Seoul,

Jinhyung Park 0 Jan 09, 2022
PyTorch implementation of "Supervised Contrastive Learning" (and SimCLR incidentally)

PyTorch implementation of "Supervised Contrastive Learning" (and SimCLR incidentally)

Yonglong Tian 2.2k Jan 08, 2023
Multi-Task Temporal Shift Attention Networks for On-Device Contactless Vitals Measurement (NeurIPS 2020)

MTTS-CAN: Multi-Task Temporal Shift Attention Networks for On-Device Contactless Vitals Measurement Paper Xin Liu, Josh Fromm, Shwetak Patel, Daniel M

Xin Liu 106 Dec 30, 2022
Fusion-DHL: WiFi, IMU, and Floorplan Fusion for Dense History of Locations in Indoor Environments

Fusion-DHL: WiFi, IMU, and Floorplan Fusion for Dense History of Locations in Indoor Environments Paper: arXiv (ICRA 2021) Video : https://youtu.be/CC

Sachini Herath 68 Jan 03, 2023
Replication Package for AequeVox:Automated Fariness Testing for Speech Recognition Systems

AequeVox Replication Package for AequeVox:Automated Fariness Testing for Speech Recognition Systems README under development. Python Packages Required

Sai Sathiesh 2 Aug 28, 2022