AudioDVP:Photorealistic Audio-driven Video Portraits

Last update: Jan 03, 2023

Related tags

Audio AudioDVP

Overview

AudioDVP

This is the official implementation of Photorealistic Audio-driven Video Portraits.

Major Requirements

Ubuntu >= 18.04
PyTorch >= 1.2
GCC >= 7.5
NVCC >= 10.1
FFmpeg (with H.264 support)

FYI, detailed environment setup is in enviroment.yml. (You definitely don't have to install all of them, just install what you need when you encounter an import error.)

Major implementation differences against original paper

Geometry parameter and texture parameter of 3DMM is now initialized from zero and shared among all samples during fitting, since it is more reasonable.
Using OpenCV rather than PIL for image editing operation.

Usage

1. Download face model data

Download Basel Face Model 2009. (Register and get 01_MorphableModel.mat.)
Download expression basis from 3DFace. (There is an Exp_Pca.bin in CoarseData.)
Download auxiliary files from Deep3DFaceReconstruction.

Put the data in renderer/data like the structure below.

renderer/data
├── 01_MorphableModel.mat
├── Exp_Pca.bin
├── BFM_front_idx.mat
├── BFM_exp_idx.mat
├── facemodel_info.mat
├── select_vertex_id.mat
├── std_exp.txt
└── data.mat(This is generated by the step 2 below.)

2. Build data

cd renderer/
python build_data.py

3.Download pretrained model of ATnet

The link is here.
Put atnet_lstm_18.pth in vendor/ATVGnet/model.

4.Download pretrained ResNet on VGGFace2

The link is here.
Put resnet50_ft_weight.pkl in weights

5.Download Trump speech video

The link is here. (Video courtesy of The White House.)
Put it in data/video

6.Compile CUDA rasterizer kernel

cd renderer/kernels
python setup.py build_ext --inplace

7.Running demo script

# Explanation of every step is provided.
./scripts/demo.sh

Since we provide both training and inference code, we won't upload pretrained model for brevity at present. We provide expected result in data/sample_result.mp4 using synthesized audio in data/test_audio.

Acknowledgment

This work is build upon many great open source code and data.

Many implementation details are learned from Deep3DFaceReconstruction.
ATVGnet in the vendor directory is directly borrowed from ATVGnet under MIT License.
neural-face-renderer in the vendor directory is heavily borrowed from CycleGAN and pix2pix in PyTorch under BSD License.
The pre-trained ResNet model on VGGFace2 dataset is from VGGFace2-pytorch under MIT License.
Basel2009 3D face dataset is from here.
The expression basis of 3DMM is from 3DFace under GPL License.
Our renderer is heavily borrowed from tf_mesh_renderer and inspired by pytorch_mesh_renderer.

Notification

Our method is built upon Deep Video Portraits.
Our method adopts a person-specific Audio2Expression module, which is not robust enough than a universal one trained on large dataset such as Lip Reading Sentences in the Wild. A universal one is encouraged! Fortunately, our method works quite well on WaveNet sythesized audio like provided in data/test_audio.
The code IS NOT fully tested on another clean machine.
There is a known bug in the rasterizer that several pixels of rendered face are black (not assigned with any color) in some corner conditions due to float point error which I can't fix.

Disclaimer

We made this code publicly available to benefit graphics and vision community. Please DO NOT abuse the code for devil things.

Citation

@article{wen2020audiodvp,
    author={Xin Wen and Miao Wang and Christian Richardt and Ze-Yin Chen and Shi-Min Hu},
    journal={IEEE Transactions on Visualization and Computer Graphics}, 
    title={Photorealistic Audio-driven Video Portraits}, 
    year={2020},
    volume={26},
    number={12},
    pages={3457-3466},
    doi={10.1109/TVCG.2020.3023573}
}

License

BSD

AudioDVP:Photorealistic Audio-driven Video Portraits

Related tags

Overview

AudioDVP

Major Requirements

Major implementation differences against original paper

Usage

1. Download face model data

2. Build data

3.Download pretrained model of ATnet

4.Download pretrained ResNet on VGGFace2

5.Download Trump speech video

6.Compile CUDA rasterizer kernel

7.Running demo script

Acknowledgment

Notification

Disclaimer

Citation

License

Owner

Real-time audio visualizations (spectrum, spectrogram, etc.)

This is a realtime voice translator program which gets input from user at any language and converts it to the desired language that the user asks

Scalable audio processing framework written in Python with a RESTful API

Automatically move or copy files based on metadata associated with the files. For example, file your photos based on EXIF metadata or use MP3 tags to file your music files.

SomaFM Plugin for Kodi

Muzic: Music Understanding and Generation with Artificial Intelligence

spafe: Simplified Python Audio-Features Extraction

Dataset and baseline code for the VocalSound dataset (ICASSP2022).

GNOME powered sound conversion

Mousai is a simple application that can identify song like Shazam

Python game programming in Jupyter notebooks.

The project aims to develop a personal-assistant for Windows & Linux-based systems

Spotifyd - An open source Spotify client running as a UNIX daemon.

Graphical interface to control granular sound synthesis.

Extract the songs from your osu! libary into proper mp3 form, complete with metadata and album art!

Python audio and music signal processing library

live coding in python + supercollider

Deep learning transformer model that generates unique music sequences.

無料で使える中品質なテキスト読み上げソフトウェア、VOICEVOXのコア

A python program to cut longer MP3 files (i.e. recordings of several songs) into the individual tracks.