Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation (SIGGRAPH Asia 2021)

Last update: Dec 31, 2022

Related tags

Overview

Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation

This repository contains the implementation of the following paper:

Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation

Yuanxun Lu, Jinxiang Chai, Xun Cao (SIGGRAPH Asia 2021)

Abstract: To the best of our knowledge, we first present a live system that generates personalized photorealistic talking-head animation only driven by audio signals at over 30 fps. Our system contains three stages. The first stage is a deep neural network that extracts deep audio features along with a manifold projection to project the features to the target person's speech space. In the second stage, we learn facial dynamics and motions from the projected audio features. The predicted motions include head poses and upper body motions, where the former is generated by an autoregressive probabilistic model which models the head pose distribution of the target person. Upper body motions are deduced from head poses. In the final stage, we generate conditional feature maps from previous predictions and send them with a candidate image set to an image-to-image translation network to synthesize photorealistic renderings. Our method generalizes well to wild audio and successfully synthesizes high-fidelity personalized facial details, e.g., wrinkles, teeth. Our method also allows explicit control of head poses. Extensive qualitative and quantitative evaluations, along with user studies, demonstrate the superiority of our method over state-of-the-art techniques.

[Project Page] [Paper] [Arxiv]

Figure 1. Given an arbitrary input audio stream, our system generates personalized and photorealistic talking-head animation in real-time. Right: May and Obama are driven by the same utterance but present different speaking characteristics.

Requirements

This project is successfully trained and tested on Windows10 with PyTorch 1.7 (Python 3.6). Linux and lower version PyTorch should also work (not tested). We recommend creating a new environment:

conda create -n LSP python=3.6
conda activate LSP

Clone the repository:

git clone https://github.com/YuanxunLu/LiveSpeechPortraits.git
cd LiveSpeechPortraits

FFmpeg is required to combine the audio and the silent generated videos. Please check FFmpeg for installation. For Linux users, you can also:

sudo apt-get install ffmpeg

Install the dependences:

pip install -r requirements.txt

Demo

Download the pre-trained models and data from Google Drive to the data folder. Five subjects data are released (May, Obama1, Obama2, Nadella and McStay).

Run the demo:

python demo.py --id May --driving_audio ./data/input/00083.wav

Results can be found under the results folder.

Citation

If you find this project useful for your research, please consider citing:

@inproceedings{LiveSpeechPortraits_SIGGRAPH_ASIA_2021,
 author = {Lu, Yuanxun and Chai, Jinxiang and Cao, Xun},
 title = {{Live Speech Portraits}: Real-Time Photorealistic Talking-Head Animation},
 journal = {ACM Transactions on Graphics},
 numpages = {17},
 volume={40},
 number={6},
 month = December,
 year = {2021},
 doi={10.1145/3478513.3480484}
}

Acknowledgment

This repo was built based on the framework of pix2pix-pytorch.
Thanks the authors of MakeItTalk, ATVG, RhythmicHead, Speech-Driven Animation for making their excellent work and codes publicly available.

Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation (SIGGRAPH Asia 2021)

Related tags

Overview

Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation

Requirements

Demo

Citation

Acknowledgment

Owner

OldSix

A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

A paper list for aspect based sentiment analysis.

Training code for Korean multi-class sentiment analysis

The Easy-to-use Dialogue Response Selection Toolkit for Researchers

Simplified diarization pipeline using some pretrained models - audio file to diarized segments in a few lines of code

Simple Python library, distributed via binary wheels with few direct dependencies, for easily using wav2vec 2.0 models for speech recognition

Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts

Python bindings to the dutch NLP tool Frog (pos tagger, lemmatiser, NER tagger, morphological analysis, shallow parser, dependency parser)

Python wrapper for Stanford CoreNLP tools v3.4.1

CoSENT 比Sentence-BERT更有效的句向量方案

GVT is a generic translation tool for parts of text on the PC screen with Text to Speak functionality.

AI and Machine Learning workflows on Anthos Bare Metal.

A Lightweight NLP Data Loader for All Deep Learning Frameworks in Python

A notebook that shows how to import the IITB English-Hindi Parallel Corpus from the HuggingFace datasets repository

Ongoing research training transformer language models at scale, including: BERT & GPT-2

This project consists of data analysis and data visualization (done using python)of all IPL seasons from 2008 to 2019 and answering the most asked questions about the IPL.

Trex is a tool to match semantically similar functions based on transfer learning.

DAGAN - Dual Attention GANs for Semantic Image Synthesis

Chinese real time voice cloning (VC) and Chinese text to speech (TTS).

TTS is a library for advanced Text-to-Speech generation.