TalkNet: Audio-visual active speaker detection Model

Overview

Is someone talking? TalkNet: Audio-visual active speaker detection Model

This repository contains the code for our ACM MM 2021 paper, TalkNet, an active speaker detection model to detect 'whether the face in the screen is speaking or not?'. [Paper] [Video_English] [Video_Chinese].

overall.png

  • Awesome ASD: Papers about active speaker detection in last years.

  • TalkNet in AVA-Activespeaker dataset: The code to preprocess the AVA-ActiveSpeaker dataset, train TalkNet in AVA train set and evaluate it in AVA val/test set.

  • TalkNet in TalkSet and Columbia ASD dataset: The code to generate TalkSet, an ASD dataset in the wild, based on VoxCeleb2 and LRS3, train TalkNet in TalkSet and evaluate it in Columnbia ASD dataset.

  • An ASD Demo with pretrained TalkNet model: An end-to-end script to detect and mark the speaking face by the pretrained TalkNet model.


Dependencies

Start from building the environment

conda create -n TalkNet python=3.7.9 anaconda
conda activate TalkNet
pip install -r requirement.txt

Start from the existing environment

pip install -r requirement.txt

TalkNet in AVA-Activespeaker dataset

Data preparation

The following script can be used to download and prepare the AVA dataset for training.

python trainTalkNet.py --dataPathAVA AVADataPath --download 

AVADataPath is the folder you want to save the AVA dataset and its preprocessing outputs, the details can be found in here . Please read them carefully.

Training

Then you can train TalkNet in AVA end-to-end by using:

python trainTalkNet.py --dataPathAVA AVADataPath

exps/exps1/score.txt: output score file, exps/exp1/model/model_00xx.model: trained model, exps/exps1/val_res.csv: prediction for val set.

Pretrained model

Our pretrained model performs mAP: 92.3 in validation set, you can check it by using:

python trainTalkNet.py --dataPathAVA AVADataPath --evaluation

The pretrained model will automaticly be downloaded into TalkNet_ASD/pretrain_AVA.model. It performs mAP: 90.8 in the testing set.


TalkNet in TalkSet and Columbia ASD dataset

Data preparation

We find that it is challenge to apply the model we trained in AVA for the videos not in AVA (Reason is here, Q1). So we build TalkSet, an active speaker detection dataset in the wild, based on VoxCeleb2 and LRS3.

We do not plan to upload this dataset since we just modify it, instead of building it. In TalkSet folder we provide these .txt files to describe which files we used to generate the TalkSet and their ASD labels. You can generate this TalkSet if you are interested to train an ASD model in the wild.

Also, we have provided our pretrained TalkNet model in TalkSet. You can evaluate it in Columbia ASD dataset or other raw videos in the wild.

Usage

A pretrain model in TalkSet will be download into TalkNet_ASD/pretrain_TalkSet.model when using the following script:

python demoTalkNet.py --evalCol --colSavePath colDataPath

Also, Columnbia ASD dataset and the labels will be downloaded into colDataPath. Finally you can get the following F1 result.

Name Bell Boll Lieb Long Sick Avg.
F1 98.1 88.8 98.7 98.0 97.7 96.3

(This result is different from that in our paper because we train the model again, while the avg. F1 is very similar)


An ASD Demo with pretrained TalkNet model

Data preparation

We build an end-to-end script to detect and extract the active speaker from the raw video by our pretrain model in TalkSet.

You can put the raw video (.mp4 and .avi are both fine) into the demo folder, such as 001.mp4.

Usage

python demoTalkNet.py --videoName 001

A pretrain model in TalkSet will be downloaded into TalkNet_ASD/pretrain_TalkSet.model. The structure of the output reults can be found in here.

You can get the output video demo/001/pyavi/video_out.avi, which has marked the active speaker by green box and non-active speaker by red box.


Citation

Please cite the following if our paper or code is helpful to your research.

@article{tao2021TalkNet,
  title={Is Someone Speaking? Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection},
  author={Ruijie Tao, Zexu Pan, Rohan Kumar Das, Xinyuan Qian, Mike Zheng Shou, Haizhou Li},
  journal={ACM Multimedia (MM)},
  year={2021}
}

I have summaried some potential FAQs. This is my first open-source work, please let me know if I can future improve in this repositories. Thanks for your support!

Owner
NUS ECE PhD student
A natural language modeling framework based on PyTorch

Overview PyText is a deep-learning based NLP modeling framework built on PyTorch. PyText addresses the often-conflicting requirements of enabling rapi

Meta Research 6.4k Jan 08, 2023
Code of paper: A Recurrent Vision-and-Language BERT for Navigation

Recurrent VLN-BERT Code of the Recurrent-VLN-BERT paper: A Recurrent Vision-and-Language BERT for Navigation Yicong Hong, Qi Wu, Yuankai Qi, Cristian

YicongHong 109 Dec 21, 2022
Simple Annotated implementation of GPT-NeoX in PyTorch

Simple Annotated implementation of GPT-NeoX in PyTorch This is a simpler implementation of GPT-NeoX in PyTorch. We have taken out several optimization

labml.ai 101 Dec 03, 2022
Creating a python chatbot that Starbucks users can text to place an order + help cut wait time of a normal coffee.

Creating a python chatbot that Starbucks users can text to place an order + help cut wait time of a normal coffee.

2 Jan 20, 2022
Snowball compiler and stemming algorithms

Snowball is a small string processing language for creating stemming algorithms for use in Information Retrieval, plus a collection of stemming algori

Snowball Stemming language and algorithms 613 Jan 07, 2023
Stack based programming language that compiles to x86_64 assembly or can alternatively be interpreted in Python

lang lang is a simple stack based programming language written in Python. It can

Christoffer Aakre 1 May 30, 2022
Nested Named Entity Recognition

Nested Named Entity Recognition Training Dataset: CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark url: https://tianchi.aliyun.

8 Dec 25, 2022
SimCTG - A Contrastive Framework for Neural Text Generation

A Contrastive Framework for Neural Text Generation Authors: Yixuan Su, Tian Lan,

Yixuan Su 345 Jan 03, 2023
🤗🖼️ HuggingPics: Fine-tune Vision Transformers for anything using images found on the web.

🤗 🖼️ HuggingPics Fine-tune Vision Transformers for anything using images found on the web. Check out the video below for a walkthrough of this proje

Nathan Raw 185 Dec 21, 2022
NLP: SLU tagging

NLP: SLU tagging

北海若 3 Jan 14, 2022
Voilà turns Jupyter notebooks into standalone web applications

Rendering of live Jupyter notebooks with interactive widgets. Introduction Voilà turns Jupyter notebooks into standalone web applications. Unlike the

Voilà Dashboards 4.5k Jan 03, 2023
scikit-learn wrappers for Python fastText.

skift scikit-learn wrappers for Python fastText. from skift import FirstColFtClassifier df = pandas.DataFrame([['woof', 0], ['meow', 1]], colu

Shay Palachy 233 Sep 09, 2022
A collection of models for image - text generation in ACM MM 2021.

Bi-directional Image and Text Generation UMT-BITG (image & text generator) Unifying Multimodal Transformer for Bi-directional Image and Text Generatio

Multimedia Research 63 Oct 30, 2022
Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers

beyond masking Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers The code is coming Figure 1: Pipeline of token-based pre-

Yunjie Tian 23 Sep 27, 2022
Local cross-platform machine translation GUI, based on CTranslate2

DesktopTranslator Local cross-platform machine translation GUI, based on CTranslate2 Download Windows Installer You can either download a ready-made W

Yasmin Moslem 29 Jan 05, 2023
PyKaldi is a Python scripting layer for the Kaldi speech recognition toolkit.

PyKaldi is a Python scripting layer for the Kaldi speech recognition toolkit. It provides easy-to-use, low-overhead, first-class Python wrappers for t

922 Dec 31, 2022
中文无监督SimCSE Pytorch实现

A PyTorch implementation of unsupervised SimCSE SimCSE: Simple Contrastive Learning of Sentence Embeddings 1. 用法 无监督训练 python train_unsup.py ./data/ne

99 Dec 23, 2022
Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.

Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.

18 Nov 28, 2022
A combination of autoregressors and autoencoders using XLNet for sentiment analysis

A combination of autoregressors and autoencoders using XLNet for sentiment analysis Abstract In this paper sentiment analysis has been performed in or

James Zaridis 2 Nov 20, 2021
A PyTorch implementation of VIOLET

VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling A PyTorch implementation of VIOLET Overview VIOLET is an implementati

Tsu-Jui Fu 119 Dec 30, 2022