[LREC] MMChat: Multi-Modal Chat Dataset on Social Media

Overview

MMChat

This repo contains the code and data for the LREC2022 paper MMChat: Multi-Modal Chat Dataset on Social Media.

Dataset

MMChat is a large-scale dialogue dataset that contains image-grounded dialogues in Chinese. Each dialogue in MMChat is associated with one or more images (maximum 9 images per dialogue). We design various strategies to ensure the quality of the dialogues in MMChat. Please read our paper for more details. The images in the dataset are hosted on Weibo's static image server. You can refer to the scripts provided in data_processing/weibo_image_crawler to download these images.

Two sample dialogues form MMChat are given below (translated from Chinese): A sample dialogue from MMChat

MMChat is released in different versions:

Rule Filtered Raw MMChat

This version of MMChat contains raw dialogues filtered by our rules. The following table shows some basic statistics:

Item Description Count
Sessions 4.257 M
Sessions with more than 4 utterances 2.304 M
Utterances 18.590 M
Images 4.874 M
Avg. utterance per session 4.367
Avg. image per session 1.670
Avg. character per utterance 14.104

We devide above dialogues into 9 splits to facilitate the download:

  1. Split0 Google Drive, Baidu Netdisk
  2. Split1 Google Drive, Baidu Netdisk
  3. Split2 Google Drive, Baidu Netdisk
  4. Split3 Google Drive, Baidu Netdisk
  5. Split4 Google Drive, Baidu Netdisk
  6. Split5 Google Drive, Baidu Netdisk
  7. Split6 Google Drive, Baidu Netdisk
  8. Split7 Google Drive, Baidu Netdisk
  9. Split8 Google Drive, Baidu Netdisk

LCCC Filtered MMChat

This version of MMChat contains the dialogues that are filtered based on the LCCC (Large-scale Cleaned Chinese Conversation) dataset. Specifically, some dialogues in MMChat are also contained in LCCC. We regard these dialogues as cleaner dialogues since sophisticated schemes are designed in LCCC to filter out noises. This version of MMChat is obtained using the script data_processing/LCCC_filter.py The following table shows some basic statistics:

Item Description Count
Sessions 492.6 K
Sessions with more than 4 utterances 208.8 K
Utterances 1.986 M
Images 1.066 M
Avg. utterance per session 4.031
Avg. image per session 2.514
Avg. character per utterance 11.336

We devide above dialogues into 9 splits to facilitate the download:

  1. Split0 Google Drive, Baidu Netdisk
  2. Split1 Google Drive, Baidu Netdisk
  3. Split2 Google Drive, Baidu Netdisk
  4. Split3 Google Drive, Baidu Netdisk
  5. Split4 Google Drive, Baidu Netdisk
  6. Split5 Google Drive, Baidu Netdisk
  7. Split6 Google Drive, Baidu Netdisk
  8. Split7 Google Drive, Baidu Netdisk
  9. Split8 Google Drive, Baidu Netdisk

MMChat

The MMChat dataset reported in our paper are given here. The Weibo content corresponding to these dialogues are all "分享图片", (i.e., "Share Images" in English). The following table shows some basic statistics:

Item Description Count
Sessions 120.84 K
Sessions with more than 4 utterances 17.32 K
Utterances 314.13 K
Images 198.82 K
Avg. utterance per session 2.599
Avg. image per session 2.791
Avg. character per utterance 8.521

The above dialogues can be downloaded from either Google Drive or Baidu Netdisk.

MMChat-hf

We perform human annotation on the sampled dialogues to determine whether the given images are related to the corresponding dialogues. The following table only shows the statistics for dialogues that are annotated as image-related.

Item Description Count
Sessions 19.90 K
Sessions with more than 4 utterances 8.91 K
Utterances 81.06 K
Images 52.66K
Avg. utterance per session 4.07
Avg. image per session 2.70
Avg. character per utterance 11.93

We annotated about 100K dialogues. All the annotated dialogues can be downloaded from either Google Drive or Baidu Netdisk.

Code

We are also releasing all the codes used for our experiments. You can use the script run_training.sh in each folder to launch the distributed training.

For models that require image features, you can extract the image features using the scripts in data_processing/extract_image_features

The model shown in our paper can be found in dialog_image: Model

Reference

Please cite our paper if you find our work useful ;)

@inproceedings{zheng2022MMChat,
  author    = {Zheng, Yinhe and Chen, Guanyi and Liu, Xin and Sun, Jian},
  title     = {MMChat: Multi-Modal Chat Dataset on Social Media},
  booktitle = {Proceedings of The 13th Language Resources and Evaluation Conference},
  year      = {2022},
  publisher = {European Language Resources Association},
}
@inproceedings{wang2020chinese,
  title     = {A Large-Scale Chinese Short-Text Conversation Dataset},
  author    = {Wang, Yida and Ke, Pei and Zheng, Yinhe and Huang, Kaili and Jiang, Yong and Zhu, Xiaoyan and Huang, Minlie},
  booktitle = {NLPCC},
  year      = {2020},
  url       = {https://arxiv.org/abs/2008.03946}
}
Owner
Silver
Dialogue System, Natural Language Processing
Silver
Companion code for the paper "Meta-Learning the Search Distribution of Black-Box Random Search Based Adversarial Attacks" by Yatsura et al.

META-RS This is the companion code for the paper "Meta-Learning the Search Distribution of Black-Box Random Search Based Adversarial Attacks" by Yatsu

Bosch Research 7 Dec 09, 2022
Neural network for digit classification powered by cuda

cuda_nn_mnist Neural network library for digit classification powered by cuda Resources The library was built to work with MNIST dataset. python-mnist

Nikita Ardashev 1 Dec 20, 2021
A Framework for Encrypted Machine Learning in TensorFlow

TF Encrypted is a framework for encrypted machine learning in TensorFlow. It looks and feels like TensorFlow, taking advantage of the ease-of-use of t

TF Encrypted 0 Jul 06, 2022
[CVPR 2021] Official PyTorch Implementation for "Iterative Filter Adaptive Network for Single Image Defocus Deblurring"

IFAN: Iterative Filter Adaptive Network for Single Image Defocus Deblurring Checkout for the demo (GUI/Google Colab)! The GUI version might occasional

Junyong Lee 173 Dec 30, 2022
Very large and sparse networks appear often in the wild and present unique algorithmic opportunities and challenges for the practitioner

Sparse network learning with snlpy Very large and sparse networks appear often in the wild and present unique algorithmic opportunities and challenges

Andrew Stolman 1 Apr 30, 2021
Off-policy continuous control in PyTorch, with RDPG, RTD3 & RSAC

arXiv technical report soon available. we are updating the readme to be as comprehensive as possible Please ask any questions in Issues, thanks. Intro

Zhihan 31 Dec 30, 2022
Linescanning - Package for (pre)processing of anatomical and (linescanning) fMRI data

line scanning repository This repository contains all of the tools used during the acquisition and postprocessing of line scanning data at the Spinoza

Jurjen Heij 4 Sep 14, 2022
Reproduces ResNet-V3 with pytorch

ResNeXt.pytorch Reproduces ResNet-V3 (Aggregated Residual Transformations for Deep Neural Networks) with pytorch. Tried on pytorch 1.6 Trains on Cifar

Pau Rodriguez 481 Dec 23, 2022
PyTorch implementation of U-TAE and PaPs for satellite image time series panoptic segmentation.

Panoptic Segmentation of Satellite Image Time Series with Convolutional Temporal Attention Networks (ICCV 2021) This repository is the official implem

71 Jan 04, 2023
Deep motion generator collections

GenMotion GenMotion (/gen’motion/) is a Python library for making skeletal animations. It enables easy dataset loading and experiment sharing for synt

23 May 24, 2022
NAVER BoostCamp Final Project

CV 14조 final project Super Resolution and Deblur module Inference code & Pretrained weight Repo SwinIR Deblur 실행 방법 streamlit run WebServer/Server_SRD

JiSeong Kim 5 Sep 06, 2022
[CVPR 2022 Oral] TubeDETR: Spatio-Temporal Video Grounding with Transformers

TubeDETR: Spatio-Temporal Video Grounding with Transformers Website • STVG Demo • Paper This repository provides the code for our paper. This includes

Antoine Yang 108 Dec 27, 2022
ECCV2020 paper: Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards. Code and Data.

This repo contains some of the codes for the following paper Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards. Code

Xuewen Yang 56 Dec 08, 2022
Implicit Model Specialization through DAG-based Decentralized Federated Learning

Federated Learning DAG Experiments This repository contains software artifacts to reproduce the experiments presented in the Middleware '21 paper "Imp

Operating Systems and Middleware Group 5 Oct 16, 2022
The code is for the paper "A Self-Distillation Embedded Supervised Affinity Attention Model for Few-Shot Segmentation"

SD-AANet The code is for the paper "A Self-Distillation Embedded Supervised Affinity Attention Model for Few-Shot Segmentation" [arxiv] Overview confi

cv516Buaa 9 Nov 07, 2022
Introduction to AI assignment 1 HCM University of Technology, term 211

Sokoban Bot Introduction to AI assignment 1 HCM University of Technology, term 211 Abstract This is basically a solver for Sokoban game using Breadth-

Quang Minh 4 Dec 12, 2022
An SMPC companion library for Syft

SyMPC A library that extends PySyft with SMPC support SyMPC /ˈsɪmpəθi/ is a library which extends PySyft ≥0.3 with SMPC support. It allows computing o

Arturo Marquez Flores 0 Oct 13, 2021
[CIKM 2019] Code and dataset for "Fi-GNN: Modeling Feature Interactions via Graph Neural Networks for CTR Prediction"

FiGNN for CTR prediction The code and data for our paper in CIKM2019: Fi-GNN: Modeling Feature Interactions via Graph Neural Networks for CTR Predicti

Big Data and Multi-modal Computing Group, CRIPAC 75 Dec 30, 2022
Detectron2 is FAIR's next-generation platform for object detection and segmentation.

Detectron2 is Facebook AI Research's next generation software system that implements state-of-the-art object detection algorithms. It is a ground-up r

Facebook Research 23.3k Jan 08, 2023
Translate darknet to tensorflow. Load trained weights, retrain/fine-tune using tensorflow, export constant graph def to mobile devices

Intro Real-time object detection and classification. Paper: version 1, version 2. Read more about YOLO (in darknet) and download weight files here. In

Trieu 6.1k Jan 04, 2023