KwaiRec: A Fully-observed Dataset for Recommender Systems (Density: Almost 100%)

Overview

KuaiRec: A Fully-observed Dataset for Recommender Systems (Density: Almost 100%)

LICENSE

KuaiRec is a real-world dataset collected from the recommendation logs of the video-sharing mobile app Kuaishou. For now, it is the first dataset that contains a fully observed user-item interaction matrix. For the term "fully observed", we mean there are almost no missing values in the user-item matrix, i.e., each user has viewed each video and then left feedback.

The following figure illustrates the user-item matrices in traditional datasets and KuaiRec.

kuaidata

With all user's preference known, KuaiRec can used in offline evaluation (i.e., offline A/B test) for recommendation models. It can benefit lots of research directions, such as unbiased recommendation, interactive/conversational recommendation, or reinforcement learning (RL) and off-policy evaluation (OPE) for recommendation.

If you use it in your work, please cite our paper: LINK PDF

@article{gao2022kuairec,
  title={KuaiRec: A Fully-observed Dataset for Recommender Systems}, 
  author={Chongming Gao and Shijun Li and Wenqiang Lei and Biao Li and Peng Jiang and Jiawei Chen and Xiangnan He and Jiaxin Mao and Tat-Seng Chua},
  journal={arXiv preprint arXiv:2202.10842},
  year={2022}
}

This repository lists the example codes in evaluating conversational recommendation as described in the paper.

We provide some simple statistics of this dataset here . It is generated by Statistics_KuaiRec.ipynb. You can do it online at Google Colab colab.


News ! ! ! ! !

2022.05.16: We update the dataset to version 2.0. We made the following changes:

  • We removed the unused video ID=1225 from all tables having the field video_id and reindex the rest videos, i.e., ID = ID - 1 if ID > 1225.
  • We added two tables to enhance the side information for users and videos, respectively. See 4.item_daily_feet.csv and 5. user_feat.csv under the data description section for details.

Download the data

We provides several options to download this dataset:

Option 1. Download via the "wget" command.

 wget https://chongming.myds.me:61364/data/KuaiRec.zip --no-check-certificate
 unzip KuaiRec.zip

Option 2. Download manually throughs the following links:

The script loaddata.py provides a simple way to load the data via Pandas in Python.


Data Descriptions

KuaiRec contains millions of user-item interactions as well as the side information include the item categorires and social network. Four files are included in the download data:

KuaiRec
├── data
│   ├── big_matrix.csv          
│   ├── small_matrix.csv
│   ├── social_network.csv
│   └── item_categories.csv

The statistics of the small matrix and big matrix in KuaiRec.

#Users #Items #Interactions Density
small matrix 1,411 3,327 4,676,570 99.6%
big matrix 7,176 10,728 12,530,806 16.3%

Note that the density of small matrix is 99.6% instead of 100% because some users have explicitly indicated that they would not be willing to receive recommendations from certain authors. I.e., They blocked these videos.

1. Descriptions of the fields in big_matrix.csv and small_matrix.csv.

Field Name: Description Type Example
user_id The ID of the user. int64 0
video_id The ID of the viewed video. int64 3650
play_duration Time of video viewing of this interaction (millisecond). int64 13838
video_duration Time of this video (millisecond). int64 10867
time Human-readable date for this interaction str "2020-07-05 00:08:23.438"
date Date of this interaction int64 20200705
timestamp Unix timestamp float64 1593878903.438
watch_ratio The video watching ratio (=play_duration/video_duration) float64 1.273397

The "watch_ratio" can be deemed as the label of the interaction. Note: there is no "like" signal for this dataset. If you need this binary signal in your scenarios, you can create it yourself. E.g., like = 1 if watch_ratio > 2.0.

2. Descriptions of the fields in social_network.csv

Field Name: Description Type Example
user_id The ID of the user. int64 5352
friend_list The list of ID of the friends of this user. list [4202,7126]

3. Descriptions of the fields in item_categories.csv.

Field Name: Description Type Example
video_id The ID of the video. int64 1
feat The list of tags of this video. list [27,9]

4. Descriptions of the fields in item_daily_feet.csv. (Added on 2022.05.16)

Field Name: Description Type Example
video_id The ID of the video. int64 3784
date Date of the statistics of this video. int64 20200730
author_id The ID of the author of this video. int64 441
video_type Type of this video (NORMAL or AD). str "NORMAL"
upload_dt Upload date of this video. str "2020-07-08"
upload_type The upload type of this video. str "ShortImport"
visible_status The visible state of this video on the APP now. str "public"
video_duration The time duration of this duration (in millisecond). float64 17200.0
video_width The width of this video on the server. int64 720
video_height The height of this video on the server. int64 1280
music_id Background music ID of this video. int64 989206467
video_tag_id The ID of tag of this video. int64 2522
video_tag_name The name of tag of this video. string "祝福"
show_cnt The number of shows of this video within this day (the same with all following fields) int64 7716
show_user_num The number of users who received the recommendation of this video. int64 5256
play_cnt The number of plays. int64 7701
play_user_num The number of users who plays this video. int64 5034
play_duration The total time duration of playing this video (in millisecond). int64 138333346
complete_play_cnt The number of complete plays. complete play: finishing playing the whole video, i.e., #(play_duration >= video_duration). int64 3446
complete_play_user_num The number of users who perform the complete play. int64 2033
valid_play_cnt valid play: play_duration >= video_duration if video_duration <= 7s, or play_duration > 7 if video_duration > 7s. int64 5099
valid_play_user_num The number of users who perform the complete play. int64 3195
long_time_play_cnt long time play: play_duration >= video_duration if video_duration <= 18s, or play_duration >=18 if video_duration > 18s. int64 3299
long_time_play_user_num The number of users who perform the long time play. int64 1940
short_time_play_cnt short time play: play_duration < min(3s, video_duration). int64 1538
short_time_play_user_num The number of users who perform the short time play. int64 1190
play_progress The average video playing ratio (=play_duration/video_duration) int64 0.579695
comment_stay_duration Total time of staying in the comments section int64 467865
like_cnt Total likes int64 659
like_user_num The number of users who hit the "like" button. int64 657
click_like_cnt The number of the "like" resulted from double click int64 496
double_click_cnt The number of users who double click the video. int64 163
cancel_like_cnt The number of likes that are cancelled by users. int64 15
cancel_like_user_num The number of users who cancel their like. int64 15
comment_cnt The number of comments within this day. int64 13
comment_user_num The number of users who comment this video. int64 12
direct_comment_cnt The number of direct comments (depth=1). int64 13
reply_comment_cnt The number of reply comments (depth>1). int64 0
delete_comment_cnt The number of deleted comments. int64 0
delete_comment_user_num The number of users who delete their comments. int64 0
comment_like_cnt The number of comment likes. int64 2
comment_like_user_num The number of users who like the comments. int64 2
follow_cnt The number of increased follows from this video. int64 151
follow_user_num The number of users who follow the author of this video due to this video. int64 151
cancel_follow_cnt The number of decreased follows from this video. int64 0
cancel_follow_user_num The number of users who cancel their following of the author of this video due to this video. int64 0
share_cnt The times of sharing this video. int64 1
share_user_num The number of users who share this video. int64 1
download_cnt The times of downloading this video. int64 2
download_user_num The number of users who download this video. int64 2
report_cnt The times of reporting this video. int64 0
report_user_num The number of users who report this video. int64 0
reduce_similar_cnt The times of reducing similar content of this video. int64 2
reduce_similar_user_num The number of users who choose to reduce similar content of this video. int64 2
collect_cnt The times of adding this video to favorite videos. int64 0
collect_user_num The number of users who add this video to their favorite videos. int64 0
cancel_collect_cnt The times of removing this video from favorite videos. int64 0
cancel_collect_user_num The number of users who remove this video from their favorite videos int64 0

5. Descriptions of the fields in user_feat.csv (Added on 2022.05.16)

Field Name: Description Type Example
user_id The ID of the user. int64 0
user_active_degree In the set of {'high_active', 'full_active', 'middle_active', 'UNKNOWN'}. str "high_active"
is_lowactive_period Is this user in its low active period int64 0
is_live_streamer Is this user a live streamer? int64 0
is_video_author Has this user uploaded any video? int64 0
follow_user_num The number of users that this user follows. int64 5
follow_user_num_range The range of the number of users that this user follows. In the set of {'0', '(0,10]', '(10,50]', '(100,150]', '(150,250]', '(250,500]', '(50,100]', '500+'} str "(0,10]"
fans_user_num The number of the fans of this user. int64 0
fans_user_num_range The range of the number of fans of this user. In the set of {'0', '[1,10)', '[10,100)', '[100,1k)', '[1k,5k)', '[5k,1w)', '[1w,10w)'} str "0"
friend_user_num The number of friends that this user has. int64 0
friend_user_num_range The range of the number of friends that this user has. In the set of {'0', '[1,5)', '[5,30)', '[30,60)', '[60,120)', '[120,250)', '250+'} str "0"
register_days The days since this user has registered. int64 107
register_days_range The range of the registered days. In the set of {'15-30', '31-60', '61-90', '91-180', '181-365', '366-730', '730+'}. str "61-90"
onehot_feat0 An encrypted feature of the user. Each value indicate the position of "1" in the one-hot vector. Range: {0,1} int64 0
onehot_feat1 An encrypted feature. Range: {0, 1, ..., 7} int64 1
onehot_feat2 An encrypted feature. Range: {0, 1, ..., 29} int64 17
onehot_feat3 An encrypted feature. Range: {0, 1, ..., 1075} int64 638
onehot_feat4 An encrypted feature. Range: {0, 1, ..., 11} int64 2
onehot_feat5 An encrypted feature. Range: {0, 1, ..., 9} int64 0
onehot_feat6 An encrypted feature. Range: {0, 1, 2} int64 1
onehot_feat7 An encrypted feature. Range: {0, 1, ..., 46} int64 6
onehot_feat8 An encrypted feature. Range: {0, 1, ..., 339} int64 184
onehot_feat9 An encrypted feature. Range: {0, 1, ..., 6} int64 6
onehot_feat10 An encrypted feature. Range: {0, 1, ..., 4} int64 3
onehot_feat11 An encrypted feature. Range: {0, 1, ..., 2} int64 0
onehot_feat12 An encrypted feature. Range: {0, 1} int64 0
onehot_feat13 An encrypted feature. Range: {0, 1} int64 0
onehot_feat14 An encrypted feature. Range: {0, 1} int64 0
onehot_feat15 An encrypted feature. Range: {0, 1} int64 0
onehot_feat16 An encrypted feature. Range: {0, 1} int64 0
onehot_feat17 An encrypted feature. Range: {0, 1} int64 0
Owner
Chongming GAO (高崇铭)
A Ph.D. student at Lab for Data Science, USTC. Research Interests: Recommender Systems.
Chongming GAO (高崇铭)
WarpRNNT loss ported in Numba CPU/CUDA for Pytorch

RNNT loss in Pytorch - Numba JIT compiled (warprnnt_numba) Warp RNN Transducer Loss for ASR in Pytorch, ported from HawkAaron/warp-transducer and a re

Somshubra Majumdar 15 Oct 22, 2022
A project which aims to protect your privacy using inexpensive hardware and easily modifiable software

Protecting your privacy using an ESP32, an IR sensor and a python script This project, which I personally call the "never-gonna-catch-me-in-the-act-ev

8 Oct 10, 2022
A computational optimization project towards the goal of gerrymandering the results of a hypothetical election in the UK.

A computational optimization project towards the goal of gerrymandering the results of a hypothetical election in the UK.

Emma 1 Jan 18, 2022
Object classification with basic computer vision techniques

naive-image-classification Object classification with basic computer vision techniques. Final assignment for the computer vision course I took at univ

2 Jul 01, 2022
Python library to receive live stream events like comments and gifts in realtime from TikTok LIVE.

TikTokLive A python library to connect to and read events from TikTok's LIVE service A python library to receive and decode livestream events such as

Isaac Kogan 277 Dec 23, 2022
Easy to use Audio Tagging in PyTorch

Audio Classification, Tagging & Sound Event Detection in PyTorch Progress: Fine-tune on audio classification Fine-tune on audio tagging Fine-tune on s

sithu3 15 Dec 22, 2022
Deep Multimodal Neural Architecture Search

MMNas: Deep Multimodal Neural Architecture Search This repository corresponds to the PyTorch implementation of the MMnas for visual question answering

Vision and Language Group@ MIL 23 Dec 21, 2022
Pytoydl: A toy deep learning framework built upon numpy.

Documents: https://pytoydl.readthedocs.io/zh/latest/ Pytoydl A toy deep learning framework built upon numpy. You can star this repository to keep trac

28 Dec 10, 2022
SkipGNN: Predicting Molecular Interactions with Skip-Graph Networks (Scientific Reports)

SkipGNN: Predicting Molecular Interactions with Skip-Graph Networks Molecular interaction networks are powerful resources for the discovery. While dee

Kexin Huang 49 Oct 15, 2022
A novel pipeline framework for multi-hop complex KGQA task. About the paper title: Improving Multi-hop Embedded Knowledge Graph Question Answering by Introducing Relational Chain Reasoning

Rce-KGQA A novel pipeline framework for multi-hop complex KGQA task. This framework mainly contains two modules, answering_filtering_module and relati

金伟强 -上海大学人工智能小渣渣~ 16 Nov 18, 2022
Code for "Learning to Regrasp by Learning to Place"

Learning2Regrasp Learning to Regrasp by Learning to Place, CoRL 2021. Introduction We propose a point-cloud-based system for robots to predict a seque

Shuo Cheng (成硕) 18 Aug 27, 2022
Multi-label Co-regularization for Semi-supervised Facial Action Unit Recognition (NeurIPS 2019)

MLCR This is the source code for paper Multi-label Co-regularization for Semi-supervised Facial Action Unit Recognition. Xuesong Niu, Hu Han, Shiguang

Edson-Niu 60 Nov 29, 2022
DimReductionClustering - Dimensionality Reduction + Clustering + Unsupervised Score Metrics

Dimensionality Reduction + Clustering + Unsupervised Score Metrics Introduction

11 Nov 15, 2022
VQGAN+CLIP Colab Notebook with user-friendly interface.

VQGAN+CLIP and other image generation system VQGAN+CLIP Colab Notebook with user-friendly interface. Latest Notebook: Mse regulized zquantize Notebook

Justin John 227 Jan 05, 2023
Expert Finding in Legal Community Question Answering

Expert Finding in Legal Community Question Answering Arian Askari, Suzan Verberne, and Gabriella Pasi. Expert Finding in Legal Community Question Answ

Arian Askari 3 Oct 31, 2022
[ICCV 2021] Self-supervised Monocular Depth Estimation for All Day Images using Domain Separation

ADDS-DepthNet This is the official implementation of the paper Self-supervised Monocular Depth Estimation for All Day Images using Domain Separation I

LIU_LINA 52 Nov 24, 2022
Unofficial PyTorch implementation of Attention Free Transformer (AFT) layers by Apple Inc.

aft-pytorch Unofficial PyTorch implementation of Attention Free Transformer's layers by Zhai, et al. [abs, pdf] from Apple Inc. Installation You can i

Rishabh Anand 184 Dec 12, 2022
The official PyTorch implementation of the paper: *Xili Dai, Xiaojun Yuan, Haigang Gong, Yi Ma. "Fully Convolutional Line Parsing." *.

F-Clip — Fully Convolutional Line Parsing This repository contains the official PyTorch implementation of the paper: *Xili Dai, Xiaojun Yuan, Haigang

Xili Dai 115 Dec 28, 2022
This is the official implementation of "One Question Answering Model for Many Languages with Cross-lingual Dense Passage Retrieval".

CORA This is the official implementation of the following paper: Akari Asai, Xinyan Yu, Jungo Kasai and Hannaneh Hajishirzi. One Question Answering Mo

Akari Asai 59 Dec 28, 2022
Implementation for "Seamless Manga Inpainting with Semantics Awareness" (SIGGRAPH 2021 issue)

Seamless Manga Inpainting with Semantics Awareness [SIGGRAPH 2021](To appear) | Project Website | BibTex Introduction: Manga inpainting fills up the d

101 Jan 01, 2023