KuaiRec: A Fully-observed Dataset for Recommender Systems (Density: Almost 100%)

KuaiRec is a real-world dataset collected from the recommendation logs of the video-sharing mobile app Kuaishou. For now, it is the first dataset that contains a fully observed user-item interaction matrix. For the term "fully observed", we mean there are almost no missing values in the user-item matrix, i.e., each user has viewed each video and then left feedback.

The following figure illustrates the user-item matrices in traditional datasets and KuaiRec.

With all user's preference known, KuaiRec can used in offline evaluation (i.e., offline A/B test) for recommendation models. It can benefit lots of research directions, such as unbiased recommendation, interactive/conversational recommendation, or reinforcement learning (RL) and off-policy evaluation (OPE) for recommendation.

If you use it in your work, please cite our paper:

@article{gao2022kuairec,
  title={KuaiRec: A Fully-observed Dataset for Recommender Systems}, 
  author={Chongming Gao and Shijun Li and Wenqiang Lei and Biao Li and Peng Jiang and Jiawei Chen and Xiangnan He and Jiaxin Mao and Tat-Seng Chua},
  journal={arXiv preprint arXiv:2202.10842},
  year={2022}
}

This repository lists the example codes in evaluating conversational recommendation as described in the paper.

We provide some simple statistics of this dataset here . It is generated by Statistics_KuaiRec.ipynb. You can do it online at Google Colab .

News ! ! ! ! !

2022.05.16: We update the dataset to version 2.0. We made the following changes:

We removed the unused video ID=1225 from all tables having the field video_id and reindex the rest videos, i.e., ID = ID - 1 if ID > 1225.
We added two tables to enhance the side information for users and videos, respectively. See 4.item_daily_feet.csv and 5. user_feat.csv under the data description section for details.

Download the data

We provides several options to download this dataset:

Option 1. Download via the "wget" command.

 wget https://chongming.myds.me:61364/data/KuaiRec.zip --no-check-certificate
 unzip KuaiRec.zip

Option 2. Download manually throughs the following links:

Optional link 1: Google Drive
Optional link 2: USTC Drive (中科大)

The script loaddata.py provides a simple way to load the data via Pandas in Python.

Data Descriptions

KuaiRec contains millions of user-item interactions as well as the side information include the item categorires and social network. Four files are included in the download data:

KuaiRec
├── data
│   ├── big_matrix.csv          
│   ├── small_matrix.csv
│   ├── social_network.csv
│   └── item_categories.csv

The statistics of the small matrix and big matrix in KuaiRec.

	#Users	#Items	#Interactions	Density
small matrix	1,411	3,327	4,676,570	99.6%
big matrix	7,176	10,728	12,530,806	16.3%

Note that the density of small matrix is 99.6% instead of 100% because some users have explicitly indicated that they would not be willing to receive recommendations from certain authors. I.e., They blocked these videos.

1. Descriptions of the fields in `big_matrix.csv` and `small_matrix.csv`.

Field Name:	Description	Type	Example
user_id	The ID of the user.	int64	0
video_id	The ID of the viewed video.	int64	3650
play_duration	Time of video viewing of this interaction (millisecond).	int64	13838
video_duration	Time of this video (millisecond).	int64	10867
time	Human-readable date for this interaction	str	"2020-07-05 00:08:23.438"
date	Date of this interaction	int64	20200705
timestamp	Unix timestamp	float64	1593878903.438
watch_ratio	The video watching ratio (=play_duration/video_duration)	float64	1.273397

The "watch_ratio" can be deemed as the label of the interaction. Note: there is no "like" signal for this dataset. If you need this binary signal in your scenarios, you can create it yourself. E.g., like = 1 if watch_ratio > 2.0.

2. Descriptions of the fields in `social_network.csv`

Field Name:	Description	Type	Example
user_id	The ID of the user.	int64	5352
friend_list	The list of ID of the friends of this user.	list	[4202,7126]

3. Descriptions of the fields in `item_categories.csv`.

Field Name:	Description	Type	Example
video_id	The ID of the video.	int64	1
feat	The list of tags of this video.	list	[27,9]

4. Descriptions of the fields in `item_daily_feet.csv`. (Added on 2022.05.16)

Field Name:	Description	Type	Example
video_id	The ID of the video.	int64	3784
date	Date of the statistics of this video.	int64	20200730
author_id	The ID of the author of this video.	int64	441
video_type	Type of this video (NORMAL or AD).	str	"NORMAL"
upload_dt	Upload date of this video.	str	"2020-07-08"
upload_type	The upload type of this video.	str	"ShortImport"
visible_status	The visible state of this video on the APP now.	str	"public"
video_duration	The time duration of this duration (in millisecond).	float64	17200.0
video_width	The width of this video on the server.	int64	720
video_height	The height of this video on the server.	int64	1280
music_id	Background music ID of this video.	int64	989206467
video_tag_id	The ID of tag of this video.	int64	2522
video_tag_name	The name of tag of this video.	string	"祝福"
show_cnt	The number of shows of this video within this day (the same with all following fields)	int64	7716
show_user_num	The number of users who received the recommendation of this video.	int64	5256
play_cnt	The number of plays.	int64	7701
play_user_num	The number of users who plays this video.	int64	5034
play_duration	The total time duration of playing this video (in millisecond).	int64	138333346
complete_play_cnt	The number of complete plays. complete play: finishing playing the whole video, i.e., `#(play_duration >= video_duration)`.	int64	3446
complete_play_user_num	The number of users who perform the complete play.	int64	2033
valid_play_cnt	valid play: `play_duration >= video_duration if video_duration <= 7s`, or `play_duration > 7 if video_duration > 7s`.	int64	5099
valid_play_user_num	The number of users who perform the complete play.	int64	3195
long_time_play_cnt	long time play: `play_duration >= video_duration if video_duration <= 18s`, or `play_duration >=18 if video_duration > 18s`.	int64	3299
long_time_play_user_num	The number of users who perform the long time play.	int64	1940
short_time_play_cnt	short time play: `play_duration < min(3s, video_duration)`.	int64	1538
short_time_play_user_num	The number of users who perform the short time play.	int64	1190
play_progress	The average video playing ratio (`=play_duration/video_duration`)	int64	0.579695
comment_stay_duration	Total time of staying in the comments section	int64	467865
like_cnt	Total likes	int64	659
like_user_num	The number of users who hit the "like" button.	int64	657
click_like_cnt	The number of the "like" resulted from double click	int64	496
double_click_cnt	The number of users who double click the video.	int64	163
cancel_like_cnt	The number of likes that are cancelled by users.	int64	15
cancel_like_user_num	The number of users who cancel their like.	int64	15
comment_cnt	The number of comments within this day.	int64	13
comment_user_num	The number of users who comment this video.	int64	12
direct_comment_cnt	The number of direct comments (depth=1).	int64	13
reply_comment_cnt	The number of reply comments (depth>1).	int64	0
delete_comment_cnt	The number of deleted comments.	int64	0
delete_comment_user_num	The number of users who delete their comments.	int64	0
comment_like_cnt	The number of comment likes.	int64	2
comment_like_user_num	The number of users who like the comments.	int64	2
follow_cnt	The number of increased follows from this video.	int64	151
follow_user_num	The number of users who follow the author of this video due to this video.	int64	151
cancel_follow_cnt	The number of decreased follows from this video.	int64	0
cancel_follow_user_num	The number of users who cancel their following of the author of this video due to this video.	int64	0
share_cnt	The times of sharing this video.	int64	1
share_user_num	The number of users who share this video.	int64	1
download_cnt	The times of downloading this video.	int64	2
download_user_num	The number of users who download this video.	int64	2
report_cnt	The times of reporting this video.	int64	0
report_user_num	The number of users who report this video.	int64	0
reduce_similar_cnt	The times of reducing similar content of this video.	int64	2
reduce_similar_user_num	The number of users who choose to reduce similar content of this video.	int64	2
collect_cnt	The times of adding this video to favorite videos.	int64	0
collect_user_num	The number of users who add this video to their favorite videos.	int64	0
cancel_collect_cnt	The times of removing this video from favorite videos.	int64	0
cancel_collect_user_num	The number of users who remove this video from their favorite videos	int64	0

5. Descriptions of the fields in `user_feat.csv` (Added on 2022.05.16)

Field Name:	Description	Type	Example
user_id	The ID of the user.	int64	0
user_active_degree	In the set of {'high_active', 'full_active', 'middle_active', 'UNKNOWN'}.	str	"high_active"
is_lowactive_period	Is this user in its low active period	int64	0
is_live_streamer	Is this user a live streamer？	int64	0
is_video_author	Has this user uploaded any video？	int64	0
follow_user_num	The number of users that this user follows.	int64	5
follow_user_num_range	The range of the number of users that this user follows. In the set of {'0', '(0,10]', '(10,50]', '(100,150]', '(150,250]', '(250,500]', '(50,100]', '500+'}	str	"(0,10]"
fans_user_num	The number of the fans of this user.	int64	0
fans_user_num_range	The range of the number of fans of this user. In the set of {'0', '[1,10)', '[10,100)', '[100,1k)', '[1k,5k)', '[5k,1w)', '[1w,10w)'}	str	"0"
friend_user_num	The number of friends that this user has.	int64	0
friend_user_num_range	The range of the number of friends that this user has. In the set of {'0', '[1,5)', '[5,30)', '[30,60)', '[60,120)', '[120,250)', '250+'}	str	"0"
register_days	The days since this user has registered.	int64	107
register_days_range	The range of the registered days. In the set of {'15-30', '31-60', '61-90', '91-180', '181-365', '366-730', '730+'}.	str	"61-90"
onehot_feat0	An encrypted feature of the user. Each value indicate the position of "1" in the one-hot vector. Range: {0,1}	int64	0
onehot_feat1	An encrypted feature. Range: {0, 1, ..., 7}	int64	1
onehot_feat2	An encrypted feature. Range: {0, 1, ..., 29}	int64	17
onehot_feat3	An encrypted feature. Range: {0, 1, ..., 1075}	int64	638
onehot_feat4	An encrypted feature. Range: {0, 1, ..., 11}	int64	2
onehot_feat5	An encrypted feature. Range: {0, 1, ..., 9}	int64	0
onehot_feat6	An encrypted feature. Range: {0, 1, 2}	int64	1
onehot_feat7	An encrypted feature. Range: {0, 1, ..., 46}	int64	6
onehot_feat8	An encrypted feature. Range: {0, 1, ..., 339}	int64	184
onehot_feat9	An encrypted feature. Range: {0, 1, ..., 6}	int64	6
onehot_feat10	An encrypted feature. Range: {0, 1, ..., 4}	int64	3
onehot_feat11	An encrypted feature. Range: {0, 1, ..., 2}	int64	0
onehot_feat12	An encrypted feature. Range: {0, 1}	int64	0
onehot_feat13	An encrypted feature. Range: {0, 1}	int64	0
onehot_feat14	An encrypted feature. Range: {0, 1}	int64	0
onehot_feat15	An encrypted feature. Range: {0, 1}	int64	0
onehot_feat16	An encrypted feature. Range: {0, 1}	int64	0
onehot_feat17	An encrypted feature. Range: {0, 1}	int64	0

KwaiRec: A Fully-observed Dataset for Recommender Systems (Density: Almost 100%)

Related tags

Overview

KuaiRec: A Fully-observed Dataset for Recommender Systems (Density: Almost 100%)

News ! ! ! ! !

Download the data

Data Descriptions

1. Descriptions of the fields in `big_matrix.csv` and `small_matrix.csv`.

2. Descriptions of the fields in `social_network.csv`

3. Descriptions of the fields in `item_categories.csv`.

4. Descriptions of the fields in `item_daily_feet.csv`. (Added on 2022.05.16)

5. Descriptions of the fields in `user_feat.csv` (Added on 2022.05.16)

Owner

Chongming GAO (高崇铭)

RL Algorithms with examples in Python / Pytorch / Unity ML agents

Contrastive Language-Image Pretraining

Code for CPM-2 Pre-Train

A library for building and serving multi-node distributed faiss indices.

Multi-Scale Aligned Distillation for Low-Resolution Detection (CVPR2021)

Boundary-preserving Mask R-CNN (ECCV 2020)

deep-prae

Single-Shot Motion Completion with Transformer

Colossal-AI: A Unified Deep Learning System for Large-Scale Parallel Training

Code repo for "Cross-Scale Internal Graph Neural Network for Image Super-Resolution" (NeurIPS'20)

Forecasting directional movements of stock prices for intraday trading using LSTM and random forest

4st place solution for the PBVS 2022 Multi-modal Aerial View Object Classification Challenge - Track 1 (SAR) at PBVS2022

Based on the paper "Geometry-aware Instance-reweighted Adversarial Training" ICLR 2021 oral

A system for quickly generating training data with weak supervision

IDRLnet, a Python toolbox for modeling and solving problems through Physics-Informed Neural Network (PINN) systematically.

🔎 Super-scale your images and run experiments with Residual Dense and Adversarial Networks.

Code for our paper A Transformer-Based Feature Segmentation and Region Alignment Method For UAV-View Geo-Localization,

Simple data balancing baselines for worst-group-accuracy benchmarks.

Hepsiburada - Hepsiburada Urun Bilgisi Cekme

Rule based classification A hotel s customers dataset

KwaiRec: A Fully-observed Dataset for Recommender Systems (Density: Almost 100%)

Related tags

Overview

KuaiRec: A Fully-observed Dataset for Recommender Systems (Density: Almost 100%)

News ! ! ! ! !

Download the data

Data Descriptions

1. Descriptions of the fields in big_matrix.csv and small_matrix.csv.

2. Descriptions of the fields in social_network.csv

3. Descriptions of the fields in item_categories.csv.

4. Descriptions of the fields in item_daily_feet.csv. (Added on 2022.05.16)

5. Descriptions of the fields in user_feat.csv (Added on 2022.05.16)

Owner

Chongming GAO (高崇铭)

RL Algorithms with examples in Python / Pytorch / Unity ML agents

Contrastive Language-Image Pretraining

Code for CPM-2 Pre-Train

A library for building and serving multi-node distributed faiss indices.

Multi-Scale Aligned Distillation for Low-Resolution Detection (CVPR2021)

Boundary-preserving Mask R-CNN (ECCV 2020)

deep-prae

Single-Shot Motion Completion with Transformer

Colossal-AI: A Unified Deep Learning System for Large-Scale Parallel Training

Code repo for "Cross-Scale Internal Graph Neural Network for Image Super-Resolution" (NeurIPS'20)

Forecasting directional movements of stock prices for intraday trading using LSTM and random forest

4st place solution for the PBVS 2022 Multi-modal Aerial View Object Classification Challenge - Track 1 (SAR) at PBVS2022

Based on the paper "Geometry-aware Instance-reweighted Adversarial Training" ICLR 2021 oral

A system for quickly generating training data with weak supervision

IDRLnet, a Python toolbox for modeling and solving problems through Physics-Informed Neural Network (PINN) systematically.

🔎 Super-scale your images and run experiments with Residual Dense and Adversarial Networks.

Code for our paper A Transformer-Based Feature Segmentation and Region Alignment Method For UAV-View Geo-Localization,

Simple data balancing baselines for worst-group-accuracy benchmarks.

Hepsiburada - Hepsiburada Urun Bilgisi Cekme

Rule based classification A hotel s customers dataset

1. Descriptions of the fields in `big_matrix.csv` and `small_matrix.csv`.

2. Descriptions of the fields in `social_network.csv`

3. Descriptions of the fields in `item_categories.csv`.

4. Descriptions of the fields in `item_daily_feet.csv`. (Added on 2022.05.16)

5. Descriptions of the fields in `user_feat.csv` (Added on 2022.05.16)