This repo includes some graph-based CTR prediction models and other representative baselines.

Overview

Graph-based CTR prediction

This is a repository designed for graph-based CTR prediction methods, it includes our graph-based CTR prediction methods:

  • Fi-GNN: Modeling Feature Interactions via Graph Neural Networks for CTR Prediction paper
  • GraphFM: Graph Factorization Machines for Feature Interaction Modeling paper

and some other representative baselines:

  • HoAFM: A High-order Attentive Factorization Machine for CTR Prediction paper
  • AutoInt: AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks paper
  • InterHAt: Interpretable Click-Through Rate Prediction through Hierarchical Attention paper

Requirements:

  • Tensorflow 1.5.0
  • Python 3.6
  • CUDA 9.0+ (For GPU)

Usage

Our code is based on AutoInt.

Input Format

The required input data is in the following format:

  • train_x: matrix with shape (num_sample, num_field). train_x[s][t] is the feature value of feature field t of sample s in the dataset. The default value for categorical feature is 1.
  • train_i: matrix with shape (num_sample, num_field). train_i[s][t] is the feature index of feature field t of sample s in the dataset. The maximal value of train_i is the feature size.
  • train_y: label of each sample in the dataset.

If you want to know how to preprocess the data, please refer to data/Dataprocess/Criteo/preprocess.py

Example

There are four public real-world datasets(Avazu, Criteo, KDD12, MovieLens-1M) that you can use. You can run the code on MovieLens-1M dataset directly in /movielens. The other three datasets are super huge, and they can not be fit into the memory as a whole. Therefore, we split the whole dataset into 10 parts and we use the first file as test set and the second file as valid set. We provide the codes for preprocessing these three datasets in data/Dataprocess. If you want to reuse these codes, you should first run preprocess.py to generate train_x.txt, train_i.txt, train_y.txt as described in Input Format. Then you should run data/Dataprocesss/Kfold_split/StratifiedKfold.py to split the whole dataset into ten folds. Finally you can run scale.py to scale the numerical value(optional).

To help test the correctness of the code and familarize yourself with the code, we upload the first 10000 samples of Criteo dataset in train_examples.txt. And we provide the scripts for preprocessing and training.(Please refer to data/sample_preprocess.sh and run_criteo.sh, you may need to modify the path in config.py and run_criteo.sh).

After you run the data/sample_preprocess.sh, you should get a folder named Criteo which contains part*, feature_size.npy, fold_index.npy, train_*.txt. feature_size.npy contains the number of total features which will be used to initialize the model. train_*.txt is the whole dataset.

Here's how to run the preprocessing.

cd data
mkdir Criteo
python ./Dataprocess/Criteo/preprocess.py
python ./Dataprocess/Kfold_split/stratifiedKfold.py
python ./Dataprocess/Criteo/scale.py

Here's how to train GraphFM on Criteo dataset.

CUDA_VISIBLE_DEVICES=$GPU python -m code.train \
--model_type GraphFM \
                        --data_path $YOUR_DATA_PATH --data Criteo \
                        --blocks 3 --heads 2 --block_shape "[64, 64, 64]" \
                        --ks "[39, 20, 5]" \
                        --is_save --has_residual \
                        --save_path ./models/GraphFM/Criteo/b3h2_64x64x64/ \
                        --field_size 39  --run_times 1 \
                        --epoch 2 --batch_size 1024 \

Here's how to train GraphFM on Avazu dataset.

CUDA_VISIBLE_DEVICES=$GPU python -m code.train \
--model_type GraphFM \
                        --data_path $YOUR_DATA_PATH --data Avazu \
                        --blocks 3 --heads 2 --block_shape "[64, 64, 64]" \
                        --ks "[23, 10, 2]" \
                        --is_save --has_residual \
                        --save_path ./models/GraphFM/Avazu/b3h2_64x64x64/ \
                        --field_size 23  --run_times 1 \
                        --epoch 2 --batch_size 1024 \

You can run the training on the relatively small MovieLens dataset in /movielens.

You should see the output like this:

...
train logs
...
start testing!...
restored from ./models/Criteo/b3h2_64x64x64/1/
test-result = 0.8088, test-logloss = 0.4430
test_auc [0.8088305055534442]
test_log_loss [0.44297631300399626]
avg_auc 0.8088305055534442
avg_log_loss 0.44297631300399626

Citation

If you find this repo useful for your research, please consider citing the following paper:

@inproceedings{li2019fi,
  title={Fi-gnn: Modeling feature interactions via graph neural networks for ctr prediction},
  author={Li, Zekun and Cui, Zeyu and Wu, Shu and Zhang, Xiaoyu and Wang, Liang},
  booktitle={Proceedings of the 28th ACM International Conference on Information and Knowledge Management},
  pages={539--548},
  year={2019}
}

@article{li2021graphfm,
  title={GraphFM: Graph Factorization Machines for Feature Interaction Modeling},
  author={Li, Zekun and Wu, Shu and Cui, Zeyu and Zhang, Xiaoyu},
  journal={arXiv preprint arXiv:2105.11866},
  year={2021}
}

Contact information

You can contact Zekun Li ([email protected]), if there are questions related to the code.

Acknowledgement

This implementation is based on Weiping Song and Chence Shi's AutoInt. Thanks for their sharing and contribution.

Owner
Big Data and Multi-modal Computing Group, CRIPAC
Big Data and Multi-modal Computing Group, Center for Research on Intelligent Perception and Computing
Big Data and Multi-modal Computing Group, CRIPAC
My project contrasts K-Nearest Neighbors and Random Forrest Regressors on Real World data

kNN-vs-RFR My project contrasts K-Nearest Neighbors and Random Forrest Regressors on Real World data In many areas, rental bikes have been launched to

1 Oct 28, 2021
Free MLOps course from DataTalks.Club

MLOps Zoomcamp Our MLOps Zoomcamp course Sign up here: https://airtable.com/shrCb8y6eTbPKwSTL (it's not automated, you will not receive an email immed

DataTalksClub 4.6k Dec 31, 2022
PySurvival is an open source python package for Survival Analysis modeling

PySurvival What is Pysurvival ? PySurvival is an open source python package for Survival Analysis modeling - the modeling concept used to analyze or p

Square 265 Dec 27, 2022
MiniTorch - a diy teaching library for machine learning engineers

This repo is the full student code for minitorch. It is designed as a single repo that can be completed part by part following the guide book. It uses

1.1k Jan 07, 2023
A linear regression model for house price prediction

Linear_Regression_Model A linear regression model for house price prediction. This code is using these packages, so please make sure your have install

ShawnWang 1 Nov 29, 2021
Distributed Evolutionary Algorithms in Python

DEAP DEAP is a novel evolutionary computation framework for rapid prototyping and testing of ideas. It seeks to make algorithms explicit and data stru

Distributed Evolutionary Algorithms in Python 4.9k Jan 05, 2023
ML-powered Loan-Marketer Customer Filtering Engine

In Loan-Marketing business employees are required to call the user's to buy loans of several fields and in several magnitudes. If employees are calling everybody in the network it is also very length

Sagnik Roy 13 Jul 02, 2022
Module is created to build a spam filter using Python and the multinomial Naive Bayes algorithm.

Naive-Bayes Spam Classificator Module is created to build a spam filter using Python and the multinomial Naive Bayes algorithm. Main goal is to code a

Viktoria Maksymiuk 1 Jun 27, 2022
TorchDrug is a PyTorch-based machine learning toolbox designed for drug discovery

A powerful and flexible machine learning platform for drug discovery

MilaGraph 1.1k Jan 08, 2023
A Lucid Framework for Transparent and Interpretable Machine Learning Models.

Currently a Beta-Version lucidmode is an open-source, low-code and lightweight Python framework for transparent and interpretable machine learning mod

lucidmode 15 Aug 12, 2022
Magenta: Music and Art Generation with Machine Intelligence

Magenta is a research project exploring the role of machine learning in the process of creating art and music. Primarily this involves developing new

Magenta 18.1k Dec 30, 2022
A benchmark of data-centric tasks from across the machine learning lifecycle.

A benchmark of data-centric tasks from across the machine learning lifecycle.

61 Dec 28, 2022
Interactive Parallel Computing in Python

Interactive Parallel Computing with IPython ipyparallel is the new home of IPython.parallel. ipyparallel is a Python package and collection of CLI scr

IPython 2.3k Dec 30, 2022
End to End toy example of MLOps

churn_model MLOps Toy Example End to End You might find below links useful Connect VSCode to Git MLFlow Port Heroku App Project Organization β”œβ”€β”€ LICEN

Ashish Tele 6 Feb 06, 2022
Reproducibility and Replicability of Web Measurement Studies

Reproducibility and Replicability of Web Measurement Studies This repository holds additional material to the paper "Reproducibility and Replicability

6 Dec 31, 2022
Machine Learning approach for quantifying detector distortion fields

DistortionML Machine Learning approach for quantifying detector distortion fields. This project is a feasibility study for training a surrogate model

Joel Bernier 1 Nov 05, 2021
Katana project is a template for ASAP πŸš€ ML application deployment

Katana project is a FastAPI template for ASAP πŸš€ ML API deployment

Mohammad Shahebaz 100 Dec 26, 2022
MooGBT is a library for Multi-objective optimization in Gradient Boosted Trees.

MooGBT is a library for Multi-objective optimization in Gradient Boosted Trees. MooGBT optimizes for multiple objectives by defining constraints on sub-objective(s) along with a primary objective. Th

Swiggy 66 Dec 06, 2022
Stock Price Prediction Bank Jago Using Facebook Prophet Machine Learning & Python

Stock Price Prediction Bank Jago Using Facebook Prophet Machine Learning & Python Overview Bank Jago has attracted investors' attention since the end

Najibulloh Asror 3 Feb 10, 2022
A visual dataflow programming language for sklearn

Persimmon What is it? Persimmon is a visual dataflow language for creating sklearn pipelines. It represents functions as blocks, inputs and outputs ar

Álvaro Bermejo 194 Jan 04, 2023