Predicting Auction Sale Price using the kaggle bulldozer auction sales data: Modeling with Ensembles vs Neural Network

Overview

Predicting Auction Sale Price using the kaggle bulldozer auction sales data: Modeling with Ensembles vs Neural Network

The performances of tree ensembles and neural networks on structured data are evaluated. In addition, the effectiveness of combining neural network and decision trees (such as random trees, histogram based gradient boosting, and xgboost) is investigated. Covariant shift, Random forest's inability to extrapolate, and data leakage are investigated.

A simple 2-layer Neural network outperformed xgboost, followed by random forests. The worst performance based on RMSE was obtained from the histogram based gradient boosting regressor.

Overall, the best rmse (0.220194)--about 4.04% improvement over the kaggle's leaderboard first place score -- was obtained by taking the average of the predictions by the neural network and xgboost regressor.

Key takeaways:

  1. Always start with a baseline

  2. Random forests are generally bad at extrapolating, hence, if there is a shift in the domain between the training input and the validation (or test) inputs, then the random forest model will perform rather poorly on the validation set(or test set).

rf_failure

The red portion of the plot above shows the extrapolation problem. The random forest was trained on the first 70% of the data and used to make predictions on thr full data including the last 30%. It fails because there is an obvious linear trend it was unable to properly capture. Moreover, the predictions by random forests are confined within the range of the training input labels, since random forests make predictions by taking the average of previously observed data. Hence, when the input for prediction is

  1. To improve the performance of random forests, you could attempt to find the columns or features on which the training and validation sets differ the most. You may drop the ones that least impacts the accuracy of the model. To achieve this, I trained a random forest that can tell if a given input is from a training set or validation set. This helped me determine if a validation set has the same or similar distribution as the training set. Lastly, I computed the feature importances. The feature importances for this model revealed the degree of dissimilarity of the features between the training and validation sets. The features with high feature importances are the most dissimilar between the sets. salesID and machineID were significantly different between the sets but impacts RMSE the least, hence they were dropped. Other common approaches taken to improve performance include: finding and removing the redundant features by making similarity plot (shown below), choosing more recent data for both the training and the validation sets.

similarity plot

  1. For forecasting tasks (time dependent targets), the validation set should not be arbitrarily chosen i.e train_test_split may not be your best option for splitting the data. Since you are looking to make predictions on future sales, your validation set should contain more recent data, so that if your model is able to do well on the validation set, then, you can be more confident about its predictions in the future.

  2. Data leakage should be investigated. Signs of data leakage include:

    • Unrealistically high level of performance on the test set
    • Apparently meaningless feature(s) scoring very high on feature importance
    • Partial dependence plots that do not make sense.

popularitypartial_dependence

Observations extracted from the notebook*

Towards the end of the productsize plot, we see an interesting trend. The auction price is at its lowest in the end. This group represent the missing values in our product size. Missing values constitute the greatest percentage in our ProductSize. However, recall that productsize is our third most important feature. So, how is it possible that a feature that is missing so often could be so important to the prediction? The answer may be tied to data leakage. We can theorize that the auctions with missing product size information were not really successful since they were sold at very low prices, as a resutlt, the size information were either removed or intentionally omitted. It is also possible that most of these data were collected after sales were made, and for the sales that were not great, the product size were simply left blank. The intention is completely debatable, it might be intended to provide clue as to the nature of the sale, however, such information can harm our model or even render it completely useless. Clearly, our model could be misled into thinking that missing product size is an indication of low price and as such will always predict a low price whenever the product size attribute is missing. A model afflicted with data leakage will not perform well in production.

  1. An histogram based gradient boosting regressor may not be the best for forecasting on time dependent data. It showed the least peroformance with an RMSE of 0.239826

  2. A simple Neural network can show superior performance on structured data. A 2-layer neural network in which the categorical variables (i.e features with cardinality < 1000) were handled using embeddings showed a 1.93% improvement in RMSE compared to the best random forest model. It also outperformed the xgboost regressor even after the hyperparameters were tuned.

  3. There is some benefit to be derived by using an ensemble of models. In this project, each time, the neural network was combined with any of the trees, a superior performance always ensues. The best performance was obtained from the combination of neural network and the xgboost model.

Owner
Mustapha Unubi Momoh
Python Developer| Data scientist
Mustapha Unubi Momoh
ICCV2021 - Mining Contextual Information Beyond Image for Semantic Segmentation

Introduction The official repository for "Mining Contextual Information Beyond Image for Semantic Segmentation". Our full code has been merged into ss

55 Nov 09, 2022
A motion detection system with RaspberryPi, OpenCV, Python

Human Detection System using Raspberry Pi Functionality Activates a relay on detecting motion. You may need following components to get the expected R

Omal Perera 55 Dec 04, 2022
Interpretable-contrastive-word-mover-s-embedding

Interpretable-contrastive-word-mover-s-embedding Paper Datasets Here is a Dropbox link to the datasets used in the paper: https://www.dropbox.com/sh/n

0 Nov 02, 2021
A PyTorch implementation of "Graph Classification Using Structural Attention" (KDD 2018).

GAM ⠀⠀ A PyTorch implementation of Graph Classification Using Structural Attention (KDD 2018). Abstract Graph classification is a problem with practic

Benedek Rozemberczki 259 Dec 05, 2022
[ICLR2021] Unlearnable Examples: Making Personal Data Unexploitable

Unlearnable Examples Code for ICLR2021 Spotlight Paper "Unlearnable Examples: Making Personal Data Unexploitable " by Hanxun Huang, Xingjun Ma, Sarah

Hanxun Huang 98 Dec 07, 2022
C3DPO - Canonical 3D Pose Networks for Non-rigid Structure From Motion.

C3DPO: Canonical 3D Pose Networks for Non-Rigid Structure From Motion By: David Novotny, Nikhila Ravi, Benjamin Graham, Natalia Neverova, Andrea Vedal

Meta Research 309 Dec 16, 2022
GAN-STEM-Conv2MultiSlice - Exploring Generative Adversarial Networks for Image-to-Image Translation in STEM Simulation

GAN-STEM-Conv2MultiSlice GAN method to help covert lower resolution STEM images generated by convolution methods to higher resolution STEM images gene

UW-Madison Computational Materials Group 2 Feb 10, 2021
OpenGAN: Open-Set Recognition via Open Data Generation

OpenGAN: Open-Set Recognition via Open Data Generation ICCV 2021 (oral) Real-world machine learning systems need to analyze novel testing data that di

Shu Kong 90 Jan 06, 2023
Neural Magic Eye: Learning to See and Understand the Scene Behind an Autostereogram, arXiv:2012.15692.

Neural Magic Eye Preprint | Project Page | Colab Runtime Official PyTorch implementation of the preprint paper "NeuralMagicEye: Learning to See and Un

Zhengxia Zou 56 Jul 15, 2022
N-HiTS: Neural Hierarchical Interpolation for Time Series Forecasting

N-HiTS: Neural Hierarchical Interpolation for Time Series Forecasting Recent progress in neural forecasting instigated significant improvements in the

Cristian Challu 82 Jan 04, 2023
🚗 INGI Dakar 2K21 - Be the first one on the finish line ! 🚗

🚗 INGI Dakar 2K21 - Be the first one on the finish line ! 🚗 This year's first semester Club Info challenge will put you at the head of a car racing

ClubINFO INGI (UCLouvain) 6 Dec 10, 2021
Erpnext app for make employee salary on payroll entry based on one or more project with percentage for all project equal 100 %

Project Payroll this app for make payroll for employee based on projects like project on 30 % and project 2 70 % as account dimension it makes genral

Ibrahim Morghim 8 Jan 02, 2023
This is the official implement of paper "ActionCLIP: A New Paradigm for Action Recognition"

This is an official pytorch implementation of ActionCLIP: A New Paradigm for Video Action Recognition [arXiv] Overview Content Prerequisites Data Prep

268 Jan 09, 2023
The official implementation of the Interspeech 2021 paper WSRGlow: A Glow-based Waveform Generative Model for Audio Super-Resolution.

WSRGlow The official implementation of the Interspeech 2021 paper WSRGlow: A Glow-based Waveform Generative Model for Audio Super-Resolution. Audio sa

Kexun Zhang 96 Jan 03, 2023
Algorithmic trading with deep learning experiments

Deep-Trading Algorithmic trading with deep learning experiments. Now released part one - simple time series forecasting. I plan to implement more soph

Alex Honchar 1.4k Jan 02, 2023
NNR conformation conditional and global probabilities estimation and analysis in peptides or proteins fragments

NNR and global probabilities estimation and analysis in peptides or protein fragments This module calculates global and NNR conformation dependent pro

0 Jul 15, 2021
Camview - A CLI-tool used to stream CCTV online footage based on URL params

CamView A CLI-tool used to stream CCTV online footage based on URL params Get St

Finn Lancaster 54 Dec 09, 2022
A set of Deep Reinforcement Learning Agents implemented in Tensorflow.

Deep Reinforcement Learning Agents This repository contains a collection of reinforcement learning algorithms written in Tensorflow. The ipython noteb

Arthur Juliani 2.2k Jan 01, 2023
Pixel-level Crack Detection From Images Of Levee Systems : A Comparative Study

PIXEL-LEVEL CRACK DETECTION FROM IMAGES OF LEVEE SYSTEMS : A COMPARATIVE STUDY G

Manisha Panta 2 Jul 23, 2022
TorchMD-Net provides state-of-the-art graph neural networks and equivariant transformer neural networks potentials for learning molecular potentials

TorchMD-net TorchMD-Net provides state-of-the-art graph neural networks and equivariant transformer neural networks potentials for learning molecular

TorchMD 104 Jan 03, 2023