In this project we predict the forest cover type using the cartographic variables in the training/test datasets.

Overview

Kaggle Competition: Forest Cover Type Prediction

In this project we predict the forest cover type (the predominant kind of tree cover) using the cartographic variables given in the training/test datasets. You can find more about this project at Forest Cover Type Prediction.

This project and its detailed notebooks were created and published on Kaggle.

Project Objective

  • We are given raw unscaled data with both numerical and categorical variables.
  • First, we performed Exploratory Data Analysis in order to visualize the characteristics of our given variables.
  • We constructed various models to train our data - utilizing Optuna hyperparameter tuning to get parameters that maximize the model accuracies.
  • Using feature engineering techniques, we built new variables to help improve the accuracy of our models.
  • Using the strategies above, we built our final model and generated forest cover type predictions for the test dataset.

Links to Detailed Notebooks

EDA Summary

The purpose of the EDA is to provide an overview of how python visualization tools can be used to understand the complex and large dataset. EDA is the first step in this workflow where the decision-making process is initiated for the feature selection. Some valuable insights can be obtained by looking at the distribution of the target, relationship to the target and link between the features.

Visualize Numerical Variables

  • Using histograms, we can visualize the spread and values of the 10 numeric variables.
  • The Slope, Vertical Distance to Hydrology, Horizontal Distance to Hydrology, Roadways and Firepoints are all skewed right.
  • Hillshade 9am, Noon, and 3pm are all skewed left. visualize numerical variables histograms

Visualize Categorical Variables

  • The plots below the number of observations of the different Wilderness Areas and Soil Types.
  • Wilderness Areas 3 and 4 have the most presence.
  • Wilderness Area 2 has the least amount of observations.
  • The most observations are seen having Soil Type 10 followed by Soil Type 29.
  • The Soil Types with the least amount of observations are Soil Type 7 and 15. # of observations of wilderness areas # of observations of soil types

Feature Correlation

With the heatmap excluding binary variables this helps us visualize the correlations of the features. We were also able to provide scatterplots for four pairs of features that had a positive correlation greater than 0.5. These are one of the many visualization that helped us understand the characteristics of the features for future feature engineering and model selection.

heatmap scatterplots

Summary of Challenges

EDA Challenges

  • This project consists of a lot of data and can have countless of patterns and details to look at.
  • The training data was not a simple random sample of the entire dataset, but a stratified sample of the seven forest cover type classes which may not represent the final predictions well.
  • Creating a "story" to be easily incorporated into the corresponding notebooks such as Feature Engineering, Models, etc.
  • Manipulating the Wilderness_Area and Soil_Type (one-hot encoded variables) to visualize its distribution compared to Cover_Type.

Feature Engineering Challenges

  • Adding new variables during feature engineering often produced lower accuracy.
  • Automated feature engineering using entities and transformations amongst existing columns from a single dataset created many new columns that did not positively contribute to the model's accuracy - even after feature selection.
  • Testing the new features produced was very time consuming, even with the GPU accelerator.
  • After playing around with several different sets of new features, we found that only including manually created new features yielded the highest results.

Modeling Challenges

  • Ensemble and stacking methods initially resulted in models yielding higher accuracy on the test set, but as we added features and refined the parameters for each individual model, an individual model yielded a better score on the test set.
  • Performing hyperparameter tuning and training for several of the models was computationally expensive. While we were able to enable GPU acceleration for the XGBoost model, activating the GPU accelerator seemed to increase the tuning and training for the other models in the training notebook.
  • Optuna worked to reduce the time to process hyperparameter trials, but some of the hyperparameters identified through this method yielded weaker models than the hyperparameters identified through GridSearchCV. A balance between the two was needed.

Summary of Modeling Techniques

We used several modeling techniques for this project. We began by training simple, standard models and applying the predictions to the test set. This resulted in models with only 50%-60% accuracy, necessitating more complex methods. The following process was used to develop the final model:

  • Scaling the training data to perform PCA and identify the most important features (see the Feature_Engineering Notebook for more detail).
  • Preprocessing the training data to add in new features.
  • Performing GridSearchCV and using the Optuna approach (see the ModelParams Notebook for more detail) for identifying optimal parameters for the following models with corresponding training set accuracy scores:
    • Logistic Regression (.7126)
    • Decision Tree (.9808)
    • Random Forest (1.0)
    • Extra Tree Classifier (1.0)
    • Gradient Boosting Classifier (1.0)
    • Extreme Gradient Boosting Classifier (using GPU acceleration; 1.0)
    • AdaBoost Classifier (.5123)
    • Light Gradient Boosting Classifier (.8923)
    • Ensemble/Voting Classifiers (assorted combinations of the above models; 1.0)
  • Saving and exporting the preprocessor/scaler and each each version of the model with the highest accuracy on the training set and highest cross validation score (see the Training notebook for more detail).
  • Calculating each model's predictions for the test set and submitting to determine accuracy on the test set:
    • Logistic Regression (.6020)
    • Decision Tree (.7102)
    • Random Forest (.7465)
    • Extra Tree Classifier (.7962)
    • Gradient Boosting Classifier (.7905)
    • Extreme Gradient Boosting Classifier (using GPU acceleration; .7803)
    • AdaBoost Classifier (.1583)
    • Light Gradient Boosting Classifier (.6891)
    • Ensemble/Voting Classifier (assorted combinations of the above models; .7952)

Summary of Final Results

The model with the highest accuracy on the out of sample (test set) data was selected as our final model. It should be noted that the model with the highest accuracy according to 10-fold cross validation was not the most accurate model on the out of sample data (although it was close). The best model was the Extra Tree Classifier with an accuracy of .7962 on the test set. The Extra Trees model outperformed our Ensemble model (.7952), which had been our best model for several weeks. See the Submission Notebook and FinalModelEvaluation Notebook for additional detail.

Owner
Marianne Joy Leano
A recent graduate with a Master's in Data Science. Excited to explore data and create projects!
Marianne Joy Leano
Automatic Video Captioning Evaluation Metric --- EMScore

Automatic Video Captioning Evaluation Metric --- EMScore Overview For an illustration, EMScore can be computed as: Installation modify the encode_text

Yaya Shi 17 Nov 28, 2022
Revisting Open World Object Detection

Revisting Open World Object Detection Installation See INSTALL.md. Dataset Our n

58 Dec 23, 2022
Hooks for VCOCO

Verbs in COCO (V-COCO) Dataset This repository hosts the Verbs in COCO (V-COCO) dataset and associated code to evaluate models for the Visual Semantic

Saurabh Gupta 131 Nov 24, 2022
Privacy-Preserving Machine Learning (PPML) Tutorial Presented at PyConDE 2022

PPML: Machine Learning on Data you cannot see Repository for the tutorial on Privacy-Preserving Machine Learning (PPML) presented at PyConDE 2022 Abst

Valerio Maggio 10 Aug 16, 2022
Sub-tomogram-Detection - Deep learning based model for Cyro ET Sub-tomogram-Detection

Deep learning based model for Cyro ET Sub-tomogram-Detection High degree of stru

Siddhant Kumar 2 Feb 04, 2022
Session-based Recommendation, CoHHN, price preferences, interest preferences, Heterogeneous Hypergraph, Co-guided Learning, SIGIR2022

This is our implementation for the paper: Price DOES Matter! Modeling Price and Interest Preferences in Session-based Recommendation Xiaokun Zhang, Bo

Xiaokun Zhang 27 Dec 02, 2022
An end-to-end library for editing and rendering motion of 3D characters with deep learning [SIGGRAPH 2020]

Deep-motion-editing This library provides fundamental and advanced functions to work with 3D character animation in deep learning with Pytorch. The co

1.2k Dec 29, 2022
Prevent `CUDA error: out of memory` in just 1 line of code.

🐨 Koila Koila solves CUDA error: out of memory error painlessly. Fix it with just one line of code, and forget it. 🚀 Features 🙅 Prevents CUDA error

RenChu Wang 1.7k Jan 02, 2023
Matplotlib Image labeller for classifying images

mpl-image-labeller Use Matplotlib to label images for classification. Works anywhere Matplotlib does - from the notebook to a standalone gui! For more

Ian Hunt-Isaak 5 Sep 24, 2022
HairCLIP: Design Your Hair by Text and Reference Image

Overview This repository hosts the official PyTorch implementation of the paper: "HairCLIP: Design Your Hair by Text and Reference Image". Our single

322 Jan 06, 2023
A human-readable PyTorch implementation of "Self-attention Does Not Need O(n^2) Memory"

memory_efficient_attention.pytorch A human-readable PyTorch implementation of "Self-attention Does Not Need O(n^2) Memory" (Rabe&Staats'21). def effic

Ryuichiro Hataya 7 Dec 26, 2022
Just Go with the Flow: Self-Supervised Scene Flow Estimation

Just Go with the Flow: Self-Supervised Scene Flow Estimation Code release for the paper Just Go with the Flow: Self-Supervised Scene Flow Estimation,

Himangi Mittal 50 Nov 22, 2022
Model of an AI powered sign language interpreter.

TEXT AND SPEECH TO SIGN LANGUAGE. A web application which takes in text or live audio speech recording as input, converts and displays the relevant Si

Mark Gatere 4 Mar 30, 2022
An experiment on the performance of homemade Q-learning AIs in Agar.io depending on their state representation and available actions

Agar.io_Q-Learning_AI An experiment on the performance of homemade Q-learning AIs in Agar.io depending on their state representation and available act

1 Jun 09, 2022
Training DALL-E with volunteers from all over the Internet using hivemind and dalle-pytorch (NeurIPS 2021 demo)

Training DALL-E with volunteers from all over the Internet This repository is a part of the NeurIPS 2021 demonstration "Training Transformers Together

<a href=[email protected]"> 19 Dec 13, 2022
Repo for the paper Extrapolating from a Single Image to a Thousand Classes using Distillation

Extrapolating from a Single Image to a Thousand Classes using Distillation by Yuki M. Asano* and Aaqib Saeed* (*Equal Contribution) Extrapolating from

Yuki M. Asano 16 Nov 04, 2022
Deep Structured Instance Graph for Distilling Object Detectors (ICCV 2021)

DSIG Deep Structured Instance Graph for Distilling Object Detectors Authors: Yixin Chen, Pengguang Chen, Shu Liu, Liwei Wang, Jiaya Jia. [pdf] [slide]

DV Lab 31 Nov 17, 2022
[CVPR'22] Weakly Supervised Semantic Segmentation by Pixel-to-Prototype Contrast

wseg Overview The Pytorch implementation of Weakly Supervised Semantic Segmentation by Pixel-to-Prototype Contrast. [arXiv] Though image-level weakly

Ye Du 96 Dec 30, 2022
Code for "The Intrinsic Dimension of Images and Its Impact on Learning" - ICLR 2021 Spotlight

dimensions Estimating the instrinsic dimensionality of image datasets Code for: The Intrinsic Dimensionaity of Images and Its Impact On Learning - Phi

Phil Pope 41 Dec 10, 2022
Speech Emotion Recognition with Fusion of Acoustic- and Linguistic-Feature-Based Decisions

APSIPA-SER-with-A-and-T This code is the implementation of Speech Emotion Recognition (SER) with acoustic and linguistic features. The network model i

kenro515 3 Jan 04, 2023