Mining the Stack Overflow Developer Survey

Overview

Mining the Stack Overflow Developer Survey

A prototype data mining application to compare the accuracy of decision tree and random forest regression models to predict annual compensation of tech workers in the US and Europe.

Objectives

Usage

To run, download the repository and execute the file main.py in the src directory with your python path variable. For example, python3 main.py.

Dependencies

  • python 3.8.1 and up
  • pandas 1.3.4 and up
  • matplotlib 3.4.3 and up
  • numpy 1.21.0 and up
  • sklearn 1.0.1 and up

Methodology

Preprocessing

The original data set provided by Stack Overflow contained 48 attribute columns and 83439 data records. Due to the large size of the data set, we wanted to narrow our focus to a certain subset of the data. In the preprocessing of the original data file, we decided to discard any records that were not employed full-time in the technology industry. Any record that did not contain country, converted annual salary, or yeared coded was also discarded, as this data is vital to our model. We also discarded some of the columns from the original data set that were open-ended. Out of the records that fit our requirements, we exported them to two output csv files. Records of United States data were put together in one output file, and records of European countries were put in the other. Data from any other countries were discarded. Once we have the two cleaned files, we applied additional preprocessing techniques. Any missing attributes that remained were replaced with 'NA' if the attributes were nominal. Two special cases existed in the columns for years coded and years coded professionally. Most contained a numerical value for the years, but some had a string for 'Less than one year' and 'More than 50 years'. These strings were replaced with 0 and 50, respectively, to keep these columns numerical. With these preprocessing steps complete, the data files are now ready to be processed to generate the models.

Models

We evaluated a variety of data mining models and algorithms to find the ones that would make the most sense for our data set and objectives. With our goal of predicting a numerical value for annual salary, we knew we needed to use a compatible regression model. We found regression models for decision trees and random forests and wanted to compare their accuracy. We wanted to see how the accuracy of a single decision tree compares to the accuracy of a random forest model, which is a number of trees together. The results are detailed in the results and analysis section. Below are the implementation details of each model.

Decision tree model

We selected the DecisionTreeRegressor model from the Scikit Learn machine learning package. In order to get the most accurate model, we trained several models with different parameters and selected the one with the highest accuracy to validate. The parameter we changed was the maximum depth level of each tree. Additional factors that affect the model are the testing split percentage and the cross validation folds. For our models, we used 20% of the data as testing and 80% as training and a cross validation value of 10. Out of every combination we tried, we found that a maximum depth of ADD RES HERE resulted in the most accurate decision tree model. The accuracy of the model was ADD RES HERE. This model will output the tree itself, several statistics of the model such as R-squared, mean absolute error, and mean squared error, and the ten attributes that have the largest weight in determining the result. With the best model selected, we then validated it against the testing data set. These steps of model generation were done for both the US data and the European data.

Random forest model

We selected the RandomForestRegressor model from the Scikit Learn machine learning package. In order to get the most accurate model, we trained several models with different parameters and selected the one with the highest accuracy to validate. The parameters we changed were the number of trees to estimate with and the maximum depth level of each tree. Additional factors that affect the model are the testing split percentage and the cross validation folds. For our models, we used 20% of the data as testing and 80% as training and a cross validation value of 10. Out of every combination we tried, we found that ADD RES HERE trees in the forest with a maximum depth of ADD RES HERE resulted in the most accurate random forest model. The accuracy of the model was ADD RES HERE. This model will output the tree itself, several statistics of the model such as R-squared, mean absolute error, and mean squared error, and the ten attributes that have the largest weight in determining the result. With the best model selected, we then validated it against the testing data set. These steps of model generation were done for both the US data and the European data.

Results and Analysis

Authors

A crude Hy handle on Pandas library

Quickstart Hyenas is a curde Hy handle written on top of Pandas API to allow for more elegant access to data-scientist's powerhouse that is Pandas. In

Peter Výboch 4 Sep 05, 2022
NFCDS Workshop Beginners Guide Bioinformatics Data Analysis

Genomics Workshop FIXME: overview of workshop Code of Conduct All participants s

Elizabeth Brooks 2 Jun 13, 2022
Produces a summary CSV report of an Amber Electric customer's energy consumption and cost data.

Amber Electric Usage Summary This is a command line tool that produces a summary CSV report of an Amber Electric customer's energy consumption and cos

Graham Lea 12 May 26, 2022
Bamboolib - a GUI for pandas DataFrames

Community repository of bamboolib bamboolib is joining forces with Databricks. For more information, please read our announcement. Please note that th

Tobias Krabel 863 Jan 08, 2023
This is a tool for speculation of ancestral allel, calculation of sfs and drawing its bar plot.

superSFS This is a tool for speculation of ancestral allel, calculation of sfs and drawing its bar plot. It is easy-to-use and runing fast. What you s

3 Dec 16, 2022
apricot implements submodular optimization for the purpose of selecting subsets of massive data sets to train machine learning models quickly.

Please consider citing the manuscript if you use apricot in your academic work! You can find more thorough documentation here. apricot implements subm

Jacob Schreiber 457 Dec 20, 2022
Option Pricing Calculator using the Binomial Pricing Method (No Libraries Required)

Binomial Option Pricing Calculator Option Pricing Calculator using the Binomial Pricing Method (No Libraries Required) Background A derivative is a fi

sammuhrai 1 Nov 29, 2021
Churn prediction with PySpark

It is expected to develop a machine learning model that can predict customers who will leave the company.

3 Aug 13, 2021
Program that predicts the NBA mvp based on data from previous years.

NBA MVP Predictor A machine learning model using RandomForest Regression that predicts NBA MVP's using player data. Explore the docs » View Demo · Rep

Muhammad Rabee 1 Jan 21, 2022
Python utility to extract differences between two pandas dataframes.

Python utility to extract differences between two pandas dataframes.

Jaime Valero 8 Jan 07, 2023
Developed for analyzing the covariance for OrcVIO

about This repo is developed for analyzing the covariance for OrcVIO environment setup platform ubuntu 18.04 using conda conda env create --file envir

Sean 1 Dec 08, 2021
Deep universal probabilistic programming with Python and PyTorch

Getting Started | Documentation | Community | Contributing Pyro is a flexible, scalable deep probabilistic programming library built on PyTorch. Notab

7.7k Dec 30, 2022
A probabilistic programming library for Bayesian deep learning, generative models, based on Tensorflow

ZhuSuan is a Python probabilistic programming library for Bayesian deep learning, which conjoins the complimentary advantages of Bayesian methods and

Tsinghua Machine Learning Group 2.2k Dec 28, 2022
Integrate bus data from a variety of sources (batch processing and real time processing).

Purpose: This is integrate bus data from a variety of sources such as: csv, json api, sensor data ... into Relational Database (batch processing and r

1 Nov 25, 2021
t-SNE and hierarchical clustering are popular methods of exploratory data analysis, particularly in biology.

tree-SNE t-SNE and hierarchical clustering are popular methods of exploratory data analysis, particularly in biology. Building on recent advances in s

Isaac Robinson 61 Nov 21, 2022
MotorcycleParts DataAnalysis python

We work with the accounting department of a company that sells motorcycle parts. The company operates three warehouses in a large metropolitan area.

NASEEM A P 1 Jan 12, 2022
PyPDC is a Python package for calculating asymptotic Partial Directed Coherence estimations for brain connectivity analysis.

Python asymptotic Partial Directed Coherence and Directed Coherence estimation package for brain connectivity analysis. Free software: MIT license Doc

Heitor Baldo 3 Nov 26, 2022
Finding project directories in Python (data science) projects, just like there R rprojroot and here packages

Find relative paths from a project root directory Finding project directories in Python (data science) projects, just like there R here and rprojroot

Daniel Chen 102 Nov 16, 2022
Pyspark Spotify ETL

This is my first Data Engineering project, it extracts data from the user's recently played tracks using Spotify's API, transforms data and then loads it into Postgresql using SQLAlchemy engine. Data

16 Jun 09, 2022
Data Science Environment Setup in single line

datascienv is package that helps your to setup your environment in single line of code with all dependency and it is also include pyforest that provide single line of import all required ml libraries

Ashish Patel 55 Dec 16, 2022