Exploratory Data Analysis for Employee Retention Dataset

Overview

Exploratory Data Analysis for Employee Retention Dataset

  • Employee turn-over is a very costly problem for companies.
  • The cost of replacing an employee if often larger than 100K USD, taking into account the time spent to interview and find a replacement, placement fees, sign-on bonuses and the loss of productivity for several months.
  • It is only natural then that data science has started being applied to this area.
  • Understanding why and when employees are most likely to leave can lead to actions to improve employee retention as well as planning new hiring in advance. This application of DS is sometimes called people analytics or people data science
  • We got employee data from a few companies. We have data about all employees who joined from 2011/01/24 to 2015/12/13. For each employee, we also know if they are still at the company as of 2015/12/13 or they have quit.
  • Beside that, we have general info about the employee, such as avg salary during her tenure, dept, and yrs of experience.

Goal:

In this challenge, you have a data set with info about the employees and have to predict when employees are going to quit by understanding the main drivers of employee churn.

  • Assume, for each company, that the headcount starts from zero on 2011/01/23. Estimate employee headcount, for each company, on each day, from 2011/01/24 to 2015/12/13. That is, if by 2012/03/02 2000 people have joined company 1 and 1000 of them have already quit, then company headcount on 2012/03/02 for company 1 would be 1000.
  • You should create a table with 3 columns: day, employee_headcount, company_id. What are the main factors that drive employee churn? Do they make sense? Explain your findings.
  • If you could add to this data set just one variable that could help explain employee churn, what would that be?

Data: (data/employee_retention_data.csv)

Columns:

  • employee_id : id of the employee. Unique by employee per company
  • company_id : company id.
  • dept : employee dept
  • seniority : number of yrs of work experience when hired
  • salary: avg yearly salary of the employee during her tenure within the company
  • join_date: when the employee joined the company, it can only be between 2011/01/24 and 2015/12/13
  • quit_date: when the employee left her job (if she is still employed as of 2015/12/13, this field is NA)

Question 1

Function that returns a list of the names of categorical variables

  • Define a function with name get_categorical_variables
  • Pass dataframe as parameter (Read csv file and convert it into pandas dataframe)
  • Return list of all categorical fields available.

Question 2

Function that returns the list of the names of numeric variables

  • Define a function with name get_numerical_variables
  • Pass dataframe as parameter (Read csv file and convert it into pandas dataframe)
  • Return list of all numerical fields available.

Question 3

Function that returns, for numeric variables, mean, median, 25, 50, 75th percentile

  • Define a function with name get_numerical_variables_percentile
  • Pass dataframe as parameter (Read csv file and convert it into pandas dataframe)
  • Return dataframe with following columns:
    • variable name
    • mean
    • median
    • 25th percentile
    • 50th percentile
    • 75th percentile

Question 4

For categorical variables, get modes

  • Define a function with name get_categorical_variables_modes
  • Pass dataframe as parameter (Read csv file and convert it into pandas dataframe)
  • Return dict object with following keys:
    • converted
    • country
    • new_user
    • source

Question 5

For each column, list the count of missing values

  • Define a function with name get_missing_values_count
  • Pass dataframe as parameter (Read csv file and convert it into pandas dataframe)
  • Return dataframe with following columns:
    • var_name
    • missing_value_count

Question 6

Plot histograms using different subplots of all the numerical values in a single plot

  • Define a function with name plot_histogram_with_numerical_values
  • Pass dataframe and list of columns you want to plot as parameter
  • Plot the graph
  • Add column names as plot names (In case you dont understand this please connect with instructor)
  • Change the histogram colour to yellow
  • Fit a normal curve on those histograms (In case you dont understand this please connect with instructor)
Owner
kana sudheer reddy
curently studying in presidency university banglore
kana sudheer reddy
bigdata_analyse 大数据分析项目

bigdata_analyse 大数据分析项目 wish 采用不同的技术栈,通过对不同行业的数据集进行分析,期望达到以下目标: 了解不同领域的业务分析指标 深化数据处理、数据分析、数据可视化能力 增加大数据批处理、流处理的实践经验 增加数据挖掘的实践经验

Way 2.4k Dec 30, 2022
Retail-Sim is python package to easily create synthetic dataset of retaile store.

Retailer's Sale Data Simulation Retail-Sim is python package to easily create synthetic dataset of retaile store. Simulation Model Simulator consists

Corca AI 7 Sep 30, 2022
Using Data Science with Machine Learning techniques (ETL pipeline and ML pipeline) to classify received messages after disasters.

Using Data Science with Machine Learning techniques (ETL pipeline and ML pipeline) to classify received messages after disasters.

1 Feb 11, 2022
Recommendations from Cramer: On the show Mad-Money (CNBC) Jim Cramer picks stocks which he recommends to buy. We will use this data to build a portfolio

Backtesting the "Cramer Effect" & Recommendations from Cramer Recommendations from Cramer: On the show Mad-Money (CNBC) Jim Cramer picks stocks which

Gábor Vecsei 12 Aug 30, 2022
Gathering data of likes on Tinder within the past 7 days

tinder_likes_data Gathering data of Likes Sent on Tinder within the past 7 days. Versions November 25th, 2021 - Functionality to get the name and age

Alex Carter 12 Jan 05, 2023
Yet Another Workflow Parser for SecurityHub

YAWPS Yet Another Workflow Parser for SecurityHub "Screaming pepper" by Rum Bucolic Ape is licensed with CC BY-ND 2.0. To view a copy of this license,

myoung34 8 Dec 22, 2022
Python package for analyzing behavioral data for Brain Observatory: Visual Behavior

Allen Institute Visual Behavior Analysis package This repository contains code for analyzing behavioral data from the Allen Brain Observatory: Visual

Allen Institute 16 Nov 04, 2022
Universal data analysis tools for atmospheric sciences

U_analysis Universal data analysis tools for atmospheric sciences Script written in python 3. This file defines multiple functions that can be used fo

Luis Ackermann 1 Oct 10, 2021
Bamboolib - a GUI for pandas DataFrames

Community repository of bamboolib bamboolib is joining forces with Databricks. For more information, please read our announcement. Please note that th

Tobias Krabel 863 Jan 08, 2023
4CAT: Capture and Analysis Toolkit

4CAT: Capture and Analysis Toolkit 4CAT is a research tool that can be used to analyse and process data from online social platforms. Its goal is to m

Digital Methods Initiative 147 Dec 20, 2022
apricot implements submodular optimization for the purpose of selecting subsets of massive data sets to train machine learning models quickly.

Please consider citing the manuscript if you use apricot in your academic work! You can find more thorough documentation here. apricot implements subm

Jacob Schreiber 457 Dec 20, 2022
Kennedy Institute of Rheumatology University of Oxford Project November 2019

TradingBot6M Kennedy Institute of Rheumatology University of Oxford Project November 2019 Run Change api.txt to binance api key: https://www.binance.c

Kannan SAR 2 Nov 16, 2021
Analyzing Covid-19 Outbreaks in Ontario

My group and I took Covid-19 outbreak statistics from ontario, and analyzed them to find different patterns and future predictions for the virus

Vishwaajeeth Kamalakkannan 0 Jan 20, 2022
MIR Cheatsheet - Survival Guidebook for MIR Researchers in the Lab

MIR Cheatsheet - Survival Guidebook for MIR Researchers in the Lab

SeungHeonDoh 3 Jul 02, 2022
Utilize data analytics skills to solve real-world business problems using Humana’s big data

Humana-Mays-2021-HealthCare-Analytics-Case-Competition- The goal of the project is to utilize data analytics skills to solve real-world business probl

Yongxian (Caroline) Lun 1 Dec 27, 2021
Supply a wrapper ``StockDataFrame`` based on the ``pandas.DataFrame`` with inline stock statistics/indicators support.

Stock Statistics/Indicators Calculation Helper VERSION: 0.3.2 Introduction Supply a wrapper StockDataFrame based on the pandas.DataFrame with inline s

Cedric Zhuang 1.1k Dec 28, 2022
An Integrated Experimental Platform for time series data anomaly detection.

Curve Sorry to tell contributors and users. We decided to archive the project temporarily due to the employee work plan of collaborators. There are no

Baidu 486 Dec 21, 2022
Create HTML profiling reports from pandas DataFrame objects

Pandas Profiling Documentation | Slack | Stack Overflow Generates profile reports from a pandas DataFrame. The pandas df.describe() function is great

10k Jan 01, 2023
A pipeline that creates consensus sequences from a Nanopore reads. I

A pipeline that creates consensus sequences from a Nanopore reads. It clusters reads that are similar to each other and creates a consensus that is then identified using BLAST.

Ada Madejska 2 May 15, 2022
Data cleaning tools for Business analysis

Datacleaning datacleaning tools for Business analysis This program is made for Vicky's work. You can use it, too. 数据清洗 该数据清洗工具是为了商业分析 这个程序是为了Vicky的工作而

Lin Jian 3 Nov 16, 2021