Explorative Data Analysis Guidelines

Overview

Explorative Data Analysis

Get data into a usable format!
Find out if the following predictive modeling phase will be successful!

  • Combine everything into a single big table
    • Convert files to .csv
    • Merge files
    • Fix encoding issues
    • Clean column names (english, no whitespace, no special chars)
    • Are there duplicate columns?
    • Fix datatypes (datetime, int, float, string)
  • Look at the raw data
    • Sort data
    • Filter data by various criteria
  • Investigation
    • Non-sensical observations/artifacts?
    • Coding of categorical features?
    • Missing values?
    • Outliers?
    • Constant values (=Zero Importance)?
    • Low importance features?
    • Collinear, correlated or otherwise dependent features?
    • Highly skewed features?
    • Irrelevant features?
  • Univariate Analysis
    • Look at mean, median, min, max, std, iqr, quantiles (1%, 5%, 25%, 50%, 75%, 95%, 99%)
    • Draw boxplots, histograms
  • Multivariate Analysis
    • Draw scatter plots
    • Create correlation matrix
  • Time Series? -> Plot variables over time
  • Fixing issues
    • Impute missing values (mode, median, mean)
    • Remove variables that have too many missings
    • Remove observations that have too many missings
    • Select appropriate time slice
  • Preparation
    • Clip values that are too small/too large
    • Scale to [0,1] or normalize (mean=0, std=1) or Robust / Quantile Scaling
    • One-hot encoding, Label Encoding (0,1,2,3)
    • Create log-transformed versions for highly skewed variables
    • Create binned versions for variables
    • Combine categories for highly skewed categorical variables
    • Create sum/difference/product/quotient of variables
    • Create polynomial features
Owner
Florian Rohrer
Florian Rohrer
Practical Python Programming

Welcome! When I first learned Python nearly 25 years ago, I was immediately struck by how I could productively apply it to all sorts of messy work pro

Dabeaz LLC 8.3k Jan 08, 2023
DataAnalysis: Some data analysis projects in charles_pikachu

DataAnalysis DataAnalysis: Some data analysis projects in charles_pikachu You can star this repository to keep track of the project if it's helpful fo

9 Nov 04, 2022
A Python validator for SHACL

pySHACL A Python validator for SHACL. This is a pure Python module which allows for the validation of RDF graphs against Shapes Constraint Language (S

RDFLib 187 Dec 29, 2022
Flask-Rebar combines flask, marshmallow, and swagger for robust REST services.

Flask-Rebar Flask-Rebar combines flask, marshmallow, and swagger for robust REST services. Features Request and Response Validation - Flask-Rebar reli

PlanGrid 223 Dec 19, 2022
Python bindings to OpenSlide

OpenSlide Python OpenSlide Python is a Python interface to the OpenSlide library. OpenSlide is a C library that provides a simple interface for readin

OpenSlide 297 Dec 21, 2022
Plotting and analysis tools for ARTIS simulations

Artistools Artistools is collection of plotting, analysis, and file format conversion tools for the ARTIS radiative transfer code. Installation First

ARTIS Monte Carlo Radiative Transfer 8 Nov 07, 2022
Collections of Beautiful Latex Snippets

HandyLatex Collections of Beautiful Latex Snippets Table 👉 Succinct table with bold separation line and gray text %################## Dependencies ##

Xintao 15 Apr 11, 2022
PySpark Cheat Sheet - learn PySpark and develop apps faster

This cheat sheet will help you learn PySpark and write PySpark apps faster. Everything in here is fully functional PySpark code you can run or adapt to your programs.

Carter Shanklin 168 Jan 01, 2023
This is a repository for "100 days of code challenge" projects. You can reach all projects from beginner to professional which are written in Python.

100 Days of Code It's a challenge that aims to gain code practice and enhance programming knowledge. Day #1 Create a Band Name Generator It's actually

SelenNB 2 May 12, 2022
Materi workshop "Light up your Python!" Himpunan Mahasiswa Sistem Informasi Fakultas Ilmu Komputer Universitas Singaperbangsa Karawang, 4 September 2021 (Online via Zoom).

Workshop Python UNSIKA 2021 Materi workshop "Light up your Python!" Himpunan Mahasiswa Sistem Informasi Fakultas Ilmu Komputer Universitas Singaperban

Eka Putra 20 Mar 24, 2022
A pluggable API specification generator. Currently supports the OpenAPI Specification (f.k.a. the Swagger specification)..

apispec A pluggable API specification generator. Currently supports the OpenAPI Specification (f.k.a. the Swagger specification). Features Supports th

marshmallow-code 1k Jan 01, 2023
pytorch_example

pytorch_examples machine learning site map ì •ëŠŹìžëŁŒ Resnet https://wolfy.tistory.com/243 convolution 연산 ì •ëŠŹ https://gaussian37.github.io/dl-concept-covolut

injae hwang 1 Nov 24, 2021
This is a tool to make easier brawl stars modding using csv manipulation

Brawler Maker : Modding Tool for Brawl Stars This is a tool to make easier brawl stars modding using csv manipulation if you want to support me, just

6 Nov 16, 2022
Cleaner script to normalize knock's output EPUBs

clean-epub The excellent knock application by Benton Edmondson outputs EPUBs that seem to be DRM-free. However, if you run the application twice on th

2 Dec 16, 2022
Contains the assignments from the course Building a Modern Computer from First Principles: From Nand to Tetris.

Contains the assignments from the course Building a Modern Computer from First Principles: From Nand to Tetris.

Matheus Rodrigues 1 Jan 20, 2022
Version bĂȘta d'un systĂšme pour suivre les prix des livres chez Books to Scrape,

Version bĂȘta d'un systĂšme pour suivre les prix des livres chez Books to Scrape, un revendeur de livres en ligne. En pratique, dans cette version bĂȘta, le programme n'effectuera pas une vĂ©ritable surv

Mouhamed Dia 1 Jan 06, 2022
Xanadu Quantum Codebook is an experimental, exercise-based introduction to quantum computing using PennyLane.

Xanadu Quantum Codebook The Xanadu Quantum Codebook is an experimental, exercise-based introduction to quantum computing using PennyLane. This reposit

Xanadu 43 Dec 09, 2022
This program has been coded to allow the user to rename all the files in the entered folder.

Bulk_File_Renamer This program has been coded to allow the user to rename all the files in the entered folder. The only required package is "termcolor

1 Jan 06, 2022
Python For Finance Cookbook - Code Repository

Python For Finance Cookbook - Code Repository

Packt 544 Dec 25, 2022
Collection of Summer 2022 tech internships!

Collection of Summer 2022 tech internships!

Pitt Computer Science Club (CSC) 15.6k Jan 03, 2023