Explorative Data Analysis Guidelines

Overview

Explorative Data Analysis

Get data into a usable format!
Find out if the following predictive modeling phase will be successful!

  • Combine everything into a single big table
    • Convert files to .csv
    • Merge files
    • Fix encoding issues
    • Clean column names (english, no whitespace, no special chars)
    • Are there duplicate columns?
    • Fix datatypes (datetime, int, float, string)
  • Look at the raw data
    • Sort data
    • Filter data by various criteria
  • Investigation
    • Non-sensical observations/artifacts?
    • Coding of categorical features?
    • Missing values?
    • Outliers?
    • Constant values (=Zero Importance)?
    • Low importance features?
    • Collinear, correlated or otherwise dependent features?
    • Highly skewed features?
    • Irrelevant features?
  • Univariate Analysis
    • Look at mean, median, min, max, std, iqr, quantiles (1%, 5%, 25%, 50%, 75%, 95%, 99%)
    • Draw boxplots, histograms
  • Multivariate Analysis
    • Draw scatter plots
    • Create correlation matrix
  • Time Series? -> Plot variables over time
  • Fixing issues
    • Impute missing values (mode, median, mean)
    • Remove variables that have too many missings
    • Remove observations that have too many missings
    • Select appropriate time slice
  • Preparation
    • Clip values that are too small/too large
    • Scale to [0,1] or normalize (mean=0, std=1) or Robust / Quantile Scaling
    • One-hot encoding, Label Encoding (0,1,2,3)
    • Create log-transformed versions for highly skewed variables
    • Create binned versions for variables
    • Combine categories for highly skewed categorical variables
    • Create sum/difference/product/quotient of variables
    • Create polynomial features
Owner
Florian Rohrer
Florian Rohrer
PowerApps-docstring is a console based, pipeline ready application that automatically generates user and technical documentation for Power Apps.

powerapps-docstring PowerApps-docstring is a console based, pipeline ready application that automatically generates user and technical documentation f

Sebastian Muthwill 30 Nov 23, 2022
A powerful Sphinx changelog-generating extension.

What is Releases? Releases is a Python (2.7, 3.4+) compatible Sphinx (1.8+) extension designed to help you keep a source control friendly, merge frien

Jeff Forcier 166 Dec 29, 2022
Repository for tutorials, examples and starter scripts for using the MTU HPC cluster

MTU-HPC-Starter Repository for tutorials, examples and starter scripts for using the MTU HPC cluster Connecting to the MTU HPC cluster Within the coll

1 Jan 31, 2022
SqlAlchemy Flask-Restful Swagger Json:API OpenAPI

SAFRS: Python OpenAPI & JSON:API Framework Overview Installation JSON:API Interface Resource Objects Relationships Methods Custom Methods Class Method

Thomas Pollet 361 Nov 16, 2022
A simple flask application to collect annotations for the Turing Change Point Dataset, a benchmark dataset for change point detection algorithms

AnnotateChange Welcome to the repository of the "AnnotateChange" application. This application was created to collect annotations of time series data

The Alan Turing Institute 16 Jul 21, 2022
Manage your WordPress installation directly from SublimeText SideBar and Command Palette.

WordpressPluginManager Manage your WordPress installation directly from SublimeText SideBar and Command Palette. Installation Dependencies You will ne

Art-i desenvolvimento 1 Dec 14, 2021
[Unofficial] Python PEP in EPUB format

PEPs in EPUB format This is a unofficial repository where I stock all valid PEPs in the EPUB format. Repository Cloning git clone --recursive Mickaël Schoentgen 9 Oct 12, 2022

This is a small project written to help build documentation for projects in less time.

Documentation-Builder This is a small project written to help build documentation for projects in less time. About This project builds documentation f

Tom Jebbo 2 Jan 17, 2022
Demonstration that AWS IAM policy evaluation docs are incorrect

The flowchart from the AWS IAM policy evaluation documentation page, as of 2021-09-12, and dating back to at least 2018-12-27, is the following: The f

Ben Kehoe 15 Oct 21, 2022
A PyTorch implementation of Deep SAD, a deep Semi-supervised Anomaly Detection method.

Deep SAD: A Method for Deep Semi-Supervised Anomaly Detection This repository provides a PyTorch implementation of the Deep SAD method presented in ou

Lukas Ruff 276 Jan 04, 2023
📘 OpenAPI/Swagger-generated API Reference Documentation

Generate interactive API documentation from OpenAPI definitions This is the README for the 2.x version of Redoc (React-based). The README for the 1.x

Redocly 19.2k Jan 02, 2023
Explain yourself! Interrogate a codebase for docstring coverage.

interrogate: explain yourself Interrogate a codebase for docstring coverage. Why Do I Need This? interrogate checks your code base for missing docstri

Lynn Root 435 Dec 29, 2022
Mayan EDMS is a document management system.

Mayan EDMS is a document management system. Its main purpose is to store, introspect, and categorize files, with a strong emphasis on preserving the contextual and business information of documents.

3 Oct 02, 2021
🧙 A simple, typed and monad-based Result type for Python.

meiga 🧙 A simple, typed and monad-based Result type for Python. Table of Contents Installation 💻 Getting Started 📈 Example Features Result Function

Alice Biometrics 31 Jan 08, 2023
Automated Integration Testing and Live Documentation for your API

Automated Integration Testing and Live Documentation for your API

ScanAPI 1.3k Dec 30, 2022
Software engineering course project. Secondhand trading system.

PigeonSale Software engineering course project. Secondhand trading system. Documentation API doumenatation: list of APIs Backend documentation: notes

Harry Lee 1 Sep 01, 2022
A python package to import files from an adjacent folder

EasyImports About EasyImports is a python package that allows users to easily access and import files from sister folders: f.ex: - Project - Folde

1 Jun 22, 2022
A simple malware that tries to explain the logic of computer viruses with Python.

Simple-Virus-With-Python A simple malware that tries to explain the logic of computer viruses with Python. What Is The Virus ? Computer viruses are ma

Xrypt0 6 Nov 18, 2022
This repo provides a package to automatically select a random seed based on ancient Chinese Xuanxue

🤞 Random Luck Deep learning is acturally the alchemy. This repo provides a package to automatically select a random seed based on ancient Chinese Xua

Tong Zhu(朱桐) 33 Jan 03, 2023
A Python library for setting up projects using tabular data.

A Python library for setting up projects using tabular data. It can create project folders, standardize delimiters, and convert files to CSV from either individual files or a directory.

0 Dec 13, 2022