Semi-Automated Data Processing

Overview

Semi-Automated Data Processing

Preparing data for model learning is one of the most important steps in any project—and traditionally, one of the most time consuming. Data Analysis plays a very important role in the entire Data Science Workflow. In fact, this takes most of the time of the Data science Workflow. There’s a nice quote (not sure who said it)According to Wikipedia, In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task. Exploratory data analysis was promoted by John Tukey to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments.**“In Data Science, 80% of time spent prepare data, 20% of time spent complain about need for prepare data.”*

This projects handles the task with minimal user interaction by analyzing your data and identifying fixes, screening out fields that are problematic or not likely to be useful, deriving new attributes when appropriate, and improving performance through intelligent screening techniques. You can use the project in semi-interactive fashion, previewing the changes before they are made and accept or reject them as you want.

This project cover the 3 steps in any project workflow, comes before the model training:
1) Exploratory data analysis
2) Feature engineering
3) Feature selection


All these steps has to be carried out by the user by calling the several functions as follows:

1) identify_feature(data)=
This function identifies the categorical, continuous numerical and discrete numerical features in the datset. It also identifies datetime feature and extracts the relevant info from it.

Input:
data=Dataset

Output:
df=Dataset
data_cont_num_feature= List of features names associated containing continuous numerical values
data_dis_num_feature=List of features names associated containing discrete numerical values
data_cat_feature=List of features names associated containing categorical values
dt_feature=List of features names associated containing datetime values

2) plot_nan_feature(data, continuous_features, discrete_features, categorical_features,dependent_var)=
It identifies the missing values in the dataset and visualize them their impact on dependent feature.

Input:
data=Dataset
continuous_features= List of features names associated containing continuous numerical values
discrete_features=List of features names associated containing discrete numerical values
categorical_features=List of features names associated containing categorical values
dependent_var= Dependent feature name in string format

Output:
df= Dataset
nan_features= List of feature names containing NaN values

3) visualize_imputation_impact(data,continuous_features, discrete_features, categorical_features,nan_features,dependent_var):
The function visualizes the impact of different NaN value impution on the distribution of values the feature.

Input:
data=Dataset
continuous_features= List of features names associated containing continuous numerical values
discrete_features=List of features names associated containing discrete numerical values
categorical_features=List of features names associated containing categorical values
nan_features= List of feature names containing NaN values
dependent_var= Dependent feature name in string format

Output:
None

4) nan_imputation(data,mean_feature,median_feature,mode_feature,random_feature,new_category):
The function imputes the NaN values in the feature as per the user input.

Input:
data=Dataset
mean_feature= List of feature names in which we have to carry out mean_imputation
median_feature=List of feature names in which we have to carry out median_imputation
mode_feature=List of feature names in which we have to carry out mode_imputation
random_feature=List of feature names in which we have to carry out random_imputation
new_category=List of feature names in which we we create a new category for the NaN values

Output:
None

5) cross_visualization(data,continuous_features,discrete_features, categorical_features,dt_features):
The function visualise the relationship between the different independent features.

Input:
df=Dataset
data_cont_num_feature= List of features names associated containing continuous numerical values
data_dis_num_feature=List of features names associated containing discrete numerical values
data_cat_feature=List of features names associated containing categorical values
dt_feature=List of features names associated containing datetime values

Output:
continuous_features2=List of features names associated containing continuous numerical values, except the dependent feature

6) dependent_independent_visualization(data,continuous_features,discrete_features, categorical_features,dt_features,dependent_feature):
The function visualise the relationship between the different independent features.

Input:
data_cont_num_feature= List of features names associated containing continuous numerical values
data_dis_num_feature=List of features names associated containing discrete numerical values
data_cat_feature=List of features names associated containing categorical values
dt_feature=List of features names associated containing datetime values
dependent_var= Dependent feature name in string format

Output:
None

7) outlier_removal(data,continuous_features,discrete_features,dependent_var,dependent_var_type,action):
The function visualizes the outlliers using the boxplot and removes them.

Input:
data=Dataset
continuous_features= List of features names associated containing continuous numerical values
discrete_features=List of features names associated containing discrete numerical values
dependent_var= Dependent feature name in string format
dependent_var_type= Contain string tells if the problem is regression (than use 'Regression') or else
action= Give input as 'remove' to delete the rows associated with the outliers

Output:
df=Dataset

8) transformation_visualization(data,continuous_features,discrete_features,dependent_feature):
The function visualize the feature after performing various transormation techniques.

Input:
data=Dataset
continuous_features= List of features names associated containing continuous numerical values
discrete_features=List of features names associated containing discrete numerical values
dependent_feature= Dependent feature name in string format

Output:
None

9) feature_transformation(train_data,continuous_features,discrete_features,transformation,dependent_feature):
The function performing the feature transormation technique as per the user input.

Input:
train_data=Training dataset
continuous_features= List of features names associated containing continuous numerical values
discrete_features=List of features names associated containing discrete numerical values
transformation=Type of transformation: none=No transformation, log=Log Transformation, sqrt= Square root Transformation, reciprocal= Reciprocal Transformation, exp= Exponential Transformation, boxcox=Boxcox Transformation
dependent_feature= Dependent feature name in string format

Output:
X_data=Training dataset

10) categorical_transformation(train_data,categorical_encoding):
This function transforms the categorical featres in the numerical ones using encoding techniques.

Input:
train_data=Training dataset
categorical_encoding={'one_hot_encoding':[],'frequency_encoding':[],'mean_encoding':[],'target_guided_ordinal_encoding':{}}

Output:
X_data=Training dataset

11a) feature_selection(Xtrain,ytrain, threshold, data_type, filter_type):
This function performs the feature selection based on the dependent and independent features in train dataset.

Input:
Xtrain=Training dataset
ytrain=dependent data in training dataset
threshold= Threshold for the correlation
{'in_num_out_num':{'linear':['pearson'],'non-linear':['spearman']},
'in_num_out_cat':{'linear':['ANOVA'],'non-linear':['kendall']},
'in_cat_out_num':{'linear':['ANOVA'],'non-linear':['kendall']},
'in_cat_out_cat':{'chi_square_test':True,'mutual_info':True},}
data_type= Data linear or non-linearly dependent on the output label
filter_type= If input data is numerical and output is numerical then --'in_num_out_num' as shown in the above dictionary

Output:
Xtrain= Training dataset
feature_df= Dataframe containig features with their pvalue

11b) feature_selection(Xtrain,ytrain,Xtest,ytest, threshold, data_type, filter_type):
This function performs the feature selection based on the dependent and independent features in train dataset.

Input:
Xtrain=Training dataset
ytrain=dependent data in training dataset
Xtest=Test dataset
ytest=dependent data in test dataset
threshold= Threshold for the correlation
{'in_num_out_num':{'linear':['pearson'],'non-linear':['spearman']},
'in_num_out_cat':{'linear':['ANOVA'],'non-linear':['kendall']},
'in_cat_out_num':{'linear':['ANOVA'],'non-linear':['kendall']},
'in_cat_out_cat':{'chi_square_test':True,'mutual_info':True},}
data_type= Data linear or non-linearly dependent on the output label
filter_type= If input data is numerical and output is numerical then --'in_num_out_num' as shown in the above dictionary

Output:
Xtrain= Training dataset
Xtest= Test dataset
feature_df= Dataframe containig features with their pvalue

12) convert_dtype(data,categorical_features):
This function converts the categorical fetaures containing the numeric values but presented as categorical into the int format.

Input:
data= Dataset
categorical_features=List of features names associated containing categorical values

Output:
df=Dataset

Note:
Use same paramters for both train and test dataset for better accuracy


We have implemented a bike sharing project to describe how the functions can be used for both the classification and regression problem statement.

Owner
Arun Singh Babal
Engineer | Data Science Enthusiasts | Machine Learning | Deep Learning | Advanced Computer Vision.
Arun Singh Babal
COVID-19 deaths statistics around the world

COVID-19-Deaths-Dataset COVID-19 deaths statistics around the world This is a daily updated dataset of COVID-19 deaths around the world. The dataset c

Nisa Efendioğlu 4 Jul 10, 2022
Containerized Demo of Apache Spark MLlib on a Data Lakehouse (2022)

Spark-DeltaLake-Demo Reliable, Scalable Machine Learning (2022) This project was completed in an attempt to become better acquainted with the latest b

8 Mar 21, 2022
Meltano: ELT for the DataOps era. Meltano is open source, self-hosted, CLI-first, debuggable, and extensible.

Meltano is open source, self-hosted, CLI-first, debuggable, and extensible. Pipelines are code, ready to be version c

Meltano 625 Jan 02, 2023
signac-flow - manage workflows with signac

signac-flow - manage workflows with signac The signac framework helps users manage and scale file-based workflows, facilitating data reuse, sharing, a

Glotzer Group 44 Oct 14, 2022
First and foremost, we want dbt documentation to retain a DRY principle. Every time we repeat ourselves, we waste our time. Second, we want to understand column level lineage and automate impact analysis.

dbt-osmosis First and foremost, we want dbt documentation to retain a DRY principle. Every time we repeat ourselves, we waste our time. Second, we wan

Alexander Butler 150 Jan 06, 2023
Deep universal probabilistic programming with Python and PyTorch

Getting Started | Documentation | Community | Contributing Pyro is a flexible, scalable deep probabilistic programming library built on PyTorch. Notab

7.7k Dec 30, 2022
Vaex library for Big Data Analytics of an Airline dataset

Vaex-Big-Data-Analytics-for-Airline-data A Python notebook (ipynb) created in Jupyter Notebook, which utilizes the Vaex library for Big Data Analytics

Nikolas Petrou 1 Feb 13, 2022
Dbt-core - dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.

Dbt-core - dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.

dbt Labs 6.3k Jan 08, 2023
Synthetic Data Generation for tabular, relational and time series data.

An Open Source Project from the Data to AI Lab, at MIT Website: https://sdv.dev Documentation: https://sdv.dev/SDV User Guides Developer Guides Github

The Synthetic Data Vault Project 1.2k Jan 07, 2023
Python package to transfer data in a fast, reliable, and packetized form.

pySerialTransfer Python package to transfer data in a fast, reliable, and packetized form.

PB2 101 Dec 07, 2022
Maximum Covariance Analysis in Python

xMCA | Maximum Covariance Analysis in Python The aim of this package is to provide a flexible tool for the climate science community to perform Maximu

Niclas Rieger 39 Jan 03, 2023
A columnar data container that can be compressed.

Unmaintained Package Notice Unfortunately, and due to lack of resources, the Blosc Development Team is unable to maintain this package anymore. During

944 Dec 09, 2022
Pipeline and Dataset helpers for complex algorithm evaluation.

tpcp - Tiny Pipelines for Complex Problems A generic way to build object-oriented datasets and algorithm pipelines and tools to evaluate them pip inst

Machine Learning and Data Analytics Lab FAU 3 Dec 07, 2022
A 2-dimensional physics engine written in Cairo

A 2-dimensional physics engine written in Cairo

Topology 38 Nov 16, 2022
Pandas-based utility to calculate weighted means, medians, distributions, standard deviations, and more.

weightedcalcs weightedcalcs is a pandas-based Python library for calculating weighted means, medians, standard deviations, and more. Features Plays we

Jeremy Singer-Vine 98 Dec 31, 2022
Cleaning and analysing aggregated UK political polling data.

Analysing aggregated UK polling data The tweet collection & storage pipeline used in email-service is used to also collect tweets from @britainelects.

Ajay Pethani 0 Dec 22, 2021
Python script for transferring data between three drives in two separate stages

Waterlock Waterlock is a Python script meant for incrementally transferring data between three folder locations in two separate stages. It performs ha

David Swanlund 13 Nov 10, 2021
A Big Data ETL project in PySpark on the historical NYC Taxi Rides data

Processing NYC Taxi Data using PySpark ETL pipeline Description This is an project to extract, transform, and load large amount of data from NYC Taxi

Unnikrishnan 2 Dec 12, 2021
Pandas and Spark DataFrame comparison for humans

DataComPy DataComPy is a package to compare two Pandas DataFrames. Originally started to be something of a replacement for SAS's PROC COMPARE for Pand

Capital One 259 Dec 24, 2022
MotorcycleParts DataAnalysis python

We work with the accounting department of a company that sells motorcycle parts. The company operates three warehouses in a large metropolitan area.

NASEEM A P 1 Jan 12, 2022