A Guide for Feature Engineering and Feature Selection, with implementations and examples in Python.

Overview

Feature Engineering & Feature Selection

A comprehensive guide [pdf] [markdown] for Feature Engineering and Feature Selection, with implementations and examples in Python.

Motivation

Feature Engineering & Selection is the most essential part of building a useable machine learning project, even though hundreds of cutting-edge machine learning algorithms coming in these days like deep learning and transfer learning. Indeed, like what Prof Domingos, the author of  'The Master Algorithm' says:

“At the end of the day, some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used.”

— Prof. Pedro Domingos

001

Data and feature has the most impact on a ML project and sets the limit of how well we can do, while models and algorithms are just approaching that limit. However, few materials could be found that systematically introduce the art of feature engineering, and even fewer could explain the rationale behind. This repo is my personal notes from learning ML and serves as a reference for Feature Engineering & Selection.

Download

Download the PDF here:

Same, but in markdown:

PDF has a much readable format, while Markdown has auto-generated anchor link to navigate from outer source. GitHub sucks at displaying markdown with complex grammar, so I would suggest read the PDF or download the repo and read markdown with Typora.

What You'll Learn

Not only a collection of hands-on functions, but also explanation on Why, How and When to adopt Which techniques of feature engineering in data mining.

  • the nature and risk of data problem we often encounter
  • explanation of the various feature engineering & selection techniques
  • rationale to use it
  • pros & cons of each method
  • code & example

Getting Started

This repo is mainly used as a reference for anyone who are doing feature engineering, and most of the modules are implemented through scikit-learn or its communities.

To run the demos or use the customized function, please download the ZIP file from the repo or just copy-paste any part of the code you find helpful. They should all be very easy to understand.

Required Dependencies:

  • Python 3.5, 3.6 or 3.7
  • numpy>=1.15
  • pandas>=0.23
  • scipy>=1.1.0
  • scikit_learn>=0.20.1
  • seaborn>=0.9.0

Table of Contents and Code Examples

Below is a list of methods currently implemented in the repo.

1. Data Exploration

2. Feature Cleaning

3. Feature Engineering

4. Feature Selection

Key Links and Resources

  • Udemy's Feature Engineering online course

https://www.udemy.com/feature-engineering-for-machine-learning/

  • Udemy's Feature Selection online course

https://www.udemy.com/feature-selection-for-machine-learning

  • JMLR Special Issue on Variable and Feature Selection

http://jmlr.org/papers/special/feature03.html

  • Data Analysis Using Regression and Multilevel/Hierarchical Models, Chapter 25: Missing data

http://www.stat.columbia.edu/~gelman/arm/missing.pdf

  • Data mining and the impact of missing data

http://core.ecu.edu/omgt/krosj/IMDSDataMining2003.pdf

  • PyOD: A Python Toolkit for Scalable Outlier Detection

https://github.com/yzhao062/pyod

  • Weight of Evidence (WoE) Introductory Overview

http://documentation.statsoft.com/StatisticaHelp.aspx?path=WeightofEvidence/WeightofEvidenceWoEIntroductoryOverview

  • About Feature Scaling and Normalization

http://sebastianraschka.com/Articles/2014_about_feature_scaling.html

  • Feature Generation with RF, GBDT and Xgboost

https://blog.csdn.net/anshuai_aw1/article/details/82983997

  • A review of feature selection methods with applications

https://ieeexplore.ieee.org/iel7/7153596/7160221/07160458.pdf

Owner
Yimeng.Zhang
I'm a lovely machine learning learner~
Yimeng.Zhang
This repository contains code for building education startup.

Learning Management System Overview It's the code for EssayBrain, a tool for teacher that automatically grades and validates essays. In order to valid

Shyam Das Shrestha 1 Nov 21, 2021
A platform for developers 👩‍💻 who wants to share their programs and projects.

Fest-Practice-2021 This project is excluded from Hacktoberfest 2021. Please use this as a testing repo/project. A platform for developers 👩‍💻 who wa

Mayank Choudhary 40 Nov 07, 2022
Aevsploit İçin Destekde Bulun Papara: 1427113016

Aevsploit İçin Destekde Bulun Papara: 1427113016 Toolu Geliştirmek İçin Fikirlerinizi Bekliyorum Telegram

9 Jun 07, 2022
Antchain-MPC is a library of MPC (Multi-Parties Computation)

Antchain-MPC Antchain-MPC is a library of MPC (Multi-Parties Computation). It include Morse-STF: A tool for machine learning using MPC. Others: Commin

Alipay 37 Nov 22, 2022
Control your gtps with gtps-tools!

Note Please give credit to me! Do not try to sell this app, because this app is 100% open source! Do not try to reupload and rename the creator app! S

Jesen N 6 Feb 16, 2022
Skull shaped MOSFET cells for the Efabless's 130nm process

SkullFET Skull shaped MOSFET cells for the Efabless's 130nm process List of cells Inverter Copyright (C) 2021 Uri Shaked

Wokwi 3 Dec 14, 2022
ThinkPHP全日志扫描工具,命令行版和BurpSuite插件版

ThinkPHP3和5日志扫描工具,提供命令行版和BurpSuite插件版,尽可能全的发掘网站日志信息 命令行版 安装 git clone https://github.com/r3change/TPLogScan.git cd TPLogScan/ pip install -r requireme

119 Dec 27, 2022
Python client library for the Databento API

Databento Python Library The Databento Python client library provides access to the Databento API for both live and historical data, from applications

Databento, Inc. 35 Dec 24, 2022
Algorand Python API examples

Algorand-Py Algorand Python API examples This repo will hold example scripts to monitor activities on Algorand main net. You can: Monitor your assets

Karthik Dutt 2 Jan 23, 2022
Blender-miHoYo-Shaders - Shaders for Blender attempting to replicate the shading of games developed by miHoYo

Blender-miHoYo-Shaders - Shaders for Blender attempting to replicate the shading of games developed by miHoYo

Matsuri 449 Jan 07, 2023
Mannaggia is a python application to praise or more likely to curse the saints

Mannaggia-py 👼 Remember Mannaggia? This is a Python remake of it, with new features. mannaggia is a python application to praise or more likely to cu

Christian Visintin 9 Aug 12, 2022
Basic code and description for GoBigger challenge 2021.

GoBigger Challenge 2021 en / 中文 Challenge Description 2021.11.13 We are holding a competition —— Go-Bigger: Multi-Agent Decision Intelligence Challeng

OpenDILab 183 Dec 29, 2022
Submission from Team OMR for the TRI-NIT Hackathon

Submission from Team OMR for the TRI-NIT Hackathon

0 Feb 01, 2022
Procscan is a quick and dirty python script used to look for potentially dangerous api call patterns in a Procmon PML file.

PROCSCAN Procscan is a quick and dirty python script used to look for potentially dangerous api call patterns in a Procmon PML file. Installation git

Daniel Santos 9 Sep 02, 2022
A collection of tips for using MISP.

MISP Tip of the Week A collection of tips for using MISP. Published via BelgoMISP (todo) and this repository. Available in MD and JSON. Do you want to

Koen Van Impe 52 Jan 07, 2023
pvaPy provides Python bindings for EPICS pvAccess

PvaPy - PvAccess for Python The PvaPy package is a Python API for EPICS7. It supports both PVA and CA providers, all standard EPICS7 types (structures

EPICS Base 25 Dec 05, 2022
Yakuake session management

yman is a python script used for saving/restoring yakuake sessions (currently running commands, working directories, environment variables, tab titles)

Szymon Borecki 6 Jun 25, 2022
A Python library to simulate a Zoom H6 recorder remote control

H6 A Python library to emulate a Zoom H6 recorder remote control Introduction This library allows you to control your Zoom H6 recorder from your compu

Matias Godoy 68 Nov 02, 2022
A Python application that helps users determine their calorie intake, and automatically generates customized weekly meal and workout plans based on metrics computed using their physical parameters

A Python application that helps users determine their calorie intake, and automatically generates customized weekly meal and workout plans based on metrics computed using their physical parameters

Anam Iqbal 1 Jan 13, 2022
Purge all transformation orientations addon for Blender 2.8 and newer versions

CTO Purge This add-on adds a new button to Blender's Transformation Orientation panel which empowers the user to purge all of his/her custom transform

MMMrqs 10 Dec 29, 2022