To effectively detect the faulty wafers

Overview

wafer_fault_detection

Aim of the project:

In electronics, a wafer (also called a slice or substrate) is a thin slice of semiconductor, such as crystalline silicon (c-Si), used for the fabrication of integrated circuits and, in photovoltaics, to manufacture solar cells. The wafer serves as the substrate for microelectronic devices built in and upon the wafer. The project aims to successfully identify the state of the provided wafer by classifying it between one of the two-class +1 (good, can be used as a substrate) or -1 (bad, the substrate need to be replaced). In this regard, a training dataset is provided to build a machine learning classification model, which can predict the wafer quality.

Data Description:

The columns of provided data can be classified into 3 parts: wafer name, sensor values and label. The wafer name contains the batch number of the wafer, whereas the sensor values obtained from the measurement carried out on the wafer. The label column contains two unique values +1 and -1 that identifies if the wafer is good or need to be replaced. Additionally, we also require a schema file, which contains all the relevant information about the training files such as file names, length of date value in the file name, length of time value in the file name, number of columns, name of the columns, and their datatype.

Directory creation:

All the necessary folders were created to effectively separate the files so that the end-user can get easy access to them.

Data Validation:

In this step, we matched our dataset with the provided schema file to match the file names, the number of columns it should contain, their names as well as their datatype. If the files matched with the schema values, then it is considered a good file on which we can train or predict our model, if not then the files are considered as bad and moved to the bad folder. Moreover, we also identify the columns with null values. If the whole column data is missing then we also consider the file as bad, on the contrary, if only a fraction of data in a column is missing then we initially fill it with NaN and consider it as good data.

Data Insertion in Database:

First, we create a database with the given name passed. If the database is already created, open the connection to the database. A table with the name- "train_good_raw_dt" or "pred_good_raw_dt" is created in the database, based on training or prediction, for inserting the good data files obtained from the data validation step. If the table is already present, then the new table is not created, and new files are inserted in the already present table as we want training to be done on new as well as old training files. In the end, the data in a stored database is exported as a CSV file to be used for model training.

Data Pre-processing and Model Training:

In the training section, first, the data is checked for the NaN values in the columns. If present, impute the NaN values using the KNN imputer. The column with zero standard deviation was also identified and removed as they don't give any information during model training. A prediction schema was created based on the remained dataset columns. Afterwards, the KMeans algorithm is used to create clusters in the pre-processed data. The optimum number of clusters is selected by plotting the elbow plot, and for the dynamic selection of the number of clusters, we are using the "KneeLocator" function. The idea behind clustering is to implement different algorithms to train data in different clusters. The Kmeans model is trained over pre-processed data and the model is saved for further use in prediction. After clusters are created, we find the best model for each cluster. We are using four algorithms, "Random Forest" “K Neighbours”, “Logistic Regression” and "XGBoost". For each cluster, both the algorithms are passed with the best parameters derived from GridSearch. We calculate the AUC scores for both models and select the model with the best score. Similarly, the best model is selected for each cluster. All the models for every cluster are saved for use in prediction. In the end, the confusion matrix of the model associated with every cluster is also saved to give a glance at the performance of the models.

Prediction:

In data prediction, first, the essential directories are created. The data validation, data insertion and data processing steps are similar to the training section. The KMeans model created during training is loaded, and clusters for the pre-processed prediction data is predicted. Based on the cluster number, the respective model is loaded and is used to predict the data for that cluster. Once the prediction is made for all the clusters, the predictions along with the Wafer names are saved in a CSV file at a given location.

Deployment:

We will be deploying the model to Heroku Cloud.

Owner
Arun Singh Babal
Engineer | Data Science Enthusiasts | Machine Learning | Deep Learning | Advanced Computer Vision.
Arun Singh Babal
An open source recipe book from the awesome staff of Clinical Genomics

meatballs An open source recipe book from the awesome staff of Clinical Genomics.

Clinical Genomics 2 Dec 07, 2021
Learning a Little about Containerlab

Learning a Little about Containerlab Hello all. This is the respository based on this blog post. Getting Started Feel free to use this example. You wi

10 Oct 16, 2022
Simple script to match riders with drivers.

theBestPooler Simple script to match riders with drivers. It's a greedy, unoptimised search, so no guarantees that it works. It just seems to work (ve

Devansh 1 Nov 22, 2021
Random Programming Language Project

Crastle Random Programming Language Project Freedom of expression Are you a fan of curly brace languages? Then use curly braces! Not a fan of curly br

DevNugget 2 Dec 23, 2021
Q-Tracker is originally a High School Project created by Admins of Cirus Lab.

Q-Tracker is originally a High School Project created by Admins of Cirus Lab. It's completly coded in python along with mysql.(Tkinter For GUI)

Adithya Krishnan 2 Nov 14, 2022
Datasets with Softcatalà website content

softcatala-web-dataset This repository contains Sofcatalà web site content (articles and programs descriptions). Dataset are available in the dataset

Softcatalà 2 Dec 26, 2021
WinBoost: Boost your windows system.

Winboost runs a complete checkup of your entire system locating junk files, speed-reducing issues and causes of any system or application glitches or crashes. Through a lot of research and testing, w

Smit Parmar 4 Oct 01, 2021
p5 is a Python package based on the core ideas of Processing.

p5 p5 is a Python library that provides high level drawing functionality to help you quickly create simulations and interactive art using Python. It c

p5py 645 Jan 04, 2023
Paxos in Python, tested with Jepsen

Python implementation of Multi-Paxos with a stable leader and reconfiguration, roughly following "Paxos Made Moderately Complex". Run python3 paxos/st

A. Jesse Jiryu Davis 25 Dec 15, 2022
Code for Crowd counting via unsupervised cross-domain feature adaptation.

CDFA-pytorch Code for Unsupervised crowd counting via cross-domain feature adaptation. Pre-trained models Google Drive Baidu Cloud : t4qc Environment

Guanchen Ding 6 Dec 11, 2022
Integration of CCURE access control system with automation HVAC of a commercial building

API-CCURE-Automation-Quantity-Floor Integration of CCURE access control system with automation HVAC of a commercial building CCURE is an access contro

Alexandre Edson Silva Pereira 1 Nov 24, 2021
A very basic ciphering/deciphering tool

ckrett-python-library This is an useful python library for people who care about privacy, this library is useful to cipher and decipher text using 4 s

SasiVatsal 8 Oct 18, 2022
Necst-lib - Pure Python tools for NECST

necst-lib Pure Python tools for NECST. Features This library provides: something

NANTEN2 Group 5 Dec 15, 2022
Pymon is like nodemon but it is for python,

Pymon is like nodemon but it is for python,

Swaraj Puppalwar 2 Jun 11, 2022
Data Structures and Algorithms Python - Practice data structures and algorithms in python with few small projects

Data Structures and Algorithms All the essential resources and template code nee

Hesham 13 Dec 01, 2022
Eros is an expiremental programming language built using simple Python code.

Eros is an expiremental programming language built using simple Python code. Featuring an easy syntax and unique features like type slicing, the language remains an expirement that grows in down time

zxro 2 Nov 21, 2021
Scalene: a high-performance, high-precision CPU, GPU, and memory profiler for Python

Scalene: a high-performance CPU, GPU and memory profiler for Python by Emery Berger, Sam Stern, and Juan Altmayer Pizzorno. Scalene community Slack Ab

PLASMA @ UMass 7k Dec 30, 2022
A repository for all ZenML projects that are specific production use-cases.

ZenFiles Original Image source: https://www.goodfon.com/wallpaper/x-files-sekretnye-materialy.html And naturally, all credits to the awesome X-Files s

ZenML 66 Jan 06, 2023
Camera track the tip of a pen to use as a drawing tablet

cablet Camera track the tip of a pen to use as a drawing tablet Setup You will need: Writing utensil with a colored tip (preferably blue or green) Bac

14 Feb 20, 2022
Build a grocery store management application.

python_projects_grocery_webapp In this python project, we will build a grocery store management application. It will be 3 tier application, Front end:

codebasics 54 Dec 29, 2022