To effectively detect the faulty wafers

Overview

wafer_fault_detection

Aim of the project:

In electronics, a wafer (also called a slice or substrate) is a thin slice of semiconductor, such as crystalline silicon (c-Si), used for the fabrication of integrated circuits and, in photovoltaics, to manufacture solar cells. The wafer serves as the substrate for microelectronic devices built in and upon the wafer. The project aims to successfully identify the state of the provided wafer by classifying it between one of the two-class +1 (good, can be used as a substrate) or -1 (bad, the substrate need to be replaced). In this regard, a training dataset is provided to build a machine learning classification model, which can predict the wafer quality.

Data Description:

The columns of provided data can be classified into 3 parts: wafer name, sensor values and label. The wafer name contains the batch number of the wafer, whereas the sensor values obtained from the measurement carried out on the wafer. The label column contains two unique values +1 and -1 that identifies if the wafer is good or need to be replaced. Additionally, we also require a schema file, which contains all the relevant information about the training files such as file names, length of date value in the file name, length of time value in the file name, number of columns, name of the columns, and their datatype.

Directory creation:

All the necessary folders were created to effectively separate the files so that the end-user can get easy access to them.

Data Validation:

In this step, we matched our dataset with the provided schema file to match the file names, the number of columns it should contain, their names as well as their datatype. If the files matched with the schema values, then it is considered a good file on which we can train or predict our model, if not then the files are considered as bad and moved to the bad folder. Moreover, we also identify the columns with null values. If the whole column data is missing then we also consider the file as bad, on the contrary, if only a fraction of data in a column is missing then we initially fill it with NaN and consider it as good data.

Data Insertion in Database:

First, we create a database with the given name passed. If the database is already created, open the connection to the database. A table with the name- "train_good_raw_dt" or "pred_good_raw_dt" is created in the database, based on training or prediction, for inserting the good data files obtained from the data validation step. If the table is already present, then the new table is not created, and new files are inserted in the already present table as we want training to be done on new as well as old training files. In the end, the data in a stored database is exported as a CSV file to be used for model training.

Data Pre-processing and Model Training:

In the training section, first, the data is checked for the NaN values in the columns. If present, impute the NaN values using the KNN imputer. The column with zero standard deviation was also identified and removed as they don't give any information during model training. A prediction schema was created based on the remained dataset columns. Afterwards, the KMeans algorithm is used to create clusters in the pre-processed data. The optimum number of clusters is selected by plotting the elbow plot, and for the dynamic selection of the number of clusters, we are using the "KneeLocator" function. The idea behind clustering is to implement different algorithms to train data in different clusters. The Kmeans model is trained over pre-processed data and the model is saved for further use in prediction. After clusters are created, we find the best model for each cluster. We are using four algorithms, "Random Forest" “K Neighbours”, “Logistic Regression” and "XGBoost". For each cluster, both the algorithms are passed with the best parameters derived from GridSearch. We calculate the AUC scores for both models and select the model with the best score. Similarly, the best model is selected for each cluster. All the models for every cluster are saved for use in prediction. In the end, the confusion matrix of the model associated with every cluster is also saved to give a glance at the performance of the models.

Prediction:

In data prediction, first, the essential directories are created. The data validation, data insertion and data processing steps are similar to the training section. The KMeans model created during training is loaded, and clusters for the pre-processed prediction data is predicted. Based on the cluster number, the respective model is loaded and is used to predict the data for that cluster. Once the prediction is made for all the clusters, the predictions along with the Wafer names are saved in a CSV file at a given location.

Deployment:

We will be deploying the model to Heroku Cloud.

Owner
Arun Singh Babal
Engineer | Data Science Enthusiasts | Machine Learning | Deep Learning | Advanced Computer Vision.
Arun Singh Babal
MODeflattener deobfuscates control flow flattened functions obfuscated by OLLVM using Miasm.

MODeflattener deobfuscates control flow flattened functions obfuscated by OLLVM using Miasm.

Suraj Malhotra 138 Jan 07, 2023
This is a Saleae Logic custom high level analyzer that allows you to search and mark specific packets.

SaleaePacketParser This is a Saleae Logic custom high level analyzer that allows you to search and mark specific packets. Field "Search For" is used f

1 Dec 16, 2021
Control your gtps with gtps-tools!

Note Please give credit to me! Do not try to sell this app, because this app is 100% open source! Do not try to reupload and rename the creator app! S

Jesen N 6 Feb 16, 2022
Got-book-6 - LSTM trained on the first five ASOIAF/GOT books

GOT Book 6 Generator Are you tired of waiting for the next GOT book to come out? I know that I am, which is why I decided to train a RNN on the first

Zack Thoutt 974 Oct 27, 2022
Reproducible nvim completion framework benchmarks.

Nvim.Bench Reproducible nvim completion framework benchmarks. Runs inside Docker. Fair and balanced Methodology Note: for all "randomness", they are g

i love my dog 14 Nov 20, 2022
Compress .dds file in ggpk to boost fps. This is a python rewrite of PoeTexureResizer.

PoeBooster Compress .dds file in ggpk to boost fps. This is a python rewrite of PoeTexureResizer. Setup Install ImageMagick-7.1.0. Download and unzip

3 Sep 30, 2022
Multtable is a collection of multiplication table generators in various languages.

Multtable Multtable is a collection of multiplication table generators in various languages. This project was created as a joke based on one of my bro

pollen__ 7 Mar 05, 2022
Scripts for BGC analysis in large MAGs and results of their application to soil metagenomes within Chernevaya Taiga RSF-funded project

Scripts for BGC analysis in large MAGs and results of their application to soil metagenomes within Chernevaya Taiga RSF-funded project

1 Dec 06, 2021
Animation retargeting tool for Autodesk Maya. Retargets mocap to a custom rig with a few clicks.

Animation Retargeting Tool for Maya A tool for transferring animation data between rigs or transfer raw mocap from a skeleton to a custom rig. (The sc

Joaen 62 Dec 19, 2022
Reference management solution using Python and Notion.

notion-scholar Reference management solution using Python and Notion. The main idea of this app is to allow to furnish a Notion database using a BibTe

Thomas Hirtz 69 Dec 21, 2022
A service to display a quick summary of a project on GitHub.

A service to display a quick summary of a project on GitHub. Usage 📖 Paste the code below with details filled in as specified below into your Readme.

Rohit V 8 Dec 06, 2022
If Google News had a Python library

pygooglenews If Google News had a Python library Created by Artem from newscatcherapi.com but you do not need anything from us or from anyone else to

Artem Bugara 1.1k Jan 08, 2023
This is the core of the program which takes 5k SYMBOLS and looks back N years to pull in the daily OHLC data of those symbols and saves them to disc.

This is the core of the program which takes 5k SYMBOLS and looks back N years to pull in the daily OHLC data of those symbols and saves them to disc.

Daniel Caine 1 Jan 31, 2022
Free APN For Python

Free APN For Python

XENZI GANZZ 4 Apr 22, 2022
Python based scripts for obtaining system information from Linux.

sysinfo Python based scripts for obtaining system information from Linux. Python2 and Python3 compatible Output in JSON format Simple scripts and exte

Petr Vavrin 70 Dec 20, 2022
A streaming animation of all the edits to a given Wikipedia page.

WikiFilms! What is it? A streaming animation of all the edits to a given Wikipedia page. How it works. It works by creating a "virtual camera," which

Tal Zaken 2 Jan 18, 2022
Simple calculator with random number button and dark gray theme created with PyQt6

Calculator Application Simple calculator with random number button and dark gray theme created with : PyQt6 Python 3.9.7 you can download the dark gra

Flamingo 2 Mar 07, 2022
Python - Aprendendo Python na ByLearn

PYTHON Identação Escopo Pai Escopo filho Escopo neto Variaveis

Italo Rafael 3 May 31, 2022
Mengzhan (John) code for Closed Loop Control system of Sharp Wave Ripples in Hippocampus CA3 region

ClosedLoopControl_Yu Mengzhan (John) code for Closed Loop Control system of Sharp Wave Ripples in Hippocampus CA3 region Creating Python Virtual Envir

Mengzhan (John) Liufu 1 Jan 22, 2022
Script de monitoramento das teclas do teclado, salvando todos os dados digitados em um arquivo de log juntamente com os dados de rede.

listenerPython Script de monitoramento das teclas do teclado, salvando todos os dados digitados em um arquivo de log juntamente com os dados de rede.

Vinícius Azevedo 4 Nov 27, 2022