Identifies the faulty wafer before it can be used for the fabrication of integrated circuits and, in photovoltaics, to manufacture solar cells.

Overview

Retrainable-Faulty-Wafer-Detector

Aim of the project:

In electronics, a wafer (also called a slice or substrate) is a thin slice of semiconductor, such as crystalline silicon (c-Si), used for the fabrication of integrated circuits and, in photovoltaics, to manufacture solar cells. The wafer serves as the substrate for microelectronic devices built in and upon the wafer. The project aims to successfully identify the state of the provided wafer by classifying it between one of the two-class +1 (good, can be used as a substrate) or -1 (bad, the substrate need to be replaced) and then train the model on this data so that it can continuously update itself with the environment and become more generalized with time. In this regard, a training and prediction dataset is provided to build a machine learning classification model, which can predict the wafer quality.

Data Description:

The columns of provided data can be classified into 3 parts: wafer name, sensor values and label. The wafer name contains the batch number of the wafer, whereas the sensor values obtained from the measurement carried out on the wafer. The label column contains two unique values +1 and -1 that identifies if the wafer is good or need to be replaced. Additionally, we also require a schema file, which contains all the relevant information about the training files such as file names, length of date value in the file name, length of time value in the file name, number of columns, name of the columns, and their datatype.

Directory creation:

All the necessary folders were created to effectively separate the files so that the end-user can get easy access to them.

Data Validation:

In this step, we matched our dataset with the provided schema file to match the file names, the number of columns it should contain, their names as well as their datatype. If the files matched with the schema values then they are considered good files on which we can train or predict our model, however, if it didn't match then they are moved to the bad folder. Moreover, we also identify the columns with null values. If all the data in a column is missing then the file is also moved to the bad folder. On the contrary, if only a fraction of data in a column is missing then we initially fill it with NaN and considered it as good data.

Data Insertion in Database:

First, open a connection to the database if it exists otherwise create a new one. A table with the name train_good_raw_dt or pred_good_raw_dt is created in the database, based on the training or prediction process, for inserting the good data files obtained from the data validation step. If the table is already present then new files are inserted in that table as we want training to be done on new as well as old training files. In the end, the data in a stored database is exported as a CSV file, to be used for the model training.

Data Pre-processing and Model Training:

In the training section, first, the data is checked for the NaN values in the columns. If present, impute the NaN values using the KNN imputer. The columns with zero standard deviation were also identified and removed as they don't give any information during model training. A prediction schema was created based on the remained dataset columns. Afterwards, the KMeans algorithm is used to create clusters in the pre-processed data. The optimum number of clusters is selected by plotting the elbow plot, and for the dynamic selection of the number of clusters, we are using the "KneeLocator" function. The idea behind clustering is to implement different algorithms to train data in different clusters. The Kmeans model is trained over pre-processed data and the model is saved for further use in prediction. After clusters are created, we find the best model for each cluster. We are using four algorithms, Random Forest, K-Neighbours, Logistic Regression and XGBoost. For each cluster, both the algorithms are passed with the best parameters derived from GridSearch. We calculate the AUC scores for all the models and select the one with the best score. Similarly, the best model is selected for each cluster. For every cluster, the models are saved so that they can be used in future predictions. In the end, the confusion matrix of the model associated with every cluster is also saved to give the a glance over the performance of the models.

Prediction:

In data prediction, first, the essential directories are created. The data validation, data insertion and data processing steps are similar to the training section. The KMeans model created during training is loaded, and clusters for the pre-processed prediction data is predicted. Based on the cluster number, the respective model is loaded and is used to predict the data for that cluster. Once the prediction is made for all the clusters, the predictions along with the Wafer names are saved in a CSV file at a given location.

Retraining:

After the prediction, the prediction data is merged with the previous training dataset and then the models were retrained on this data using the hyperparameter values obtained from the GridSearch. The cycle repeats with every prediction it does and learns from the newly acquired data, making it more robust.

Deployment:

We will be deploying the model to Heroku Cloud.

Owner
Arun Singh Babal
Engineer | Data Science Enthusiasts | Machine Learning | Deep Learning | Advanced Computer Vision.
Arun Singh Babal
Decentralized intelligent voting application.

DiVA Decentralized intelligent voting application. Hack the North 2021. Inspiration Following the previous US election, many voters were fearful that

Ali Shariatmadari 4 Jun 05, 2022
Comprehensive Python Cheatsheet

Comprehensive Python Cheatsheet

Jure Šorn 31.3k Dec 30, 2022
Python module used to generate random facts

Randfacts is a python library that generates random facts. You can use randfacts.get_fact() to return a random fun fact. Disclaimer: Facts are not gua

Tabulate 14 Dec 14, 2022
Experimental Brawl Stars v36.218 server emulator written in Python.

Brawl Stars v36 Experimental Brawl Stars v36.218 server emulator written in Python. Requirements: Python 3.7 or higher colorama Running the server In

8 Oct 31, 2021
Anki for desktop computers

Anki This repo contains the source code for the computer version of Anki. If you'd like to try development builds of Anki but don't feel comfortable b

Ankitects 12.9k Jan 09, 2023
RxPY - The Reactive Extensions for Python (RxPY)

The Reactive Extensions for Python (RxPY) A library for composing asynchronous and event-based programs using observable collections and query operato

ReactiveX 4.4k Dec 29, 2022
An almost fully customizable language made in python!

Whython is a project language, the idea of it is that anyone can download and edit the language to make it suitable to what they want.

Julian 47 Nov 05, 2022
Chalice - A tool to facilitate Python based lambda deployment

Chalice is a tool to facilitate Python based lambda deployment. This repo contains the output of my basic exploration of this tool.

Csilla Bessenyei 1 Feb 03, 2022
Pyfetch - Simple Fetch written in Python

pyfetch Simple Fetch written in Python Screenshots Install Clone this repository

2 Sep 02, 2022
List of all D&D 5e monsters: WotC + popular third-party sourcebooks

Xio's Guide to Monsters If you're a DM like me, and you have multiple sources of D&D 5e monsters that include WotC as well as third-party suppliers, y

20 Jan 06, 2023
A male and female dog names python package

A male and female dog names python package

Fayas Noushad 3 Dec 12, 2021
Terminal compatible with ansi-bbs. Meant to be a prototype, but published because why not.

pybbsterm: Terminal emulator for calling BBSs. Use cases (non-exhaustive) Explore terminal protocols. Connect to BBSs. Highlights Python 3.8+ code. Bu

Roc Vallès i Domènech 9 Apr 29, 2022
Python library to decorate and beautify strings

outputformater Python library to decorate and beautify your standard output 💖 I

Felipe Delestro Matos 259 Dec 13, 2022
This is a method to build your own qgis configuration packages using osgeo4W.

This is a method to build your own qgis configuration packages using osgeo4W. Then you can automate deployment in your organization with a controled and trusted environnement.

Régis Haubourg 26 Dec 05, 2022
A python script to make leaderboards using a CSV with the runners name, IDs and Flag Emojis

SrcLbMaker A python script to make speedrun.com global leaderboards. Installation You need python 3.6 or higher. First, go to the folder where you wan

2 Jul 25, 2022
The ldapconsole script allows you to perform custom LDAP requests to a Windows domain

ldapconsole The ldapconsole script allows you to perform custom LDAP requests to a Windows domain. Features Authenticate with password Authenticate wi

Podalirius 38 Dec 09, 2022
Analyze FnO trends by using NSE Bhav copy

BhavFnO Analyze FnO trends by using NSE Bhav copy Download entire BhavFnO folder and unzip it In that folder open command window

33 Jan 04, 2023
The code behind sqlfmt.com, a web UI for sqlfmt

The code behind sqlfmt.com, a web UI for sqlfmt

Ted Conbeer 2 Dec 14, 2022
An Airflow operator to call the main function from the dbt-core Python package

airflow-dbt-python An Airflow operator to call the main function from the dbt-core Python package Motivation Airflow running in a managed environment

Tomás Farías Santana 93 Jan 08, 2023
Solves Maths24 problems for you!

maths24-solver Solves Maths24 problems for you! Enjoy this open scource project! You can edit modify and share! My wishes is for you to use this proje

6 Nov 07, 2021