Breast-Cancer-Classification - Using SKLearn breast cancer dataset which contains 569 examples and 32 features classifying has been made with 6 different algorithms

Overview

Breast Cancer Classification

  Using SKLearn breast cancer dataset which contains 569 examples and 32 features classifying has been made with 6 different algorithms. The metrics below have been used to determine these algorithms performance.

  • Accuracy
  • Precision
  • Recall
  • F Score

Accuracy may produce misleading results so because of that I also added some metrics which some of them are more reliable (e.g. F Score).

Algorithms

  Logistic regression, SVM (Support Vector Machines), decision trees, random forest, naive bayes, k-nearest neighbor algorithms have been used and for each of them metrics are calculated and results are shown.

Data Preprocessing

  The dataset contains no missing rows or columns so we can start feature selection. To do that I used correlation map to show the correlation between features. And I eliminated mostly correlated features like perimeter_mean and perimeter_worst. After this process we have 18 features.

image

Then we apply data normalization and our data is ready for classification.

# Data normalization
standardizer = StandardScaler()
X = standardizer.fit_transform(X)

Train and Test Split

I have split my dataset as %30 test, % 70 training and set random_state parameter to 0 as shown.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

After splitting dataset, I created dictionaries for algorithms and metrics. And in one for loop every model trained and tested.

models = {'Logistic Regression': LogisticRegression(), 'Support Vector Machines': LinearSVC(),
          'Decision Trees': DecisionTreeClassifier(), 'Random Forest': RandomForestClassifier(),
          'Naive Bayes': GaussianNB(), 'K-Nearest Neighbor': KNeighborsClassifier()}

accuracy, precision, recall, f_score = {}, {}, {}, {}

for key in models.keys():
    # Fit the classifier model
    models[key].fit(X_train, y_train)

    # Classification
    classification = models[key].predict(X_test)

    # Calculate Accuracy, Precision, Recall and F Score Metrics
    accuracy[key] = accuracy_score(classification, y_test)
    precision[key] = precision_score(classification, y_test)
    recall[key] = recall_score(classification, y_test)
    f_score[key] = f1_score(classification, y_test)

Results

As you can see the figure below, most successful classification algorithm seems to logistic regression. And decision tress has the worst performance.

image

To see the values algorithms got for each metric see the table below.

Algorithm Accuracy Precision Recall F Score
Logistic Regression 0.97 0.95 0.96 0.96
SVM 0.95 0.95 0.93 0.94
Decision Trees 0.86 0.84 0.80 0.82
Random Forest 0.94 0.93 0.90 0.92
Naive Bayes 0.90 0.87 0.85 0.86
K-Nearest Neighbor 0.91 0.85 0.91 0.88

Conclusion

I have tuned few parameters for example training and test size, random state and most of the algorithms performed close enough to each other. For different datasets this code can be used. You may need to change feature selection part and if your dataset has missing values you should fill in these values as well. Other than these things you can perform classification with different kind of algorithms.

Owner
Mert Sezer Ardal
Mert Sezer Ardal
Machine Learning Algorithms

Machine-Learning-Algorithms In this project, the dataset was created through a survey opened on Google forms. The purpose of the form is to find the p

Göktuğ Ayar 3 Aug 10, 2022
Module for statistical learning, with a particular emphasis on time-dependent modelling

Operating system Build Status Linux/Mac Windows tick tick is a Python 3 module for statistical learning, with a particular emphasis on time-dependent

X - Data Science Initiative 410 Dec 14, 2022
A Tools that help Data Scientists and ML engineers train and deploy ML models.

Domino Research This repo contains projects under active development by the Domino R&D team. We build tools that help Data Scientists and ML engineers

Domino Data Lab 73 Oct 17, 2022
A visual dataflow programming language for sklearn

Persimmon What is it? Persimmon is a visual dataflow language for creating sklearn pipelines. It represents functions as blocks, inputs and outputs ar

Álvaro Bermejo 194 Jan 04, 2023
Visualize classified time series data with interactive Sankey plots in Google Earth Engine

sankee Visualize changes in classified time series data with interactive Sankey plots in Google Earth Engine Contents Description Installation Using P

Aaron Zuspan 76 Dec 15, 2022
Automated Machine Learning Pipeline with Feature Engineering and Hyper-Parameters Tuning

The mljar-supervised is an Automated Machine Learning Python package that works with tabular data. I

MLJAR 2.4k Jan 02, 2023
Machine-care - A simple python script to take care of simple maintenance tasks

Machine care An simple python script to take care of simple maintenance tasks fo

2 Jul 10, 2022
A collection of interactive machine-learning experiments: 🏋️models training + 🎨models demo

🤖 Interactive Machine Learning experiments: 🏋️models training + 🎨models demo

Oleksii Trekhleb 1.4k Jan 06, 2023
Machine-learning-dell - Repositório com as atividades desenvolvidas no curso de Machine Learning

📚 Descrição Neste curso da Dell aprofundamos nossos conhecimentos em Machine Learning. 🖥️ Aulas (Em curso) 1.1 - Python aplicado a Data Science 1.2

Claudia dos Anjos 1 Jan 05, 2022
distfit - Probability density fitting

Python package for probability density function fitting of univariate distributions of non-censored data

Erdogan Taskesen 187 Dec 30, 2022
A Python-based application demonstrating various search algorithms, namely Depth-First Search (DFS), Breadth-First Search (BFS), and A* Search (Manhattan Distance Heuristic)

A Python-based application demonstrating various search algorithms, namely Depth-First Search (DFS), Breadth-First Search (BFS), and the A* Search (using the Manhattan Distance Heuristic)

17 Aug 14, 2022
DirectML is a high-performance, hardware-accelerated DirectX 12 library for machine learning.

DirectML is a high-performance, hardware-accelerated DirectX 12 library for machine learning. DirectML provides GPU acceleration for common machine learning tasks across a broad range of supported ha

Microsoft 1.1k Jan 04, 2023
PyNNDescent is a Python nearest neighbor descent for approximate nearest neighbors.

PyNNDescent PyNNDescent is a Python nearest neighbor descent for approximate nearest neighbors. It provides a python implementation of Nearest Neighbo

Leland McInnes 699 Jan 09, 2023
A logistic regression model for health insurance purchasing prediction

Logistic_Regression_Model A logistic regression model for health insurance purchasing prediction This code is using these packages, so please make sur

ShawnWang 1 Nov 29, 2021
Traingenerator 🧙 A web app to generate template code for machine learning ✨

Traingenerator 🧙 A web app to generate template code for machine learning ✨ 🎉 Traingenerator is now live! 🎉

Johannes Rieke 1.2k Jan 07, 2023
Bayesian optimization in JAX

Bayesian optimization in JAX

Predictive Intelligence Lab 26 May 11, 2022
Bayesian Additive Regression Trees For Python

BartPy Introduction BartPy is a pure python implementation of the Bayesian additive regressions trees model of Chipman et al [1]. Reasons to use BART

187 Dec 16, 2022
A statistical library designed to fill the void in Python's time series analysis capabilities, including the equivalent of R's auto.arima function.

pmdarima Pmdarima (originally pyramid-arima, for the anagram of 'py' + 'arima') is a statistical library designed to fill the void in Python's time se

alkaline-ml 1.3k Dec 22, 2022
Apache Liminal is an end-to-end platform for data engineers & scientists, allowing them to build, train and deploy machine learning models in a robust and agile way

Apache Liminals goal is to operationalise the machine learning process, allowing data scientists to quickly transition from a successful experiment to an automated pipeline of model training, validat

The Apache Software Foundation 121 Dec 28, 2022
The MLOps is the process of continuous integration and continuous delivery of Machine Learning artifacts as a software product, keeping it inside a loop of Design, Model Development and Operations.

MLOps The MLOps is the process of continuous integration and continuous delivery of Machine Learning artifacts as a software product, keeping it insid

Maykon Schots 25 Nov 27, 2022