A machine learning project that predicts the price of used cars in the UK

Overview

Car Price Prediction

Car Image

Image Credit: AA Cars

Project Overview

  • Scraped 3000 used cars data from AA Cars website using Python and BeautifulSoup.
  • Cleaned the data and built a model to help determine the price of cars on auction
  • Built a flask web app and deploy to cloud

Packages/Tools Used

  • Python Version: 3.9
  • BeautifulSoup
  • Request
  • Numpy
  • Matplotlib
  • Seaborn
  • Scikit-Learn

Data

The data was scraped from AA Cars. The data was scraped from multiple pages from the site and was stored as a csv file. The scraped data contains:

  • Name
  • Price
  • Year
  • Mileage
  • Engine
  • Transmisson

Data Cleaning

The features (columns) contained messy entries and were tidied using some custom functions. The following steps were taken.

  • Removed the duplicate rows in the data because it will affect the analysis.
  • Deleted thhe rows with missing values because they ae not up to 1% of the data.
  • Extracted the manufaturer of each car from the name column
  • Corrected some of the values in the manufacturers column by merging similar value and correcting those wrongly extracted.
  • Removed the pounds symbol and the comma in the values of the price column
  • Created an age column by substacting the values in the year column fom the current year, 2021. This is an easier column to work with.
  • Removed the commas, space and miles input in all the values of the mileage columns.
    • Corrected some of the values in the engine and transmission columns by merging similar value and correcting those wrongly extracted.

Exploratory Data Analysis

  • The count of the number of cars owned by each car manufacturer Car manufacturer distribution

  • The count of the number of cars from the different years Year distribution

  • The count of the number of cars with the diffrent car engine types Car engine distribution

  • The count of the number of cars with different car transmission types Car transmission distribution

  • The word cloud of all car manufacturers.

Car manufacturer wordcloud

Model Building

  • The 'name' and 'year' column were dropped because they are irrelevant.
  • The categorical features (name, colour and transmission) were transformed into numerical data and I scaled all the feature values to make all of them be in the same range
  • Linear Regression, Ridge Regression, Random Forest Regressor, Ada Boost Regressor and Support Vector Regressor models were all built.
  • Root mean squared error (RMSE) which is the square root of the sum of the difference between the true value and the predicted value was the metric used to evaluate the performance of the model.
  • The CatBoost Regressor model has the best performance and it was hypertuned using GridSearchCV to improve the performance.
  • The model was tested on new data and it gave a good output.

A flask web app is currently under construction

NB: I am open to constructive criticisms about this project

Owner
Victor Umunna
Victor Umunna
Predicting job salaries from ads - a Kaggle competition

Predicting job salaries from ads - a Kaggle competition

Zygmunt Zając 57 Oct 23, 2020
It is a forest of random projection trees

rpforest rpforest is a Python library for approximate nearest neighbours search: finding points in a high-dimensional space that are close to a given

Lyst 211 Dec 29, 2022
MasTrade is a trading bot in baselines3,pytorch,gym

mastrade MasTrade is a trading bot in baselines3,pytorch,gym idea we have for example 1 btc and we buy a crypto with it with market option to trade in

Masoud Azizi 18 May 24, 2022
Implementations of Machine Learning models, Regularizers, Optimizers and different Cost functions.

Linear Models Implementations of LinearRegression, LassoRegression and RidgeRegression with appropriate Regularizers and Optimizers. Linear Regression

Keivan Ipchi Hagh 1 Nov 22, 2021
Model search (MS) is a framework that implements AutoML algorithms for model architecture search at scale.

Model Search Model search (MS) is a framework that implements AutoML algorithms for model architecture search at scale. It aims to help researchers sp

AriesTriputranto 1 Dec 13, 2021
This machine learning model was developed for House Prices

This machine learning model was developed for House Prices - Advanced Regression Techniques competition in Kaggle by using several machine learning models such as Random Forest, XGBoost and LightGBM.

serhat_derya 1 Mar 02, 2022
Create large-scale ML-driven multiscale simulation ensembles to study the interactions

MuMMI RAS v0.1 Released: Nov 16, 2021 MuMMI RAS is the application component of the MuMMI framework developed to create large-scale ML-driven multisca

4 Feb 16, 2022
Customers Segmentation with RFM Scores and K-means

Customer Segmentation with RFM Scores and K-means RFM Segmentation table: K-Means Clustering: Business Problem Rule-based customer segmentation machin

5 Aug 10, 2022
A simple machine learning python sign language detection project.

SST Coursework 2022 About the app A python application that utilises the tensorflow object detection algorithm to achieve automatic detection of ameri

Xavier Koh 2 Jun 30, 2022
MaD GUI is a basis for graphical annotation and computational analysis of time series data.

MaD GUI Machine Learning and Data Analytics Graphical User Interface MaD GUI is a basis for graphical annotation and computational analysis of time se

Machine Learning and Data Analytics Lab FAU 10 Dec 19, 2022
A statistical library designed to fill the void in Python's time series analysis capabilities, including the equivalent of R's auto.arima function.

pmdarima Pmdarima (originally pyramid-arima, for the anagram of 'py' + 'arima') is a statistical library designed to fill the void in Python's time se

alkaline-ml 1.3k Dec 22, 2022
Required for a machine learning pipeline data preprocessing and variable engineering script needs to be prepared

Feature-Engineering Required for a machine learning pipeline data preprocessing and variable engineering script needs to be prepared. When the dataset

kemalgunay 5 Apr 21, 2022
scikit-multimodallearn is a Python package implementing algorithms multimodal data.

scikit-multimodallearn is a Python package implementing algorithms multimodal data. It is compatible with scikit-learn, a popul

12 Jun 29, 2022
TorchDrug is a PyTorch-based machine learning toolbox designed for drug discovery

A powerful and flexible machine learning platform for drug discovery

MilaGraph 1.1k Jan 08, 2023
AutoTabular automates machine learning tasks enabling you to easily achieve strong predictive performance in your applications.

AutoTabular AutoTabular automates machine learning tasks enabling you to easily achieve strong predictive performance in your applications. With just

wenqi 2 Jun 26, 2022
WAGMA-SGD is a decentralized asynchronous SGD for distributed deep learning training based on model averaging.

WAGMA-SGD is a decentralized asynchronous SGD based on wait-avoiding group model averaging. The synchronization is relaxed by making the collectives externally-triggerable, namely, a collective can b

Shigang Li 6 Jun 18, 2022
This is my implementation on the K-nearest neighbors algorithm from scratch using Python

K Nearest Neighbors (KNN) algorithm In this Machine Learning world, there are various algorithms designed for classification problems such as Logistic

sonny1902 1 Jan 08, 2022
End to End toy example of MLOps

churn_model MLOps Toy Example End to End You might find below links useful Connect VSCode to Git MLFlow Port Heroku App Project Organization ├── LICEN

Ashish Tele 6 Feb 06, 2022
The Emergence of Individuality

The Emergence of Individuality

16 Jul 20, 2022
Kalman filter library

The kalman filter framework described here is an incredibly powerful tool for any optimization problem, but particularly for visual odometry, sensor fusion localization or SLAM.

comma.ai 276 Jan 01, 2023