Tutorial repo for an end-to-end Data Science project

Overview

End-to-end Data Science project

This is the repo with the notebooks, code, and additional material used in the ITI's workshop. The goal of the sessions was to illustrate the end-to-end process of an real project.

Additional material

In addition to the notebooks and code, the following material is also available:

Problem statement

Our (fictional) client is an IT educational institute. They have reached out to us has reach out with the following: “IT jobs and technologies keep evolving quickly. This makes our field to be one of the most interesting out there. But on the other hand, such fast development confuses our students. They do not know which skills they need to learn for which job. “Do I need to learn C++ to be a Data Scientist?” “Do DevOps and System admins use the same technologies?” “I really like JavaScript; can I use it in Data Analytics?” Those are some of the questions that our students ask. Could you please develop a data-driven solution for our students to answer such questions? They mostly want to understand the relationships between the jobs and the technologies.


Level guide

Basic Intermediate Advanced
Business case Decide on the KPIs that you will positively influence Calculate the expected financial returns
Data collection Decide on and collect a suitable data source for your business case Decide on, collect and connect multiple data sources for better performance
Legal review Get basic information about the local data privacy law Study the local data privacy law
Cookie Cutter Create the standard directory structure
Git Use Git's GUI to track on master branch Use Git's CLI to track on Dev branch and merge back to Master Decide on a branching strategy and solve merge conflicts
Environments Install python packages using conda Create a dedicated conda environment Share your environment and install it on a different machine
Data cleaning Use basic statistics to filter out non-sense entries Use advanced statistics and unsupervised learning to filter out non-sense entries Calculate a 'sanity probability value' for each data point and use it later as the weight
Descriptive analytics Calculate summary statistics to provide data insights Produce visualizations to provide deeper understanding Apply unsupervised learning to provide even deeper understanding
Predictive analytics Create a single baseline model Create multiple hyper-tuned models. Benchmark their performance Combine the chosen models via ensemble and provide prediction confidence
Prescriptive analytics Recommend the action that the user should take
Software Engineering Refactor your notebooks to simple python scripts Create a production OOP class for predictions Expose your model using an API
MLops Export and load models from pickle files Track your models using Mlflow Create and run a docker image for your project
Product Create a Web App / GUI to expose prediction functionality Add the relevant historical insights, predictions and optimization results Collect users' feedback and retrain your model accordingly
Owner
Deena Gergis
Deena Gergis
Pytorch and Keras Implementations of Hyperspectral Image Classification -- Traditional to Deep Models: A Survey for Future Prospects.

The repository contains the implementations for Hyperspectral Image Classification -- Traditional to Deep Models: A Survey for Future Prospects. Model

Ankur Deria 115 Jan 06, 2023
Mouse Brain in the Model Zoo

Deep Neural Mouse Brain Modeling This is the repository for the ongoing deep neural mouse modeling project, an attempt to characterize the representat

Colin Conwell 15 Aug 22, 2022
Unofficial implementation of Pix2SEQ

Unofficial-Pix2seq: A Language Modeling Framework for Object Detection Unofficial implementation of Pix2SEQ. Please use this code with causion. Many i

159 Dec 12, 2022
Official repository for Automated Learning Rate Scheduler for Large-Batch Training (8th ICML Workshop on AutoML)

Automated Learning Rate Scheduler for Large-Batch Training The official repository for Automated Learning Rate Scheduler for Large-Batch Training (8th

Kakao Brain 35 Jan 04, 2023
Official Pytorch implementation of paper "Reverse Engineering of Generative Models: Inferring Model Hyperparameters from Generated Images"

Reverse_Engineering_GMs Official Pytorch implementation of paper "Reverse Engineering of Generative Models: Inferring Model Hyperparameters from Gener

100 Dec 18, 2022
[ICCV 2021 Oral] NerfingMVS: Guided Optimization of Neural Radiance Fields for Indoor Multi-view Stereo

NerfingMVS Project Page | Paper | Video | Data NerfingMVS: Guided Optimization of Neural Radiance Fields for Indoor Multi-view Stereo Yi Wei, Shaohui

Yi Wei 369 Dec 24, 2022
A PyTorch implementation of "Predict then Propagate: Graph Neural Networks meet Personalized PageRank" (ICLR 2019).

APPNP ⠀ A PyTorch implementation of Predict then Propagate: Graph Neural Networks meet Personalized PageRank (ICLR 2019). Abstract Neural message pass

Benedek Rozemberczki 329 Dec 30, 2022
FedGS: A Federated Group Synchronization Framework Implemented by LEAF-MX.

FedGS: Data Heterogeneity-Robust Federated Learning via Group Client Selection in Industrial IoT Preparation For instructions on generating data, plea

Lizonghang 9 Dec 22, 2022
FaceOcc: A Diverse, High-quality Face Occlusion Dataset for Human Face Extraction

FaceExtraction FaceOcc: A Diverse, High-quality Face Occlusion Dataset for Human Face Extraction Occlusions often occur in face images in the wild, tr

16 Dec 14, 2022
Neural Architecture Search Powered by Swarm Intelligence 🐜

Neural Architecture Search Powered by Swarm Intelligence 🐜 DeepSwarm DeepSwarm is an open-source library which uses Ant Colony Optimization to tackle

288 Oct 28, 2022
Colab notebook for openai/glide-text2im.

GLIDE text2im on Colab This repository provides a Colab notebook to produce images conditioned on text prompts with GLIDE [1]. Usage Run text2im.ipynb

Wok 19 Oct 19, 2022
[ECCV'20] Convolutional Occupancy Networks

Convolutional Occupancy Networks Paper | Supplementary | Video | Teaser Video | Project Page | Blog Post This repository contains the implementation o

622 Dec 30, 2022
CRF-RNN for Semantic Image Segmentation - PyTorch version

This repository contains the official PyTorch implementation of the "CRF-RNN" semantic image segmentation method, published in the ICCV 2015

Sadeep Jayasumana 170 Dec 13, 2022
RuDOLPH: One Hyper-Modal Transformer can be creative as DALL-E and smart as CLIP

[Paper] [Хабр] [Model Card] [Colab] [Kaggle] RuDOLPH 🦌 🎄 ☃️ One Hyper-Modal Transformer can be creative as DALL-E and smart as CLIP Russian Diffusio

AI Forever 232 Jan 04, 2023
Open-source implementation of Google Vizier for hyper parameters tuning

Advisor Introduction Advisor is the hyper parameters tuning system for black box optimization. It is the open-source implementation of Google Vizier w

tobe 1.5k Jan 04, 2023
Cross-platform-profile-pic-changer - Script to change profile pictures across multiple platforms

cross-platform-profile-pic-changer script to change profile pictures across mult

4 Jan 17, 2022
Strongly local p-norm-cut algorithms for semi-supervised learning and local graph clustering

Strongly local p-norm-cut algorithms for semi-supervised learning and local graph clustering

Meng Liu 2 Jul 19, 2022
quantize aware training package for NCNN on pytorch

ncnnqat ncnnqat is a quantize aware training package for NCNN on pytorch. Table of Contents ncnnqat Table of Contents Installation Usage Code Examples

62 Nov 23, 2022
High-Resolution Image Synthesis with Latent Diffusion Models

Latent Diffusion Models arXiv | BibTeX High-Resolution Image Synthesis with Latent Diffusion Models Robin Rombach*, Andreas Blattmann*, Dominik Lorenz

CompVis Heidelberg 5.6k Dec 30, 2022
Official Pytorch implementation of "Unbiased Classification Through Bias-Contrastive and Bias-Balanced Learning (NeurIPS 2021)

Unbiased Classification Through Bias-Contrastive and Bias-Balanced Learning (NeurIPS 2021) Official Pytorch implementation of Unbiased Classification

Youngkyu 17 Jan 01, 2023