๐Ÿš€ An end-to-end ML applications using PyTorch, W&B, FastAPI, Docker, Streamlit and Heroku

Overview

Creating an End-to-End ML Application w/ PyTorch

๐Ÿš€ This project was created using the Made With ML boilerplate template. Check it out to start creating your own ML applications.

Overview

  • Why do we need to build end-to-end applications?
    • By building e2e applications, you ensure that your code is organized, tested, testable / interactive and easy to scale-up / assimilate with larger pipelines.
    • If you're someone in industry and are looking to showcase your work to future employers, it's no longer enough to just have code on Jupyter notebooks. ML is just another tool and you need to show that you can use it in conjunction with all the other software engineering disciplines (frontend, backend, devops, etc.). The perfect way to do this is to create end-to-end applications that utilize all these different facets.
  • What are the components of an end-to-end ML application?
    1. Basic experimentation in Jupyter notebooks.
      • We aren't going to completely dismiss notebooks because they're still great tool to iterate quickly. Check out the notebook for our task here โ†’ notebook
    2. Moving our code from notebooks to organized scripts.
      • Once we did some basic development (on downsized datasets), we want to move our code to scripts to reduce technical debt. We'll create functions and classes for different parts of the pipeline (data, model, train, etc.) so we can easily make them robust for different circumstances.
      • We used our own boilerplate to organize our code before moving any of the code from our notebook.
    3. Proper logging and testing for you code.
      • Log key events (preprocessing, training performance, etc.) using the built-in logging library. Also use logging to see new inputs and outputs during prediction to catch issues, etc.
      • You also need to properly test your code. You will add and update your functions and their tests over time but it's important to at least start testing crucial pieces of your code from the beginning. These typically include sanity checks with preprocessing and modeling functions to catch issues early. There are many options for testing Python code but we'll use pytest here.
    4. Experiment tracking.
      • We use Weights and Biases (WandB), where you can easily track all the metrics of your experiment, config files, performance details, etc. for free. Check out the Dashboards page for an overview and tutorials.
      • When you're developing your models, start with simple approaches first and then slowly add complexity. You should clearly document (README, articles and WandB reports) and save your progression from simple to more complex models so your audience can see the improvements. The ability to write well and document your thinking process is a core skill to have in research and industry.
      • WandB also has free tools for hyperparameter tuning (Sweeps) and for data/pipeline/model management (Artifacts).
    5. Robust prediction pipelines.
      • When you actually deploy an ML application for the real world to use, we don't just look at the softmax scores.
      • Before even doing any forward pass, we need to analyze the input and deem if it's within the manifold of the training data. If it's something new (or adversarial) we shouldn't send it down the ML pipeline because the results cannot be trusted.
      • During processes like proprocessing, we need to constantly observe what the model received. For example, if the input has a bunch of unknown tokens than we need to flag the prediction because it may not be reliable.
      • After the forward pass we need to do tests on the model's output as well. If the predicted class has a mediocre test set performance, then we need the class probability to be above some critical threshold. Similarly we can relax the threshold for classes where we do exceptionally well.
    6. Wrap your model as an API.
      • Now we start to modularize larger operations (single/batch predict, get experiment details, etc.) so others can use our application without having to execute granular code. There are many options for this like Flask, Django, FastAPI, etc. but we'll use FastAPI for the ease and performance boost.
      • We can also use a Dockerfile to create a Docker image that runs our API. This is a great way to package our entire application to scale it (horizontally and vertically) depending on requirements and usage.
    7. Create an interactive frontend for your application.
      • The best way to showcase your work is to let others easily play with it. We'll be using Streamlit to very quickly create an interactive medium for our application and use Heroku to serve it (1000 hours of usage per month).
      • This is also a great skill to have because in industry you'll need to create this to show key stakeholders and great to have in documentation as well.

Set up

virtualenv -p python3.6 venv
source venv/bin/activate
pip install -r requirements.txt
pip install torch==1.4.0

Download embeddings

python text_classification/utils.py

Training

python text_classification/train.py \
    --data-url https://raw.githubusercontent.com/madewithml/lessons/master/data/news.csv --lower --shuffle --use-glove

Endpoints

uvicorn text_classification.app:app --host 0.0.0.0 --port 5000 --reload
GOTO: http://localhost:5000/docs

Prediction

Scripts

python text_classification/predict.py --text 'The Canadian government officials proposed the new federal law.'

cURL

curl "http://localhost:5000/predict" \
    -X POST -H "Content-Type: application/json" \
    -d '{
            "inputs":[
                {
                    "text":"The Wimbledon tennis tournament starts next week!"
                },
                {
                    "text":"The Canadian government officials proposed the new federal law."
                }
            ]
        }' | json_pp

Requests

import json
import requests

headers = {
    'Content-Type': 'application/json',
}

data = {
    "experiment_id": "latest",
    "inputs": [
        {
            "text": "The Wimbledon tennis tournament starts next week!"
        },
        {
            "text": "The Canadian minister signed in the new federal law."
        }
    ]
}

response = requests.post('http://0.0.0.0:5000/predict',
                         headers=headers, data=json.dumps(data))
results = json.loads(response.text)
print (json.dumps(results, indent=2, sort_keys=False))

Streamlit

streamlit run text_classification/streamlit.py
GOTO: http://localhost:8501

Tests

pytest

Docker

  1. Build image
docker build -t text-classification:latest -f Dockerfile .
  1. Run container
docker run -d -p 5000:5000 -p 6006:6006 --name text-classification text-classification:latest

Heroku

Set `WANDB_API_KEY` as an environment variable.

Directory structure

text-classification/
โ”œโ”€โ”€ datasets/                           - datasets
โ”œโ”€โ”€ logs/                               - directory of log files
|   โ”œโ”€โ”€ errors/                           - error log
|   โ””โ”€โ”€ info/                             - info log
โ”œโ”€โ”€ tests/                              - unit tests
โ”œโ”€โ”€ text_classification/                - ml scripts
|   โ”œโ”€โ”€ app.py                            - app endpoints
|   โ”œโ”€โ”€ config.py                         - configuration
|   โ”œโ”€โ”€ data.py                           - data processing
|   โ”œโ”€โ”€ models.py                         - model architectures
|   โ”œโ”€โ”€ predict.py                        - prediction script
|   โ”œโ”€โ”€ streamlit.py                      - streamlit app
|   โ”œโ”€โ”€ train.py                          - training script
|   โ””โ”€โ”€ utils.py                          - load embeddings and utilities
โ”œโ”€โ”€ wandb/                              - wandb experiment runs
โ”œโ”€โ”€ .dockerignore                       - files to ignore on docker
โ”œโ”€โ”€ .gitignore                          - files to ignore on git
โ”œโ”€โ”€ CODE_OF_CONDUCT.md                  - code of conduct
โ”œโ”€โ”€ CODEOWNERS                          - code owner assignments
โ”œโ”€โ”€ CONTRIBUTING.md                     - contributing guidelines
โ”œโ”€โ”€ Dockerfile                          - dockerfile to containerize app
โ”œโ”€โ”€ LICENSE                             - license description
โ”œโ”€โ”€ logging.json                        - logger configuration
โ”œโ”€โ”€ Procfile                            - process script for Heroku
โ”œโ”€โ”€ README.md                           - this README
โ”œโ”€โ”€ requirements.txt                    - requirementss
โ”œโ”€โ”€ setup.sh                            - streamlit setup for Heroku
โ””โ”€โ”€ sweeps.yaml                         - hyperparameter wandb sweeps config

Overfit to small subset

python text_classification/train.py \
    --data-url https://raw.githubusercontent.com/madewithml/lessons/master/data/news.csv --lower --shuffle --data-size 0.1 --num-epochs 3

Experiments

  1. Random, unfrozen, embeddings
python text_classification/train.py \
    --data-url https://raw.githubusercontent.com/madewithml/lessons/master/data/news.csv --lower --shuffle
  1. GloVe, frozen, embeddings
python text_classification/train.py \
    --data-url https://raw.githubusercontent.com/madewithml/lessons/master/data/news.csv --lower --shuffle --use-glove --freeze-embeddings
  1. GloVe, unfrozen, embeddings
python text_classification/train.py \
    --data-url https://raw.githubusercontent.com/madewithml/lessons/master/data/news.csv --lower --shuffle --use-glove

Next steps

End-to-end topics that will be covered in subsequent lessons.

  • Utilizing wrappers like PyTorch Lightning to structure the modeling even more while getting some very useful utility.
  • Data / model version control (Artifacts, DVC, MLFlow, etc.)
  • Experiment tracking options (MLFlow, KubeFlow, WandB, Comet, Neptune, etc)
  • Hyperparameter tuning options (Optuna, Hyperopt, Sweeps)
  • Multi-process data loading
  • Dealing with imbalanced datasets
  • Distributed training for much larger models
  • GitHub actions for automatic testing during commits
  • Prediction fail safe techniques (input analysis, class-specific thresholds, etc.)

Helpful docker commands

โ€ข Build image

docker build -t madewithml:latest -f Dockerfile .

โ€ข Run container if using CMD ["python", "app.py"] or ENTRYPOINT [ "/bin/sh", "entrypoint.sh"]

docker run -p 5000:5000 --name madewithml madewithml:latest

โ€ข Get inside container if using CMD ["/bin/bash"]

docker run -p 5000:5000 -it madewithml /bin/bash

โ€ข Run container with mounted volume

docker run -p 5000:5000 -v $PWD:/root/madewithml/ --name madewithml madewithml:latest

โ€ข Other flags

-d: detached
-ti: interative terminal

โ€ข Clean up

docker stop $(docker ps -a -q)     # stop all containers
docker rm $(docker ps -a -q)       # remove all containers
docker rmi $(docker images -a -q)  # remove all images
Owner
Made With ML
Applied ML ยท MLOps ยท Production
Made With ML
Python package for Bayesian Machine Learning with scikit-learn API

Python package for Bayesian Machine Learning with scikit-learn API Installing & Upgrading package pip install https://github.com/AmazaspShumik/sklearn

Amazasp Shaumyan 482 Jan 04, 2023
ไบคไบ’ๅผๆ ‡ๆณจ่ฝฏไปถ๏ผŒๆš‚ๅฎšๅ iann

iann ไบคไบ’ๅผๆ ‡ๆณจ่ฝฏไปถ๏ผŒๆš‚ๅฎšๅiannใ€‚ ๅฎ‰่ฃ… ๆŒ‰็…งๅฎ˜็ฝ‘ไป‹็ปๅฎ‰่ฃ…paddleใ€‚ ๅฎ‰่ฃ…ๅ…ถไป–ไพ่ต– pip install -r requirements.txt ่ฟ่กŒ git clone https://github.com/PaddleCV-SIG/iann/ cd iann python iann

294 Dec 30, 2022
Practical and Real-world applications of ML based on the homework of Hung-yi Lee Machine Learning Course 2021

Machine Learning Theory and Application Overview This repository is inspired by the Hung-yi Lee Machine Learning Course 2021. In that course, professo

SilenceJiang 35 Nov 22, 2022
A repository for interferometer controller code.

dses-interferometer-controller A repository for interferometer controller code, hardware, and simulations. See dses.science for more information on th

Eli Reed 1 Jan 17, 2022
This repository contains the code and models for the following paper.

DC-ShadowNet Introduction This is an implementation of the following paper DC-ShadowNet: Single-Image Hard and Soft Shadow Removal Using Unsupervised

AuAgCu 65 Dec 27, 2022
Self-Supervised Monocular DepthEstimation with Internal Feature Fusion(arXiv), BMVC2021

DIFFNet This repo is for Self-Supervised Monocular DepthEstimation with Internal Feature Fusion(arXiv), BMVC2021 A new backbone for self-supervised de

Hang 94 Dec 25, 2022
Improving Contrastive Learning by Visualizing Feature Transformation, ICCV 2021 Oral

Improving Contrastive Learning by Visualizing Feature Transformation This project hosts the codes, models and visualization tools for the paper: Impro

Bingchen Zhao 83 Dec 15, 2022
The repository includes the code for training cell counting applications. (Keras + Tensorflow)

cell_counting_v2 The repository includes the code for training cell counting applications. (Keras + Tensorflow) Dataset can be downloaded here : http:

Weidi 113 Oct 06, 2022
Pytorch implementation for "Distribution-Balanced Loss for Multi-Label Classification in Long-Tailed Datasets" (ECCV 2020 Spotlight)

Distribution-Balanced Loss [Paper] The implementation of our paper Distribution-Balanced Loss for Multi-Label Classification in Long-Tailed Datasets (

Tong WU 304 Dec 22, 2022
Awesome-AI-books - Some awesome AI related books and pdfs for learning and downloading

Awesome AI books Some awesome AI related books and pdfs for downloading and learning. Preface This repo only used for learning, do not use in business

luckyzhou 1k Jan 01, 2023
A Python Reconnection Tool for alt:V

altv-reconnect What? It invokes a reconnect in the altV Client Dev Console. You get to determine when your local client should reconnect when developi

8 Jun 30, 2022
Code Repository for The Kaggle Book, Published by Packt Publishing

The Kaggle Book Data analysis and machine learning for competitive data science Code Repository for The Kaggle Book, Published by Packt Publishing "Lu

Packt 1.6k Jan 07, 2023
Generalized Matrix Means for Semi-Supervised Learning with Multilayer Graphs

Generalized Matrix Means for Semi-Supervised Learning with Multilayer Graphs MATLAB implementation of the paper: P. Mercado, F. Tudisco, and M. Hein,

Pedro Mercado 6 May 26, 2022
Official implementation of NPMs: Neural Parametric Models for 3D Deformable Shapes - ICCV 2021

NPMs: Neural Parametric Models Project Page | Paper | ArXiv | Video NPMs: Neural Parametric Models for 3D Deformable Shapes Pablo Palafox, Aljaz Bozic

PabloPalafox 109 Nov 22, 2022
Privacy-Preserving Machine Learning (PPML) Tutorial Presented at PyConDE 2022

PPML: Machine Learning on Data you cannot see Repository for the tutorial on Privacy-Preserving Machine Learning (PPML) presented at PyConDE 2022 Abst

Valerio Maggio 10 Aug 16, 2022
Deep Learning and Logical Reasoning from Data and Knowledge

Logic Tensor Networks (LTN) Logic Tensor Network (LTN) is a neurosymbolic framework that supports querying, learning and reasoning with both rich data

171 Dec 29, 2022
Research on Event Accumulator Settings for Event-Based SLAM

Research on Event Accumulator Settings for Event-Based SLAM This is the source code for paper "Research on Event Accumulator Settings for Event-Based

Robin Shaun 26 Dec 21, 2022
Deep functional residue identification

DeepFRI Deep functional residue identification Citing @article {Gligorijevic2019, author = {Gligorijevic, Vladimir and Renfrew, P. Douglas and Koscio

Flatiron Institute 156 Dec 25, 2022
This repository contains the code needed to train Mega-NeRF models and generate the sparse voxel octrees

Mega-NeRF This repository contains the code needed to train Mega-NeRF models and generate the sparse voxel octrees used by the Mega-NeRF-Dynamic viewe

cmusatyalab 260 Dec 28, 2022
This project provides a stock market environment using OpenGym with Deep Q-learning and Policy Gradient.

Stock Trading Market OpenAI Gym Environment with Deep Reinforcement Learning using Keras Overview This project provides a general environment for stoc

Kim, Ki Hyun 769 Dec 25, 2022