This project shows how to serve an TF based image classification model as a web service with TFServing, Docker, and Kubernetes(GKE).

Overview

Deploying ML models with CPU based TFServing, Docker, and Kubernetes

By: Chansung Park and Sayak Paul

This project shows how to serve a TensorFlow image classification model as RESTful and gRPC based service with TFServing, Docker, and Kubernetes. The idea is to first create a custom TFServing docker image with a TensorFlow model, and then deploy it on a k8s cluster running on Google Kubernetes Engine (GKE). Also we are using GitHub Actions to automate all the procedures when a new TensorFlow model is released.

👋 NOTE

  • Even though this project uses an image classification its structure and techniques can be used to serve other models as well.
  • There is a counter part project using FastAPI instead of TFServing. If you wonder from how to convert TensorFlow model to ONNX optimized model to deploy it on k8s cluster, check out the this repo.

Deploying the model as a service with k8s

  • Prerequisites: Doing anything beforehand, you have to create GKE cluster and service accounts with appropriate roles. Also, you need to grasp GCP credentials to access any GCP resources in GitHub Action. Please check out the more detailed information here
flowchart LR
    A[First: Environmental Setup]-->B;
    B[Second: Build TFServing Image]-->C[Third: Deploy on GKE];
  • To deploy a custom TFServing docker image, we define deployment.yml workflow file which is is only triggered when there is a new release for the current repository. It is subdivided into three parts to do the following tasks:
    • First subtask handles the environmental setup.
      • GCP Authentication (GCP credential has to be provided in GitHub Secret)
      • Install gcloud CLI toolkit
      • Authenticate Docker to push images to GCR(Google Cloud Registry)
      • Connect to the designated GKE cluster
    • Second subtask handles building a custom TFServing image.
      • Download and extract the latest released model from the current repository
      • Run the CPU optimized TFServing image which is compiled from the source code (FYI. image tag is gcr.io/gcp-ml-172005/tfs-resnet-cpu-opt, and it is publicly available)
      • Copy the extracted model into the running container
      • Commit the changes of the running container and give it a new image name
      • Push the commited image
    • Third subtask handles deploying the custom TFServing image to GKE cluster.
      • Pick a one of the scenarios from a various experiments
      • Download Kustomize toolkit to handle overlay configurations.
      • Update image tag with the currently built one with Kustomize
      • By provisioning Deployment, Service, and ConfigMap, the custom TFServing image gets deployed.
        • NOTE: ConfigMap is only used for batching enabled scenarios to inject batching configurations dynamically into the Deployment.
    • In order to use this repo for your own purpose, please read this document to know what environment variables have to be set.

If the entire workflow goes without any errors, you will see something silimar to the text below. As you see, two external interfaces(8500 for RESTful, 8501 for gRPC) are exposed. You can check out the complete logs in the past runs.

NAME             TYPE           CLUSTER-IP     EXTERNAL-IP     PORT(S)                          AGE
tfs-server       LoadBalancer   xxxxxxxxxx     xxxxxxxxxx      8500:30869/TCP,8501:31469/TCP    23m
kubernetes       ClusterIP      xxxxxxxxxx     <none>          443/TCP                         160m

Load testing

We used Locust to conduct load tests for both TFServing and FastAPI. Below is the results for TFServing(gRPC) on a various setups, and you can find out the result for FastAPI(RESTful) in a separate repo. For specific instructions about how to install Locust and run a load test, follow this separate document.

Hypothesis

  • This is a follow-up project after ONNX optimized FastAPI deployment, so we wanted to know how CPU optimized TensorFlow runtime could be compared to ONNX based one.
  • TFServing's objective is to maximize throughput while keeping tail-latency below certain bounds. We wanted to see if this is true, how reliably it provides a good throughput performance and how much throughput is sacrified to keep the reliability.
  • According to the TFServing's official document, TFServing can achieve the best performance when it is deployed on fewer, larger(in terms of CPU, RAM) machines. We wanted to estimate how large of machine and how many nodes are enough. For this, we have prepared a set of different setups in combination of (# of nodes + # of CPU cores + RAM capacity).
  • TFServing has a number of configurable options to tune the performance. Especially, we wanted to find out how different values of --tensorflow_inter_op_parallelism, --tensorflow_intra_op_parallelism, and --enable_batching options gives different results.

Conclusion

From the results above,

  • TFServing focuses more on reliability than performance(in terms of throughput). In any cases, no failures are observed, and the the response time is consistent.
  • Req/s is lower than ONNX optimized FastAPI deployment, so it sacrifies some performance to achieve reliability. However, you need to notice that TFServing comes with lots of built-in features which are required in most of ML serving scenarios such as multi model serving, dynamic batching, model versioning, and so on. Those features possibly make TFServing heavier than simple FastAPI server.
    • NOTE: We spawned requests every seconds to clearly see how TFServing behaves with the increasing number of clients. So you can assume that the Req/s doesn't reflect the real world situation where clients try to send requests in any time.
  • 8vCPU + 16GB RAM seems like large enough machine. At least bigger size of RAM doesn't help much. We might achieve better performance if we increase the number of CPU core than 8, but beyond 8 cores is somewhat costly.
  • In any cases, the optimal value of --tensorflow_inter_op_parallelism seems like 4. The value of --tensorflow_intra_op_parallelism is fixed to the number of CPU cores since it specifies the number of threads to use to parallelize the execution of an individual op.
  • --enable_batching could give you better performance. However, since TFServing doesn't immediately response to each requests, there is a trade-off.
  • By considering cost trade-off, our recommendation from the experiment is to choose 2n-8c-16r-interop4 configuration unless you care about dynamic batching capabilities. Or you can write a similar setup by referencing 2n-8c-16r-interop2-batch but for smaller machines as well.

👋 NOTE

  • Locust doesnt' have a built-in support to write a gRPC based client, so we have written one for ourselves. If you are curious about the implementation, check this locustfile.py out.
  • For the legend in the plot, n means the number of nodes(pods), c means the number of CPU cores, r means the RAM capacity, interop means the number of --tensorflow_inter_op_parallelism, and batch means the batching configuration is enabled with this config.

Future works

  • More load test comparisons with more ML inference frameworks such as NVIDIA's Triton Inference Server, KServe, and RedisAI.

  • Advancing this repo by providing a semi-automatic model deployment. To be more specific, when new codes implementing new ML model is pull requested, maintainers could trigger model performance evaluable on GCP's Vertex Training via comments. The experiment results could be exposed through TensorBoard.dev or W&B. If it is approved, the code will be merged, the trained model will be released, and it is going to be deployed on GKE.

Acknowledgements

ML-GDE program for providing GCP credit support.

You might also like...
Checkmk kube agent - Checkmk Kubernetes Cluster and Node Collectors

Checkmk Kubernetes Cluster and Node Collectors Checkmk cluster and node collecto

A basic instruction for Kubernetes setup and understanding.

A basic instruction for Kubernetes setup and understanding Module ID Module Guide - Install Kubernetes Cluster k8s-install 3 Docker Core Technology mo

A Blazing fast Security Auditing tool for Kubernetes
A Blazing fast Security Auditing tool for Kubernetes

A Blazing fast Security Auditing tool for kubernetes!! Basic Overview Kubestriker performs numerous in depth checks on kubernetes infra to identify th

Official Python client library for kubernetes

Kubernetes Python Client Python client for the kubernetes API. Installation From source: git clone --recursive https://github.com/kubernetes-client/py

A Kubernetes operator that creates UptimeRobot monitors for your ingresses

This operator automatically creates uptime monitors at UptimeRobot for your Kubernetes Ingress resources. This allows you to easily integrate uptime monitoring of your services into your Kubernetes deployments.

A Simple script to hunt unused Kubernetes resources.

K8SPurger A Simple script to hunt unused Kubernetes resources. Release History Release 0.3 Added Ingress Added Services Account Adding RoleBindding Re

Run Oracle on Kubernetes with El Carro

El Carro is a new project that offers a way to run Oracle databases in Kubernetes as a portable, open source, community driven, no vendor lock-in container orchestration system. El Carro provides a powerful declarative API for comprehensive and consistent configuration and deployment as well as for real-time operations and monitoring.

Chartreuse: Automated Alembic migrations within kubernetes
Chartreuse: Automated Alembic migrations within kubernetes

Chartreuse: Automated Alembic SQL schema migrations within kubernetes "How to automate management of Alembic database schema migration at scale using

Caboto, the Kubernetes semantic analysis tool
Caboto, the Kubernetes semantic analysis tool

Caboto Caboto, the Kubernetes semantic analysis toolkit. It contains a lightweight Python library for semantic analysis of plain Kubernetes manifests

Comments
  • Update README.md

    Update README.md

    The README looks really comprehensive.

    A couple of minor things I would suggest changing / adding:

    • I think it'd be helpful for the readers to know that we're interested in deploying the gRPC client of TF Serving via GitHub Actions.
    • A note on how to perform inference with the gRPC client deployed via Actions. Or better yet, include the pre-processing and post-processing handlers since we're not load-testing anymore.
    • Brief notes about the important numbers shown in the load-test plots.
    • Spec of the machine we used to perform load-testing.
    • How the load-test charts were generated.
    opened by sayakpaul 1
Owner
Chansung Park
GDE for Machine Learning
Chansung Park
Oncall is a calendar tool designed for scheduling and managing on-call shifts. It can be used as source of dynamic ownership info for paging systems like http://iris.claims.

Oncall See admin docs for information on how to run and manage Oncall. Development setup Prerequisites Debian/Ubuntu - sudo apt-get install libsasl2-d

LinkedIn 928 Dec 22, 2022
SSH tunnels to remote server.

Author: Pahaz Repo: https://github.com/pahaz/sshtunnel/ Inspired by https://github.com/jmagnusson/bgtunnel, which doesn't work on Windows. See also: h

Pavel White 1k Dec 28, 2022
Automate SSH in python easily!

RedExpect RedExpect makes automating remote machines over SSH very easy to do and is very fast in doing exactly what you ask of it. Based on ssh2-pyth

Red_M 19 Dec 17, 2022
Play Wordle from any Kubernetes cluster.

wordle-operator 🟩 ⬛ 🟩 🟨 ⬛ Play Wordle from any Kubernetes cluster. Using the power of CustomResourceDefinitions and Kubernetes Operators, now you c

Lucas Melin 1 Jan 15, 2022
Daemon to ban hosts that cause multiple authentication errors

__ _ _ ___ _ / _|__ _(_) |_ ) |__ __ _ _ _ | _/ _` | | |/ /| '_ \/ _` | ' \

Fail2Ban 7.8k Jan 09, 2023
DC/OS - The Datacenter Operating System

DC/OS - The Datacenter Operating System The easiest way to run microservices, big data, and containers in production. What is DC/OS? Like traditional

DC/OS 2.3k Jan 06, 2023
A colony of interacting processes

NColony Infrastructure for running "colonies" of processes. Hacking $ tox Should DTRT -- if it passes, it means unit tests are passing, and 100% cover

23 Apr 04, 2022
The leading native Python SSHv2 protocol library.

Paramiko Paramiko: Python SSH module Copyright: Copyright (c) 2009 Robey Pointer 8.1k Jan 04, 2023

Rundeck / Grafana / Prometheus / Rundeck Exporter integration demo

Rundeck / Prometheus / Grafana integration demo via Rundeck Exporter This is a demo environment that shows how to monitor a Rundeck instance using Run

Reiner 4 Oct 14, 2022
Wiremind Kubernetes helper

Wiremind Kubernetes helper This Python library is a high-level set of Kubernetes Helpers allowing either to manage individual standard Kubernetes cont

Wiremind 3 Oct 09, 2021
DataOps framework for Machine Learning projects.

Noronha DataOps Noronha is a Python framework designed to help you orchestrate and manage ML projects life-cycle. It hosts Machine Learning models ins

52 Oct 30, 2022
A job launching library for docker, EC2, GCP, etc.

doodad A library for packaging dependencies and launching scripts (with a focus on python) on different platforms using Docker. Currently supported pl

Justin Fu 55 Aug 27, 2022
A lobby boy will create a VPS server when you need one, and destroy it after using it.

Lobbyboy What is a lobby boy? A lobby boy is completely invisible, yet always in sight. A lobby boy remembers what people hate. A lobby boy anticipate

226 Dec 29, 2022
Cobbler is a versatile Linux deployment server

Cobbler Cobbler is a Linux installation server that allows for rapid setup of network installation environments. It glues together and automates many

Cobbler 2.4k Dec 24, 2022
Hackergame nc 类题目的 Docker 容器资源限制、动态 flag、网页终端

Hackergame nc 类题目的 Docker 容器资源限制、动态 flag、网页终端 快速入门 配置证书 证书用于验证用户 Token。请确保这里的证书文件(cert.pem)与 Hackergame 平台 配置的证书相同,这样 Hackergame 平台为每个用户生成的 Token 才可以通

USTC Hackergame 68 Nov 09, 2022
Find-Xss - Termux Kurulum Dosyası Eklendi Eğer Hata Alıyorsanız Lütfen Resmini Çekip İnstagramdan Bildiriniz

FindXss Waf Bypass Eklendi !!! PRODUCER: Saep UPDATER: Aser-Vant Download: git c

Aser 2 Apr 17, 2022
Chef-like functionality for Fabric

/ / ___ ___ ___ ___ | | )| |___ | | )|___) |__ |__/ | __/ | | / |__ -- Chef-like functionality for Fabric About Fabric i

Sébastien Pierre 1.3k Dec 21, 2022
This repository contains code examples and documentation for learning how applications can be developed with Kubernetes

BigBitBus KAT Components Click on the diagram to enlarge, or follow this link for detailed documentation Introduction Welcome to the BigBitBus Kuberne

51 Oct 16, 2022
🐳 Docker templates for various languages.

Docker Deployment Templates One Stop repository for Docker Compose and Docker Templates for Deployment. Features Python (FastAPI, Flask) Screenshots D

CodeChef-VIT 6 Aug 28, 2022