Containerized architectures for deep learning

Containerized architectures for deep learning
Antje Barth @anbarth

Me
Data Enthusiast
Technical Evangelist
AI / ML / Deep Learning
Container / Kubernetes
Big Data
#CodeLikeAGirl

Agenda
• Motivation
• ML pipeline tools and platforms
• Machine Learning on Kubernetes
• Deep Learning Demo
• Conclusion

ML – Helicopter view
How good are
your predictions?• Accuracy
• Optimization

ML – The (enterprise) reality
• Wrangle large datasets
• Unify disparate systems
• Composability
• Manage pipeline complexity
• Improve training/serving
consistency
• Improve portability
• Improve model quality
• Manage versions
Building
a model
Data
ingestion
Data
analysis
Data
transform
Data
validation
Data
splitting
Ad-hoc
Training
Model
validation
Logging
Roll-out Serving
Monitoring
Distributed
Training
Training
at scale
Data
Versioning
HP Tuning
Experiment
Tracking
Feature
Store
SYSTEM 1
SYSTEM 2 SYSTEM 3
SYSTEM 4 SYSTEM 5
SYSTEM 6
SYSTEM 3.5
SYSTEM 1.5

The rise of ML pipeline tools & platforms

Quick comparison
Apache Airflow is a
platform to
programmatically author,
schedule and monitor
workflows.
The Kubeflow project is
dedicated to making
deployments of machine
learning (ML) workflows
on Kubernetes simple,
portable and scalable.
TensorFlow Extended
(TFX) is an end-to-end
platform for deploying
production ML pipelines.
MLflow is an open
source platform to
manage the ML lifecycle,
including
experimentation,
reproducibility and
deployment.
https://airflow.apache.org/ https://www.kubeflow.org/
https://www.tensorflow.org/
tfx
https://mlflow.org/

How to scale to production?
Composability
Portability
Scalability

Virtual Machines
are Computers in a Box
Containers
are Applications in a Box
Containers?

Kubernetes is an API and agents
The Kubernetes API provides containers
with a scheduling, configuration, network,
and storage
The Kubernetes runtime manages the
containers

Machine Learning on Kubernetes
• Kubernetes-native
• Run wherever k8s runs
• Move between local – dev – test – prod – cloud
• Use k8s to manage ML tasks
• CRDs (UDTs) for distributed training
• Adopt k8s patterns
• Microservices
• Manage infrastructure declaratively
• Support for multiple ML frameworks
• Tensorflow, Pytorch, Scikit, Xgboost, etc.

Kubernetes ML/DL
Landscape
Source: https://twimlai.com/kubernetes-ebook/
https://landscape.lfai.foundation/
https://landscape.cncf.io/

Introducing Kubeflow
Make it easy for everyone to develop,
deploy, and manage portable, scalable
ML everywhere.

Credits:
Kubeflow components
Credits:

Composability
• Build and deploy re-usable,
portable, scalable, machine
learning workflows based on
Docker containers.
• Use the libraries/ frameworks of
your choice
Example:
KubeFlow "deployer" component lets you
deploy as a plain TF Serving model
server:
https://github.com/kubeflow/pipelines/tree/
master/components/kubeflow/deployer

METADATA
SERVING
Back to our ML enterprise workflow!
Building
a model
Data
ingestion
Data
analysis
Data
transform
Data
validation
Data
splitting
Ad-hoc
Training
Model
validation
Logging
Roll-out Serving
Monitoring
Distributed
Training
Training
at scale
Data
Versioning
HP Tuning
Experiment
Tracking
Feature
Store

Portability
Containers for
Deep Learning
Container runtime
Infrastructure
NVIDIA drivers
Host OS
Packages:
TensorFlow
mkl
cudnn
cublas
Nccl
CUDA toolkit
CPU:
GPU:
TensorFlow
Container
Image
Keras
horovod
numpy
scipy
others…
scikit-
learn
pandas
openmpi
Python
ML environments
that are:

TensorFlow
mkl
cudnn
cublas
Nccl
CUDA toolkit
CPU:
GPU:
TensorFlow
Container
Image
Keras
horovod
numpy
scipy
others…
scikit-
learn
pandas
openmpi
Python
Container runtime
Development System
NVIDIA drivers
Host OS
Container registry
push
pull
TensorFlow
mkl
cudnn
cublas
Nccl
CUDA toolkit
CPU:
GPU:
TensorFlow
Container
Image
Keras
horovod
numpy
scipy
others…
scikit-
learn
pandas
openmpi
Python
Container runtime
Training Cluster
NVIDIA drivers
Host OS

Scalability
• Kubernetes - Autoscaling Jobs
• Describe the job, let Kubernetes take care of the rest
• CPU, RAM, Accelerators
• TF Jobs delete themselves when finished, node pool will auto scale back
down
Model works
great! But I need
six nodes.
Data Scientist IT Ops
Credit: @aronchick

Scalability
down
apiVersion: "kubeflow.org/v1alpha1"
kind: "TFJob"
spec:
replicaSpecs:
replicas: 6
CPU: 1
GPU: 1
containers: gcr.io/myco/myjob:1.0
Credit: @aronchick

Scalability
down
GPU GPU GPU
GPU GPU GPU
Credit: @aronchick

Scalability
down
Job’s done!
Credit: @aronchick

Agenda
• Motivation
• ML pipeline tools and platforms
• Container > Kubernetes > Kubeflow
• Deep Learning Demo
• Conclusion

Implementing Image Similarity search

Recap:
The “Kube”flow
• Deploy Kubernetes & Kubeflow
• Experiment in Jupyter
• Build Docker Image
• Train at Scale
• Build Model Server
• Deploy Model
• Integrate Model into App
• Operate
Model Training Model Serving
Pod
Pod Pod
Kubernetes Worker Nodes
#1 #2 #3
Jupyter
Notebook
Seldon Core
Engine
Seldon Core
Engine
Doppelganger
Model
Doppelganger
Model
Istio Gateway
(Traffic Routing)
{REST API}
curl…
Dockerfile
Training Job
Dockerfile
Inference Service
Data Scientist
Pod
Train
Model
Pod
Train
Model

Conclusion & Take-aways
• Platform matters
• Composability – Portability – Scalability
• Containerized architectures
• Kubernetes + Machine Learning = Kubeflow
• Start building!
https://github.com/antje/doppelganger

More information
• Kubeflow
https://www.kubeflow.org/
https://github.com/kubeflow/kubeflow
• Tensorflow Extended (TFX)
https://www.tensorflow.org/tfx
• The Definitive Guide to Machine Learning Platforms
https://twimlai.com/mlplatforms-ebook/
• Amazon Elastic Kubernetes Service (Amazon EKS)
https://eksworkshop.com
https://github.com/aws-samples/machine-learning-using-k8s

Session page on conference website O’Reilly Events App
Rate today’s session

Thank you!
antje.official
antje@anbarth
Antje Barth

Containerized architectures for deep learning

More Related Content

What's hot

Similar to Containerized architectures for deep learning

Recently uploaded

Containerized architectures for deep learning