Containerized architectures for deep learning
Antje Barth @anbarth
Me
Data Enthusiast
Technical Evangelist
AI / ML / Deep Learning
Container / Kubernetes
Big Data
#CodeLikeAGirl
Agenda
• Motivation
• ML pipeline tools and platforms
• Machine Learning on Kubernetes
• Deep Learning Demo
• Conclusion
Agenda
• Motivation
• ML pipeline tools and platforms
• Machine Learning on Kubernetes
• Deep Learning Demo
• Conclusion
ML – Helicopter view
How good are
your predictions?• Accuracy
• Optimization
ML – The (enterprise) reality
• Wrangle large datasets
• Unify disparate systems
• Composability
• Manage pipeline complexity
• Improve training/serving
consistency
• Improve portability
• Improve model quality
• Manage versions
Building
a model
Data
ingestion
Data
analysis
Data
transform
Data
validation
Data
splitting
Ad-hoc
Training
Model
validation
Logging
Roll-out Serving
Monitoring
Distributed
Training
Training
at scale
Data
Versioning
HP Tuning
Experiment
Tracking
Feature
Store
SYSTEM 1
SYSTEM 2 SYSTEM 3
SYSTEM 4 SYSTEM 5
SYSTEM 6
SYSTEM 3.5
SYSTEM 1.5
The rise of ML pipeline tools & platforms
Agenda
• Motivation
• ML pipeline tools and platforms
• Machine Learning on Kubernetes
• Deep Learning Demo
• Conclusion
Quick comparison
Apache Airflow is a
platform to
programmatically author,
schedule and monitor
workflows.
The Kubeflow project is
dedicated to making
deployments of machine
learning (ML) workflows
on Kubernetes simple,
portable and scalable.
TensorFlow Extended
(TFX) is an end-to-end
platform for deploying
production ML pipelines.
MLflow is an open
source platform to
manage the ML lifecycle,
including
experimentation,
reproducibility and
deployment.
https://airflow.apache.org/ https://www.kubeflow.org/
https://www.tensorflow.org/
tfx
https://mlflow.org/
How to scale to production?
Composability
Portability
Scalability
Wait a minute…
Virtual Machines
are Computers in a Box
Containers
are Applications in a Box
Containers?
Kubernetes?
{api}
Kubernetes is an API and agents
The Kubernetes API provides containers
with a scheduling, configuration, network,
and storage
The Kubernetes runtime manages the
containers
Agenda
• Motivation
• ML pipeline tools and platforms
• Machine Learning on Kubernetes
• Deep Learning Demo
• Conclusion
Machine Learning on Kubernetes
• Kubernetes-native
• Run wherever k8s runs
• Move between local – dev – test – prod – cloud
• Use k8s to manage ML tasks
• CRDs (UDTs) for distributed training
• Adopt k8s patterns
• Microservices
• Manage infrastructure declaratively
• Support for multiple ML frameworks
• Tensorflow, Pytorch, Scikit, Xgboost, etc.
Kubernetes ML/DL
Landscape
Source: https://twimlai.com/kubernetes-ebook/
https://landscape.lfai.foundation/
https://landscape.cncf.io/
Introducing Kubeflow
Make it easy for everyone to develop,
deploy, and manage portable, scalable
ML everywhere.
Credits:
Kubeflow components
Credits:
Composability
• Build and deploy re-usable,
portable, scalable, machine
learning workflows based on
Docker containers.
• Use the libraries/ frameworks of
your choice
Example:
KubeFlow "deployer" component lets you
deploy as a plain TF Serving model
server:
https://github.com/kubeflow/pipelines/tree/
master/components/kubeflow/deployer
METADATA
SERVING
Back to our ML enterprise workflow!
Building
a model
Data
ingestion
Data
analysis
Data
transform
Data
validation
Data
splitting
Ad-hoc
Training
Model
validation
Logging
Roll-out Serving
Monitoring
Distributed
Training
Training
at scale
Data
Versioning
HP Tuning
Experiment
Tracking
Feature
Store
Portability
Containers for
Deep Learning
Container runtime
Infrastructure
NVIDIA drivers
Host OS
Packages:
TensorFlow
mkl
cudnn
cublas
Nccl
CUDA toolkit
CPU:
GPU:
TensorFlow
Container
Image
Keras
horovod
numpy
scipy
others…
scikit-
learn
pandas
openmpi
Python
ML environments
that are:
TensorFlow
mkl
cudnn
cublas
Nccl
CUDA toolkit
CPU:
GPU:
TensorFlow
Container
Image
Keras
horovod
numpy
scipy
others…
scikit-
learn
pandas
openmpi
Python
Container runtime
Development System
NVIDIA drivers
Host OS
Container registry
push
pull
TensorFlow
mkl
cudnn
cublas
Nccl
CUDA toolkit
CPU:
GPU:
TensorFlow
Container
Image
Keras
horovod
numpy
scipy
others…
scikit-
learn
pandas
openmpi
Python
Container runtime
Training Cluster
NVIDIA drivers
Host OS
Scalability
• Kubernetes - Autoscaling Jobs
• Describe the job, let Kubernetes take care of the rest
• CPU, RAM, Accelerators
• TF Jobs delete themselves when finished, node pool will auto scale back
down
Model works
great! But I need
six nodes.
Data Scientist IT Ops
Credit: @aronchick
Scalability
• Kubernetes - Autoscaling Jobs
• Describe the job, let Kubernetes take care of the rest
• CPU, RAM, Accelerators
• TF Jobs delete themselves when finished, node pool will auto scale back
down
Data Scientist IT Ops
apiVersion: "kubeflow.org/v1alpha1"
kind: "TFJob"
spec:
replicaSpecs:
replicas: 6
CPU: 1
GPU: 1
containers: gcr.io/myco/myjob:1.0
Credit: @aronchick
Scalability
• Kubernetes - Autoscaling Jobs
• Describe the job, let Kubernetes take care of the rest
• CPU, RAM, Accelerators
• TF Jobs delete themselves when finished, node pool will auto scale back
down
Data Scientist IT Ops
GPU GPU GPU
GPU GPU GPU
Credit: @aronchick
Scalability
• Kubernetes - Autoscaling Jobs
• Describe the job, let Kubernetes take care of the rest
• CPU, RAM, Accelerators
• TF Jobs delete themselves when finished, node pool will auto scale back
down
Job’s done!
Data Scientist IT Ops
Credit: @aronchick
Agenda
• Motivation
• ML pipeline tools and platforms
• Container > Kubernetes > Kubeflow
• Deep Learning Demo
• Conclusion
DEMO “Doppelganger App”
Implementing Image Similarity search
Recap:
The “Kube”flow
• Deploy Kubernetes & Kubeflow
• Experiment in Jupyter
• Build Docker Image
• Train at Scale
• Build Model Server
• Deploy Model
• Integrate Model into App
• Operate
Model Training Model Serving
Pod
Pod Pod
Kubernetes Worker Nodes
#1 #2 #3
Jupyter
Notebook
Seldon Core
Engine
Seldon Core
Engine
Doppelganger
Model
Doppelganger
Model
Istio Gateway
(Traffic Routing)
{REST API}
curl…
Dockerfile
Training Job
Dockerfile
Inference Service
Data Scientist
Pod
Train
Model
Pod
Train
Model
Agenda
• Motivation
• ML pipeline tools and platforms
• Machine Learning on Kubernetes
• Deep Learning Demo
• Conclusion
Conclusion & Take-aways
• Platform matters
• Composability – Portability – Scalability
• Containerized architectures
• Kubernetes + Machine Learning = Kubeflow
• Start building!
https://github.com/antje/doppelganger
More information
• Kubeflow
https://www.kubeflow.org/
https://github.com/kubeflow/kubeflow
• Tensorflow Extended (TFX)
https://www.tensorflow.org/tfx
• The Definitive Guide to Machine Learning Platforms
https://twimlai.com/mlplatforms-ebook/
• Amazon Elastic Kubernetes Service (Amazon EKS)
https://eksworkshop.com
https://github.com/aws-samples/machine-learning-using-k8s
Session page on conference website O’Reilly Events App
Rate today’s session
Thank you!
antje.official
antje@anbarth
Antje Barth

Containerized architectures for deep learning

  • 1.
    Containerized architectures fordeep learning Antje Barth @anbarth
  • 2.
    Me Data Enthusiast Technical Evangelist AI/ ML / Deep Learning Container / Kubernetes Big Data #CodeLikeAGirl
  • 3.
    Agenda • Motivation • MLpipeline tools and platforms • Machine Learning on Kubernetes • Deep Learning Demo • Conclusion
  • 4.
    Agenda • Motivation • MLpipeline tools and platforms • Machine Learning on Kubernetes • Deep Learning Demo • Conclusion
  • 6.
    ML – Helicopterview How good are your predictions?• Accuracy • Optimization
  • 8.
    ML – The(enterprise) reality • Wrangle large datasets • Unify disparate systems • Composability • Manage pipeline complexity • Improve training/serving consistency • Improve portability • Improve model quality • Manage versions Building a model Data ingestion Data analysis Data transform Data validation Data splitting Ad-hoc Training Model validation Logging Roll-out Serving Monitoring Distributed Training Training at scale Data Versioning HP Tuning Experiment Tracking Feature Store SYSTEM 1 SYSTEM 2 SYSTEM 3 SYSTEM 4 SYSTEM 5 SYSTEM 6 SYSTEM 3.5 SYSTEM 1.5
  • 10.
    The rise ofML pipeline tools & platforms
  • 11.
    Agenda • Motivation • MLpipeline tools and platforms • Machine Learning on Kubernetes • Deep Learning Demo • Conclusion
  • 12.
    Quick comparison Apache Airflowis a platform to programmatically author, schedule and monitor workflows. The Kubeflow project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable and scalable. TensorFlow Extended (TFX) is an end-to-end platform for deploying production ML pipelines. MLflow is an open source platform to manage the ML lifecycle, including experimentation, reproducibility and deployment. https://airflow.apache.org/ https://www.kubeflow.org/ https://www.tensorflow.org/ tfx https://mlflow.org/
  • 13.
    How to scaleto production? Composability Portability Scalability
  • 14.
  • 16.
    Virtual Machines are Computersin a Box Containers are Applications in a Box Containers?
  • 17.
  • 18.
    Kubernetes is anAPI and agents The Kubernetes API provides containers with a scheduling, configuration, network, and storage The Kubernetes runtime manages the containers
  • 19.
    Agenda • Motivation • MLpipeline tools and platforms • Machine Learning on Kubernetes • Deep Learning Demo • Conclusion
  • 20.
    Machine Learning onKubernetes • Kubernetes-native • Run wherever k8s runs • Move between local – dev – test – prod – cloud • Use k8s to manage ML tasks • CRDs (UDTs) for distributed training • Adopt k8s patterns • Microservices • Manage infrastructure declaratively • Support for multiple ML frameworks • Tensorflow, Pytorch, Scikit, Xgboost, etc.
  • 21.
  • 22.
    Introducing Kubeflow Make iteasy for everyone to develop, deploy, and manage portable, scalable ML everywhere.
  • 23.
  • 24.
    Composability • Build anddeploy re-usable, portable, scalable, machine learning workflows based on Docker containers. • Use the libraries/ frameworks of your choice Example: KubeFlow "deployer" component lets you deploy as a plain TF Serving model server: https://github.com/kubeflow/pipelines/tree/ master/components/kubeflow/deployer
  • 25.
    METADATA SERVING Back to ourML enterprise workflow! Building a model Data ingestion Data analysis Data transform Data validation Data splitting Ad-hoc Training Model validation Logging Roll-out Serving Monitoring Distributed Training Training at scale Data Versioning HP Tuning Experiment Tracking Feature Store
  • 26.
    Portability Containers for Deep Learning Containerruntime Infrastructure NVIDIA drivers Host OS Packages: TensorFlow mkl cudnn cublas Nccl CUDA toolkit CPU: GPU: TensorFlow Container Image Keras horovod numpy scipy others… scikit- learn pandas openmpi Python ML environments that are:
  • 27.
    TensorFlow mkl cudnn cublas Nccl CUDA toolkit CPU: GPU: TensorFlow Container Image Keras horovod numpy scipy others… scikit- learn pandas openmpi Python Container runtime DevelopmentSystem NVIDIA drivers Host OS Container registry push pull TensorFlow mkl cudnn cublas Nccl CUDA toolkit CPU: GPU: TensorFlow Container Image Keras horovod numpy scipy others… scikit- learn pandas openmpi Python Container runtime Training Cluster NVIDIA drivers Host OS
  • 28.
    Scalability • Kubernetes -Autoscaling Jobs • Describe the job, let Kubernetes take care of the rest • CPU, RAM, Accelerators • TF Jobs delete themselves when finished, node pool will auto scale back down Model works great! But I need six nodes. Data Scientist IT Ops Credit: @aronchick
  • 29.
    Scalability • Kubernetes -Autoscaling Jobs • Describe the job, let Kubernetes take care of the rest • CPU, RAM, Accelerators • TF Jobs delete themselves when finished, node pool will auto scale back down Data Scientist IT Ops apiVersion: "kubeflow.org/v1alpha1" kind: "TFJob" spec: replicaSpecs: replicas: 6 CPU: 1 GPU: 1 containers: gcr.io/myco/myjob:1.0 Credit: @aronchick
  • 30.
    Scalability • Kubernetes -Autoscaling Jobs • Describe the job, let Kubernetes take care of the rest • CPU, RAM, Accelerators • TF Jobs delete themselves when finished, node pool will auto scale back down Data Scientist IT Ops GPU GPU GPU GPU GPU GPU Credit: @aronchick
  • 31.
    Scalability • Kubernetes -Autoscaling Jobs • Describe the job, let Kubernetes take care of the rest • CPU, RAM, Accelerators • TF Jobs delete themselves when finished, node pool will auto scale back down Job’s done! Data Scientist IT Ops Credit: @aronchick
  • 32.
    Agenda • Motivation • MLpipeline tools and platforms • Container > Kubernetes > Kubeflow • Deep Learning Demo • Conclusion
  • 33.
  • 34.
  • 35.
    Recap: The “Kube”flow • DeployKubernetes & Kubeflow • Experiment in Jupyter • Build Docker Image • Train at Scale • Build Model Server • Deploy Model • Integrate Model into App • Operate Model Training Model Serving Pod Pod Pod Kubernetes Worker Nodes #1 #2 #3 Jupyter Notebook Seldon Core Engine Seldon Core Engine Doppelganger Model Doppelganger Model Istio Gateway (Traffic Routing) {REST API} curl… Dockerfile Training Job Dockerfile Inference Service Data Scientist Pod Train Model Pod Train Model
  • 36.
    Agenda • Motivation • MLpipeline tools and platforms • Machine Learning on Kubernetes • Deep Learning Demo • Conclusion
  • 37.
    Conclusion & Take-aways •Platform matters • Composability – Portability – Scalability • Containerized architectures • Kubernetes + Machine Learning = Kubeflow • Start building! https://github.com/antje/doppelganger
  • 38.
    More information • Kubeflow https://www.kubeflow.org/ https://github.com/kubeflow/kubeflow •Tensorflow Extended (TFX) https://www.tensorflow.org/tfx • The Definitive Guide to Machine Learning Platforms https://twimlai.com/mlplatforms-ebook/ • Amazon Elastic Kubernetes Service (Amazon EKS) https://eksworkshop.com https://github.com/aws-samples/machine-learning-using-k8s
  • 39.
    Session page onconference website O’Reilly Events App Rate today’s session
  • 40.