Deep learning beyond the learning
@joerg_schad @dcos
Jörg Schad
Technical Community
Lead / Developer
Deep Learning
● Core Mesos
developer at
Mesosphere
● Twitter:
@joerg_schad
© 2018 Mesosphere, Inc. All Rights Reserved.
Deep Learning: The Promise
3
© 2018 Mesosphere, Inc. All Rights Reserved.
Deep Learning: The Process
4
Step 1: Training
(In Data Center - Over Hours/Days/Weeks)
Step 2: Inference
(Endpoint or Data Center - Instantaneous)
Dog
Input:
Lots of Labeled
Data
Output:
Trained Model
Deep neural
network model
Trained
Model
Output:
Classification
Trained Model
New Input from
Camera or
Sensor
97% Dog
3%
Panda
© 2018 Mesosphere, Inc. All Rights Reserved.
Deep Learning: Some insight
5
© 2018 Mesosphere, Inc. All Rights Reserved.
Deep Learning: The Challenges
6
1. Explore data using
Jupyter notebook
2. Train the model
using TensorFlow
3. Monitor training progress
using TensorBoard 4. Debug Model using tfdbg 5. Serve Model using TensorFlow
Serving
Cloud Pipeline
2. Explore data using
Jupyter notebook
3. Train the model
using TensorFlow
4. Monitor training progress
using TensorBoard 5. Debug Model using tfdbg 6. Serve Model using TensorFlow
Serving
1. Data Preparation using
Spark
7.Streaming of requests
...
Open Source Pipeline
2. Explore data using
Jupyter notebook
3. Train the model
using TensorFlow
4. Monitor training progress
using TensorBoard 5. Debug Model using tfdbg 6. Serve Model using TensorFlow
Serving
1. Data Preparation using
Spark
7. Kafka stream of
requests
Kubeflow
Deep Learning Pipeline
Data &
Streaming
Users
Frameworks &
Cluster
Models
Distributed Data
Storage and
Streaming
Model Serving
Data Preparation and
Analysis
Deep Learning Tools
and Distributed
Hosting
Building Machine
Learning Model
Sending Model to
Clients
Monitoring & Operations
© 2017 Mesosphere, Inc. All Rights Reserved.
Training Challenges
11
Step 1: Training
(In Data Center - Over Hours/Days/Weeks)
Dog
Input:
Lots of Labeled
Data
Output:
Trained Model
Deep neural
network model
● Compute Intensive
○ (Hopefully) Large Datasets
■ Train
■ Dev
■ Test
○ Hyperparameter
■ #Layer
■ #Units per Layer
■ Learning Rate
■ ….
Data Management
Data &
Streaming
Users
Frameworks &
Cluster
Models
Distributed Data
Storage and
Streaming
Model Serving
Data Preparation and
Analysis
Deep Learning Tools
and Distributed
Hosting
Building Machine
Learning Model
Sending Model to
Clients
Monitoring & Operations
© 2018 Mesosphere, Inc. All Rights Reserved. 13
Challenges
●
● Training/Dev/Test + New Data
● Large amounts
● Quality
● Availability (for cluster)
● Velocity
● Streaming
Solutions
GFS
Input Data Management
Input:
Lots of Labeled
Data
Apache Kafka
Apache Cassandra
© 2018 Mesosphere, Inc. All Rights Reserved. 14
Challenges
● Data is typically not ready to be
consumed by ML job*
● Data Cleaning
● Missing/incorrect labels
● Data Preparation
● Same Format
● Same Distribution
Solutions
Data Preparation
* Demo datasets are a fortunate exception :)
Users
Data &
Streaming
Users
Frameworks &
Cluster
Models
Distributed Data
Storage and
Streaming
Model Serving
Data Preparation and
Analysis
Deep Learning Tools
and Distributed
Hosting
Building Machine
Learning Model
Sending Model to
Clients
Monitoring & Operations
© 2018 Mesosphere, Inc. All Rights Reserved. 16
Challenges
● Different Users/Use cases
● Data Analyst/Exploring
● Production Workloads
● Highly Optimized
● How to spawn Environments?
Solutions
Users
© 2018 Mesosphere, Inc. All Rights Reserved. 17
Challenges
● Different Users/Use cases
● Data Analyst/Exploring
● Production Workloads
● Highly Optimized
● How to spawn Environments?
Solutions
Users
Frameworks
Data &
Streaming
Users
Frameworks &
Cluster
Models
Distributed Data
Storage and
Streaming
Model Serving
Data Preparation and
Analysis
Deep Learning Tools
and Distributed
Hosting
Building Machine
Learning Model
Sending Model to
Clients
Monitoring & Operations
19
© 2018 Mesosphere, Inc. All Rights Reserved.
● Machine Intelligence is the broad term used to describe
techniques allowing computers to “learn” by analyzing very
large data sets using artificial neural networks
20
What is Tensorflow?
“An open-source software library for Machine Intelligence” - tensorflow.org
© 2018 Mesosphere, Inc. All Rights Reserved. 21
What is Tensorflow?
“An open-source software library for Machine Intelligence” - tensorflow.org
● Tensorflow is a software library that makes it easy for
developers to construct artificial neural networks to analyze
their data of interest
TensorFlow
Library
Python
Dataflow
Executor,
Compute Kernel
Implementations,
Networking, etc.
GPUs
CPUs
© 2017 Mesosphere, Inc. All Rights Reserved. 22
© 2018 Mesosphere, Inc. All Rights Reserved. 23
Alternatives
© 2018 Mesosphere, Inc. All Rights Reserved. 24
Alternatives
tf.enable_eager_execution()
https://www.tensorflow.org/get_started/eager
© 2018 Mesosphere, Inc. All Rights Reserved. 25
Data Analytics Ecosystem
© 2018 Mesosphere, Inc. All Rights Reserved.
APIs
26
© 2018 Mesosphere, Inc. All Rights Reserved. 27
Challenges
● Different Frameworks
● No one rules them all
Solutions
● Pick the right tool
● PMML if needed
Deep Learning Frameworks
Cluster
Data &
Streaming
Users
Frameworks &
Cluster
Models
Distributed Data
Storage and
Streaming
Model Serving
Data Preparation and
Analysis
Deep Learning Tools
and Distributed
Hosting
Building Machine
Learning Model
Sending Model to
Clients
Monitoring & Operations
© 2017 Mesosphere, Inc. All Rights Reserved.
Trained
Model
Typical Developer Workflow for TensorFlow (Single-Node)
● Download and install the Python TensorFlow library
● Design your model in terms of TensorFlow’s basic machine learning primitives
● Write your code, optimized for single-node performance
● Train your data on a single-node → Output Trained Model
29
Input
Data Set
© 2017 Mesosphere, Inc. All Rights Reserved.
Typical Developer Workflow for TensorFlow (Distributed)
● …
● Provision a set of machines to run your computation
● Install TensorFlow on them
● Write code to map distributed computations to the exact IP address
of the machine where those computations will be performed
● Deploy your code on every machine
● Train your data on the cluster → Output Trained Model
30
Trained
Model
Input
Data Set
© 2017 Mesosphere, Inc. All Rights Reserved.
Typical Developer Workflow for TensorFlow (Distributed)
● Download and install the Python TensorFlow library
● Design your model in terms of TensorFlow’s basic machine learning primitives
● Write your code, optimized for distributed computation
● …
31
© 2018 Mesosphere, Inc. All Rights Reserved.
Resource Isolation and Allocation
32
© 2018 Mesosphere, Inc. All Rights Reserved.
TPU
33
© 2018 Mesosphere, Inc. All Rights Reserved.
TPUs
34
© 2017 Mesosphere, Inc. All Rights Reserved. 35
Datacenter
Typical Datacenter
siloed, over-provisioned servers,
low utilization
Mesos/ DC/OS
automated schedulers, workload multiplexing onto the
same machines
Tensorflow
Jenkins
Kafka
Spark
Tensorflow
© 2018 Mesosphere, Inc. All Rights Reserved.
PHYSICAL
INFRASTRUCTURE
MICROSERVICES, CONTAINERS, & DEV TOOLS
VIRTUAL MACHINES PUBLIC CLOUDS
DATA SERVICES, MACHINE LEARNING, & AI
Security &
Compliance
Application-Aware
Automation Multitenancy
Hybrid Cloud
Management
100+
MORE
DatacenterEdge
Datacenter and Cloud as a Single Computing Resource
Powered by Apache Mesos
20+
MORE
© 2017 Mesosphere, Inc. All Rights Reserved.
Challenges running distributed TensorFlow*
37
● Dealing with failures is not graceful
○ Users need to stop training, change their hard-coded ClusterSpec, and
manually restart their jobs
* Any Distributed System
Deploy
Scale
Configure
Recover
3 AM
...
Typical Datacenter
siloed, over-provisioned servers,
low utilization
HDFS
Kafka
Kubernetes
Flink
TensorFlow
© 2018 Mesosphere, Inc. All Rights Reserved.
Two-level Scheduling
1. Agents advertise resources to Master
2. Master offers resources to Framework
3. Framework rejects / uses resources
4. Agent reports task status to Master
39
MESOS ARCHITECTURE
Mesos
Master
Mesos
Master
Mesos
Master
Mesos AgentMesos Agent Service
Cassandra
Executor
Cassandra
Task
Flink
Scheduler
Spark
Executor
Spark
Task
Mesos AgentMesos Agent Service
Docker
Executor
Docker
Task
CDB
Executor
Spark
Task
Spark
Scheduler
Kafka
Scheduler
© 2017 Mesosphere, Inc. All Rights Reserved.
Challenges running distributed TensorFlow
40
● Hard-coding a “ClusterSpec” is incredibly tedious
○ Users need to rewrite code for every job they want to run in a distributed setting
○ True even for code they “inherit” from standard models
tf.train.ClusterSpec({
"worker": [
"worker0.example.com:2222",
"worker1.example.com:2222",
"worker2.example.com:2222",
"worker3.example.com:2222",
"worker4.example.com:2222",
"worker5.example.com:2222",
...
],
"ps": [
"ps0.example.com:2222",
"ps1.example.com:2222",
"ps2.example.com:2222",
"ps3.example.com:2222",
...
]})
tf.train.ClusterSpec({
"worker": [
"worker0.example.com:2222",
"worker1.example.com:2222",
"worker2.example.com:2222",
"worker3.example.com:2222",
"worker4.example.com:2222",
"worker5.example.com:2222",
...
],
"ps": [
"ps0.example.com:2222",
"ps1.example.com:2222",
"ps2.example.com:2222",
"ps3.example.com:2222",
...
]})
tf.train.ClusterSpec({
"worker": [
"worker0.example.com:2222",
"worker1.example.com:2222",
"worker2.example.com:2222",
"worker3.example.com:2222",
"worker4.example.com:2222",
"worker5.example.com:2222",
...
],
"ps": [
"ps0.example.com:2222",
"ps1.example.com:2222",
"ps2.example.com:2222",
"ps3.example.com:2222
© 2017 Mesosphere, Inc. All Rights Reserved.
Challenges running distributed TensorFlow
● Manually configuring each node in a cluster takes a long time and is error-prone
○ Setting up access to a shared file system (for checkpoint and summary files)
requires authenticating on each node
○ Tweaking hyper-parameters requires re-uploading code to every node
41
© 2017 Mesosphere, Inc. All Rights Reserved.
Typical Developer Workflow for TensorFlow (Distributed)
● …
● Provision a set of machines to run your computation
● Install TensorFlow on them
● Write code to map distributed computations to the exact IP
of the machine where those computations will be performed
● Deploy your code on every machine
● Train your data on the cluster → Output Trained Model
42
Trained
Model
Input
Data Set
© 2017 Mesosphere, Inc. All Rights Reserved.
Running distributed TensorFlow on DC/OS
● We use the dcos-commons SDK to dynamically create the ClusterSpec
43
{
"service": {
"name": "mnist",
"job_url": "...",
"job_context": "..."
},
"gpu_worker": {... },
"worker": {... },
"ps": {... }
}
tf.train.ClusterSpec({
"worker": [
"worker0.example.com:2222",
"worker1.example.com:2222",
"worker2.example.com:2222",
"worker3.example.com:2222",
"worker4.example.com:2222",
"worker5.example.com:2222",
...
],
"ps": [
"ps0.example.com:2222",
"ps1.example.com:2222",
"ps2.example.com:2222",
"ps3.example.com:2222",
...
]})
tf.train.ClusterSpec({
"worker": [
"worker0.example.com:2222",
"worker1.example.com:2222",
"worker2.example.com:2222",
"worker3.example.com:2222",
"worker4.example.com:2222",
"worker5.example.com:2222",
...
],
"ps": [
"ps0.example.com:2222",
"ps1.example.com:2222",
"ps2.example.com:2222",
"ps3.example.com:2222",
...
]})
tf.train.ClusterSpec({
"worker": [
"worker0.example.com:2222",
"worker1.example.com:2222",
"worker2.example.com:2222",
"worker3.example.com:2222",
"worker4.example.com:2222",
"worker5.example.com:2222",
...
],
"ps": [
"ps0.example.com:2222",
"ps1.example.com:2222",
"ps2.example.com:2222",
"ps3.example.com:2222
© 2017 Mesosphere, Inc. All Rights Reserved.
Running distributed TensorFlow on DC/OS
44
● Wrapper script to abstract away distributed TensorFlow configuration
○ Separates “deployer” responsibilities from “developer” responsibilities
{
"service": {
"name": "mnist",
"job_url": "...",
"job_context": "..."
},
"gpu_worker": {... },
"worker": {... },
"ps": {... }
}
User
Code
Wrapper
Script
© 2017 Mesosphere, Inc. All Rights Reserved.
Running distributed TensorFlow on DC/OS
45
● The dcos-commons SDK cleanly restarts failed tasks and reconnects
them to the cluster
Model Management
Data &
Streaming
Users
Frameworks &
Cluster
Models
Distributed Data
Storage and
Streaming
Model Serving
Data Preparation and
Analysis
Deep Learning Tools
and Distributed
Hosting
Building Machine
Learning Model
Sending Model to
Clients
Monitoring & Operations
© 2018 Mesosphere, Inc. All Rights Reserved.
Recall
47
Step 1: Training
(In Data Center - Over Hours/Days/Weeks)
Step 2: Inference
(Endpoint or Data Center - Instantaneous)
Dog
Input:
Lots of Labeled
Data
Output:
Trained Model
Deep neural
network model
Trained
Model
Output:
Classification
Trained Model
New Input from
Camera or
Sensor
97% Dog
3%
Panda
© 2017 Mesosphere, Inc. All Rights Reserved.
Many Models
48
Step 1: Training
(In Data Center - Over Hours/Days/Weeks)
Dog
Input:
Lots of Labeled
Data
Output:
Trained Model
Deep neural
network model
© 2018 Mesosphere, Inc. All Rights Reserved. 49
Challenges
● Many Models
● Different Hyperparameter
● Different Models
● New Training Data
● ...
Solutions
● Persistent Storage + Metadata
Model Management
GFS
© 2017 Mesosphere, Inc. All Rights Reserved.
TensorFlow Hub
50
https://www.tensorflow.org/hub/
Serving
Data &
Streaming
Users
Frameworks &
Cluster
Models
Distributed Data
Storage and
Streaming
Model Serving
Data Preparation and
Analysis
Deep Learning Tools
and Distributed
Hosting
Building Machine
Learning Model
Sending Model to
Clients
Monitoring & Operations
© 2018 Mesosphere, Inc. All Rights Reserved. 52
Challenges
● How to Deploy Models?
● Zero Downtime
● Canary
Solutions
● TensorFlow Serving
Model Serving
© 2018 Mesosphere, Inc. All Rights Reserved.
TensorFlow Lite
53
https://www.tensorflow.org/mobile/tflite/
Challenges
● Small/Fast model without losing too
much performance
● 500 KB models….
© 2018 Mesosphere, Inc. All Rights Reserved.
Rendezvous Architecture
54
https://mapr.com/ebooks/machine-learning-logistics/
Monitoring
Data &
Streaming
Users
Frameworks &
Cluster
Models
Distributed Data
Storage and
Streaming
Model Serving
Data Preparation and
Analysis
Deep Learning Tools
and Distributed
Hosting
Building Machine
Learning Model
Sending Model to
Clients
Monitoring & Operations
© 2018 Mesosphere, Inc. All Rights Reserved. 56
Challenges
● Understand {...}
● Debug
● Model Quality
● Accuracy
● Training Time
● …
● Overall Architecture
● Availability
● Latencies
● ...
Solutions
● TensorBoard
● Traditional Cluster Monitoring Tool
Monitoring
© 2018 Mesosphere, Inc. All Rights Reserved.
Debugging
57
tfdbg
https://www.tensorflow.org/programmers_guide/debugger
© 2018 Mesosphere, Inc. All Rights Reserved.
Debugging
58
Tfdbg
- GUI currently alpha
https://github.com/tensorflow/tensorboard/blob/master/tensorboard/plugins/debugger/README.md
© 2018 Mesosphere, Inc. All Rights Reserved.
Profiling
59
Performance optimization for different
devices
- Keep device occupied
Profiling!
+
Experience!
https://www.tensorflow.org/performance/performance_guide
© 2018 Mesosphere, Inc. All Rights Reserved.
Platforms
60
● AWS Sagemaker
+ Spark, MXNet, TF
+ Serving/AB
- Cloud Only
● Google Datalab/ML-Engine
+ TF, Keras, Scikit, XGBoost
+ Serving/AB
- Cloud Only
- No control of docker images
● KubeFlow
+ TF Everywhere
- TF only
● DC/OS
+ Flexibility (all of the above)
+ GPU support
- More Manual setup
© 2018 Mesosphere, Inc. All Rights Reserved. 61
Demo
1. Explore data using
Jupyter notebook
2. Train the
model using
TensorFlow
3. Monitor training progress
using TensorBoard 4. Debug Model using tfdbg 5. Serve Model using TensorFlow
Serving
© 2018 Mesosphere, Inc. All Rights Reserved.
Related Work
62
● DC/OS TensorFlow
https://mesosphere.com/blog/tensorflow-gpu-support-deep-learning/
● DC/OS PyTorch
https://mesosphere.com/blog/deep-learning-pytorch-gpus/
● Ted Dunning’s Machine Learning Logistics
https://thenewstack.io/maprs-ted-dunning-intersection-machine-learning-containers/
● KubeFlow
https://github.com/kubeflow/kubeflow
● Tensorflow (+ TensorBoard and Serving)
https://www.tensorflow.org/
© 2018 Mesosphere, Inc. All Rights Reserved.
Special Thanks to All Collaborators
63
Ben Wood Robin Oh
Evan Lezar Art Rand
Gabriel Hartmann Chris Lambert
Bo Hu
Sam Pringle Kevin Klues
© 2018 Mesosphere, Inc. All Rights Reserved.
● DC/OS TensorFlow Package (currently closed source)
○ https://github.com/mesosphere/dcos-tensorflow
● DC/OS TensorFlow Tools
○ https://github.com/dcos-labs/dcos-tensorflow-tools/
● Tutorial for deploying TensorFlow on DC/OS
○ https://github.com/dcos/examples/tree/master/tensorflow
● Contact:
○ https://groups.google.com/a/mesosphere.io/forum/#!forum/tensorflow-dco
s
○ Slack: chat.dcos.io #tensorflow
Questions and Links
64

Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018

  • 1.
    Deep learning beyondthe learning @joerg_schad @dcos
  • 2.
    Jörg Schad Technical Community Lead/ Developer Deep Learning ● Core Mesos developer at Mesosphere ● Twitter: @joerg_schad
  • 3.
    © 2018 Mesosphere,Inc. All Rights Reserved. Deep Learning: The Promise 3
  • 4.
    © 2018 Mesosphere,Inc. All Rights Reserved. Deep Learning: The Process 4 Step 1: Training (In Data Center - Over Hours/Days/Weeks) Step 2: Inference (Endpoint or Data Center - Instantaneous) Dog Input: Lots of Labeled Data Output: Trained Model Deep neural network model Trained Model Output: Classification Trained Model New Input from Camera or Sensor 97% Dog 3% Panda
  • 5.
    © 2018 Mesosphere,Inc. All Rights Reserved. Deep Learning: Some insight 5
  • 6.
    © 2018 Mesosphere,Inc. All Rights Reserved. Deep Learning: The Challenges 6
  • 7.
    1. Explore datausing Jupyter notebook 2. Train the model using TensorFlow 3. Monitor training progress using TensorBoard 4. Debug Model using tfdbg 5. Serve Model using TensorFlow Serving
  • 8.
    Cloud Pipeline 2. Exploredata using Jupyter notebook 3. Train the model using TensorFlow 4. Monitor training progress using TensorBoard 5. Debug Model using tfdbg 6. Serve Model using TensorFlow Serving 1. Data Preparation using Spark 7.Streaming of requests ...
  • 9.
    Open Source Pipeline 2.Explore data using Jupyter notebook 3. Train the model using TensorFlow 4. Monitor training progress using TensorBoard 5. Debug Model using tfdbg 6. Serve Model using TensorFlow Serving 1. Data Preparation using Spark 7. Kafka stream of requests Kubeflow
  • 10.
    Deep Learning Pipeline Data& Streaming Users Frameworks & Cluster Models Distributed Data Storage and Streaming Model Serving Data Preparation and Analysis Deep Learning Tools and Distributed Hosting Building Machine Learning Model Sending Model to Clients Monitoring & Operations
  • 11.
    © 2017 Mesosphere,Inc. All Rights Reserved. Training Challenges 11 Step 1: Training (In Data Center - Over Hours/Days/Weeks) Dog Input: Lots of Labeled Data Output: Trained Model Deep neural network model ● Compute Intensive ○ (Hopefully) Large Datasets ■ Train ■ Dev ■ Test ○ Hyperparameter ■ #Layer ■ #Units per Layer ■ Learning Rate ■ ….
  • 12.
    Data Management Data & Streaming Users Frameworks& Cluster Models Distributed Data Storage and Streaming Model Serving Data Preparation and Analysis Deep Learning Tools and Distributed Hosting Building Machine Learning Model Sending Model to Clients Monitoring & Operations
  • 13.
    © 2018 Mesosphere,Inc. All Rights Reserved. 13 Challenges ● ● Training/Dev/Test + New Data ● Large amounts ● Quality ● Availability (for cluster) ● Velocity ● Streaming Solutions GFS Input Data Management Input: Lots of Labeled Data Apache Kafka Apache Cassandra
  • 14.
    © 2018 Mesosphere,Inc. All Rights Reserved. 14 Challenges ● Data is typically not ready to be consumed by ML job* ● Data Cleaning ● Missing/incorrect labels ● Data Preparation ● Same Format ● Same Distribution Solutions Data Preparation * Demo datasets are a fortunate exception :)
  • 15.
    Users Data & Streaming Users Frameworks & Cluster Models DistributedData Storage and Streaming Model Serving Data Preparation and Analysis Deep Learning Tools and Distributed Hosting Building Machine Learning Model Sending Model to Clients Monitoring & Operations
  • 16.
    © 2018 Mesosphere,Inc. All Rights Reserved. 16 Challenges ● Different Users/Use cases ● Data Analyst/Exploring ● Production Workloads ● Highly Optimized ● How to spawn Environments? Solutions Users
  • 17.
    © 2018 Mesosphere,Inc. All Rights Reserved. 17 Challenges ● Different Users/Use cases ● Data Analyst/Exploring ● Production Workloads ● Highly Optimized ● How to spawn Environments? Solutions Users
  • 18.
    Frameworks Data & Streaming Users Frameworks & Cluster Models DistributedData Storage and Streaming Model Serving Data Preparation and Analysis Deep Learning Tools and Distributed Hosting Building Machine Learning Model Sending Model to Clients Monitoring & Operations
  • 19.
  • 20.
    © 2018 Mesosphere,Inc. All Rights Reserved. ● Machine Intelligence is the broad term used to describe techniques allowing computers to “learn” by analyzing very large data sets using artificial neural networks 20 What is Tensorflow? “An open-source software library for Machine Intelligence” - tensorflow.org
  • 21.
    © 2018 Mesosphere,Inc. All Rights Reserved. 21 What is Tensorflow? “An open-source software library for Machine Intelligence” - tensorflow.org ● Tensorflow is a software library that makes it easy for developers to construct artificial neural networks to analyze their data of interest TensorFlow Library Python Dataflow Executor, Compute Kernel Implementations, Networking, etc. GPUs CPUs
  • 22.
    © 2017 Mesosphere,Inc. All Rights Reserved. 22
  • 23.
    © 2018 Mesosphere,Inc. All Rights Reserved. 23 Alternatives
  • 24.
    © 2018 Mesosphere,Inc. All Rights Reserved. 24 Alternatives tf.enable_eager_execution() https://www.tensorflow.org/get_started/eager
  • 25.
    © 2018 Mesosphere,Inc. All Rights Reserved. 25 Data Analytics Ecosystem
  • 26.
    © 2018 Mesosphere,Inc. All Rights Reserved. APIs 26
  • 27.
    © 2018 Mesosphere,Inc. All Rights Reserved. 27 Challenges ● Different Frameworks ● No one rules them all Solutions ● Pick the right tool ● PMML if needed Deep Learning Frameworks
  • 28.
    Cluster Data & Streaming Users Frameworks & Cluster Models DistributedData Storage and Streaming Model Serving Data Preparation and Analysis Deep Learning Tools and Distributed Hosting Building Machine Learning Model Sending Model to Clients Monitoring & Operations
  • 29.
    © 2017 Mesosphere,Inc. All Rights Reserved. Trained Model Typical Developer Workflow for TensorFlow (Single-Node) ● Download and install the Python TensorFlow library ● Design your model in terms of TensorFlow’s basic machine learning primitives ● Write your code, optimized for single-node performance ● Train your data on a single-node → Output Trained Model 29 Input Data Set
  • 30.
    © 2017 Mesosphere,Inc. All Rights Reserved. Typical Developer Workflow for TensorFlow (Distributed) ● … ● Provision a set of machines to run your computation ● Install TensorFlow on them ● Write code to map distributed computations to the exact IP address of the machine where those computations will be performed ● Deploy your code on every machine ● Train your data on the cluster → Output Trained Model 30 Trained Model Input Data Set
  • 31.
    © 2017 Mesosphere,Inc. All Rights Reserved. Typical Developer Workflow for TensorFlow (Distributed) ● Download and install the Python TensorFlow library ● Design your model in terms of TensorFlow’s basic machine learning primitives ● Write your code, optimized for distributed computation ● … 31
  • 32.
    © 2018 Mesosphere,Inc. All Rights Reserved. Resource Isolation and Allocation 32
  • 33.
    © 2018 Mesosphere,Inc. All Rights Reserved. TPU 33
  • 34.
    © 2018 Mesosphere,Inc. All Rights Reserved. TPUs 34
  • 35.
    © 2017 Mesosphere,Inc. All Rights Reserved. 35 Datacenter Typical Datacenter siloed, over-provisioned servers, low utilization Mesos/ DC/OS automated schedulers, workload multiplexing onto the same machines Tensorflow Jenkins Kafka Spark Tensorflow
  • 36.
    © 2018 Mesosphere,Inc. All Rights Reserved. PHYSICAL INFRASTRUCTURE MICROSERVICES, CONTAINERS, & DEV TOOLS VIRTUAL MACHINES PUBLIC CLOUDS DATA SERVICES, MACHINE LEARNING, & AI Security & Compliance Application-Aware Automation Multitenancy Hybrid Cloud Management 100+ MORE DatacenterEdge Datacenter and Cloud as a Single Computing Resource Powered by Apache Mesos 20+ MORE
  • 37.
    © 2017 Mesosphere,Inc. All Rights Reserved. Challenges running distributed TensorFlow* 37 ● Dealing with failures is not graceful ○ Users need to stop training, change their hard-coded ClusterSpec, and manually restart their jobs * Any Distributed System
  • 38.
    Deploy Scale Configure Recover 3 AM ... Typical Datacenter siloed,over-provisioned servers, low utilization HDFS Kafka Kubernetes Flink TensorFlow
  • 39.
    © 2018 Mesosphere,Inc. All Rights Reserved. Two-level Scheduling 1. Agents advertise resources to Master 2. Master offers resources to Framework 3. Framework rejects / uses resources 4. Agent reports task status to Master 39 MESOS ARCHITECTURE Mesos Master Mesos Master Mesos Master Mesos AgentMesos Agent Service Cassandra Executor Cassandra Task Flink Scheduler Spark Executor Spark Task Mesos AgentMesos Agent Service Docker Executor Docker Task CDB Executor Spark Task Spark Scheduler Kafka Scheduler
  • 40.
    © 2017 Mesosphere,Inc. All Rights Reserved. Challenges running distributed TensorFlow 40 ● Hard-coding a “ClusterSpec” is incredibly tedious ○ Users need to rewrite code for every job they want to run in a distributed setting ○ True even for code they “inherit” from standard models tf.train.ClusterSpec({ "worker": [ "worker0.example.com:2222", "worker1.example.com:2222", "worker2.example.com:2222", "worker3.example.com:2222", "worker4.example.com:2222", "worker5.example.com:2222", ... ], "ps": [ "ps0.example.com:2222", "ps1.example.com:2222", "ps2.example.com:2222", "ps3.example.com:2222", ... ]}) tf.train.ClusterSpec({ "worker": [ "worker0.example.com:2222", "worker1.example.com:2222", "worker2.example.com:2222", "worker3.example.com:2222", "worker4.example.com:2222", "worker5.example.com:2222", ... ], "ps": [ "ps0.example.com:2222", "ps1.example.com:2222", "ps2.example.com:2222", "ps3.example.com:2222", ... ]}) tf.train.ClusterSpec({ "worker": [ "worker0.example.com:2222", "worker1.example.com:2222", "worker2.example.com:2222", "worker3.example.com:2222", "worker4.example.com:2222", "worker5.example.com:2222", ... ], "ps": [ "ps0.example.com:2222", "ps1.example.com:2222", "ps2.example.com:2222", "ps3.example.com:2222
  • 41.
    © 2017 Mesosphere,Inc. All Rights Reserved. Challenges running distributed TensorFlow ● Manually configuring each node in a cluster takes a long time and is error-prone ○ Setting up access to a shared file system (for checkpoint and summary files) requires authenticating on each node ○ Tweaking hyper-parameters requires re-uploading code to every node 41
  • 42.
    © 2017 Mesosphere,Inc. All Rights Reserved. Typical Developer Workflow for TensorFlow (Distributed) ● … ● Provision a set of machines to run your computation ● Install TensorFlow on them ● Write code to map distributed computations to the exact IP of the machine where those computations will be performed ● Deploy your code on every machine ● Train your data on the cluster → Output Trained Model 42 Trained Model Input Data Set
  • 43.
    © 2017 Mesosphere,Inc. All Rights Reserved. Running distributed TensorFlow on DC/OS ● We use the dcos-commons SDK to dynamically create the ClusterSpec 43 { "service": { "name": "mnist", "job_url": "...", "job_context": "..." }, "gpu_worker": {... }, "worker": {... }, "ps": {... } } tf.train.ClusterSpec({ "worker": [ "worker0.example.com:2222", "worker1.example.com:2222", "worker2.example.com:2222", "worker3.example.com:2222", "worker4.example.com:2222", "worker5.example.com:2222", ... ], "ps": [ "ps0.example.com:2222", "ps1.example.com:2222", "ps2.example.com:2222", "ps3.example.com:2222", ... ]}) tf.train.ClusterSpec({ "worker": [ "worker0.example.com:2222", "worker1.example.com:2222", "worker2.example.com:2222", "worker3.example.com:2222", "worker4.example.com:2222", "worker5.example.com:2222", ... ], "ps": [ "ps0.example.com:2222", "ps1.example.com:2222", "ps2.example.com:2222", "ps3.example.com:2222", ... ]}) tf.train.ClusterSpec({ "worker": [ "worker0.example.com:2222", "worker1.example.com:2222", "worker2.example.com:2222", "worker3.example.com:2222", "worker4.example.com:2222", "worker5.example.com:2222", ... ], "ps": [ "ps0.example.com:2222", "ps1.example.com:2222", "ps2.example.com:2222", "ps3.example.com:2222
  • 44.
    © 2017 Mesosphere,Inc. All Rights Reserved. Running distributed TensorFlow on DC/OS 44 ● Wrapper script to abstract away distributed TensorFlow configuration ○ Separates “deployer” responsibilities from “developer” responsibilities { "service": { "name": "mnist", "job_url": "...", "job_context": "..." }, "gpu_worker": {... }, "worker": {... }, "ps": {... } } User Code Wrapper Script
  • 45.
    © 2017 Mesosphere,Inc. All Rights Reserved. Running distributed TensorFlow on DC/OS 45 ● The dcos-commons SDK cleanly restarts failed tasks and reconnects them to the cluster
  • 46.
    Model Management Data & Streaming Users Frameworks& Cluster Models Distributed Data Storage and Streaming Model Serving Data Preparation and Analysis Deep Learning Tools and Distributed Hosting Building Machine Learning Model Sending Model to Clients Monitoring & Operations
  • 47.
    © 2018 Mesosphere,Inc. All Rights Reserved. Recall 47 Step 1: Training (In Data Center - Over Hours/Days/Weeks) Step 2: Inference (Endpoint or Data Center - Instantaneous) Dog Input: Lots of Labeled Data Output: Trained Model Deep neural network model Trained Model Output: Classification Trained Model New Input from Camera or Sensor 97% Dog 3% Panda
  • 48.
    © 2017 Mesosphere,Inc. All Rights Reserved. Many Models 48 Step 1: Training (In Data Center - Over Hours/Days/Weeks) Dog Input: Lots of Labeled Data Output: Trained Model Deep neural network model
  • 49.
    © 2018 Mesosphere,Inc. All Rights Reserved. 49 Challenges ● Many Models ● Different Hyperparameter ● Different Models ● New Training Data ● ... Solutions ● Persistent Storage + Metadata Model Management GFS
  • 50.
    © 2017 Mesosphere,Inc. All Rights Reserved. TensorFlow Hub 50 https://www.tensorflow.org/hub/
  • 51.
    Serving Data & Streaming Users Frameworks & Cluster Models DistributedData Storage and Streaming Model Serving Data Preparation and Analysis Deep Learning Tools and Distributed Hosting Building Machine Learning Model Sending Model to Clients Monitoring & Operations
  • 52.
    © 2018 Mesosphere,Inc. All Rights Reserved. 52 Challenges ● How to Deploy Models? ● Zero Downtime ● Canary Solutions ● TensorFlow Serving Model Serving
  • 53.
    © 2018 Mesosphere,Inc. All Rights Reserved. TensorFlow Lite 53 https://www.tensorflow.org/mobile/tflite/ Challenges ● Small/Fast model without losing too much performance ● 500 KB models….
  • 54.
    © 2018 Mesosphere,Inc. All Rights Reserved. Rendezvous Architecture 54 https://mapr.com/ebooks/machine-learning-logistics/
  • 55.
    Monitoring Data & Streaming Users Frameworks & Cluster Models DistributedData Storage and Streaming Model Serving Data Preparation and Analysis Deep Learning Tools and Distributed Hosting Building Machine Learning Model Sending Model to Clients Monitoring & Operations
  • 56.
    © 2018 Mesosphere,Inc. All Rights Reserved. 56 Challenges ● Understand {...} ● Debug ● Model Quality ● Accuracy ● Training Time ● … ● Overall Architecture ● Availability ● Latencies ● ... Solutions ● TensorBoard ● Traditional Cluster Monitoring Tool Monitoring
  • 57.
    © 2018 Mesosphere,Inc. All Rights Reserved. Debugging 57 tfdbg https://www.tensorflow.org/programmers_guide/debugger
  • 58.
    © 2018 Mesosphere,Inc. All Rights Reserved. Debugging 58 Tfdbg - GUI currently alpha https://github.com/tensorflow/tensorboard/blob/master/tensorboard/plugins/debugger/README.md
  • 59.
    © 2018 Mesosphere,Inc. All Rights Reserved. Profiling 59 Performance optimization for different devices - Keep device occupied Profiling! + Experience! https://www.tensorflow.org/performance/performance_guide
  • 60.
    © 2018 Mesosphere,Inc. All Rights Reserved. Platforms 60 ● AWS Sagemaker + Spark, MXNet, TF + Serving/AB - Cloud Only ● Google Datalab/ML-Engine + TF, Keras, Scikit, XGBoost + Serving/AB - Cloud Only - No control of docker images ● KubeFlow + TF Everywhere - TF only ● DC/OS + Flexibility (all of the above) + GPU support - More Manual setup
  • 61.
    © 2018 Mesosphere,Inc. All Rights Reserved. 61 Demo 1. Explore data using Jupyter notebook 2. Train the model using TensorFlow 3. Monitor training progress using TensorBoard 4. Debug Model using tfdbg 5. Serve Model using TensorFlow Serving
  • 62.
    © 2018 Mesosphere,Inc. All Rights Reserved. Related Work 62 ● DC/OS TensorFlow https://mesosphere.com/blog/tensorflow-gpu-support-deep-learning/ ● DC/OS PyTorch https://mesosphere.com/blog/deep-learning-pytorch-gpus/ ● Ted Dunning’s Machine Learning Logistics https://thenewstack.io/maprs-ted-dunning-intersection-machine-learning-containers/ ● KubeFlow https://github.com/kubeflow/kubeflow ● Tensorflow (+ TensorBoard and Serving) https://www.tensorflow.org/
  • 63.
    © 2018 Mesosphere,Inc. All Rights Reserved. Special Thanks to All Collaborators 63 Ben Wood Robin Oh Evan Lezar Art Rand Gabriel Hartmann Chris Lambert Bo Hu Sam Pringle Kevin Klues
  • 64.
    © 2018 Mesosphere,Inc. All Rights Reserved. ● DC/OS TensorFlow Package (currently closed source) ○ https://github.com/mesosphere/dcos-tensorflow ● DC/OS TensorFlow Tools ○ https://github.com/dcos-labs/dcos-tensorflow-tools/ ● Tutorial for deploying TensorFlow on DC/OS ○ https://github.com/dcos/examples/tree/master/tensorflow ● Contact: ○ https://groups.google.com/a/mesosphere.io/forum/#!forum/tensorflow-dco s ○ Slack: chat.dcos.io #tensorflow Questions and Links 64