Shingo Omura (@everpeace), Preferred Networks, Inc.
Kubeflow Meetup #1 2018-09-26
(Cloud Native Meetup Tokyo #5)
Kubeflow Operators
1
Shingo Omura
• Engineer, Preferred Networks, Inc.
• Dev/Ops in-house GPU clusters
• chainer usability improvement on clouds
• kubeflow/chainer-operator developer
– spin up distributed chainer jobs
with one yaml !!
• @everpeace (twitter)
• shingo.omura (facebook)
2
We’re Hiring!!
Shingo Omura
Key Note at July Tech Festa 2018
SlideShare
Kubernetes Meetup Tokyo #13
28th(Fri) at Yahoo Japan!!!
3
Please Join!!
Today’s Topic
4
c.f. Kubeflow Deep Dive – David Aronchick & Jeremy Lewi, Google, KubeCon + CloudNativeCon Europe 2018
Training!!
Kubeflow supports multiple ML frameworks
New!!
0.3.0
HOROVOD
5
How? ➔ Operators and CRDs !!
Icons made by Gregor Cresnar from www.flaticon.com is licensed by CC 3.0 BY
kind: CustomResourceDefinition
…
spec:
kind: MyKind
What is CRD !?
6
Operator
What is Operator !?
What is CRD !?
Icons made by Gregor Cresnar, Kiranshastry, Icon Pond, Icon Monk from www.flaticon.com is licensed by CC 3.0 BY
kind: MyKind
metadata:
name: my-name
kind: CustomResourceDefinition
…
spec:
kind: MyKind
Custom Resource
Definition
Custom Resource
7
What is Operator !?
Icons made by Gregor Cresnar, Kiranshastry, Icon Pond, Icon Monk from www.flaticon.com is licensed by CC 3.0 BY
kind: MyKind
metadata:
name: my-name
Custom Resource
& Cluster State
Cluster State
Operator
8
Kubeflow’s multi ML framework support
apiVersion: kubeflow.org/v1alpha*
kind: **Job
...
Operator
CRDs
TFJob
PyTorchJob
MPIJob
MXJob
Caffe2Job
ChainerJob
Operators
tf-opeartor
pytorch-operator
mpi-operator
mxnet-operator
caffe2-operator
chainer-operator
kssonnet packages
examples
pytorch-job
mpi-job
mxnet-job
_no pkg for caffe2_
chainer-job
* mpi-operator supports horovod jobs * examples package contains TFJob
9Icons made by Gregor Cresnar, Kiranshastry, Icon Pond, Icon Monk, Freepik from www.flaticon.com is licensed by CC 3.0 BY
Kubeflow’s multi ML framework support
apiVersion: kubeflow.org/v1alpha*
kind: **Job
...
Operator
Icons made by Gregor Cresnar, Kiranshastry, Icon Pond, Icon Monk, Freepik from www.flaticon.com is licensed by CC 3.0 BY
CRDs
TFJob
PyTorchJob
MPIJob
MXJob
Caffe2Job
ChainerJob
Operators
tf-opeartor
pytorch-operator
mpi-operator
mxnet-operator
caffe2-operator
chainer-operator
kssonnet packages
examples
pytorch-job
mpi-job
mxnet-job
_no pkg for caffe2_
chainer-job
* mpi-operator supports horovod jobs * examples package contains TFJob
10
All the CRDs support
single-node and multi-nodes
machine learning jobs
A CLI-supported framework for extensible Kubernetes configurations
ksonnet
11
ksonnet save us from editing lengthy yaml files !
12
ksonnet save us from editing length yaml files!
apiVersion: kubeflow.org/v1alpha2
kind: TFJob
metadata:
name: sample
namespace: user-omura
spec:
tfReplicaSpecs:
Ps:
template:
spec:
containers:
- args:
- python
- tf_cnn_benchmarks.py
- --batch_size=32
- --model=resnet50
- --variable_update=parameter_server
- --flush_stdout=true
- --num_gpus=1
- --local_parameter_device=cpu
- --device=cpu
- --data_format=NHWC
image:
gcr.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
name: tensorflow
workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
restartPolicy: OnFailure
tfReplicaType: PS
Worker:
replicas: 1
template:
spec:
containers:
- args:
- python
- tf_cnn_benchmarks.py
- --batch_size=32
- --model=resnet50
- --variable_update=parameter_server
- --flush_stdout=true
- --num_gpus=1
- --local_parameter_device=cpu
- --device=cpu
…….
13
How Does
Kubeflow Operators Work??
14
Two Different Distributed Training Job Styles
Icons made by Eucalyp, Smashicons from www.flaticon.com is licensed by CC 3.0 BY
Parameter Servers Style All-Reduce Style
Parameter servers
● calc gradient avgs
● send them back to Workers
Workers
● train (calc gradients) in parallel
● send them to parameter servers
Workers
● train (calc gradients) in parallel
● exchange them each other
15
Two Different Distributed Training Job Styles
Icons made by Eucalyp, Smashicons from www.flaticon.com is licensed by CC 3.0 BY
Parameter Servers Style All-Reduce Style
Parameter servers
● calc gradient avgs
● send them back to Workers
Workers
● train (calc gradients) in parallel
● send them to parameter servers
Workers
● train (calc gradients) in parallel
● exchange them each other
HORO
VOD
16
TFJob structure (Parameter Server style)
apiVersion: kubeflow.org/v1alpha2
kind: TFJob
spec:
tfReplicaSpecs:
cleanPodPolicy: ... # controls deletion of pods when a job terminates (Running, All, None)
Chief: … # orchestrating training and performing tasks like checkpointing the model
Evaluator: … # compute evaluation metrics as the model is trained
Ps: … # parameter servers
Worker: # the actual work of training the model. worker 0 might also act as the chief
replicas: ... # number of replicas
restartPolicy: # behaviour when they exit. (Always, OnFailure, ExitCode, Never)
template: … # PodTemplate
c.f. https://www.kubeflow.org/docs/guides/components/tftraining/
17
Pod
Pod
Pod
Pod
Anatomy of TFJobs
tf-operator k8s
Icons made by Gregor Cresnar, Kiranshastry, Icon Pond, Icon Monk from www.flaticon.com is licensed by CC 3.0 BY
TFJob
Pod
Pod
Pod
Pod
● expand TFJob to bear Pods and Service
● retry when pods exits by restartPolicy
● clean up pods when job finished by
cleanPodPolicy
Service
18
ChainerJob structure (All-Reduce style)
apiVersion: kubeflow.org/v1alpha2
kind: ChainerJob
spec:
backend: mpi # defines the protocol to initiate process groups (only ‘mpi’ is supported now)
master: # initiate and orchestrate distributed job
activeDeadlineSeconds: # the same with Jobspec
backoffLimit: # the same with Jobspec
...
workerSets: # a set of workerSet (for defining heterogeneous workers)
workerSetName: # your own workerSet name
replicas: # number of replicas of workerSet
mpiConfig: # you can define number of slot for each worker
template: # PodTemplate
c.f. https://www.kubeflow.org/docs/guides/components/chainer/
19
Anatomy of ChainerJob
● expand ChainerJob to ConfigMap, Job,
Service and StatefulSets
● fault tolerancy borrow from Job and StatefulSets
● scale down when job finished for cleanup
Icons made by Gregor Cresnar, Kiranshastry, Icon Pond, Icon Monk from www.flaticon.com is licensed by CC 3.0 BY
chainer-operator
ChainerJob
Pod
Job
PodPodPodPod
k8s
Service
StatefulSets
ConfiMap
20
Icons made by Eucalyp, rom www.flaticon.com is licensed by CC 3.0 BY 21
Demo Time!!demo script
PFNでは 効率的で柔軟な機械学習クラスタの構築
を一緒に 挑戦してみたい人を募集 しています
https://www.preferred-networks.jp/jobs
We’re Hiring!!
22
Icons made by Vincent Le Moign from https://icon-icons.com/ licensed by CC 3.0 BY
Thank you for Listening!!
Any Questions?
23

20180926 kubeflow-meetup-1-kubeflow-operators-Preferred Networks-Shingo Omura

  • 1.
    Shingo Omura (@everpeace),Preferred Networks, Inc. Kubeflow Meetup #1 2018-09-26 (Cloud Native Meetup Tokyo #5) Kubeflow Operators 1
  • 2.
    Shingo Omura • Engineer,Preferred Networks, Inc. • Dev/Ops in-house GPU clusters • chainer usability improvement on clouds • kubeflow/chainer-operator developer – spin up distributed chainer jobs with one yaml !! • @everpeace (twitter) • shingo.omura (facebook) 2 We’re Hiring!!
  • 3.
    Shingo Omura Key Noteat July Tech Festa 2018 SlideShare Kubernetes Meetup Tokyo #13 28th(Fri) at Yahoo Japan!!! 3 Please Join!!
  • 4.
    Today’s Topic 4 c.f. KubeflowDeep Dive – David Aronchick & Jeremy Lewi, Google, KubeCon + CloudNativeCon Europe 2018 Training!!
  • 5.
    Kubeflow supports multipleML frameworks New!! 0.3.0 HOROVOD 5
  • 6.
    How? ➔ Operatorsand CRDs !! Icons made by Gregor Cresnar from www.flaticon.com is licensed by CC 3.0 BY kind: CustomResourceDefinition … spec: kind: MyKind What is CRD !? 6 Operator What is Operator !?
  • 7.
    What is CRD!? Icons made by Gregor Cresnar, Kiranshastry, Icon Pond, Icon Monk from www.flaticon.com is licensed by CC 3.0 BY kind: MyKind metadata: name: my-name kind: CustomResourceDefinition … spec: kind: MyKind Custom Resource Definition Custom Resource 7
  • 8.
    What is Operator!? Icons made by Gregor Cresnar, Kiranshastry, Icon Pond, Icon Monk from www.flaticon.com is licensed by CC 3.0 BY kind: MyKind metadata: name: my-name Custom Resource & Cluster State Cluster State Operator 8
  • 9.
    Kubeflow’s multi MLframework support apiVersion: kubeflow.org/v1alpha* kind: **Job ... Operator CRDs TFJob PyTorchJob MPIJob MXJob Caffe2Job ChainerJob Operators tf-opeartor pytorch-operator mpi-operator mxnet-operator caffe2-operator chainer-operator kssonnet packages examples pytorch-job mpi-job mxnet-job _no pkg for caffe2_ chainer-job * mpi-operator supports horovod jobs * examples package contains TFJob 9Icons made by Gregor Cresnar, Kiranshastry, Icon Pond, Icon Monk, Freepik from www.flaticon.com is licensed by CC 3.0 BY
  • 10.
    Kubeflow’s multi MLframework support apiVersion: kubeflow.org/v1alpha* kind: **Job ... Operator Icons made by Gregor Cresnar, Kiranshastry, Icon Pond, Icon Monk, Freepik from www.flaticon.com is licensed by CC 3.0 BY CRDs TFJob PyTorchJob MPIJob MXJob Caffe2Job ChainerJob Operators tf-opeartor pytorch-operator mpi-operator mxnet-operator caffe2-operator chainer-operator kssonnet packages examples pytorch-job mpi-job mxnet-job _no pkg for caffe2_ chainer-job * mpi-operator supports horovod jobs * examples package contains TFJob 10 All the CRDs support single-node and multi-nodes machine learning jobs
  • 11.
    A CLI-supported frameworkfor extensible Kubernetes configurations ksonnet 11
  • 12.
    ksonnet save usfrom editing lengthy yaml files ! 12
  • 13.
    ksonnet save usfrom editing length yaml files! apiVersion: kubeflow.org/v1alpha2 kind: TFJob metadata: name: sample namespace: user-omura spec: tfReplicaSpecs: Ps: template: spec: containers: - args: - python - tf_cnn_benchmarks.py - --batch_size=32 - --model=resnet50 - --variable_update=parameter_server - --flush_stdout=true - --num_gpus=1 - --local_parameter_device=cpu - --device=cpu - --data_format=NHWC image: gcr.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3 name: tensorflow workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks restartPolicy: OnFailure tfReplicaType: PS Worker: replicas: 1 template: spec: containers: - args: - python - tf_cnn_benchmarks.py - --batch_size=32 - --model=resnet50 - --variable_update=parameter_server - --flush_stdout=true - --num_gpus=1 - --local_parameter_device=cpu - --device=cpu ……. 13
  • 14.
  • 15.
    Two Different DistributedTraining Job Styles Icons made by Eucalyp, Smashicons from www.flaticon.com is licensed by CC 3.0 BY Parameter Servers Style All-Reduce Style Parameter servers ● calc gradient avgs ● send them back to Workers Workers ● train (calc gradients) in parallel ● send them to parameter servers Workers ● train (calc gradients) in parallel ● exchange them each other 15
  • 16.
    Two Different DistributedTraining Job Styles Icons made by Eucalyp, Smashicons from www.flaticon.com is licensed by CC 3.0 BY Parameter Servers Style All-Reduce Style Parameter servers ● calc gradient avgs ● send them back to Workers Workers ● train (calc gradients) in parallel ● send them to parameter servers Workers ● train (calc gradients) in parallel ● exchange them each other HORO VOD 16
  • 17.
    TFJob structure (ParameterServer style) apiVersion: kubeflow.org/v1alpha2 kind: TFJob spec: tfReplicaSpecs: cleanPodPolicy: ... # controls deletion of pods when a job terminates (Running, All, None) Chief: … # orchestrating training and performing tasks like checkpointing the model Evaluator: … # compute evaluation metrics as the model is trained Ps: … # parameter servers Worker: # the actual work of training the model. worker 0 might also act as the chief replicas: ... # number of replicas restartPolicy: # behaviour when they exit. (Always, OnFailure, ExitCode, Never) template: … # PodTemplate c.f. https://www.kubeflow.org/docs/guides/components/tftraining/ 17
  • 18.
    Pod Pod Pod Pod Anatomy of TFJobs tf-operatork8s Icons made by Gregor Cresnar, Kiranshastry, Icon Pond, Icon Monk from www.flaticon.com is licensed by CC 3.0 BY TFJob Pod Pod Pod Pod ● expand TFJob to bear Pods and Service ● retry when pods exits by restartPolicy ● clean up pods when job finished by cleanPodPolicy Service 18
  • 19.
    ChainerJob structure (All-Reducestyle) apiVersion: kubeflow.org/v1alpha2 kind: ChainerJob spec: backend: mpi # defines the protocol to initiate process groups (only ‘mpi’ is supported now) master: # initiate and orchestrate distributed job activeDeadlineSeconds: # the same with Jobspec backoffLimit: # the same with Jobspec ... workerSets: # a set of workerSet (for defining heterogeneous workers) workerSetName: # your own workerSet name replicas: # number of replicas of workerSet mpiConfig: # you can define number of slot for each worker template: # PodTemplate c.f. https://www.kubeflow.org/docs/guides/components/chainer/ 19
  • 20.
    Anatomy of ChainerJob ●expand ChainerJob to ConfigMap, Job, Service and StatefulSets ● fault tolerancy borrow from Job and StatefulSets ● scale down when job finished for cleanup Icons made by Gregor Cresnar, Kiranshastry, Icon Pond, Icon Monk from www.flaticon.com is licensed by CC 3.0 BY chainer-operator ChainerJob Pod Job PodPodPodPod k8s Service StatefulSets ConfiMap 20
  • 21.
    Icons made byEucalyp, rom www.flaticon.com is licensed by CC 3.0 BY 21 Demo Time!!demo script
  • 22.
    PFNでは 効率的で柔軟な機械学習クラスタの構築 を一緒に 挑戦してみたい人を募集しています https://www.preferred-networks.jp/jobs We’re Hiring!! 22
  • 23.
    Icons made byVincent Le Moign from https://icon-icons.com/ licensed by CC 3.0 BY Thank you for Listening!! Any Questions? 23