Introduction to DL Platform
Changjian Gao
Table of Contents
• Intro

• Why?

• Goals & Non-Goals

• Heterogeneous Resources and Multi-tenant

• Distributed Training

• Deep Learning as Software Engineering
Deep Learning Frameworks
Why DL platform?
Hidden Technical Debt in Machine Learning Systems, NIPS’15
Goals
• Deep Learning as Software Engineering (think about CI/CD)

• Heterogeneous resources management (CPU, GPU etc.)

• Multi-tenant management (sharing and isolation)

• Distributed training

• Multiple DL frameworks support

• Easy to tuning and diagnosis (logs, metrics, profiling etc.)

• User-friendly interface (CLI, Web UI etc.)

• AutoML

• Feature and model sharing

• Maybe: elastic DL, model zoo
Non-Goals
• Invent yet another DL framework

• Intrusive design
Heterogeneous Resources and Multi-tenant
K8s
• Good
• Good for heterogeneous resources management and isolation

• Basic multi-tenant management (namespace etc.)

• PVC make data isolation easily

• Active community

• Bad
• Batch workload scheduling

• Flexible multi-tenant management

• YAML isn’t user-friendly (too trivial)

• So many new concepts (pod, service, deployment etc.)
K8s - Scheduling
• The default scheduler isn’t suit for batch workload

• DL job is usually batch workload (especially distributed training)

• What we miss from other scheduler (e.g. YARN):

• Gang scheduling (a.k.a. coscheduling)

• Fair-share and capacity scheduler

• Queue

• Priority

• Preemption
K8s - Scheduling
• Volcano
• Batch system built on K8s

• CNCF sandbox project

• Lead by Huawei Cloud

• SIG Scheduling
• K8s scheduling framework (since 1.15)

• Lead by IBM and Alibaba Cloud

• Scheduler Plugins
K8s - GPU Sharing
• GPU sharing is hard

• Current solutions:

• GPU Sharing Scheduler Extender (Alibaba Cloud)

• GPU Manager (Tencent Cloud)

• Virtual GPU Device Plugin (AWS)

• Multi-Instance GPUs (Nvidia)
Distributed Training
Goals
• High scaling efficiency
Meet Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
Parameter Server
Large Scale Distributed Deep Networks, NIPS’12
Ring Allreduce
Meet Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
K8s - Operator
• The Operator pattern aims to capture the key aim of a human operator
who is managing a service or set of services

• Invented by CoreOS (acquired by Red Hat now)

• Useful operators for distributed training:

• kubeflow/tf-operator (TensorFlow, PS mode)

• kubeflow/pytorch-operator (PyTorch, PS mode)

• kubeflow/mxnet-operator (MXNet, PS mode)

• kubeflow/mpi-operator (Any framework, Allreduce mode)
Deep Learning as Software Engineering
Kubeflow Pipelines
• Reusable end-to-end ML workflows built using the Kubeflow Pipelines
SDK

• Integrate with K8s from day one (Kubeflow = Kubernetes + Workflow)

• DAG orchestration based on Argo

• Heavily rely on K8s operator (i.e. CRD)

• Web UI and API

• Lead by Google Cloud
Kubeflow Pipelines
MLflow
• An open source platform for the machine learning lifecycle

• Integrate with K8s experimentally

• Rely on K8s Job resource

• Web UI and API

• Lead by Databricks
MLflow
Thanks

Introduction to DL platform

  • 1.
    Introduction to DLPlatform Changjian Gao
  • 2.
    Table of Contents •Intro • Why? • Goals & Non-Goals • Heterogeneous Resources and Multi-tenant • Distributed Training • Deep Learning as Software Engineering
  • 4.
  • 5.
    Why DL platform? HiddenTechnical Debt in Machine Learning Systems, NIPS’15
  • 6.
    Goals • Deep Learningas Software Engineering (think about CI/CD) • Heterogeneous resources management (CPU, GPU etc.) • Multi-tenant management (sharing and isolation) • Distributed training • Multiple DL frameworks support • Easy to tuning and diagnosis (logs, metrics, profiling etc.) • User-friendly interface (CLI, Web UI etc.) • AutoML • Feature and model sharing • Maybe: elastic DL, model zoo
  • 7.
    Non-Goals • Invent yetanother DL framework • Intrusive design
  • 8.
  • 9.
    K8s • Good • Goodfor heterogeneous resources management and isolation • Basic multi-tenant management (namespace etc.) • PVC make data isolation easily • Active community • Bad • Batch workload scheduling • Flexible multi-tenant management • YAML isn’t user-friendly (too trivial) • So many new concepts (pod, service, deployment etc.)
  • 10.
    K8s - Scheduling •The default scheduler isn’t suit for batch workload • DL job is usually batch workload (especially distributed training) • What we miss from other scheduler (e.g. YARN): • Gang scheduling (a.k.a. coscheduling) • Fair-share and capacity scheduler • Queue • Priority • Preemption
  • 11.
    K8s - Scheduling •Volcano • Batch system built on K8s • CNCF sandbox project • Lead by Huawei Cloud • SIG Scheduling • K8s scheduling framework (since 1.15) • Lead by IBM and Alibaba Cloud • Scheduler Plugins
  • 12.
    K8s - GPUSharing • GPU sharing is hard • Current solutions: • GPU Sharing Scheduler Extender (Alibaba Cloud) • GPU Manager (Tencent Cloud) • Virtual GPU Device Plugin (AWS) • Multi-Instance GPUs (Nvidia)
  • 13.
  • 14.
    Goals • High scalingefficiency Meet Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
  • 15.
    Parameter Server Large ScaleDistributed Deep Networks, NIPS’12
  • 16.
    Ring Allreduce Meet Horovod:Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
  • 17.
    K8s - Operator •The Operator pattern aims to capture the key aim of a human operator who is managing a service or set of services • Invented by CoreOS (acquired by Red Hat now) • Useful operators for distributed training: • kubeflow/tf-operator (TensorFlow, PS mode) • kubeflow/pytorch-operator (PyTorch, PS mode) • kubeflow/mxnet-operator (MXNet, PS mode) • kubeflow/mpi-operator (Any framework, Allreduce mode)
  • 18.
    Deep Learning asSoftware Engineering
  • 19.
    Kubeflow Pipelines • Reusableend-to-end ML workflows built using the Kubeflow Pipelines SDK • Integrate with K8s from day one (Kubeflow = Kubernetes + Workflow) • DAG orchestration based on Argo • Heavily rely on K8s operator (i.e. CRD) • Web UI and API • Lead by Google Cloud
  • 20.
  • 21.
    MLflow • An opensource platform for the machine learning lifecycle • Integrate with K8s experimentally • Rely on K8s Job resource • Web UI and API • Lead by Databricks
  • 22.
  • 23.