Introduction to DL platform

Introduction to DL Platform
Changjian Gao

Table of Contents
• Intro

• Why?

• Goals & Non-Goals

• Heterogeneous Resources and Multi-tenant

• Distributed Training

• Deep Learning as Software Engineering

Why DL platform?
Hidden Technical Debt in Machine Learning Systems, NIPS’15

Goals
• Deep Learning as Software Engineering (think about CI/CD)

• Heterogeneous resources management (CPU, GPU etc.)

• Multi-tenant management (sharing and isolation)

• Distributed training

• Multiple DL frameworks support

• Easy to tuning and diagnosis (logs, metrics, proﬁling etc.)

• User-friendly interface (CLI, Web UI etc.)

• AutoML

• Feature and model sharing

• Maybe: elastic DL, model zoo

Non-Goals
• Invent yet another DL framework

• Intrusive design

Heterogeneous Resources and Multi-tenant

K8s
• Good
• Good for heterogeneous resources management and isolation

• Basic multi-tenant management (namespace etc.)

• PVC make data isolation easily

• Active community

• Bad
• Batch workload scheduling

• Flexible multi-tenant management

• YAML isn’t user-friendly (too trivial)

• So many new concepts (pod, service, deployment etc.)

K8s - Scheduling
• The default scheduler isn’t suit for batch workload

• DL job is usually batch workload (especially distributed training)

• What we miss from other scheduler (e.g. YARN):

• Gang scheduling (a.k.a. coscheduling)

• Fair-share and capacity scheduler

• Queue

• Priority

• Preemption

K8s - Scheduling
• Volcano
• Batch system built on K8s

• CNCF sandbox project

• Lead by Huawei Cloud

• SIG Scheduling
• K8s scheduling framework (since 1.15)

• Lead by IBM and Alibaba Cloud

• Scheduler Plugins

K8s - GPU Sharing
• GPU sharing is hard

• Current solutions:

• GPU Sharing Scheduler Extender (Alibaba Cloud)

• GPU Manager (Tencent Cloud)

• Virtual GPU Device Plugin (AWS)

• Multi-Instance GPUs (Nvidia)

Goals
• High scaling eﬃciency
Meet Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow

Parameter Server
Large Scale Distributed Deep Networks, NIPS’12

Ring Allreduce
Meet Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow

K8s - Operator
• The Operator pattern aims to capture the key aim of a human operator
who is managing a service or set of services

• Invented by CoreOS (acquired by Red Hat now)

• Useful operators for distributed training:

• kubeflow/tf-operator (TensorFlow, PS mode)

• kubeflow/pytorch-operator (PyTorch, PS mode)

• kubeflow/mxnet-operator (MXNet, PS mode)

• kubeflow/mpi-operator (Any framework, Allreduce mode)

Deep Learning as Software Engineering

Kubeflow Pipelines
• Reusable end-to-end ML workflows built using the Kubeflow Pipelines
SDK

• Integrate with K8s from day one (Kubeflow = Kubernetes + Workflow)

• DAG orchestration based on Argo

• Heavily rely on K8s operator (i.e. CRD)

• Web UI and API

• Lead by Google Cloud

MLﬂow
• An open source platform for the machine learning lifecycle

• Integrate with K8s experimentally

• Rely on K8s Job resource

• Web UI and API

• Lead by Databricks

Introduction to DL platform

More Related Content

What's hot

Similar to Introduction to DL platform

Recently uploaded

Introduction to DL platform