Distributed deep learning reference architecture v3.2l

Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
OpenPOWER ADG
+
IBM Deep Learning Cluster Reference Architecture
—
Florin Manaila
Senior IT Architect and Inventor
Cognitive Systems (HPC and Deep Learning)
IBM Systems Hardware Europe

OpenPOWER ADG
2Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation

Welcome you all for the AI and OpenPOWER event
3

Chip / SOC
This is What A Revolution Looks Like © 2018 OpenPOWER
Foundation
I/O / Storage / Acceleration
Boards /
Systems
Software
System / Integration
Implementation / HPC / Research

Chip / SOC
Foundation
Boards /
Systems
Software
328+
Members
33
Countries
70+
ISVs

Chip / SOC
Foundation
Boards /
Systems
Software
328+
Members
33
Countries
70+
ISVs
Active Membership
From All
Layers of the
Stack
100k+ Linux Applications
Running on Power
2300 ISVs Written Code
on Linux
Partners
Bring
Systems
to Market
150+ OpenPOWER Ready
Certified Products
20+ Systems Manufacturers
40+ POWER-based systems
shipping or in development
100+ Collaborative innovations
under way

OpenPOWER in Action

Academic Membership
11
A*STAR ASU ASTRI Moscow State University Carnegie Mellon Univ.
CDAC Colorado School of
Mines
CINECA CFMS Coimbatore Institute of
Technology
Dalian University of
Technology
GSIC Hartree Centre ICM IIIT Bangalore
IIT Bombay Indian Institute for
Technology Roorkee
ICCS INAF FZ Jülich
LSU BSC Nanyang Technological
University
National University of
Singapore
NIT Mangalore
NIT Warangal Northeastern University
in China
ORNL OSU RICE
Rome HPC Center LLNL SANDIA SASTRA University Seoul National University
Shanghai Shao Tong
University
SICSR TEES Tohoku University Tsinghua University
University of Arkansas SDSC Unicamp University of Central
Florida
University of Florida
University of Hawai University of Hyderabad University of Illinois University of Michigan University of Oregon
University of Patras University of Southern
California
TACC Waseda University IISc ,Loyola,IIT Roorkee
Currently about 100+ academic members in OPF

Goals of the Academia Discussion Group
12
§ Provide training and exchange of experience and know-how
§ Provide platform for networking among academic members
§ Work on engagement of HPC community
§ Enable co-design/development activities

OpenPOWER Foundation
13
Growing number of academic organizations have become member of the OpenPOWER Foundation
The Academia Discussion Groups provides a platform for training, networking, engagement and
enablement of co-design
Those who have not yet joined:
You are welcome to join
https://members.openpowerfoundation.org/wg/AcademiaDG/mail/index
OpenPOWER AI virtual University's focus on bringing together industry, government and academic
expertise to connect and help shape the AI future .
https://www.youtube.com/channel/UCYLtbUp0AH0ZAv5mNut1Kcg

IBM Deep Learning Cluster Reference Architecture

Distributed Deep Learning Approach
15
SINGLE ACCELERATOR DATA PARALLEL MODEL PARALLEL DATA AND MODEL PARALLEL
1x Accelerator 4x Accelerators 4x Accelerators
4x n Accelerators
Longer Training Time Shorter Training Time
System1System2Systemn
System
Data
Data
DataDataDataData
DataDataData

Phases of AI development
16
Experimentation Phase
– Single-node
– Small scale data
– Algorithms prototyping and hyper-parammeters
Scaling Phase
– Multi-node
– Medium scale data (local SSD’s or NVM’s)
Production Phase
– Cluster deployment
– Upstream data pipeline
– Inference
Experimentation Scaling Production

Challenges in Deep Learning
17
§ Storage performance / Data-pipeline
§ Network performance
§ Orchestration
§ Management and monitoring of the cluster
§ Monitoring of DL training or DL inference
§ Scaling
§ Efficiency
§ Data ingest
§ ILM
§ Backup
§ Accelerated rate of new DL frameworks and versions
§ Software refresh cycle

Deep Learning Scaling Challenges
18
§ Model replication
§ Device placement for variables
§ Fault tolerance
§ Sessions and Servers
§ Monitoring training session
§ Data splitting

Some Data Scientists considerations
19
Data Size
– The entire model might not fit onto a single GPU If the size of the input data is especially
large
– Shared file system is required if the number of records is prohibitively large
– If the number of records is large convergence can be sped up using multiple GPUs or
distributed models
Model Size
– Splitting the model across multi-GPUs (model parallel) is required if the size of the network
is larger than used GPU’s memory
# Updates
– Multiple GPU configurations on a single server (4, 6, 8) should be taken into consideration
in case number of updates and the size of the updates are considerable
Hardware
– Network speeds play a crucial role in distributed model settings
– Infiniband RDMA and MPI play an important role (MPI latency is 1-3us/message due to OS
bypass)

Standards
20
§ Mellanox InfiniBand
§ RDMA over InfiniBand
§ NVIDIA GPU’s and related software
§ Containers
§ Workload Managers (LSF, SLURM, Kubernetes etc)
§ xCAT
§ High Performance File System
§ Python 2.x and/or 3.x
§ DL Frameworks (Caffe, TF, Torch, etc)
§ SSD/NVMe

Functional Requirements
21
§ NVIDIA GPU’s SMX2 form factor
§ InfiniBand EDR interconnect no over-
subscription
§ Islands approach for large cluster
§ Inter islalands 1:2 InfiniBand over-
subscription
§ High Performance file system using SSD’s,
NVM’s or Flash
§ MPI
§ Job Scheduler support for GPU based
containers
§ Job Scheduler python integration
§ DL Frameworks support for NVLINK
§ Distributed Deep Learning
§ Large Model Support
§ HDFS support
§ IPIMI support
§ Management and Monitoring of the infrastructure
with xCAT or similar and web interface
§ Visualization of Distributed Deep Learning training
activities

Non-Functional Requirements
22
§ Accessibility
§ Audibility and Control
§ Availability
§ Backup
§ Fault tolerance (e.g. Operational System Monitoring, Measuring, and Management)
§ Open Source Frameworks
§ Resilience
§ Scalability in integrated way (from 2 nodes to 2000 nodes)
§ Security and Privacy
§ Throughput
§ Performance / short training times
§ Platform compatibility

Architecture Decisions
Containers vs Bare Metal

Architecture Decisions
Storage

Architecture for an experimental IBM Deep Learning System
Hardware Overview
26Experimentation Scaling Production
Data Scientists
workstations
Data Scientists Internal SAS
drives and
NVM’s
POWER
Accelerated
Servers with GPU’s
InfiniBand EDR
P2P connection

Architecture for small IBM Deep Learning Cluster
Hardware Overview
27

Architecture for small IBM Deep Learning Cluster
Hardware Overview for fully containerized environment
28Group Name / DOC ID / Month XX, 2017 / © 2017 IBM Corporation
Experimentation Scaling ProductionProduction

29
Architecture for large IBM Deep Learning Cluster
Hardware Overview

Architecture for small to large IBM Deep Learning Cluster
Storage – Spectrum Scale
30
Block
iSCSI
Data Scientists
workstations
Data Scientists
and
applications
Traditional
applications
Global Namespace
Analytics
Transparent
HDFS
Spark
OpenStack
Cinder
Glance
Manilla
Object
Swift S3
Transparent Cloud
Powered by IBM Spectrum Scale
Automated data placement and data migration
Disk Tape Shared Nothing
Cluster
Flash
Transparent Cloud
Tier
SMBNFS
POSIX
File
Worldwide Data
Distribution(AFM)
Site
B
Site
A
Site
C
Encryption
DR Site
AFM-DR
JBOD/JBOF
Spectrum Scale RAID
Compression
Deep
Learning
Cluster
Native
RDMA
over InfiniBnd
Long Term Only

Compute (InfiniBand) Networking
31
Compute Island #1 Compute Island #2 Mng and IO Island #1
L3-1 L3-X
L2-1 L2-Z
18x Links to Login, Srv 18x Links to IBM ESS
L1-1 L1-Y
L2-1 L2-Z
18x Links to Compute 18x Links to Compute
L1-1 L1-Y
L2-1 L2-Z
18x Links to Compute 18x Links to Compute
NOTE: Number of InfiniBand switches depends of the no of compute nodes and required
oversubscription as well as no of available IB ports / switch

Management Networking

Docker Containers (only for HPC based customers)

Physical View of small IBM Deep Learning Cluster
Hardware rack view
34
Compute Nodes (9x)
• Shown with decorative bezel
• Hardware viewable behind
bezel
Network Switch
Location
• Shown with blank cover
• 3 EIA
Empty Space
• 2 EIA
• Space reserved in the back
for power, cooling, cabling
escape
Empty Space
• 1 EIA
• Space reserved in the back
for power, cabling escape
Compute Nodes (9x)
• Shown with decorative bezel
• Hardware viewable behind
bezel

Physical View of small IBM Deep Learning Cluster – Sample Scalability
Hardware rack view
35
Scale by factor of:
- 2x Storage (capacity
and performance)
- 3.2x Compute
- 1:1 IB
Oversubscription

Software Overview
RHEL 7.5
Mlnx OFED 4
CUDA 9
cuDDN 7
IBM Spectrum MPI
PowerAI 5.1
Docker
Anaconda
Nvidia-Docker
RHEL 7.5
Mlnx OFED 4
CUDA 9
cuDDN 7
Docker
ICP with K8s
PowerAIBase
PowerAIVision
PowerAIBase
PowerAIBase
DSXLocal
RHEL 7.5
Mlnx OFED 4
CUDA 9
cuDDN 7
IBM Spectrum MPI
PowerAI 5.1
Docker
Anaconda
Nvidia-Docker
IBM Spectrum LSF
Option 1 Option 2 Option 3

Software Overview
RHEL 7.5
Mlnx OFED 4
CUDA 9
cuDDN 7
Docker
ICP Compute
PowerAIBase
PowerAIVision
PowerAIBase
PowerAIBase
DSXLocal
RHEL 7.5
Mlnx OFED 4
CUDA 9
cuDDN 7
IBM Spectrum MPI
PowerAI 5.1
Docker
Anaconda
Nvidia-Docker
LSF Client
Option 1
Option 2
RHEL 7.5
Mlnx OFED 4
IBM Spectrum MPI
LSF Master
RHEL 7.5
Mlnx OFED 4
Docker
ICP Master with K8s
xCAT, Grafana

IBM Cloud Private Architecture Overview
Containerized environment based on Kubernetes

Architecture Overview for IBM Deep Learning Cluster
Hardware Components
39
§ Login Nodes (40c POWER9, 2x V100 GPU's, 256GB RAM, 2x 960GB SSD, IB EDR,
10GE, 1Gbps)
§ Service/Master Nodes (40c POWER, 256GB RAM, 4x 960GB SSD, IB EDR, 10GE)
§ CES Nodes (40c POWER, 256GB RAM, 2x 960GB SSD, IB EDR, 10GE)
§ Compute/Worker Nodes (40c POWER9, 4x V100 GPU's, 512GB RAM, 2x 960GB SSD,
1x 1.6TB NVMe adapter, IB EDR, 1Gbps)
§ EDR Mellanox InfiniBand Switches with 36 ports; including IB cables
§ IBM Ethernet Switches for management (48-ports 1Gbps and 4x 10GE ports )
including cables and SFP+
§ IBM ESS GS2S, with InfiniBand EDR and 10GE Network for storage

IBM Newell
AC922 System Architecture Overview

Operational Model 1
41
Data Scientists
2x IBM AC922
SSHv2
HTTP
DIGITS Web
CLI - Python
AI Vision
Experimentation Scaling ProductionCognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation

Operational Model 2
42
Data Scientists
2x IBM AC922
SSHv2
HTTP
Jupiter
Notebook
CLI - Python
TensorBoar
d
Experimentation Scaling ProductionCognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation

LSF New GPU Scheduling Options
43
GPU mode management
§ The user can request the desired GPU mode for the job. If the mode of a GPU needs to be
changed for a job to run, it’s original mode will be restored after the job completes.
GPU allocation policy
§ Support reserving physical GPU resources
§ Provide “best-effort” GPU allocation policy considering: CPU-GPU affinity, Current GPU
mode and GPU job load
§ Export CUDA_VISIBLE_DEVICES for using in job pre/post scripts
Integrated support for IBM Spectrum MPI
§ Export per task environmental variables CUDA_VISIBLE_DEVICES%d
§ IBM Spectrum MPI will apply the correct CVD mask to each task

LSF Docker Support
44
Starting with LSF 10.1.0.3 we provide support for nVidia’s distribution of Docker which allows
LSF’s CPU, cgroup and GPU allocation functionality to work correctly.
Begin Application
NAME = nvdia-docker
CONTAINER = nvidia-docker [ image(nvidia/cuda) options(--rm --net=host --ipc=host --sig-
proxy=false) starter(lsfadmin)]
End Application
$bsub -app nvdia-docker –gpu “num=1” ./ibm-powerai

HW Design: Elastic Storage Server (ESS)
45
Software
§ IBM Spectrum Scale for
IBM Elastic Storage
Server
§ RedHat Linux Enterprise
Data Server Summary
§ 2x20 Cores POWER8 3.42 GHz
§ 2x 256GB DDR4 Memory
§ 4x 100Gb/s Infiniband EDR
Storage SSD Enclosures
§ 2x 24 3.84 TB SSD (288 SSD)
§ Cc 128TB usable capacity (8+2
parity)
§ Burst Buffer capacity - sum of all
NVMe‘s in the compute nodes

HW Design: Burst Buffer Integration
46
§ Compute Node SSD uses a standard XFS Linux file system
§ Burst buffer is a file transfer service
§ Raw data transfer uses NVMe over Fabrics
§ Formerly called FlashDirect
§ Think of it as: RDMA targeting NVMe memory
§ Data transferred between ESS I/O node and NVMe PCIe device
§ Data is directly placed onto NVMe PCIe device (or pulled from)
§ Avoiding CPU/GPU usage
§ Hardware offload support built into ConnectX-5
§ File system
§ BB determines where to place data onto NVMe PCIe
§ Consistent with where the file system expects
§ Optimized for direct placement of data.
NVMe PCIe
Device

Thank you
47
Florin Manaila
Senior IT Architect and Inventor
Cognitive Systems (HPC and Deep Learning)
florin.manaila@de.ibm.com

Distributed deep learning reference architecture v3.2l

More Related Content

What's hot

Similar to Distributed deep learning reference architecture v3.2l

More from Ganesan Narayanasamy

Recently uploaded

Distributed deep learning reference architecture v3.2l