Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
OpenPOWER ADG
+
IBM Deep Learning Cluster Reference Architecture
—
Florin Manaila
Senior IT Architect and Inventor
Cognitive Systems (HPC and Deep Learning)
IBM Systems Hardware Europe
OpenPOWER ADG
2Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
Welcome you all for the AI and OpenPOWER event
3
4
Founding Members
in 2013
5
Ecosystem
Chip / SOC
This is What A Revolution Looks Like © 2018 OpenPOWER
Foundation
I/O / Storage / Acceleration
Boards /
Systems
Software
System / Integration
Implementation / HPC / Research
Chip / SOC
This is What A Revolution Looks Like © 2018 OpenPOWER
Foundation
I/O / Storage / Acceleration
Boards /
Systems
Software
System / Integration
Implementation / HPC / Research
328+
Members
33
Countries
70+
ISVs
Chip / SOC
This is What A Revolution Looks Like © 2018 OpenPOWER
Foundation
I/O / Storage / Acceleration
Boards /
Systems
Software
System / Integration
Implementation / HPC / Research
328+
Members
33
Countries
70+
ISVs
Active Membership
From All
Layers of the
Stack
100k+ Linux Applications
Running on Power
2300 ISVs Written Code
on Linux
Partners
Bring
Systems
to Market
150+ OpenPOWER Ready
Certified Products
20+ Systems Manufacturers
40+ POWER-based systems
shipping or in development
100+ Collaborative innovations
under way
POWER Roadmap
9
OpenPOWER in Action
10Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
Academic Membership
11
A*STAR ASU ASTRI Moscow State University Carnegie Mellon Univ.
CDAC Colorado School of
Mines
CINECA CFMS Coimbatore Institute of
Technology
Dalian University of
Technology
GSIC Hartree Centre ICM IIIT Bangalore
IIT Bombay Indian Institute for
Technology Roorkee
ICCS INAF FZ Jülich
LSU BSC Nanyang Technological
University
National University of
Singapore
NIT Mangalore
NIT Warangal Northeastern University
in China
ORNL OSU RICE
Rome HPC Center LLNL SANDIA SASTRA University Seoul National University
Shanghai Shao Tong
University
SICSR TEES Tohoku University Tsinghua University
University of Arkansas SDSC Unicamp University of Central
Florida
University of Florida
University of Hawai University of Hyderabad University of Illinois University of Michigan University of Oregon
University of Patras University of Southern
California
TACC Waseda University IISc ,Loyola,IIT Roorkee
Currently about 100+ academic members in OPF
Goals of the Academia Discussion Group
12
§ Provide training and exchange of experience and know-how
§ Provide platform for networking among academic members
§ Work on engagement of HPC community
§ Enable co-design/development activities
OpenPOWER Foundation
13
Growing number of academic organizations have become member of the OpenPOWER Foundation
The Academia Discussion Groups provides a platform for training, networking, engagement and
enablement of co-design
Those who have not yet joined:
You are welcome to join
https://members.openpowerfoundation.org/wg/AcademiaDG/mail/index
OpenPOWER AI virtual University's focus on bringing together industry, government and academic
expertise to connect and help shape the AI future .
https://www.youtube.com/channel/UCYLtbUp0AH0ZAv5mNut1Kcg
IBM Deep Learning Cluster Reference Architecture
14Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
Distributed Deep Learning Approach
15
SINGLE ACCELERATOR DATA PARALLEL MODEL PARALLEL DATA AND MODEL PARALLEL
1x Accelerator 4x Accelerators 4x Accelerators
4x n Accelerators
Longer Training Time Shorter Training Time
System1System2Systemn
System
Data
Data
DataDataDataData
DataDataData
Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
Phases of AI development
16
Experimentation Phase
– Single-node
– Small scale data
– Algorithms prototyping and hyper-parammeters
Scaling Phase
– Multi-node
– Medium scale data (local SSD’s or NVM’s)
Production Phase
– Cluster deployment
– Upstream data pipeline
– Inference
Experimentation Scaling Production
Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
Challenges in Deep Learning
17
§ Storage performance / Data-pipeline
§ Network performance
§ Orchestration
§ Management and monitoring of the cluster
§ Monitoring of DL training or DL inference
§ Scaling
§ Efficiency
§ Data ingest
§ ILM
§ Backup
§ Accelerated rate of new DL frameworks and versions
§ Software refresh cycle
Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
Deep Learning Scaling Challenges
18
§ Model replication
§ Device placement for variables
§ Fault tolerance
§ Sessions and Servers
§ Monitoring training session
§ Data splitting
Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
Some Data Scientists considerations
19
Data Size
– The entire model might not fit onto a single GPU If the size of the input data is especially
large
– Shared file system is required if the number of records is prohibitively large
– If the number of records is large convergence can be sped up using multiple GPUs or
distributed models
Model Size
– Splitting the model across multi-GPUs (model parallel) is required if the size of the network
is larger than used GPU’s memory
# Updates
– Multiple GPU configurations on a single server (4, 6, 8) should be taken into consideration
in case number of updates and the size of the updates are considerable
Hardware
– Network speeds play a crucial role in distributed model settings
– Infiniband RDMA and MPI play an important role (MPI latency is 1-3us/message due to OS
bypass)
Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
Standards
20
§ Mellanox InfiniBand
§ RDMA over InfiniBand
§ NVIDIA GPU’s and related software
§ Containers
§ Workload Managers (LSF, SLURM, Kubernetes etc)
§ xCAT
§ High Performance File System
§ Python 2.x and/or 3.x
§ DL Frameworks (Caffe, TF, Torch, etc)
§ SSD/NVMe
Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
Functional Requirements
21
§ NVIDIA GPU’s SMX2 form factor
§ InfiniBand EDR interconnect no over-
subscription
§ Islands approach for large cluster
§ Inter islalands 1:2 InfiniBand over-
subscription
§ High Performance file system using SSD’s,
NVM’s or Flash
§ MPI
§ Job Scheduler support for GPU based
containers
§ Job Scheduler python integration
§ DL Frameworks support for NVLINK
§ Distributed Deep Learning
§ Large Model Support
§ HDFS support
§ IPIMI support
§ Management and Monitoring of the infrastructure
with xCAT or similar and web interface
§ Visualization of Distributed Deep Learning training
activities
Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
Non-Functional Requirements
22
§ Accessibility
§ Audibility and Control
§ Availability
§ Backup
§ Fault tolerance (e.g. Operational System Monitoring, Measuring, and Management)
§ Open Source Frameworks
§ Resilience
§ Scalability in integrated way (from 2 nodes to 2000 nodes)
§ Security and Privacy
§ Throughput
§ Performance / short training times
§ Platform compatibility
Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
Architecture Decisions
Containers vs Bare Metal
23Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
Architecture Decisions
Storage
24Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
Architecture for an experimental IBM Deep Learning System
Hardware Overview
26Experimentation Scaling Production
Data Scientists
workstations
Data Scientists Internal SAS
drives and
NVM’s
POWER
Accelerated
Servers with GPU’s
InfiniBand EDR
P2P connection
Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
Architecture for small IBM Deep Learning Cluster
Hardware Overview
27
Experimentation Scaling Production
Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
Architecture for small IBM Deep Learning Cluster
Hardware Overview for fully containerized environment
28Group Name / DOC ID / Month XX, 2017 / © 2017 IBM Corporation
Experimentation Scaling ProductionProduction
29
Experimentation Scaling Production
Architecture for large IBM Deep Learning Cluster
Hardware Overview
Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
Architecture for small to large IBM Deep Learning Cluster
Storage – Spectrum Scale
30
Block
iSCSI
Data Scientists
workstations
Data Scientists
and
applications
Traditional
applications
Global Namespace
Analytics
Transparent
HDFS
Spark
OpenStack
Cinder
Glance
Manilla
Object
Swift S3
Transparent Cloud
Powered by IBM Spectrum Scale
Automated data placement and data migration
Disk Tape Shared Nothing
Cluster
Flash
Transparent Cloud
Tier
SMBNFS
POSIX
File
Worldwide Data
Distribution(AFM)
Site
B
Site
A
Site
C
Encryption
DR Site
AFM-DR
JBOD/JBOF
Spectrum Scale RAID
Compression
Deep
Learning
Cluster
Native
RDMA
over InfiniBnd
Long Term Only
Experimentation Scaling Production
Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
Architecture for large IBM Deep Learning Cluster
Compute (InfiniBand) Networking
31
Compute Island #1 Compute Island #2 Mng and IO Island #1
L3-1 L3-X
L2-1 L2-Z
18x Links to Login, Srv 18x Links to IBM ESS
L1-1 L1-Y
L2-1 L2-Z
18x Links to Compute 18x Links to Compute
L1-1 L1-Y
L2-1 L2-Z
18x Links to Compute 18x Links to Compute
NOTE: Number of InfiniBand switches depends of the no of compute nodes and required
oversubscription as well as no of available IB ports / switch
Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
Architecture for large IBM Deep Learning Cluster
Management Networking
32Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
Architecture for large IBM Deep Learning Cluster
Docker Containers (only for HPC based customers)
33Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
Physical View of small IBM Deep Learning Cluster
Hardware rack view
34
Compute Nodes (9x)
• Shown with decorative bezel
• Hardware viewable behind
bezel
Network Switch
Location
• Shown with blank cover
• 3 EIA
Empty Space
• 2 EIA
• Space reserved in the back
for power, cooling, cabling
escape
Empty Space
• 1 EIA
• Space reserved in the back
for power, cabling escape
Compute Nodes (9x)
• Shown with decorative bezel
• Hardware viewable behind
bezel
Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
Physical View of small IBM Deep Learning Cluster – Sample Scalability
Hardware rack view
35
Scale by factor of:
- 2x Storage (capacity
and performance)
- 3.2x Compute
- 1:1 IB
Oversubscription
Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
Architecture for an experimental IBM Deep Learning System
Software Overview
36Experimentation Scaling Production
RHEL 7.5
Mlnx OFED 4
CUDA 9
cuDDN 7
IBM Spectrum MPI
PowerAI 5.1
Docker
Anaconda
Nvidia-Docker
RHEL 7.5
Mlnx OFED 4
CUDA 9
cuDDN 7
Docker
ICP with K8s
PowerAIBase
PowerAIVision
PowerAIBase
PowerAIBase
DSXLocal
RHEL 7.5
Mlnx OFED 4
CUDA 9
cuDDN 7
IBM Spectrum MPI
PowerAI 5.1
Docker
Anaconda
Nvidia-Docker
IBM Spectrum LSF
Option 1 Option 2 Option 3
Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
Architecture for an experimental IBM Deep Learning System
Software Overview
37Experimentation Scaling Production
RHEL 7.5
Mlnx OFED 4
CUDA 9
cuDDN 7
Docker
ICP Compute
PowerAIBase
PowerAIVision
PowerAIBase
PowerAIBase
DSXLocal
RHEL 7.5
Mlnx OFED 4
CUDA 9
cuDDN 7
IBM Spectrum MPI
PowerAI 5.1
Docker
Anaconda
Nvidia-Docker
LSF Client
Option 1
Option 2
RHEL 7.5
Mlnx OFED 4
IBM Spectrum MPI
LSF Master
RHEL 7.5
Mlnx OFED 4
Docker
ICP Master with K8s
xCAT, Grafana
Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
IBM Cloud Private Architecture Overview
Containerized environment based on Kubernetes
38Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
Architecture Overview for IBM Deep Learning Cluster
Hardware Components
39
§ Login Nodes (40c POWER9, 2x V100 GPU's, 256GB RAM, 2x 960GB SSD, IB EDR,
10GE, 1Gbps)
§ Service/Master Nodes (40c POWER, 256GB RAM, 4x 960GB SSD, IB EDR, 10GE)
§ CES Nodes (40c POWER, 256GB RAM, 2x 960GB SSD, IB EDR, 10GE)
§ Compute/Worker Nodes (40c POWER9, 4x V100 GPU's, 512GB RAM, 2x 960GB SSD,
1x 1.6TB NVMe adapter, IB EDR, 1Gbps)
§ EDR Mellanox InfiniBand Switches with 36 ports; including IB cables
§ IBM Ethernet Switches for management (48-ports 1Gbps and 4x 10GE ports )
including cables and SFP+
§ IBM ESS GS2S, with InfiniBand EDR and 10GE Network for storage
Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
IBM Newell
AC922 System Architecture Overview
40Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
Architecture Overview for IBM Deep Learning Cluster
Operational Model 1
41
Data Scientists
2x IBM AC922
SSHv2
HTTP
DIGITS Web
CLI - Python
AI Vision
Experimentation Scaling ProductionCognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
Architecture Overview for IBM Deep Learning Cluster
Operational Model 2
42
Data Scientists
2x IBM AC922
SSHv2
HTTP
Jupiter
Notebook
CLI - Python
TensorBoar
d
Experimentation Scaling ProductionCognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
LSF New GPU Scheduling Options
43
GPU mode management
§ The user can request the desired GPU mode for the job. If the mode of a GPU needs to be
changed for a job to run, it’s original mode will be restored after the job completes.
GPU allocation policy
§ Support reserving physical GPU resources
§ Provide “best-effort” GPU allocation policy considering: CPU-GPU affinity, Current GPU
mode and GPU job load
§ Export CUDA_VISIBLE_DEVICES for using in job pre/post scripts
Integrated support for IBM Spectrum MPI
§ Export per task environmental variables CUDA_VISIBLE_DEVICES%d
§ IBM Spectrum MPI will apply the correct CVD mask to each task
Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
LSF Docker Support
44
Starting with LSF 10.1.0.3 we provide support for nVidia’s distribution of Docker which allows
LSF’s CPU, cgroup and GPU allocation functionality to work correctly.
Begin Application
NAME = nvdia-docker
CONTAINER = nvidia-docker [ image(nvidia/cuda) options(--rm --net=host --ipc=host --sig-
proxy=false) starter(lsfadmin)]
End Application
$bsub -app nvdia-docker –gpu “num=1” ./ibm-powerai
Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
HW Design: Elastic Storage Server (ESS)
45
Software
§ IBM Spectrum Scale for
IBM Elastic Storage
Server
§ RedHat Linux Enterprise
Data Server Summary
§ 2x20 Cores POWER8 3.42 GHz
§ 2x 256GB DDR4 Memory
§ 4x 100Gb/s Infiniband EDR
Storage SSD Enclosures
§ 2x 24 3.84 TB SSD (288 SSD)
§ Cc 128TB usable capacity (8+2
parity)
§ Burst Buffer capacity - sum of all
NVMe‘s in the compute nodes
Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
HW Design: Burst Buffer Integration
46
§ Compute Node SSD uses a standard XFS Linux file system
§ Burst buffer is a file transfer service
§ Raw data transfer uses NVMe over Fabrics
§ Formerly called FlashDirect
§ Think of it as: RDMA targeting NVMe memory
§ Data transferred between ESS I/O node and NVMe PCIe device
§ Data is directly placed onto NVMe PCIe device (or pulled from)
§ Avoiding CPU/GPU usage
§ Hardware offload support built into ConnectX-5
§ File system
§ BB determines where to place data onto NVMe PCIe
§ Consistent with where the file system expects
§ Optimized for direct placement of data.
NVMe PCIe
Device
Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
Thank you
47
Florin Manaila
Senior IT Architect and Inventor
Cognitive Systems (HPC and Deep Learning)
florin.manaila@de.ibm.com
Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
48Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation

Distributed deep learning reference architecture v3.2l

  • 1.
    Cognitive Systems /v3.1 / May 28 / © 2018 IBM Corporation OpenPOWER ADG + IBM Deep Learning Cluster Reference Architecture — Florin Manaila Senior IT Architect and Inventor Cognitive Systems (HPC and Deep Learning) IBM Systems Hardware Europe
  • 2.
    OpenPOWER ADG 2Cognitive Systems/ v3.1 / May 28 / © 2018 IBM Corporation
  • 3.
    Welcome you allfor the AI and OpenPOWER event 3
  • 4.
  • 5.
  • 6.
    Chip / SOC Thisis What A Revolution Looks Like © 2018 OpenPOWER Foundation I/O / Storage / Acceleration Boards / Systems Software System / Integration Implementation / HPC / Research
  • 7.
    Chip / SOC Thisis What A Revolution Looks Like © 2018 OpenPOWER Foundation I/O / Storage / Acceleration Boards / Systems Software System / Integration Implementation / HPC / Research 328+ Members 33 Countries 70+ ISVs
  • 8.
    Chip / SOC Thisis What A Revolution Looks Like © 2018 OpenPOWER Foundation I/O / Storage / Acceleration Boards / Systems Software System / Integration Implementation / HPC / Research 328+ Members 33 Countries 70+ ISVs Active Membership From All Layers of the Stack 100k+ Linux Applications Running on Power 2300 ISVs Written Code on Linux Partners Bring Systems to Market 150+ OpenPOWER Ready Certified Products 20+ Systems Manufacturers 40+ POWER-based systems shipping or in development 100+ Collaborative innovations under way
  • 9.
  • 10.
    OpenPOWER in Action 10CognitiveSystems / v3.1 / May 28 / © 2018 IBM Corporation
  • 11.
    Academic Membership 11 A*STAR ASUASTRI Moscow State University Carnegie Mellon Univ. CDAC Colorado School of Mines CINECA CFMS Coimbatore Institute of Technology Dalian University of Technology GSIC Hartree Centre ICM IIIT Bangalore IIT Bombay Indian Institute for Technology Roorkee ICCS INAF FZ Jülich LSU BSC Nanyang Technological University National University of Singapore NIT Mangalore NIT Warangal Northeastern University in China ORNL OSU RICE Rome HPC Center LLNL SANDIA SASTRA University Seoul National University Shanghai Shao Tong University SICSR TEES Tohoku University Tsinghua University University of Arkansas SDSC Unicamp University of Central Florida University of Florida University of Hawai University of Hyderabad University of Illinois University of Michigan University of Oregon University of Patras University of Southern California TACC Waseda University IISc ,Loyola,IIT Roorkee Currently about 100+ academic members in OPF
  • 12.
    Goals of theAcademia Discussion Group 12 § Provide training and exchange of experience and know-how § Provide platform for networking among academic members § Work on engagement of HPC community § Enable co-design/development activities
  • 13.
    OpenPOWER Foundation 13 Growing numberof academic organizations have become member of the OpenPOWER Foundation The Academia Discussion Groups provides a platform for training, networking, engagement and enablement of co-design Those who have not yet joined: You are welcome to join https://members.openpowerfoundation.org/wg/AcademiaDG/mail/index OpenPOWER AI virtual University's focus on bringing together industry, government and academic expertise to connect and help shape the AI future . https://www.youtube.com/channel/UCYLtbUp0AH0ZAv5mNut1Kcg
  • 14.
    IBM Deep LearningCluster Reference Architecture 14Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
  • 15.
    Distributed Deep LearningApproach 15 SINGLE ACCELERATOR DATA PARALLEL MODEL PARALLEL DATA AND MODEL PARALLEL 1x Accelerator 4x Accelerators 4x Accelerators 4x n Accelerators Longer Training Time Shorter Training Time System1System2Systemn System Data Data DataDataDataData DataDataData Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
  • 16.
    Phases of AIdevelopment 16 Experimentation Phase – Single-node – Small scale data – Algorithms prototyping and hyper-parammeters Scaling Phase – Multi-node – Medium scale data (local SSD’s or NVM’s) Production Phase – Cluster deployment – Upstream data pipeline – Inference Experimentation Scaling Production Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
  • 17.
    Challenges in DeepLearning 17 § Storage performance / Data-pipeline § Network performance § Orchestration § Management and monitoring of the cluster § Monitoring of DL training or DL inference § Scaling § Efficiency § Data ingest § ILM § Backup § Accelerated rate of new DL frameworks and versions § Software refresh cycle Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
  • 18.
    Deep Learning ScalingChallenges 18 § Model replication § Device placement for variables § Fault tolerance § Sessions and Servers § Monitoring training session § Data splitting Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
  • 19.
    Some Data Scientistsconsiderations 19 Data Size – The entire model might not fit onto a single GPU If the size of the input data is especially large – Shared file system is required if the number of records is prohibitively large – If the number of records is large convergence can be sped up using multiple GPUs or distributed models Model Size – Splitting the model across multi-GPUs (model parallel) is required if the size of the network is larger than used GPU’s memory # Updates – Multiple GPU configurations on a single server (4, 6, 8) should be taken into consideration in case number of updates and the size of the updates are considerable Hardware – Network speeds play a crucial role in distributed model settings – Infiniband RDMA and MPI play an important role (MPI latency is 1-3us/message due to OS bypass) Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
  • 20.
    Standards 20 § Mellanox InfiniBand §RDMA over InfiniBand § NVIDIA GPU’s and related software § Containers § Workload Managers (LSF, SLURM, Kubernetes etc) § xCAT § High Performance File System § Python 2.x and/or 3.x § DL Frameworks (Caffe, TF, Torch, etc) § SSD/NVMe Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
  • 21.
    Functional Requirements 21 § NVIDIAGPU’s SMX2 form factor § InfiniBand EDR interconnect no over- subscription § Islands approach for large cluster § Inter islalands 1:2 InfiniBand over- subscription § High Performance file system using SSD’s, NVM’s or Flash § MPI § Job Scheduler support for GPU based containers § Job Scheduler python integration § DL Frameworks support for NVLINK § Distributed Deep Learning § Large Model Support § HDFS support § IPIMI support § Management and Monitoring of the infrastructure with xCAT or similar and web interface § Visualization of Distributed Deep Learning training activities Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
  • 22.
    Non-Functional Requirements 22 § Accessibility §Audibility and Control § Availability § Backup § Fault tolerance (e.g. Operational System Monitoring, Measuring, and Management) § Open Source Frameworks § Resilience § Scalability in integrated way (from 2 nodes to 2000 nodes) § Security and Privacy § Throughput § Performance / short training times § Platform compatibility Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
  • 23.
    Architecture Decisions Containers vsBare Metal 23Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
  • 24.
    Architecture Decisions Storage 24Cognitive Systems/ v3.1 / May 28 / © 2018 IBM Corporation
  • 25.
    Architecture for anexperimental IBM Deep Learning System Hardware Overview 26Experimentation Scaling Production Data Scientists workstations Data Scientists Internal SAS drives and NVM’s POWER Accelerated Servers with GPU’s InfiniBand EDR P2P connection Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
  • 26.
    Architecture for smallIBM Deep Learning Cluster Hardware Overview 27 Experimentation Scaling Production Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
  • 27.
    Architecture for smallIBM Deep Learning Cluster Hardware Overview for fully containerized environment 28Group Name / DOC ID / Month XX, 2017 / © 2017 IBM Corporation Experimentation Scaling ProductionProduction
  • 28.
    29 Experimentation Scaling Production Architecturefor large IBM Deep Learning Cluster Hardware Overview Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
  • 29.
    Architecture for smallto large IBM Deep Learning Cluster Storage – Spectrum Scale 30 Block iSCSI Data Scientists workstations Data Scientists and applications Traditional applications Global Namespace Analytics Transparent HDFS Spark OpenStack Cinder Glance Manilla Object Swift S3 Transparent Cloud Powered by IBM Spectrum Scale Automated data placement and data migration Disk Tape Shared Nothing Cluster Flash Transparent Cloud Tier SMBNFS POSIX File Worldwide Data Distribution(AFM) Site B Site A Site C Encryption DR Site AFM-DR JBOD/JBOF Spectrum Scale RAID Compression Deep Learning Cluster Native RDMA over InfiniBnd Long Term Only Experimentation Scaling Production Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
  • 30.
    Architecture for largeIBM Deep Learning Cluster Compute (InfiniBand) Networking 31 Compute Island #1 Compute Island #2 Mng and IO Island #1 L3-1 L3-X L2-1 L2-Z 18x Links to Login, Srv 18x Links to IBM ESS L1-1 L1-Y L2-1 L2-Z 18x Links to Compute 18x Links to Compute L1-1 L1-Y L2-1 L2-Z 18x Links to Compute 18x Links to Compute NOTE: Number of InfiniBand switches depends of the no of compute nodes and required oversubscription as well as no of available IB ports / switch Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
  • 31.
    Architecture for largeIBM Deep Learning Cluster Management Networking 32Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
  • 32.
    Architecture for largeIBM Deep Learning Cluster Docker Containers (only for HPC based customers) 33Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
  • 33.
    Physical View ofsmall IBM Deep Learning Cluster Hardware rack view 34 Compute Nodes (9x) • Shown with decorative bezel • Hardware viewable behind bezel Network Switch Location • Shown with blank cover • 3 EIA Empty Space • 2 EIA • Space reserved in the back for power, cooling, cabling escape Empty Space • 1 EIA • Space reserved in the back for power, cabling escape Compute Nodes (9x) • Shown with decorative bezel • Hardware viewable behind bezel Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
  • 34.
    Physical View ofsmall IBM Deep Learning Cluster – Sample Scalability Hardware rack view 35 Scale by factor of: - 2x Storage (capacity and performance) - 3.2x Compute - 1:1 IB Oversubscription Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
  • 35.
    Architecture for anexperimental IBM Deep Learning System Software Overview 36Experimentation Scaling Production RHEL 7.5 Mlnx OFED 4 CUDA 9 cuDDN 7 IBM Spectrum MPI PowerAI 5.1 Docker Anaconda Nvidia-Docker RHEL 7.5 Mlnx OFED 4 CUDA 9 cuDDN 7 Docker ICP with K8s PowerAIBase PowerAIVision PowerAIBase PowerAIBase DSXLocal RHEL 7.5 Mlnx OFED 4 CUDA 9 cuDDN 7 IBM Spectrum MPI PowerAI 5.1 Docker Anaconda Nvidia-Docker IBM Spectrum LSF Option 1 Option 2 Option 3 Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
  • 36.
    Architecture for anexperimental IBM Deep Learning System Software Overview 37Experimentation Scaling Production RHEL 7.5 Mlnx OFED 4 CUDA 9 cuDDN 7 Docker ICP Compute PowerAIBase PowerAIVision PowerAIBase PowerAIBase DSXLocal RHEL 7.5 Mlnx OFED 4 CUDA 9 cuDDN 7 IBM Spectrum MPI PowerAI 5.1 Docker Anaconda Nvidia-Docker LSF Client Option 1 Option 2 RHEL 7.5 Mlnx OFED 4 IBM Spectrum MPI LSF Master RHEL 7.5 Mlnx OFED 4 Docker ICP Master with K8s xCAT, Grafana Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
  • 37.
    IBM Cloud PrivateArchitecture Overview Containerized environment based on Kubernetes 38Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
  • 38.
    Architecture Overview forIBM Deep Learning Cluster Hardware Components 39 § Login Nodes (40c POWER9, 2x V100 GPU's, 256GB RAM, 2x 960GB SSD, IB EDR, 10GE, 1Gbps) § Service/Master Nodes (40c POWER, 256GB RAM, 4x 960GB SSD, IB EDR, 10GE) § CES Nodes (40c POWER, 256GB RAM, 2x 960GB SSD, IB EDR, 10GE) § Compute/Worker Nodes (40c POWER9, 4x V100 GPU's, 512GB RAM, 2x 960GB SSD, 1x 1.6TB NVMe adapter, IB EDR, 1Gbps) § EDR Mellanox InfiniBand Switches with 36 ports; including IB cables § IBM Ethernet Switches for management (48-ports 1Gbps and 4x 10GE ports ) including cables and SFP+ § IBM ESS GS2S, with InfiniBand EDR and 10GE Network for storage Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
  • 39.
    IBM Newell AC922 SystemArchitecture Overview 40Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
  • 40.
    Architecture Overview forIBM Deep Learning Cluster Operational Model 1 41 Data Scientists 2x IBM AC922 SSHv2 HTTP DIGITS Web CLI - Python AI Vision Experimentation Scaling ProductionCognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
  • 41.
    Architecture Overview forIBM Deep Learning Cluster Operational Model 2 42 Data Scientists 2x IBM AC922 SSHv2 HTTP Jupiter Notebook CLI - Python TensorBoar d Experimentation Scaling ProductionCognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
  • 42.
    LSF New GPUScheduling Options 43 GPU mode management § The user can request the desired GPU mode for the job. If the mode of a GPU needs to be changed for a job to run, it’s original mode will be restored after the job completes. GPU allocation policy § Support reserving physical GPU resources § Provide “best-effort” GPU allocation policy considering: CPU-GPU affinity, Current GPU mode and GPU job load § Export CUDA_VISIBLE_DEVICES for using in job pre/post scripts Integrated support for IBM Spectrum MPI § Export per task environmental variables CUDA_VISIBLE_DEVICES%d § IBM Spectrum MPI will apply the correct CVD mask to each task Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
  • 43.
    LSF Docker Support 44 Startingwith LSF 10.1.0.3 we provide support for nVidia’s distribution of Docker which allows LSF’s CPU, cgroup and GPU allocation functionality to work correctly. Begin Application NAME = nvdia-docker CONTAINER = nvidia-docker [ image(nvidia/cuda) options(--rm --net=host --ipc=host --sig- proxy=false) starter(lsfadmin)] End Application $bsub -app nvdia-docker –gpu “num=1” ./ibm-powerai Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
  • 44.
    HW Design: ElasticStorage Server (ESS) 45 Software § IBM Spectrum Scale for IBM Elastic Storage Server § RedHat Linux Enterprise Data Server Summary § 2x20 Cores POWER8 3.42 GHz § 2x 256GB DDR4 Memory § 4x 100Gb/s Infiniband EDR Storage SSD Enclosures § 2x 24 3.84 TB SSD (288 SSD) § Cc 128TB usable capacity (8+2 parity) § Burst Buffer capacity - sum of all NVMe‘s in the compute nodes Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
  • 45.
    HW Design: BurstBuffer Integration 46 § Compute Node SSD uses a standard XFS Linux file system § Burst buffer is a file transfer service § Raw data transfer uses NVMe over Fabrics § Formerly called FlashDirect § Think of it as: RDMA targeting NVMe memory § Data transferred between ESS I/O node and NVMe PCIe device § Data is directly placed onto NVMe PCIe device (or pulled from) § Avoiding CPU/GPU usage § Hardware offload support built into ConnectX-5 § File system § BB determines where to place data onto NVMe PCIe § Consistent with where the file system expects § Optimized for direct placement of data. NVMe PCIe Device Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
  • 46.
    Thank you 47 Florin Manaila SeniorIT Architect and Inventor Cognitive Systems (HPC and Deep Learning) florin.manaila@de.ibm.com Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
  • 47.
    48Cognitive Systems /v3.1 / May 28 / © 2018 IBM Corporation