“Advanced Global-Scale Networking
Supporting Data-Intensive
Artificial Intelligence Applications ”
Joint Networks Summit
San Diego State University
January 30, 2020
Dr. Larry Smarr
Director, California Institute for Telecommunications and Information Technology
Harry E. Gruber Professor,
Dept. of Computer Science and Engineering
Jacobs School of Engineering, UCSD
http://lsmarr.calit2.net
1
Creating a Tightly-Coupled, Yet Highly Distributed
Cyberinfrastructure for Big Data Analysis
Vision:
Use Optical Fiber to Connect
Big Data Generators and Consumers,
Creating a “Big Data” Freeway System
“The Bisection Bandwidth of a Cluster Interconnect,
but Deployed on a 20-Campus Scale.”
This Vision Has Been Building for 15 Years
Source: Maxine Brown, OptIPuter Project Manager
The OptIPuter
Exploits a New World
in Which
the Central Architectural Element
is Optical Networking,
Not Computers.
Distributed Cyberinfrastructure
to Support
Data-Intensive Scientific Research
and Collaboration
PI Smarr,
2002-2009
Integrated “OptIPlatform” Cyberinfrastructure System:
A 10Gbps Lightpath Cloud
National LambdaRail
Campus
Optical
Switch
Data Repositories & Clusters
HPC
HD/4k Video Images
HD/4k Video Cams
End User
OptIPortal
10G
Lightpath
HD/4k Telepresence
Instruments
LS 2009
Slide
2000-2015 Using UCSD Campus as Development Prototype:
NSF OptIPuter, Quartzite, Prism Awards
PI Papadopoulos,
2013-2015
PI Smarr,
2002-2009
PI Papadopoulos,
2004-2007
Before the PRP: ESnet’s ScienceDMZ Accelerates Science Research:
DOE & NSF Partnering on Science Engagement and Technology Adoption
Science
DMZ
Data Transfer
Nodes
(DTN/FIONA)
Network
Architecture
(zero friction)
Performance
Monitoring
(perfSONAR)
ScienceDMZ Coined in 2010 by ESnet
Basis of PRP Architecture and Design
http://fasterdata.es.net/science-dmz/
Slide Adapted From Inder Monga, ESnet
DOE
NSF
NSF Campus Cyberinfrastructure Program
Has Made Over 250 Awards
2012 2013 2014 2015 2016 2017 2018
Quartzite
Prism
(GDC)
2015 Vision: The Pacific Research Platform Will Connect Science DMZs
Creating a Regional End-to-End Science-Driven Community Cyberinfrastructure
NSF CC*DNI Grant
$6.3M 10/2015-10/2020
In Year 5 Now
PI: Larry Smarr, UC San Diego Calit2
Co-PIs:
• Camille Crittenden, UC Berkeley CITRIS,
• Tom DeFanti, UC San Diego Calit2/QI,
• Philip Papadopoulos, UCI
• Frank Wuerthwein, UCSD Physics and SDSC
Source: John Hess, CENIC
ESnet: Given Fast Networks, Need
DMZs and Fast/Tuned DTNs
Letters of Commitment from:
• 50 Researchers from 15 Campuses
• 32 IT/Network Organization Leaders
Supercomputer
Centers
Terminating the Fiber Optics - Data Transfer Nodes (DTNs):
Flash I/O Network Appliances (FIONAs)
UCSD-Designed FIONAs Solved the Disk-to-Disk Data Transfer Problem
at Near Full Speed on Best-Effort 10G, 40G and 100G Networks
FIONAs Designed by UCSD’s Phil Papadopoulos, John Graham,
Joe Keefe, and Tom DeFanti
Two FIONA DTNs at UC Santa Cruz: 40G & 100G
Up to 192 TB Rotating Storage
Add Up to 8 Nvidia GPUs Per 2U FIONA
To Add Machine Learning Capability
2017-2020: CHASE-CI Grant Adds a Machine Learning Layer
Built on Top of the Pacific Research Platform
Caltech
UCB
UCI UCR
UCSD
UCSC
Stanford
MSU
UCM
SDSU
NSF Grant for 256 High Speed “Cloud” GPUs
For 32 ML Faculty & Their Students at 10 Campuses
To Train AI Algorithms on Big Data
Original PRP
CENIC/PW Link
2018-2021: Toward the National Research Platform (NRP) -
Using CENIC & Internet2 to Connect Quilt Regional R&E Networks
“Towards
The NRP”
3-Year Grant
Funded
by NSF
$2.5M
October 2018
PI Smarr
Co-PIs Altintas
Papadopoulos
Wuerthwein
Rosing
DeFanti
NSF CENIC Link
Original PRP
CENIC/PW Link
2018/2019: PRP Game Changer!
Using Kubernetes to Orchestrate Containers Across the PRP
“Kubernetes is a way of stitching together
a collection of machines into,
basically, a big computer,”
--Craig Mcluckie, Google
and now CEO and Founder of Heptio
"Everything at Google runs in a container."
--Joe Beda,Google
PRP’s Nautilus Hypercluster Adopted Kubernetes to Orchestrate Software Containers
and Rook, Which Runs Inside of Kubernetes, to Manage Distributed Storage
https://rook.io/
“Kubernetes with Rook/Ceph Allows Us to Manage Petabytes of Distributed Storage
and GPUs for Data Science,
While We Measure and Monitor Network Use.”
--John Graham, Calit2/QI UC San Diego
Security Technologies Utilized in the
PRP / CHASE-CI Nautilus Hypercluster
• CILogon Federated Authentication
• Secure Namespaces for Multi-Institution Collaborations
• Calico Policy-Driven Network Security
Nautilus Uses CILogon for Federated Identity
Already Adopted by Campuses
Nautilus Namespaces
Enable Secure for Multi-Institution Collaborations
✔Isolate user
space
✔Create
environment
to
collaborate
✔Define
policies
CHASE -
Nautilus is Using Calico
For Policy-Driven Network Security
Installing FIONAs Across California in Late 2018 and Early 2019
To Enhance User’s CPU and GPU Computing, Data Posting, and Data Transfers
UC Merced
Stanford UC Santa Barbara
UC Riverside
UC Santa Cruz
UC Irvine
100G NVMe 6.4TB
Caltech
40G 192TB
UCSF
40G 160TB HPWREN
40G 160TB
4 FIONA8s*
Calit2/UCI
35 FIONA2s
17 FIONA8s
2x40G 160TB HPWREN
UCSD
100G Epyc NVMe
100G Gold NVMe
8 FIONA8s + 5 FIONA8s
SDSC @ UCSD
1 FIONA8
40G 160TB
UCR 40G 160TB
USC
100G NVMe 6.4TB
2x40G 160TB
UCLA
1 FIONA8*
40G 160TB
Stanford U
2 FIONA8s*
40G 192TB
UCSB
4.5 FIONA8s
100G NVMe 6.4TB
40G 160TB
UCSC
PRP’s California Nautilus Hypercluster Connected
by Use of CENIC 100G Network
10 FIONA2s
2 FIONA8
40G 160TB
UCM
15-Campus Nautilus Cluster:
4360 CPU Cores 134 Hosts
~1.7 PB Storage
407 GPUs, ~4000 cores each
40G 160TB HPWREN
100G NVMe 6.4TB
1 FIONA8* 2 FIONA4s
FPGAs + 2PB BeeGFS
SDSU
PRP Disks
10G 3TB
CSUSB
Minority Serving Institution
CHASE-CI
100G 48TB
NPS
*= July RT
40G 192TB
USD
CENIC/PW Link
40G 3TB
U Hawaii
40G 160TB
NCAR-WY
40G 192TB
UWashington
10G FIONA1
40G FIONA
UIC
40G 3TB
StarLight
PRP/TNRP’s United States Nautilus Hypercluster FIONAs
Now Connects 4 More Regionals and 3 Internet2 Storage Sites
100G FIONA
I2 Chicago
100G FIONA
I2 Kansas City
100G FIONA
I2 NYC
PRP Global Nautilus Hypercluster Is Rapidly Adding International Partners
Beyond Our Original Partner in Amsterdam
Netherlands
10G 35TB
UvA
PRP
Transoceanic Nodes Show Distance is Not a Barrier
to Above 5Gb/s Disk-to-Disk Performance
PRP’s Current
International
Partners
Guam
Australia
Korea
Singapore
40G FIONA6
40G 28TB
KISTI
10G 96TB
U of Guam
100G 35TB
U of Queensland
GRP Workshop 9/17-18/2019
at Calit2@UCSD
Operational Metrics: Containerized Trace Route Tool Allows Realtime Visualization
of Status of PRP Network Links on a National and Global Scale
Source: Dima Mishin, SDSC9/16/2019
Guam
Univ. Queensland
Australia
LIGO
UK
Netherlands
Korea
PRP’s Nautilus Forms a Multi-Application
Powerful Distributed “Big Data” Storage and Machine-Learning Computer
Source: grafana.nautilus.optiputer.net on 1/27/2020
PRP is Science-Driven:
Connecting Multi-Campus Application Teams and Devices
Earth
Sciences
UC San Diego UCBerkeley UC Merced
100 Gbps FIONA at UCSC Allows for Downloads to the UCSC Hyades Cluster
from the LBNL NERSC Supercomputer for DESI Science Analysis
300 images per night.
100MB per raw image
120GB per night
250 images per night.
530MB per raw image
800GB per night
Source: Peter Nugent, LBNL
Professor of Astronomy, UC Berkeley
Precursors to
LSST and NCSA
NSF-Funded Cyberengineer
Shaw Dong @UCSC
Receiving FIONA
Feb 7, 2017
Global Scientific Instruments Will Produce Ultralarge Datasets Continuously
Requiring Dedicated Optic Fiber and Supercomputers
Large Synoptic Survey Telescope (LSST)
3.2 Gpixel Camera
Tracks ~40B Objects,
Creates 1-10M Alerts/Night
Within 1 Minute of Observing
1000 Supernovas Discovered/Night
Use PRP-Like CI to Connect NCSA Repository
To Remote Astronomy Big Data Analysis Users?
Director: F. Martin Ralph
Big Data Collaboration with:
Source: Scott Sellers, PhD CHRS; Postdoc CW3E
Collaboration on Atmospheric Water in the West
Between UC San Diego and UC Irvine
Director, Soroosh Sorooshian, UCSD
Rapid 4D Object Segmentation of NASA Water Vapor Data -
“Stitching” in Time and Space
NASA *MERRA v2 –
Water Vapor Data
Across the Globe
4D Object Constructed
(Lat, Lon, Value, Time)
Object Detection,
Segmentation and Tracking
Scott L. Sellars1, John Graham1, Dima Mishin1, Kyle Marcus2 , Ilkay Altintas2, Tom DeFanti1, Larry Smarr1,
Joulien Tatar3, Phu Nguyen4, Eric Shearer4, and Soroosh Sorooshian4
1Calit2@UCSD; 2SDSC; 3Office of Information Technology, UCI; 4Center for Hydrometeorology and Remote Sensing, UCI
Calit2’s FIONA
SDSC’s COMET
Calit2’s FIONA
Pacific Research Platform (10-100 Gb/s)
GPUsGPUs
Complete workflow time: 19.2 days52 Minutes!
UC, Irvine UC, San Diego
PRP Shortened Workflow From 19.2 Days to 52 Minutes -
532 Times Faster!
Source: Scott Sellers, CW3E
PRP Optical Fiber Connects Data Servers for
High Performance Wireless Research and Education Network (HPWREN)
• PRP Uses CENIC
100G Optical Fiber
to Link UCSD, SDSU
& UCI HPWREN
Servers
– Data Redundancy
– Disaster Recovery
– High Availability
– Kubernetes Handles
Software Containers
and Data
UCI
UCSD
SDSU
Source: Frank Vernon,
Hans Werner Braun HPWREN
UCI Antenna Dedicated
June 27, 2017
Once a Wildfire is Spotted, PRP Brings High-Resolution Weather Data
to Fire Modeling Workflows in WIFIRE
Real-Time
Meteorological Sensors
Weather Forecast
Landscape data
WIFIRE Firemap
Fire Perimeter
Work Flow
PRP
Source: Ilkay Altintas, SDSC
California Public-Private Partnership Plan
To Scale Fixed Wireless With CENIC Optical Fiber Backhaul
Co-Existence of Interactive and
Non-Interactive Computing on PRP
GPU Simulations Needed to Improve Ice Model.
=> Results in Significant Improvement
in Pointing Resolution for Multi-Messenger Astrophysics
NSF Large-Scale Observatories
Are Beginning to Utilize PRP Compute Resources
IceCube
Number of Requested PRP Nautilus GPUs
Has Gone Up 4X in 2019
4X
https://grafana.nautilus.optiputer.net/d/fHSeM5Lmk/k8s-compute-resources-cluster-
gpus?orgId=1&fullscreen&panelId=2&from=1546329600000&to=1577865599000
Using Grafana to Track PRP Requested GPUs
10/1-12/31/19
IceCube
Mark Alber Group, UCR
Quantitative Modeling in Biology
Gary Cottrell, UCSD
Deep Learning
Volkan Vural, UCSD
Workflow Analysis
SunCAVE, UCSD
Nuno Vasconcelos, UCSD
Domain-adaptation ML
Hao Su, UCSD
AI in Robotics
Nuno Vasconcelos, UCSD
OOWL Synthetic Data ML
Dinesh Bharadia, UCSD
Autonomous Driving
Jeff Krichmar, UCI
Reinforcement Learning
Xinyu Zhang, UCSD
Wireless ML
Ravi Ramamoorthi, UCSD
Viscomp ML
Nuno Vasconcelos, UCSD
Crowdcounting Image ML
Kurt Schoenhoff, JCU
Semantic ML
Ravi Ramamoorthi, UCSD
CG ML
Frank Wuerthwein, UCSD
LHC CMS
Padhraic Smyth, UCI
Language ML
Alex Feltus, Clemson
Oncogenomics
Xiaolong Wang, UCSD
Robot ML
The Open Science Grid (OSG)
Has Been Integrated With the PRP
In aggregate ~ 200,000 Intel x86 cores
used by ~400 projects
Source: Frank Würthwein,
OSG Exec Director; PRP co-PI; UCSD/SDSC OSG Federates ~100 Clusters Worldwide
All OSG User
Communities
Use HTCondor for
Resource Orchestration
SDSC
U.ChicagoFNAL
Caltech
Distributed
OSG Petabyte
Storage Caches
The Open Science Grid Delivers
1.8 Billion Core-Hours Per Year to Over 50 Fields of Science
NCSA Delivered
~35,000 Core-Hours
Per Year in 1990
https://gracc.opensciencegrid.org/dashboard/db/gracc-home
Running a 51k GPU Burst for
Multi-Messenger Astrophysics
with IceCube Across
All Available GPUs in the Cloud
Frank Würthwein - OSG Executive Director, PRP co-PI
Igor Sfiligoi - Lead Scientific Researcher
UCSD/SDSC
The In-Cloud Network Testing
Ahead of Time
38
In-Cloud Storage Showed Great Scalability
Exceeded 1 Tbps
Networking Inside a Cloud Region Is Not a Concern!
0
200
400
600
800
1000
1200
1400
1600
1800
2000
0 200 400 600 800 1000 1200 1400
Gbps
Number of compute instances
AWS
Azure
GCP
Probably
Can Scale
Much Higher
The U.S. Network Testing
Ahead of Time
39
Cloud to PRP/TNRP/OSG/I2 Networking in US
is Good But Not Great.
At Least as Compared to the In-Cloud Networking
60 Gbps
US East
US West
15-35 Gbps
20 Gbps
20 Gbps
20 Gbps
20 Gbps
One of the Reasons for Pre-Staging the Inputs Inside of the Clouds
Legend:
Blue dots are PRP/I2 On-Prem DTNs,
Yellow are Edge of Cloud Regions
The Global Network Testing
Ahead of Time
Cloud to On-Prem International Networking
14-24 Gbps
7-32 Gbps
6-44 Gbps
1-18 Gbps
11 Gbps
Australia
Korea
Europe
Legend:
Blue dots are PRP On-Prem DTNs
Yellow are Edge of Cloud Regions
Slide Source: Frank Würthwein, Igor Sfiligoi
The Idea
• Integrate All GPUs Available for Sale Worldwide into a
Single HTCondor Pool
– Use 28 Regions Across AWS, Azure, and Google Cloud
for a Burst of a Couple Hours, or so
• IceCube Submits Their Photon Propagation Workflow to
this HTCondor Pool.
– The Input, Jobs on the GPUs, and Output are All Part of
a Single Globally Distributed System
– This Demo Used Just the Standard HTCondor Tools
Run a GPU Burst Relevant in-Scale
for Future Exascale HPC Systems
Science with 51,000 GPUs
Achieved as Peak Performance
42
Time in Minutes
Each Color is a Different
Cloud Region in US, EU, or Asia.
Total of 28 Regions in Use
Peaked at 51,500 GPUs
~380 Petaflops of FP32
Summary of Stats at Peak - 8 Generations of NVIDIA GPUs Used
2020-2025 NRP Future:
SDSC’s EXPANSE Will Use CHASE-CI Developed Composable Systems
~$20M over 5 Years
PI Mike Norman, SDSC
PRP/TNRP/CHASE-CI Support and Community:
• US National Science Foundation (NSF) awards to UCSD, NU, and SDSC
 CNS-1456638, CNS-1730158, ACI-1540112, ACI-1541349, & OAC-1826967
 OAC 1450871 (NU) and OAC-1659169 (SDSU)
• UC Office of the President, Calit2 and Calit2’s UCSD Qualcomm Institute
• San Diego Supercomputer Center and UCSD’s Research IT and Instructional IT
• Partner Campuses: UCB, UCSC, UCI, UCR, UCLA, USC, UCD, UCSB, SDSU, Caltech, NU,
UWash UChicago, UIC, UHM, CSUSB, HPWREN, UMo, MSU, NYU, UNeb, UNC,UIUC,
UTA/Texas Advanced Computing Center, FIU, KISTI, UVA, AIST
• CENIC, Pacific Wave/PNWGP, StarLight/MREN, The Quilt, Kinber, Great Plains Network,
NYSERNet, LEARN, Open Science Grid, Internet2, DOE ESnet, NCAR/UCAR & Wyoming
Supercomputing Center, AWS, Google, Microsoft, Cisco

Advanced Global-Scale Networking Supporting Data-Intensive Artificial Intelligence Applications

  • 1.
    “Advanced Global-Scale Networking SupportingData-Intensive Artificial Intelligence Applications ” Joint Networks Summit San Diego State University January 30, 2020 Dr. Larry Smarr Director, California Institute for Telecommunications and Information Technology Harry E. Gruber Professor, Dept. of Computer Science and Engineering Jacobs School of Engineering, UCSD http://lsmarr.calit2.net 1
  • 2.
    Creating a Tightly-Coupled,Yet Highly Distributed Cyberinfrastructure for Big Data Analysis Vision: Use Optical Fiber to Connect Big Data Generators and Consumers, Creating a “Big Data” Freeway System “The Bisection Bandwidth of a Cluster Interconnect, but Deployed on a 20-Campus Scale.” This Vision Has Been Building for 15 Years
  • 3.
    Source: Maxine Brown,OptIPuter Project Manager The OptIPuter Exploits a New World in Which the Central Architectural Element is Optical Networking, Not Computers. Distributed Cyberinfrastructure to Support Data-Intensive Scientific Research and Collaboration PI Smarr, 2002-2009
  • 4.
    Integrated “OptIPlatform” CyberinfrastructureSystem: A 10Gbps Lightpath Cloud National LambdaRail Campus Optical Switch Data Repositories & Clusters HPC HD/4k Video Images HD/4k Video Cams End User OptIPortal 10G Lightpath HD/4k Telepresence Instruments LS 2009 Slide
  • 5.
    2000-2015 Using UCSDCampus as Development Prototype: NSF OptIPuter, Quartzite, Prism Awards PI Papadopoulos, 2013-2015 PI Smarr, 2002-2009 PI Papadopoulos, 2004-2007
  • 6.
    Before the PRP:ESnet’s ScienceDMZ Accelerates Science Research: DOE & NSF Partnering on Science Engagement and Technology Adoption Science DMZ Data Transfer Nodes (DTN/FIONA) Network Architecture (zero friction) Performance Monitoring (perfSONAR) ScienceDMZ Coined in 2010 by ESnet Basis of PRP Architecture and Design http://fasterdata.es.net/science-dmz/ Slide Adapted From Inder Monga, ESnet DOE NSF NSF Campus Cyberinfrastructure Program Has Made Over 250 Awards 2012 2013 2014 2015 2016 2017 2018 Quartzite Prism
  • 7.
    (GDC) 2015 Vision: ThePacific Research Platform Will Connect Science DMZs Creating a Regional End-to-End Science-Driven Community Cyberinfrastructure NSF CC*DNI Grant $6.3M 10/2015-10/2020 In Year 5 Now PI: Larry Smarr, UC San Diego Calit2 Co-PIs: • Camille Crittenden, UC Berkeley CITRIS, • Tom DeFanti, UC San Diego Calit2/QI, • Philip Papadopoulos, UCI • Frank Wuerthwein, UCSD Physics and SDSC Source: John Hess, CENIC ESnet: Given Fast Networks, Need DMZs and Fast/Tuned DTNs Letters of Commitment from: • 50 Researchers from 15 Campuses • 32 IT/Network Organization Leaders Supercomputer Centers
  • 8.
    Terminating the FiberOptics - Data Transfer Nodes (DTNs): Flash I/O Network Appliances (FIONAs) UCSD-Designed FIONAs Solved the Disk-to-Disk Data Transfer Problem at Near Full Speed on Best-Effort 10G, 40G and 100G Networks FIONAs Designed by UCSD’s Phil Papadopoulos, John Graham, Joe Keefe, and Tom DeFanti Two FIONA DTNs at UC Santa Cruz: 40G & 100G Up to 192 TB Rotating Storage Add Up to 8 Nvidia GPUs Per 2U FIONA To Add Machine Learning Capability
  • 9.
    2017-2020: CHASE-CI GrantAdds a Machine Learning Layer Built on Top of the Pacific Research Platform Caltech UCB UCI UCR UCSD UCSC Stanford MSU UCM SDSU NSF Grant for 256 High Speed “Cloud” GPUs For 32 ML Faculty & Their Students at 10 Campuses To Train AI Algorithms on Big Data
  • 10.
    Original PRP CENIC/PW Link 2018-2021:Toward the National Research Platform (NRP) - Using CENIC & Internet2 to Connect Quilt Regional R&E Networks “Towards The NRP” 3-Year Grant Funded by NSF $2.5M October 2018 PI Smarr Co-PIs Altintas Papadopoulos Wuerthwein Rosing DeFanti NSF CENIC Link Original PRP CENIC/PW Link
  • 11.
    2018/2019: PRP GameChanger! Using Kubernetes to Orchestrate Containers Across the PRP “Kubernetes is a way of stitching together a collection of machines into, basically, a big computer,” --Craig Mcluckie, Google and now CEO and Founder of Heptio "Everything at Google runs in a container." --Joe Beda,Google
  • 12.
    PRP’s Nautilus HyperclusterAdopted Kubernetes to Orchestrate Software Containers and Rook, Which Runs Inside of Kubernetes, to Manage Distributed Storage https://rook.io/ “Kubernetes with Rook/Ceph Allows Us to Manage Petabytes of Distributed Storage and GPUs for Data Science, While We Measure and Monitor Network Use.” --John Graham, Calit2/QI UC San Diego
  • 13.
    Security Technologies Utilizedin the PRP / CHASE-CI Nautilus Hypercluster • CILogon Federated Authentication • Secure Namespaces for Multi-Institution Collaborations • Calico Policy-Driven Network Security
  • 14.
    Nautilus Uses CILogonfor Federated Identity Already Adopted by Campuses
  • 15.
    Nautilus Namespaces Enable Securefor Multi-Institution Collaborations ✔Isolate user space ✔Create environment to collaborate ✔Define policies CHASE -
  • 16.
    Nautilus is UsingCalico For Policy-Driven Network Security
  • 17.
    Installing FIONAs AcrossCalifornia in Late 2018 and Early 2019 To Enhance User’s CPU and GPU Computing, Data Posting, and Data Transfers UC Merced Stanford UC Santa Barbara UC Riverside UC Santa Cruz UC Irvine
  • 18.
    100G NVMe 6.4TB Caltech 40G192TB UCSF 40G 160TB HPWREN 40G 160TB 4 FIONA8s* Calit2/UCI 35 FIONA2s 17 FIONA8s 2x40G 160TB HPWREN UCSD 100G Epyc NVMe 100G Gold NVMe 8 FIONA8s + 5 FIONA8s SDSC @ UCSD 1 FIONA8 40G 160TB UCR 40G 160TB USC 100G NVMe 6.4TB 2x40G 160TB UCLA 1 FIONA8* 40G 160TB Stanford U 2 FIONA8s* 40G 192TB UCSB 4.5 FIONA8s 100G NVMe 6.4TB 40G 160TB UCSC PRP’s California Nautilus Hypercluster Connected by Use of CENIC 100G Network 10 FIONA2s 2 FIONA8 40G 160TB UCM 15-Campus Nautilus Cluster: 4360 CPU Cores 134 Hosts ~1.7 PB Storage 407 GPUs, ~4000 cores each 40G 160TB HPWREN 100G NVMe 6.4TB 1 FIONA8* 2 FIONA4s FPGAs + 2PB BeeGFS SDSU PRP Disks 10G 3TB CSUSB Minority Serving Institution CHASE-CI 100G 48TB NPS *= July RT 40G 192TB USD
  • 19.
    CENIC/PW Link 40G 3TB UHawaii 40G 160TB NCAR-WY 40G 192TB UWashington 10G FIONA1 40G FIONA UIC 40G 3TB StarLight PRP/TNRP’s United States Nautilus Hypercluster FIONAs Now Connects 4 More Regionals and 3 Internet2 Storage Sites 100G FIONA I2 Chicago 100G FIONA I2 Kansas City 100G FIONA I2 NYC
  • 20.
    PRP Global NautilusHypercluster Is Rapidly Adding International Partners Beyond Our Original Partner in Amsterdam Netherlands 10G 35TB UvA PRP Transoceanic Nodes Show Distance is Not a Barrier to Above 5Gb/s Disk-to-Disk Performance PRP’s Current International Partners Guam Australia Korea Singapore 40G FIONA6 40G 28TB KISTI 10G 96TB U of Guam 100G 35TB U of Queensland GRP Workshop 9/17-18/2019 at Calit2@UCSD
  • 21.
    Operational Metrics: ContainerizedTrace Route Tool Allows Realtime Visualization of Status of PRP Network Links on a National and Global Scale Source: Dima Mishin, SDSC9/16/2019 Guam Univ. Queensland Australia LIGO UK Netherlands Korea
  • 22.
    PRP’s Nautilus Formsa Multi-Application Powerful Distributed “Big Data” Storage and Machine-Learning Computer Source: grafana.nautilus.optiputer.net on 1/27/2020
  • 23.
    PRP is Science-Driven: ConnectingMulti-Campus Application Teams and Devices Earth Sciences UC San Diego UCBerkeley UC Merced
  • 24.
    100 Gbps FIONAat UCSC Allows for Downloads to the UCSC Hyades Cluster from the LBNL NERSC Supercomputer for DESI Science Analysis 300 images per night. 100MB per raw image 120GB per night 250 images per night. 530MB per raw image 800GB per night Source: Peter Nugent, LBNL Professor of Astronomy, UC Berkeley Precursors to LSST and NCSA NSF-Funded Cyberengineer Shaw Dong @UCSC Receiving FIONA Feb 7, 2017
  • 25.
    Global Scientific InstrumentsWill Produce Ultralarge Datasets Continuously Requiring Dedicated Optic Fiber and Supercomputers Large Synoptic Survey Telescope (LSST) 3.2 Gpixel Camera Tracks ~40B Objects, Creates 1-10M Alerts/Night Within 1 Minute of Observing 1000 Supernovas Discovered/Night Use PRP-Like CI to Connect NCSA Repository To Remote Astronomy Big Data Analysis Users?
  • 26.
    Director: F. MartinRalph Big Data Collaboration with: Source: Scott Sellers, PhD CHRS; Postdoc CW3E Collaboration on Atmospheric Water in the West Between UC San Diego and UC Irvine Director, Soroosh Sorooshian, UCSD
  • 27.
    Rapid 4D ObjectSegmentation of NASA Water Vapor Data - “Stitching” in Time and Space NASA *MERRA v2 – Water Vapor Data Across the Globe 4D Object Constructed (Lat, Lon, Value, Time) Object Detection, Segmentation and Tracking Scott L. Sellars1, John Graham1, Dima Mishin1, Kyle Marcus2 , Ilkay Altintas2, Tom DeFanti1, Larry Smarr1, Joulien Tatar3, Phu Nguyen4, Eric Shearer4, and Soroosh Sorooshian4 1Calit2@UCSD; 2SDSC; 3Office of Information Technology, UCI; 4Center for Hydrometeorology and Remote Sensing, UCI
  • 28.
    Calit2’s FIONA SDSC’s COMET Calit2’sFIONA Pacific Research Platform (10-100 Gb/s) GPUsGPUs Complete workflow time: 19.2 days52 Minutes! UC, Irvine UC, San Diego PRP Shortened Workflow From 19.2 Days to 52 Minutes - 532 Times Faster! Source: Scott Sellers, CW3E
  • 29.
    PRP Optical FiberConnects Data Servers for High Performance Wireless Research and Education Network (HPWREN) • PRP Uses CENIC 100G Optical Fiber to Link UCSD, SDSU & UCI HPWREN Servers – Data Redundancy – Disaster Recovery – High Availability – Kubernetes Handles Software Containers and Data UCI UCSD SDSU Source: Frank Vernon, Hans Werner Braun HPWREN UCI Antenna Dedicated June 27, 2017
  • 30.
    Once a Wildfireis Spotted, PRP Brings High-Resolution Weather Data to Fire Modeling Workflows in WIFIRE Real-Time Meteorological Sensors Weather Forecast Landscape data WIFIRE Firemap Fire Perimeter Work Flow PRP Source: Ilkay Altintas, SDSC
  • 31.
    California Public-Private PartnershipPlan To Scale Fixed Wireless With CENIC Optical Fiber Backhaul
  • 32.
    Co-Existence of Interactiveand Non-Interactive Computing on PRP GPU Simulations Needed to Improve Ice Model. => Results in Significant Improvement in Pointing Resolution for Multi-Messenger Astrophysics NSF Large-Scale Observatories Are Beginning to Utilize PRP Compute Resources
  • 33.
    IceCube Number of RequestedPRP Nautilus GPUs Has Gone Up 4X in 2019 4X https://grafana.nautilus.optiputer.net/d/fHSeM5Lmk/k8s-compute-resources-cluster- gpus?orgId=1&fullscreen&panelId=2&from=1546329600000&to=1577865599000
  • 34.
    Using Grafana toTrack PRP Requested GPUs 10/1-12/31/19 IceCube Mark Alber Group, UCR Quantitative Modeling in Biology Gary Cottrell, UCSD Deep Learning Volkan Vural, UCSD Workflow Analysis SunCAVE, UCSD Nuno Vasconcelos, UCSD Domain-adaptation ML Hao Su, UCSD AI in Robotics Nuno Vasconcelos, UCSD OOWL Synthetic Data ML Dinesh Bharadia, UCSD Autonomous Driving Jeff Krichmar, UCI Reinforcement Learning Xinyu Zhang, UCSD Wireless ML Ravi Ramamoorthi, UCSD Viscomp ML Nuno Vasconcelos, UCSD Crowdcounting Image ML Kurt Schoenhoff, JCU Semantic ML Ravi Ramamoorthi, UCSD CG ML Frank Wuerthwein, UCSD LHC CMS Padhraic Smyth, UCI Language ML Alex Feltus, Clemson Oncogenomics Xiaolong Wang, UCSD Robot ML
  • 35.
    The Open ScienceGrid (OSG) Has Been Integrated With the PRP In aggregate ~ 200,000 Intel x86 cores used by ~400 projects Source: Frank Würthwein, OSG Exec Director; PRP co-PI; UCSD/SDSC OSG Federates ~100 Clusters Worldwide All OSG User Communities Use HTCondor for Resource Orchestration SDSC U.ChicagoFNAL Caltech Distributed OSG Petabyte Storage Caches
  • 36.
    The Open ScienceGrid Delivers 1.8 Billion Core-Hours Per Year to Over 50 Fields of Science NCSA Delivered ~35,000 Core-Hours Per Year in 1990 https://gracc.opensciencegrid.org/dashboard/db/gracc-home
  • 37.
    Running a 51kGPU Burst for Multi-Messenger Astrophysics with IceCube Across All Available GPUs in the Cloud Frank Würthwein - OSG Executive Director, PRP co-PI Igor Sfiligoi - Lead Scientific Researcher UCSD/SDSC
  • 38.
    The In-Cloud NetworkTesting Ahead of Time 38 In-Cloud Storage Showed Great Scalability Exceeded 1 Tbps Networking Inside a Cloud Region Is Not a Concern! 0 200 400 600 800 1000 1200 1400 1600 1800 2000 0 200 400 600 800 1000 1200 1400 Gbps Number of compute instances AWS Azure GCP Probably Can Scale Much Higher
  • 39.
    The U.S. NetworkTesting Ahead of Time 39 Cloud to PRP/TNRP/OSG/I2 Networking in US is Good But Not Great. At Least as Compared to the In-Cloud Networking 60 Gbps US East US West 15-35 Gbps 20 Gbps 20 Gbps 20 Gbps 20 Gbps One of the Reasons for Pre-Staging the Inputs Inside of the Clouds Legend: Blue dots are PRP/I2 On-Prem DTNs, Yellow are Edge of Cloud Regions
  • 40.
    The Global NetworkTesting Ahead of Time Cloud to On-Prem International Networking 14-24 Gbps 7-32 Gbps 6-44 Gbps 1-18 Gbps 11 Gbps Australia Korea Europe Legend: Blue dots are PRP On-Prem DTNs Yellow are Edge of Cloud Regions Slide Source: Frank Würthwein, Igor Sfiligoi
  • 41.
    The Idea • IntegrateAll GPUs Available for Sale Worldwide into a Single HTCondor Pool – Use 28 Regions Across AWS, Azure, and Google Cloud for a Burst of a Couple Hours, or so • IceCube Submits Their Photon Propagation Workflow to this HTCondor Pool. – The Input, Jobs on the GPUs, and Output are All Part of a Single Globally Distributed System – This Demo Used Just the Standard HTCondor Tools Run a GPU Burst Relevant in-Scale for Future Exascale HPC Systems
  • 42.
    Science with 51,000GPUs Achieved as Peak Performance 42 Time in Minutes Each Color is a Different Cloud Region in US, EU, or Asia. Total of 28 Regions in Use Peaked at 51,500 GPUs ~380 Petaflops of FP32 Summary of Stats at Peak - 8 Generations of NVIDIA GPUs Used
  • 43.
    2020-2025 NRP Future: SDSC’sEXPANSE Will Use CHASE-CI Developed Composable Systems ~$20M over 5 Years PI Mike Norman, SDSC
  • 44.
    PRP/TNRP/CHASE-CI Support andCommunity: • US National Science Foundation (NSF) awards to UCSD, NU, and SDSC  CNS-1456638, CNS-1730158, ACI-1540112, ACI-1541349, & OAC-1826967  OAC 1450871 (NU) and OAC-1659169 (SDSU) • UC Office of the President, Calit2 and Calit2’s UCSD Qualcomm Institute • San Diego Supercomputer Center and UCSD’s Research IT and Instructional IT • Partner Campuses: UCB, UCSC, UCI, UCR, UCLA, USC, UCD, UCSB, SDSU, Caltech, NU, UWash UChicago, UIC, UHM, CSUSB, HPWREN, UMo, MSU, NYU, UNeb, UNC,UIUC, UTA/Texas Advanced Computing Center, FIU, KISTI, UVA, AIST • CENIC, Pacific Wave/PNWGP, StarLight/MREN, The Quilt, Kinber, Great Plains Network, NYSERNet, LEARN, Open Science Grid, Internet2, DOE ESnet, NCAR/UCAR & Wyoming Supercomputing Center, AWS, Google, Microsoft, Cisco