Running Emerging AI Applications on Big
Data Platforms with Ray On Apache Spark
Kai Huang
Intel Corporation
Jason Dai
Intel Corporation
Agenda
Overview of Analytics Zoo
Introduction to Ray
Motivations for Ray On Apache Spark
Implementation details and API design
Real-world use cases
Conclusion
Distributed, High-Performance
Deep Learning Framework
for Apache Spark*
AI on Big Data
Accelerating Data Analytics + AI Solutions At Scale
https://github.com/intel-analytics/bigdl
Unified Analytics + AI Platform
for TensorFlow*, PyTorch*, Keras*, BigDL, Ray*
and Apache Spark*
https://github.com/intel-analytics/analytics-zoo
Analytics Zoo
https://github.com/intel-analytics/analytics-zoo
Recommendation
Distributed TensorFlow & PyTorch on Spark
Spark Dataframes & ML Pipelines for DL
RayOnSpark
InferenceModel
Models &
Algorithms
Integrated
Analytics & AI
Pipelines
Time Series Computer Vision NLP
Automated ML
Workflow
AutoML for Time Series Automatic Cluster Serving
Compute
Environment
K8s Cluster Spark Cluster
Python Libraries
(Numpy/Pandas/sklearn/…)
DL Frameworks
(TF/PyTorch/OpenVINO/…)
Distributed Analytics
(Spark/Flink/Ray/…)
Laptop Hadoop Cluster
Powered by oneAPI
Unified Data Analytics and AI Platform for distributed TensorFlow, Keras and PyTorch
on Apache Spark/Flink & Ray
Unified Data Analytics and AI Platform
Easily prototype end-to-end pipelines that apply AI models to big data.
“Zero” code change from laptop to distributed cluster.
Seamlessly deployed on production Hadoop/K8s clusters.
Automate the process of applying machine learning to big data.
Seamless Scaling from Laptop to Distributed Big Data Clusters
Production
Data pipeline
Prototype on laptop
using sample data
Experiment on clusters
with history data
Production deployment w/
distributed data pipeline
Ray
Ray Core provides easy Python interface for parallelism by using remote functions
and actors.
Ray is a fast and simple framework for building and running distributed applications.
import ray
ray.init()
@ray.remote(num_cpus, …)
def f(x):
return x * x
# Executed in parallel
ray.get([f.remote(i) for i in range(5)])
@ray.remote(num_cpus, …)
class Counter(object):
def __init__(self):
self.n = 0
def increment(self):
self.n += 1
return self.n
counters = [Counter.remote() for i in range(5)]
# Executed in parallel
ray.get([c.increment.remote() for c in counters])
Ray
Tune: Scalable Experiment Execution and Hyperparameter Tuning
RLlib: Scalable Reinforcement Learning
RaySGD: Distributed Training Wrappers
https://github.com/ray-project/ray/
Ray is packaged with several high-level libraries to accelerate machine learning workloads.
Motivations for RayOnSpark
Demand to embrace emerging AI technologies on production data.
Efforts required to directly deploy Ray applications on existing Hadoop/Spark clusters.
Challenge to prepare the Python environment on each node without modifying the cluster.
Need a unified system for big data analytics and advanced AI applications.
Implementation of RayOnSpark
Leverage conda-pack and Spark for runtime
Python package distribution.
RayContext on Spark driver launches Ray across
the cluster.
Ray processes exist alongside Spark executors.
For each Spark executor, a Ray Manager is
created to manage Ray processes.
Able to run in-memory Spark RDDs or DataFrames
in Ray applications.
RayOnSpark allows Ray applications to seamlessly integrate
into Spark data processing pipelines.
Launch Ray* on
Apache Spark*
Interface of RayOnSpark
Three-step programming with minimum code
changes:
Initiate or use an existing SparkContext.
Initiate RayContext.
Shut down SparkContext and RayContext
after tasks finish.
More instructions at: https://analytics-
zoo.github.io/master/#ProgrammingGuide/r
ayonspark/
import ray
from zoo import init_spark_on_yarn
from zoo.ray import RayContext
sc = init_spark_on_yarn(hadoop_conf, conda_name,
num_executors, executor_cores,…)
ray_ctx = RayContext(sc, object_store_memory,…)
ray_ctx.init()
@ray.remote
class Counter(object):
def __init__(self):
self.n = 0
def increment(self):
self.n += 1
return self.n
counters = [Counter.remote() for i in range(5)]
ray.get([c.increment.remote() for c in counters])
ray_ctx.stop()
sc.stop()
RayOnSpark code
Pure Ray code
Use Cases of RayOnSpark
Scalable AutoML for time series prediction.
- Automate the feature generation, model selection and hyperparameter tuning processes.
- See more at: https://github.com/intel-analytics/analytics-zoo/tree/master/pyzoo/zoo/automl.
Source: Yao, Q., Wang, et. al Taking the Human out of Learning Applications : A Survey on Automated Machine Learning.
Use Cases of RayOnSpark
Data parallel pre-processing and distributed training pipeline of deep neural networks.
- Use PySpark or Ray for parallel data loading and processing.
- Use RayOnSpark to implement thin wrappers to automatically setup distributed environment.
- Run distributed training with either framework native modules or Horovod (from Uber) as the backend.
- Users only need to write the training script on the single node and make minimum code changes to achieve
distributed training.
Drive-thru Recommendation System
at Burger King
Burger King performs Spark ETL tasks first, followed by distributed MXNet training.
Similar to RaySGD, we implement a lightweight shim layer around native MXNet modules for
easy deployment on YARN cluster.
The entire pipeline runs on a single cluster. No extra data transfer needed.
Conclusion
RayOnSpark: Running Emerging AI Applications on Big Data Clusters with Ray and
Analytics Zoo https://medium.com/riselab/rayonspark-running-emerging-ai-
applications-on-big-data-clusters-with-ray-and-analytics-zoo-923e0136ed6a
More information for Analytics Zoo at:
▪ https://github.com/intel-analytics/analytics-zoo
▪ https://analytics-zoo.github.io/
We are working on full support and more out-of-box solutions for easily scaling out
Python AI pipelines from single node to cluster based on Ray and Spark.
16
Accelerate Your Data Analytics & AI Journey with
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark

Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark

  • 2.
    Running Emerging AIApplications on Big Data Platforms with Ray On Apache Spark Kai Huang Intel Corporation Jason Dai Intel Corporation
  • 3.
    Agenda Overview of AnalyticsZoo Introduction to Ray Motivations for Ray On Apache Spark Implementation details and API design Real-world use cases Conclusion
  • 4.
    Distributed, High-Performance Deep LearningFramework for Apache Spark* AI on Big Data Accelerating Data Analytics + AI Solutions At Scale https://github.com/intel-analytics/bigdl Unified Analytics + AI Platform for TensorFlow*, PyTorch*, Keras*, BigDL, Ray* and Apache Spark* https://github.com/intel-analytics/analytics-zoo
  • 5.
    Analytics Zoo https://github.com/intel-analytics/analytics-zoo Recommendation Distributed TensorFlow& PyTorch on Spark Spark Dataframes & ML Pipelines for DL RayOnSpark InferenceModel Models & Algorithms Integrated Analytics & AI Pipelines Time Series Computer Vision NLP Automated ML Workflow AutoML for Time Series Automatic Cluster Serving Compute Environment K8s Cluster Spark Cluster Python Libraries (Numpy/Pandas/sklearn/…) DL Frameworks (TF/PyTorch/OpenVINO/…) Distributed Analytics (Spark/Flink/Ray/…) Laptop Hadoop Cluster Powered by oneAPI Unified Data Analytics and AI Platform for distributed TensorFlow, Keras and PyTorch on Apache Spark/Flink & Ray
  • 6.
    Unified Data Analyticsand AI Platform Easily prototype end-to-end pipelines that apply AI models to big data. “Zero” code change from laptop to distributed cluster. Seamlessly deployed on production Hadoop/K8s clusters. Automate the process of applying machine learning to big data. Seamless Scaling from Laptop to Distributed Big Data Clusters Production Data pipeline Prototype on laptop using sample data Experiment on clusters with history data Production deployment w/ distributed data pipeline
  • 7.
    Ray Ray Core provideseasy Python interface for parallelism by using remote functions and actors. Ray is a fast and simple framework for building and running distributed applications. import ray ray.init() @ray.remote(num_cpus, …) def f(x): return x * x # Executed in parallel ray.get([f.remote(i) for i in range(5)]) @ray.remote(num_cpus, …) class Counter(object): def __init__(self): self.n = 0 def increment(self): self.n += 1 return self.n counters = [Counter.remote() for i in range(5)] # Executed in parallel ray.get([c.increment.remote() for c in counters])
  • 8.
    Ray Tune: Scalable ExperimentExecution and Hyperparameter Tuning RLlib: Scalable Reinforcement Learning RaySGD: Distributed Training Wrappers https://github.com/ray-project/ray/ Ray is packaged with several high-level libraries to accelerate machine learning workloads.
  • 9.
    Motivations for RayOnSpark Demandto embrace emerging AI technologies on production data. Efforts required to directly deploy Ray applications on existing Hadoop/Spark clusters. Challenge to prepare the Python environment on each node without modifying the cluster. Need a unified system for big data analytics and advanced AI applications.
  • 10.
    Implementation of RayOnSpark Leverageconda-pack and Spark for runtime Python package distribution. RayContext on Spark driver launches Ray across the cluster. Ray processes exist alongside Spark executors. For each Spark executor, a Ray Manager is created to manage Ray processes. Able to run in-memory Spark RDDs or DataFrames in Ray applications. RayOnSpark allows Ray applications to seamlessly integrate into Spark data processing pipelines. Launch Ray* on Apache Spark*
  • 11.
    Interface of RayOnSpark Three-stepprogramming with minimum code changes: Initiate or use an existing SparkContext. Initiate RayContext. Shut down SparkContext and RayContext after tasks finish. More instructions at: https://analytics- zoo.github.io/master/#ProgrammingGuide/r ayonspark/ import ray from zoo import init_spark_on_yarn from zoo.ray import RayContext sc = init_spark_on_yarn(hadoop_conf, conda_name, num_executors, executor_cores,…) ray_ctx = RayContext(sc, object_store_memory,…) ray_ctx.init() @ray.remote class Counter(object): def __init__(self): self.n = 0 def increment(self): self.n += 1 return self.n counters = [Counter.remote() for i in range(5)] ray.get([c.increment.remote() for c in counters]) ray_ctx.stop() sc.stop() RayOnSpark code Pure Ray code
  • 12.
    Use Cases ofRayOnSpark Scalable AutoML for time series prediction. - Automate the feature generation, model selection and hyperparameter tuning processes. - See more at: https://github.com/intel-analytics/analytics-zoo/tree/master/pyzoo/zoo/automl. Source: Yao, Q., Wang, et. al Taking the Human out of Learning Applications : A Survey on Automated Machine Learning.
  • 13.
    Use Cases ofRayOnSpark Data parallel pre-processing and distributed training pipeline of deep neural networks. - Use PySpark or Ray for parallel data loading and processing. - Use RayOnSpark to implement thin wrappers to automatically setup distributed environment. - Run distributed training with either framework native modules or Horovod (from Uber) as the backend. - Users only need to write the training script on the single node and make minimum code changes to achieve distributed training.
  • 14.
    Drive-thru Recommendation System atBurger King Burger King performs Spark ETL tasks first, followed by distributed MXNet training. Similar to RaySGD, we implement a lightweight shim layer around native MXNet modules for easy deployment on YARN cluster. The entire pipeline runs on a single cluster. No extra data transfer needed.
  • 15.
    Conclusion RayOnSpark: Running EmergingAI Applications on Big Data Clusters with Ray and Analytics Zoo https://medium.com/riselab/rayonspark-running-emerging-ai- applications-on-big-data-clusters-with-ray-and-analytics-zoo-923e0136ed6a More information for Analytics Zoo at: ▪ https://github.com/intel-analytics/analytics-zoo ▪ https://analytics-zoo.github.io/ We are working on full support and more out-of-box solutions for easily scaling out Python AI pipelines from single node to cluster based on Ray and Spark.
  • 16.
    16 Accelerate Your DataAnalytics & AI Journey with
  • 17.
    Feedback Your feedback isimportant to us. Don’t forget to rate and review the sessions.