Use Machine Learning to Get the Most out of Your Big Data Clusters

!2
Big data stack is a collection of distributed systems
Logs, files,  
and media
(unstructured)
Business/custom
apps (structured)
Ingest Store Prep & Train
Python,
Scala, Spark,
SQL
Model & Serve
Data Lake
Amazon
Athena
Amazon
Redshift
Internet of Things
Machine
Learning
Amazon
SageMaker

!4
My data pipeline is missing SLA

!5
Our Cloud costs are out of control

!6
What is the real root cause?

!7
What enterprises are facing: Monitoring Data is Silo’ed
A survey of 6000+ enterprise IT professionals
from Australia, Canada, France, Germany,  
UK, & USA reveals that  
91% are struggling with silo'ed monitoring data
Published: https://blog.appdynamics.com/aiops/aiops-platforms-transform-performance-monitoring/

!8
What enterprises are facing: Reactive Approach

!9
What enterprises are facing: High MTTR

!10
Solving this as a Data + AI problem
1. Lots of data about how the applications run
2. Most root causes are common and recurring
3. Much of the root cause analysis and
recommendations can be automated and improved
via learning

!11
Using AI/ML to address  
distributed application performance &
operations management
AIOps

!12
First: Bring all monitoring data to a single platform
One complete correlated view
History Server API
Resource Manager
API
Container Metrics
Logs
Metadata
Data Statistics
SQL Query Plans
Configuration

!13
Then: Apply algorithms to analyze the data & 
(whenever possible) take actions automatically
Built-in intelligence & automation
History Server API
Resource Manager
API
Container Metrics
Logs
Metadata
Data Statistics
SQL Query Plans
Configuration

!14
• Non-intrusive, low overhead, transient/elastic clusters
In data collection & transport
• Variety, scale, asynchronous arrival
In data storage
• Real-time, combine expert knowledge with ML
In algorithms to provide insights
• Reliable, predictable
In algorithms to take actions
Building such a platform
requires innovation

!15
Application autotuning
Holistic cluster optimization
Innovative
solutions
designed for
production
enterprise
environments

!16
spark.driver.cores 2
spark.executor.cores
…
10
spark.sql.shuffle.partitions 300
spark.sql.autoBroadcastJoinThreshold 20MB
…
SKEW('orders', 'o_custId') true
spark.catalog.cacheTable(“orders") true
…
PERFORMANCE
We represent this setting as vector X X
Tuning is often by trial-and-error

!17
Given: App + Goal
• Goal: Find the setting of X
that best meets the goal
• Challenge: Response
surface y = ƒ(X) is
unknown
X
PERFORMANCE

!18
A new world
INPUTS
1. App = Spark Query
2. Goal = Speedup
“I need to make this app faster”

!19
A new world
TIME
APPDURATION
In blink of an eye, user
gets recommendations to  
make the app 30% faster
As user finishes
checking email, she
has a verified run that
is 60% faster
User comes back from
lunch. A verified run that
is 90% faster
90%
faster!

!20
Autotuning workflow
App,Goal
Monitoring
Data
Historical Data
&
Probe Data
Recommendation
Algorithm
Cluster Services On-premises and Cloud
Orchestrator
Xnext
Probe Algorithm

!21
Get initial set of
monitoring data
from history or via
probes: <X1,y1>,
<X2,y2>, …, <Xn,yn>
1
Select next probe  
Xnext based on all
history and probe data
available so far to
calculate the setting
with maximum expected
improvement EIP(X)
2
Bootstrap
Probe Algorithm
Until the
stopping
condition
is reached
PERFORMANCE
X

!22
4 6 8 10 12
02468
x1
y
4 6 8 10 12
02468
x1
y
4 6 8 10 12
02468
x1
y
4 6 8 10 12
02468
x1
y
X
Performance
⌔
EIP(X)
⌔⌔
Exploration
⌔
Exploitation
Balance exploration VS exploitation

!23
Automated tuning of a failed Spark application
Failed due to OOM
Fixed conf setting
More memory-efficient
configuration

!24
Challenges and Limitations
• Need to account for data volume changes
• Not applicable to certain root causes
- Small files
- SQL queries that need to be rewritten
- Node/network issues
- Pool resource allocation
Need user input!

!26
Application autotuning
Innovative
solutions
designed for
production
enterprise
environments

!27
History Server API
Resource Manager
API
Container Metrics
Logs
Metadata
Data Statistics
SQL Query Plans
Configuration
Examples of optimization
beyond app level
• Scheduler configurations
for queues
• Schema/Data insights
• Cluster default for certain
parameters
• Cluster cost optimization

!28
Single-app tuning is not scalable
• A cluster may run 1000s of apps everyday
• Default is often bad - many apps could use some tuning
spark.driver.cores 2
spark.executor.cores
…
10
spark.sql.shuffle.partitions 300
spark.sql.autoBroadcastJoinThreshold 20MB
…
SKEW('orders', 'o_custId') true
spark.catalog.cacheTable(“orders") true
…
PERFORMANCE

!29
Better idea: Change the cluster defaults instead!
• Every app uses the new values automatically
• Challenge: new default could be better for some
apps but worse for others
• Data-driven approach: identify the best new default
based on the cluster’s workload
• New default not necessarily the best for a given app,
but the best for the input workload

!30
Finding new default values based on workload
1. Specify a time window on the input workload,
and other optional parameters
2. Go through all the apps in this window
3. For each parameter
• Identify one or more candidates that
improves the performance of the workload
• Evaluate benefit and risk, if exists, for
each candidate
• Suggest one final candidate
Reward:
memory
saved
Risk: % of
jobs that still
run
Histogram of
apps using diff
defaults
50% memory savings for
97% of the workload

!31
Evaluate new cluster defaults
• 2 windows before and after changing cluster defaults
- Assume workload characteristics don’t change drastically
• Measure KPIs
- Total # of apps per day
- Vcore-hours per day
- Memory-hours per day
- % of apps using the default

!32
One Customer (using Hive on MapReduce)
2x throughput increase and 2x reduction in cost!
• Compare # apps/day and Vcore-hours/day before
and after the change

!33
Spark cluster defaults
• spark.executor.memory
• spark.driver.memory
• spark.executor.cores
• spark.driver.cores
• spark.default.parallelism
• spark.executor.instances
• spark.yarn.driver.memoryOverhead
• spark.dynamicAllocation.enabled
• spark.dynamicAllocation.minExecutors
• spark.dynamicAllocation.initialExecutors
• spark.shuffle.service.enabled
• spark.sql.shuffle.partitions
• spark.sql.autoBroadcastJoinThreshold
Coming soon!

!34
Understanding cluster cost
Forecasting
Chargeback
Cluster Workload

!35
Cluster cost optimization
• Common challenges
- How many nodes should I allocate?
- What types of nodes should I use?
- Improving autoscaling rules for spiky workloads
• Combine resource utilization data with cluster defaults
• Use cases:
- On-prem to Cloud migration
- Ephemeral and permanent clusters on the Cloud

!36
Summary
• Rich opportunities to address distributed application
performance management with a Data+AI approach
• Data collection - non-intrusive, scalable,
asynchronous
• Expert knowledge + learning from data + use case
driven
Free trial? unraveldata.com/free-trial
Join our team? Email eric@unraveldata.com

Use Machine Learning to Get the Most out of Your Big Data Clusters

More Related Content

What's hot

Similar to Use Machine Learning to Get the Most out of Your Big Data Clusters

More from Databricks

Recently uploaded

Use Machine Learning to Get the Most out of Your Big Data Clusters