!1
!2
Big data stack is a collection of distributed systems
Logs, files, 

and media
(unstructured)
Business/custom
apps (structured)
Ingest Store Prep & Train
Python,
Scala, Spark,
SQL
Model & Serve
Data Lake
Amazon
Athena
Amazon
Redshift
Internet of Things
Machine
Learning
Amazon
SageMaker
!3
My app failed
!4
My data pipeline is missing SLA
!5
Our Cloud costs are out of control
!6
What is the real root cause?
!7
What enterprises are facing: Monitoring Data is Silo’ed
A survey of 6000+ enterprise IT professionals
from Australia, Canada, France, Germany, 

UK, & USA reveals that 

91% are struggling with silo'ed monitoring data 
Published: https://blog.appdynamics.com/aiops/aiops-platforms-transform-performance-monitoring/
!8
What enterprises are facing: Reactive Approach
Published: https://blog.appdynamics.com/aiops/aiops-platforms-transform-performance-monitoring/
!9
What enterprises are facing: High MTTR
Published: https://blog.appdynamics.com/aiops/aiops-platforms-transform-performance-monitoring/
!10
Solving this as a Data + AI problem
1. Lots of data about how the applications run
2. Most root causes are common and recurring
3. Much of the root cause analysis and
recommendations can be automated and improved
via learning
!11
Using AI/ML to address 

distributed application performance &
operations management
AIOps
!12
First: Bring all monitoring data to a single platform
One complete correlated view
History Server API
Resource Manager
API
Container Metrics
Logs
Metadata
Data Statistics
SQL Query Plans
Configuration
!13
Then: Apply algorithms to analyze the data &

 (whenever possible) take actions automatically
Built-in intelligence & automation
History Server API
Resource Manager
API
Container Metrics
Logs
Metadata
Data Statistics
SQL Query Plans
Configuration
One complete correlated view
!14
• Non-intrusive, low overhead, transient/elastic clusters
In data collection & transport
• Variety, scale, asynchronous arrival
In data storage
• Real-time, combine expert knowledge with ML
In algorithms to provide insights
• Reliable, predictable
In algorithms to take actions
Building such a platform
requires innovation
!15
Application autotuning
Holistic cluster optimization
Innovative
solutions
designed for
production
enterprise
environments 
 
!16
spark.driver.cores 2
spark.executor.cores
…
10
spark.sql.shuffle.partitions 300
spark.sql.autoBroadcastJoinThreshold 20MB
…
SKEW('orders', 'o_custId') true
spark.catalog.cacheTable(“orders") true
…
PERFORMANCE
We represent this setting as vector X X
Tuning is often by trial-and-error
!17
Given: App + Goal
• Goal: Find the setting of X
that best meets the goal
• Challenge: Response
surface y = ƒ(X) is
unknown
X
PERFORMANCE
!18
A new world
INPUTS
1. App = Spark Query
2. Goal = Speedup
“I need to make this app faster”
!19
A new world
TIME
APPDURATION
In blink of an eye, user
gets recommendations to 

make the app 30% faster
As user finishes
checking email, she
has a verified run that
is 60% faster
User comes back from
lunch. A verified run that
is 90% faster
90%
faster!
!20
Autotuning workflow
App,Goal
Monitoring
Data
Historical Data
&
Probe Data
Recommendation
Algorithm
Cluster Services On-premises and Cloud
Orchestrator
Xnext
Probe Algorithm
!21
Get initial set of
monitoring data
from history or via
probes: <X1,y1>,
<X2,y2>, …, <Xn,yn>
1
Select next probe 

Xnext based on all
history and probe data
available so far to
calculate the setting
with maximum expected
improvement EIP(X)
2
Bootstrap
Probe Algorithm
Until the
stopping
condition
is reached
PERFORMANCE
X
!22
4 6 8 10 12
02468
x1
y
4 6 8 10 12
02468
x1
y
4 6 8 10 12
02468
x1
y
4 6 8 10 12
02468
x1
y
X
Performance
⌔
EIP(X)
⌔⌔
Exploration
⌔
Exploitation
Balance exploration VS exploitation
!23
Automated tuning of a failed Spark application
Failed due to OOM
Fixed conf setting
More memory-efficient
configuration
!24
Challenges and Limitations
• Need to account for data volume changes
• Not applicable to certain root causes
- Small files
- SQL queries that need to be rewritten
- Node/network issues
- Pool resource allocation
Need user input!
!25
!26
Application autotuning
Holistic cluster optimization
Innovative
solutions
designed for
production
enterprise
environments 
 
!27
Holistic cluster optimization
One complete correlated view
History Server API
Resource Manager
API
Container Metrics
Logs
Metadata
Data Statistics
SQL Query Plans
Configuration
Examples of optimization
beyond app level
• Scheduler configurations
for queues
• Schema/Data insights
• Cluster default for certain
parameters
• Cluster cost optimization
!28
Single-app tuning is not scalable
• A cluster may run 1000s of apps everyday
• Default is often bad - many apps could use some tuning
spark.driver.cores 2
spark.executor.cores
…
10
spark.sql.shuffle.partitions 300
spark.sql.autoBroadcastJoinThreshold 20MB
…
SKEW('orders', 'o_custId') true
spark.catalog.cacheTable(“orders") true
…
PERFORMANCE
!29
Better idea: Change the cluster defaults instead!
• Every app uses the new values automatically
• Challenge: new default could be better for some
apps but worse for others
• Data-driven approach: identify the best new default
based on the cluster’s workload
• New default not necessarily the best for a given app,
but the best for the input workload
!30
Finding new default values based on workload
1. Specify a time window on the input workload,
and other optional parameters
2. Go through all the apps in this window
3. For each parameter
• Identify one or more candidates that
improves the performance of the workload
• Evaluate benefit and risk, if exists, for
each candidate
• Suggest one final candidate
Reward:
memory
saved
Risk: % of
jobs that still
run
Histogram of
apps using diff
defaults
50% memory savings for
97% of the workload
!31
Evaluate new cluster defaults
• 2 windows before and after changing cluster defaults
- Assume workload characteristics don’t change drastically
• Measure KPIs
- Total # of apps per day
- Vcore-hours per day
- Memory-hours per day
- % of apps using the default
!32
One Customer (using Hive on MapReduce)
2x throughput increase and 2x reduction in cost!
• Compare # apps/day and Vcore-hours/day before
and after the change
!33
Spark cluster defaults
• spark.executor.memory
• spark.driver.memory
• spark.executor.cores
• spark.driver.cores
• spark.default.parallelism
• spark.executor.instances
• spark.yarn.driver.memoryOverhead
• spark.dynamicAllocation.enabled
• spark.dynamicAllocation.minExecutors
• spark.dynamicAllocation.initialExecutors
• spark.shuffle.service.enabled
• spark.sql.shuffle.partitions
• spark.sql.autoBroadcastJoinThreshold
Coming soon!
!34
Understanding cluster cost
Forecasting
Chargeback
Cluster Workload
!35
Cluster cost optimization
• Common challenges
- How many nodes should I allocate?
- What types of nodes should I use?
- Improving autoscaling rules for spiky workloads
• Combine resource utilization data with cluster defaults
• Use cases:
- On-prem to Cloud migration
- Ephemeral and permanent clusters on the Cloud
!36
Summary
• Rich opportunities to address distributed application
performance management with a Data+AI approach
• Data collection - non-intrusive, scalable,
asynchronous
• Expert knowledge + learning from data + use case
driven
Free trial? unraveldata.com/free-trial
Join our team? Email eric@unraveldata.com

Use Machine Learning to Get the Most out of Your Big Data Clusters

  • 1.
  • 2.
    !2 Big data stackis a collection of distributed systems Logs, files, 
 and media (unstructured) Business/custom apps (structured) Ingest Store Prep & Train Python, Scala, Spark, SQL Model & Serve Data Lake Amazon Athena Amazon Redshift Internet of Things Machine Learning Amazon SageMaker
  • 3.
  • 4.
    !4 My data pipelineis missing SLA
  • 5.
    !5 Our Cloud costsare out of control
  • 6.
    !6 What is thereal root cause?
  • 7.
    !7 What enterprises arefacing: Monitoring Data is Silo’ed A survey of 6000+ enterprise IT professionals from Australia, Canada, France, Germany, 
 UK, & USA reveals that 
 91% are struggling with silo'ed monitoring data  Published: https://blog.appdynamics.com/aiops/aiops-platforms-transform-performance-monitoring/
  • 8.
    !8 What enterprises arefacing: Reactive Approach Published: https://blog.appdynamics.com/aiops/aiops-platforms-transform-performance-monitoring/
  • 9.
    !9 What enterprises arefacing: High MTTR Published: https://blog.appdynamics.com/aiops/aiops-platforms-transform-performance-monitoring/
  • 10.
    !10 Solving this asa Data + AI problem 1. Lots of data about how the applications run 2. Most root causes are common and recurring 3. Much of the root cause analysis and recommendations can be automated and improved via learning
  • 11.
    !11 Using AI/ML to address
 distributed application performance & operations management AIOps
  • 12.
    !12 First: Bring allmonitoring data to a single platform One complete correlated view History Server API Resource Manager API Container Metrics Logs Metadata Data Statistics SQL Query Plans Configuration
  • 13.
    !13 Then: Apply algorithmsto analyze the data &
  (whenever possible) take actions automatically Built-in intelligence & automation History Server API Resource Manager API Container Metrics Logs Metadata Data Statistics SQL Query Plans Configuration One complete correlated view
  • 14.
    !14 • Non-intrusive, lowoverhead, transient/elastic clusters In data collection & transport • Variety, scale, asynchronous arrival In data storage • Real-time, combine expert knowledge with ML In algorithms to provide insights • Reliable, predictable In algorithms to take actions Building such a platform requires innovation
  • 15.
    !15 Application autotuning Holistic clusteroptimization Innovative solutions designed for production enterprise environments   
  • 16.
    !16 spark.driver.cores 2 spark.executor.cores … 10 spark.sql.shuffle.partitions 300 spark.sql.autoBroadcastJoinThreshold20MB … SKEW('orders', 'o_custId') true spark.catalog.cacheTable(“orders") true … PERFORMANCE We represent this setting as vector X X Tuning is often by trial-and-error
  • 17.
    !17 Given: App +Goal • Goal: Find the setting of X that best meets the goal • Challenge: Response surface y = ƒ(X) is unknown X PERFORMANCE
  • 18.
    !18 A new world INPUTS 1.App = Spark Query 2. Goal = Speedup “I need to make this app faster”
  • 19.
    !19 A new world TIME APPDURATION Inblink of an eye, user gets recommendations to 
 make the app 30% faster As user finishes checking email, she has a verified run that is 60% faster User comes back from lunch. A verified run that is 90% faster 90% faster!
  • 20.
    !20 Autotuning workflow App,Goal Monitoring Data Historical Data & ProbeData Recommendation Algorithm Cluster Services On-premises and Cloud Orchestrator Xnext Probe Algorithm
  • 21.
    !21 Get initial setof monitoring data from history or via probes: <X1,y1>, <X2,y2>, …, <Xn,yn> 1 Select next probe 
 Xnext based on all history and probe data available so far to calculate the setting with maximum expected improvement EIP(X) 2 Bootstrap Probe Algorithm Until the stopping condition is reached PERFORMANCE X
  • 22.
    !22 4 6 810 12 02468 x1 y 4 6 8 10 12 02468 x1 y 4 6 8 10 12 02468 x1 y 4 6 8 10 12 02468 x1 y X Performance ⌔ EIP(X) ⌔⌔ Exploration ⌔ Exploitation Balance exploration VS exploitation
  • 23.
    !23 Automated tuning ofa failed Spark application Failed due to OOM Fixed conf setting More memory-efficient configuration
  • 24.
    !24 Challenges and Limitations •Need to account for data volume changes • Not applicable to certain root causes - Small files - SQL queries that need to be rewritten - Node/network issues - Pool resource allocation Need user input!
  • 25.
  • 26.
    !26 Application autotuning Holistic clusteroptimization Innovative solutions designed for production enterprise environments   
  • 27.
    !27 Holistic cluster optimization Onecomplete correlated view History Server API Resource Manager API Container Metrics Logs Metadata Data Statistics SQL Query Plans Configuration Examples of optimization beyond app level • Scheduler configurations for queues • Schema/Data insights • Cluster default for certain parameters • Cluster cost optimization
  • 28.
    !28 Single-app tuning isnot scalable • A cluster may run 1000s of apps everyday • Default is often bad - many apps could use some tuning spark.driver.cores 2 spark.executor.cores … 10 spark.sql.shuffle.partitions 300 spark.sql.autoBroadcastJoinThreshold 20MB … SKEW('orders', 'o_custId') true spark.catalog.cacheTable(“orders") true … PERFORMANCE
  • 29.
    !29 Better idea: Changethe cluster defaults instead! • Every app uses the new values automatically • Challenge: new default could be better for some apps but worse for others • Data-driven approach: identify the best new default based on the cluster’s workload • New default not necessarily the best for a given app, but the best for the input workload
  • 30.
    !30 Finding new defaultvalues based on workload 1. Specify a time window on the input workload, and other optional parameters 2. Go through all the apps in this window 3. For each parameter • Identify one or more candidates that improves the performance of the workload • Evaluate benefit and risk, if exists, for each candidate • Suggest one final candidate Reward: memory saved Risk: % of jobs that still run Histogram of apps using diff defaults 50% memory savings for 97% of the workload
  • 31.
    !31 Evaluate new clusterdefaults • 2 windows before and after changing cluster defaults - Assume workload characteristics don’t change drastically • Measure KPIs - Total # of apps per day - Vcore-hours per day - Memory-hours per day - % of apps using the default
  • 32.
    !32 One Customer (usingHive on MapReduce) 2x throughput increase and 2x reduction in cost! • Compare # apps/day and Vcore-hours/day before and after the change
  • 33.
    !33 Spark cluster defaults •spark.executor.memory • spark.driver.memory • spark.executor.cores • spark.driver.cores • spark.default.parallelism • spark.executor.instances • spark.yarn.driver.memoryOverhead • spark.dynamicAllocation.enabled • spark.dynamicAllocation.minExecutors • spark.dynamicAllocation.initialExecutors • spark.shuffle.service.enabled • spark.sql.shuffle.partitions • spark.sql.autoBroadcastJoinThreshold Coming soon!
  • 34.
  • 35.
    !35 Cluster cost optimization •Common challenges - How many nodes should I allocate? - What types of nodes should I use? - Improving autoscaling rules for spiky workloads • Combine resource utilization data with cluster defaults • Use cases: - On-prem to Cloud migration - Ephemeral and permanent clusters on the Cloud
  • 36.
    !36 Summary • Rich opportunitiesto address distributed application performance management with a Data+AI approach • Data collection - non-intrusive, scalable, asynchronous • Expert knowledge + learning from data + use case driven Free trial? unraveldata.com/free-trial Join our team? Email eric@unraveldata.com