Josh Rosen and Henry Davidge
Lessons Learned From
Managing Thousands of
Apache Spark Clusters at
Scale
About Us
• Josh
– Apache Spark Committer; contributing since 2012
• Henry
– Software engineer on cluster management team
– BS Yale 2014
• Both love data + Spark
TEAM
About Databricks
Started Spark project (now Apache Spark) at UC Berkeley in 2009
33
PRODUCT
Unified Analytics Platform
MISSION
Making Big Data Simple
Monitoring challenges at Databricks
• Databricks is an inherently complex system
– Many Spark clusters, configurations, and versions
– Customers can run arbitrary code on their clusters
– Deep integration with cloud providers
• Scale
– 300k+ metrics/min from Databricks
– 2M+ metrics/min from customers
– 200+ MB/second of logs
Background: Databricks Architecture
• Control plane in our AWS account
• Spark clusters in customer’ AWS accounts
Databricks
Create
Configure
Terminate
Customer 1
Customer 2
ClusterClusterCluster
ClusterCluster
service
service
service
customer
requests
Background: Data pipeline
• Services output three
streams
– Fast path metrics
– Rich structured logs
– Unstructured service logs
• Everything to Kinesis
Databricks
service
service
Customer
logs &
metrics
Amazon
Kinesis
Background: Data pipeline
• Structured streaming job reads raw logs from Kinesis
and saves as Parquet in S3
• Batch and streaming jobs perform additional processing
on S3 data and output additional Parquet
Amazon
Kinesis
Structured
Streaming
Amazon S3
Batch &
Streaming Jobs
raw logs
Story 1: Tracking AWS
anomalies
Stuff happens
• Many failure modes possible during provisioning
– Limits
– Environment issues
– Bugs
• Want to catch the bugs!
– Without many false positives
Observing failures
{
“instance-id”: STRING,
“api-error-code”: STRING, // Request rejected
“instance-state-reason”: STRING, // Terminated after launch
“instance-status”: STRING, // Health status from AWS
“customer-metadata”: OBJECT
}
Analyzing failures
• Structured streaming query reads from file source
• Strip known error patterns (limits, nonexistent
resource)
• If new error type is seen, low prio alert
– If seen by multiple customers, page on call
One day in May...
One day in May...
• AWS bug caused Spot requests to be rejected
• Alert allowed us to identify and notify customers
• Informed AWS of issue, they patched
• Our Spark cluster helped yours!
Story 2: Discovering bugs from
unstructured logs
Goal: monitor Spark logs for errors
• Search Spark logs for error messages to discover
issues and determine their scope/impact:
– Which customers are impacted by an error?
– How frequently is that error occurring?
– Does the error only affect certain versions of Spark?
– Are there long-term trends in the data?
Challenge: false-positives
• Errors may be fixed in newer Spark versions but
continue to occur in old versions
• Raw logs can be very messy: many near-duplicate
errors due to variables being included in log messages
• Normalize: replace constants in logs (numbers, IP
addresses, customer names) with placeholders
• Deduplicate: Store (count, version,
set(customers), example) instead of raw logs
• Filter: Use patterns to (conditionally) ignore known
errors or to surface only new errors (errors that
appeared for the first time)
17
Solution: normalize, deduplicate & filter
18
High level overview of pipeline:
18
Service
version info
Raw Logs
Logs with
versions
Fast
normalize
Deduplicate
/ aggregate
Slow
normalize
Final
aggregation
Error
suppression
patterns
Historic data
New /
interesting
errors
Alerts
Reports
Dashboards
Storage
(for historical
analysis)
Non-suppressed
errors
Pipeline overview
19
Example 1: SPARK-19691
• spark.range(10)
.selectExpr("cast (id as decimal) as x")
.selectExpr("percentile(x, 0.5)")
.collect()
• Failed with java.lang.ClassCastException:
org.apache.spark.sql.types.Decimal cannot
be cast to java.lang.Number
20
Role in QA
• Proactively detect bugs in unreleased versions:
– Compare error profiles between staging and
production environments
Lessons Learned
• Structure proactively
– Much easier to implement at logging time
• Strive for small data
– Normalize, deduplicate, find patterns
• Alert at multiple urgency levels
For more details
• Log collection pipeline:
– “A Journey into Databricks' Pipelines:
Journey and Lessons Learned”
– Somewhat out of date, but thorough
• Structured metrics pipeline:
– “Monitoring Large-Scale Spark Clusters
at Databricks”
– Kinesis + Prometheus + Grafana
• Error log analysis:
– “Monitoring error logs at Databricks”
UNIFIED ANALYTICS PLATFORM
Try Apache Spark in Databricks!
• Collaborative cloud environment
• Free version (community edition)
2323
DATABRICKS RUNTIME 3.0
• Apache Spark - optimized for the cloud
• Caching and optimization layer - DBIO
• Enterprise security - DBES
Try for free today.
databricks.com
Thank You.
joshrosen
hhd
@databricks.com

Lessons Learned from Managing Thousands of Production Apache Spark Clusters with Scale with Henry Davidge and Josh Rosen

  • 1.
    Josh Rosen andHenry Davidge Lessons Learned From Managing Thousands of Apache Spark Clusters at Scale
  • 2.
    About Us • Josh –Apache Spark Committer; contributing since 2012 • Henry – Software engineer on cluster management team – BS Yale 2014 • Both love data + Spark
  • 3.
    TEAM About Databricks Started Sparkproject (now Apache Spark) at UC Berkeley in 2009 33 PRODUCT Unified Analytics Platform MISSION Making Big Data Simple
  • 4.
    Monitoring challenges atDatabricks • Databricks is an inherently complex system – Many Spark clusters, configurations, and versions – Customers can run arbitrary code on their clusters – Deep integration with cloud providers • Scale – 300k+ metrics/min from Databricks – 2M+ metrics/min from customers – 200+ MB/second of logs
  • 5.
    Background: Databricks Architecture •Control plane in our AWS account • Spark clusters in customer’ AWS accounts Databricks Create Configure Terminate Customer 1 Customer 2 ClusterClusterCluster ClusterCluster service service service customer requests
  • 6.
    Background: Data pipeline •Services output three streams – Fast path metrics – Rich structured logs – Unstructured service logs • Everything to Kinesis Databricks service service Customer logs & metrics Amazon Kinesis
  • 7.
    Background: Data pipeline •Structured streaming job reads raw logs from Kinesis and saves as Parquet in S3 • Batch and streaming jobs perform additional processing on S3 data and output additional Parquet Amazon Kinesis Structured Streaming Amazon S3 Batch & Streaming Jobs raw logs
  • 8.
    Story 1: TrackingAWS anomalies
  • 9.
    Stuff happens • Manyfailure modes possible during provisioning – Limits – Environment issues – Bugs • Want to catch the bugs! – Without many false positives
  • 10.
    Observing failures { “instance-id”: STRING, “api-error-code”:STRING, // Request rejected “instance-state-reason”: STRING, // Terminated after launch “instance-status”: STRING, // Health status from AWS “customer-metadata”: OBJECT }
  • 11.
    Analyzing failures • Structuredstreaming query reads from file source • Strip known error patterns (limits, nonexistent resource) • If new error type is seen, low prio alert – If seen by multiple customers, page on call
  • 12.
    One day inMay...
  • 13.
    One day inMay... • AWS bug caused Spot requests to be rejected • Alert allowed us to identify and notify customers • Informed AWS of issue, they patched • Our Spark cluster helped yours!
  • 14.
    Story 2: Discoveringbugs from unstructured logs
  • 15.
    Goal: monitor Sparklogs for errors • Search Spark logs for error messages to discover issues and determine their scope/impact: – Which customers are impacted by an error? – How frequently is that error occurring? – Does the error only affect certain versions of Spark? – Are there long-term trends in the data?
  • 16.
    Challenge: false-positives • Errorsmay be fixed in newer Spark versions but continue to occur in old versions • Raw logs can be very messy: many near-duplicate errors due to variables being included in log messages
  • 17.
    • Normalize: replaceconstants in logs (numbers, IP addresses, customer names) with placeholders • Deduplicate: Store (count, version, set(customers), example) instead of raw logs • Filter: Use patterns to (conditionally) ignore known errors or to surface only new errors (errors that appeared for the first time) 17 Solution: normalize, deduplicate & filter
  • 18.
    18 High level overviewof pipeline: 18 Service version info Raw Logs Logs with versions Fast normalize Deduplicate / aggregate Slow normalize Final aggregation Error suppression patterns Historic data New / interesting errors Alerts Reports Dashboards Storage (for historical analysis) Non-suppressed errors Pipeline overview
  • 19.
    19 Example 1: SPARK-19691 •spark.range(10) .selectExpr("cast (id as decimal) as x") .selectExpr("percentile(x, 0.5)") .collect() • Failed with java.lang.ClassCastException: org.apache.spark.sql.types.Decimal cannot be cast to java.lang.Number
  • 20.
    20 Role in QA •Proactively detect bugs in unreleased versions: – Compare error profiles between staging and production environments
  • 21.
    Lessons Learned • Structureproactively – Much easier to implement at logging time • Strive for small data – Normalize, deduplicate, find patterns • Alert at multiple urgency levels
  • 22.
    For more details •Log collection pipeline: – “A Journey into Databricks' Pipelines: Journey and Lessons Learned” – Somewhat out of date, but thorough • Structured metrics pipeline: – “Monitoring Large-Scale Spark Clusters at Databricks” – Kinesis + Prometheus + Grafana • Error log analysis: – “Monitoring error logs at Databricks”
  • 23.
    UNIFIED ANALYTICS PLATFORM TryApache Spark in Databricks! • Collaborative cloud environment • Free version (community edition) 2323 DATABRICKS RUNTIME 3.0 • Apache Spark - optimized for the cloud • Caching and optimization layer - DBIO • Enterprise security - DBES Try for free today. databricks.com
  • 24.