Databricks’ Data Pipelines:
Journey and Lessons Learned
Yu Peng, Burak Yavuz
07/06/2016
Who Are We
Yu Peng
Data Engineer at Databricks
Building Databricks’ next-generation data pipeline
on top of Apache Spark
BS in Xiamen University
Ph.D in The University of Hong Kong
Burak Yavuz
Software Engineer at Databricks
Contributor to Spark since Spark 1.1
Maintainer of Spark Packages
BS in Mechanical Engineering at Bogazici University
MS in Management Science & Engineering at Stanford
University
Building a data pipeline is hard
• At least once or exactly once semantics
• Fault tolerance
• Resource management
• Scalability
• Maintainability
Apache®
Spark™
+ Databricks = Our Solution
• All ETL jobs are built on top of Apache Spark
• Unified solution, everything in the same place
• All ETL jobs are run on Databricks platform
• Platform for Data Engineers and Scientists
• Test out Spark and Databricks new features
Apache, Apache Spark and Spark are trademarks of the Apache Software Foundation
Classic Lambda Data Pipeline
service 0
service ...
log collector
…
.
Centralized
Messaging
System
Delta ETL
Batch ETL
Storage
System
service 1
service ...
log collector
….
service x
service ...
log collector
…
.
…...
Customer
Dep 0
Customer
Dep 1
Amazon
Kinesis
Customer
Dep 2
Databricks Data Pipeline Overview
Databricks
Dep
….
Customer
Dep 0
Customer
Dep 1
Amazon
Kinesis
service 1
service 2
service x
log-daemon
….
Customer
Dep 2
Cluster 0
service 0
service x
log-daemon
….
service 1
service y
log-daemon
….
Cluster 1
….
Databricks Data Pipeline Overview
Cluster 2 Databricks
Dep
….
7
Customer
Dep 0
Customer
Dep 1
Amazon
Kinesis
service 1
service 2
service x
log-daemon
….
Customer
Dep 2
Cluster 0
service 0
service x
log-daemon
….
service 1
service y
log-daemon
….
Cluster 1
….
Databricks Data Pipeline Overview
Cluster 2 Databricks
Dep
….
8
Databricks Deployment
Customer
Dep 0
Customer
Dep 1
Amazon
Kinesis
Databricks Filesystem
Databricks Jobs
service 1
service 2
service x
log-daemon
….
Customer
Dep 2
Cluster 0
service 0
service x
log-daemon
….
service 1
service y
log-daemon
….
Cluster 1
….
Databricks Data Pipeline Overview
Cluster 2 Databricks
Dep
….
9
Databricks Deployment
Customer
Dep 0
Customer
Dep 1
Amazon
Kinesis
Databricks Filesystem
Databricks Jobs
service 1
service 2
service x
log-daemon
….
Customer
Dep 2
Cluster 0
service 0
service x
log-daemon
….
service 1
service y
log-daemon
….
Cluster 1
….
Databricks Data Pipeline Overview
Cluster 2
Real-time analysis
Databricks
Dep
….
10
Databricks Deployment
Customer
Dep 0
Customer
Dep 1
Amazon
Kinesis
DBFS
Databricks Jobs
service 1
service 2
service x
log-daemon
….
Customer
Dep 2
Cluster 0
service 0
service x
log-daemon
….
service 1
service y
log-daemon
….
Cluster 1
….
Sync daemonRaw record batch (json)
Databricks Data Pipeline Overview
Cluster 2 Databricks
Dep
….
11
Databricks Deployment
Customer
Dep 0
Customer
Dep 1
Amazon
Kinesis
DBFS
Databricks Jobs
service 1
service 2
service x
log-daemon
….
Customer
Dep 2
Cluster 0
service 0
service x
log-daemon
….
service 1
service y
log-daemon
….
Cluster 1
….
Sync daemon
ETL jobs
Raw record batch (json)
Tables (parquet)
Databricks Data Pipeline Overview
Cluster 2 Databricks
Dep
….
12
Databricks Deployment
Customer
Dep 0
Customer
Dep 1
Amazon
Kinesis
DBFS
Databricks Jobs
service 1
service 2
service x
log-daemon
….
Customer
Dep 2
Cluster 0
service 0
service x
log-daemon
….
service 1
service y
log-daemon
….
Cluster 1
….
Sync daemon
ETL jobs
Data analysis
Raw record batch (json)
Tables (parquet)
Databricks Data Pipeline Overview
Cluster 2
Real-time analysis
Databricks
Dep
….
13
Log collection (Log-daemon)
• Fault tolerance and at least once semantics
• Streaming
• Batch
• Spark History Server
• Multi-tenant and config driven
• Spark container
14
Log Daemon
logStream1
Service 1
active.log
2015-11-30-20.log
2015-11-30-19.log
log rotation
…..
Service 2
active.log
2015-11-30-20.log
2015-11-30-19.log
log rotation
Kinesistopic-1
Service x
active.log
2015-11-30-20.log
2015-11-30-19.log
log rotation
state files
Log Daemon
Architecture
producer
reader
Message Producer
logStream2
producer
reader
logStreamX
producer
reader
…………... …………... …………...
15
topic-2
Sync Daemon
• Read from Kinesis and Write to DBFS
• Buffer and write in batches (128 MB or 5 Mins)
• Partitioned by date
• A long running Apache Spark job
• Easy to scale up and down
16
Databricks Deployment
ETL Jobs
Databricks
Filesystem
No dedup
Append
Dedup
Overwrite
17
New files
Current day
All files
Previous day
Databricks Jobs
Delta job
(every 10 mins)
Batch job
(daily)
Raw records
Databricks
Filesystem
ETL Tables
(Parquet)
ETL Jobs
• Use the same code for Delta and Batch jobs
• Run as scheduled Databricks jobs
• Use spot instances and fallback to on-demand
• Deliver to Databricks as parquet tables
Lessons Learned
- Partition Pruning can save a lot of time and money
Reduced query time from 2800 seconds to just 15 seconds.
Don’t partition too many levels as it leads to worse metadata discovery
performance and cost.
19
Lessons Learned
- High S3 costs: Lots of LIST Requests
Metadata discovery on S3 is expensive. Spark SQL tries to refresh it’s
metadata cache even after write operations.
20
Running It All in Databricks - Jobs
Running It All in Databricks - Spark
Data Analysis & Tools
We get the data in. What’s next?
● Monitoring
● Debugging
● Usage Analysis
● Product Design (A/B testing)
23
Debugging
Access to logs in a matter of seconds thanks to Apache Spark.
24
Monitoring
Monitor logs by log level. Bug introduced on 2016-05-26 01:00:00 UTC. Fix deployed in 2 hours.
25
Usage Analysis + Product Design
SparkR + ggplot2 = Match made in heaven
26
Summary
Databricks + Apache Spark create a unified platform for:
- ETL
- Data Warehousing
- Data Analysis
- Real time analytics
Issues with DevOps out of the question:
- No need to manage a huge cluster
- Jobs are isolated, they don’t cannibalize each other’s resources
- Can launch any Spark version
Ongoing & Future Work
Structured Streaming
- Reduce Complexity of pipeline:
Sync Daemon + Delta + Batch Jobs => Single Streaming Job
- Reduce Latency
Availability of data in seconds instead of minutes
- Event Time Dashboards
28
Try Apache Spark with Databricks
29
http://databricks.com/try
Thank you.
Have questions about ETL with Spark?
Join us at the Databricks Booth 3.45-6.00pm!

A Journey into Databricks' Pipelines: Journey and Lessons Learned

  • 1.
    Databricks’ Data Pipelines: Journeyand Lessons Learned Yu Peng, Burak Yavuz 07/06/2016
  • 2.
    Who Are We YuPeng Data Engineer at Databricks Building Databricks’ next-generation data pipeline on top of Apache Spark BS in Xiamen University Ph.D in The University of Hong Kong Burak Yavuz Software Engineer at Databricks Contributor to Spark since Spark 1.1 Maintainer of Spark Packages BS in Mechanical Engineering at Bogazici University MS in Management Science & Engineering at Stanford University
  • 3.
    Building a datapipeline is hard • At least once or exactly once semantics • Fault tolerance • Resource management • Scalability • Maintainability
  • 4.
    Apache® Spark™ + Databricks =Our Solution • All ETL jobs are built on top of Apache Spark • Unified solution, everything in the same place • All ETL jobs are run on Databricks platform • Platform for Data Engineers and Scientists • Test out Spark and Databricks new features Apache, Apache Spark and Spark are trademarks of the Apache Software Foundation
  • 5.
    Classic Lambda DataPipeline service 0 service ... log collector … . Centralized Messaging System Delta ETL Batch ETL Storage System service 1 service ... log collector …. service x service ... log collector … . …...
  • 6.
    Customer Dep 0 Customer Dep 1 Amazon Kinesis Customer Dep2 Databricks Data Pipeline Overview Databricks Dep ….
  • 7.
    Customer Dep 0 Customer Dep 1 Amazon Kinesis service1 service 2 service x log-daemon …. Customer Dep 2 Cluster 0 service 0 service x log-daemon …. service 1 service y log-daemon …. Cluster 1 …. Databricks Data Pipeline Overview Cluster 2 Databricks Dep …. 7
  • 8.
    Customer Dep 0 Customer Dep 1 Amazon Kinesis service1 service 2 service x log-daemon …. Customer Dep 2 Cluster 0 service 0 service x log-daemon …. service 1 service y log-daemon …. Cluster 1 …. Databricks Data Pipeline Overview Cluster 2 Databricks Dep …. 8
  • 9.
    Databricks Deployment Customer Dep 0 Customer Dep1 Amazon Kinesis Databricks Filesystem Databricks Jobs service 1 service 2 service x log-daemon …. Customer Dep 2 Cluster 0 service 0 service x log-daemon …. service 1 service y log-daemon …. Cluster 1 …. Databricks Data Pipeline Overview Cluster 2 Databricks Dep …. 9
  • 10.
    Databricks Deployment Customer Dep 0 Customer Dep1 Amazon Kinesis Databricks Filesystem Databricks Jobs service 1 service 2 service x log-daemon …. Customer Dep 2 Cluster 0 service 0 service x log-daemon …. service 1 service y log-daemon …. Cluster 1 …. Databricks Data Pipeline Overview Cluster 2 Real-time analysis Databricks Dep …. 10
  • 11.
    Databricks Deployment Customer Dep 0 Customer Dep1 Amazon Kinesis DBFS Databricks Jobs service 1 service 2 service x log-daemon …. Customer Dep 2 Cluster 0 service 0 service x log-daemon …. service 1 service y log-daemon …. Cluster 1 …. Sync daemonRaw record batch (json) Databricks Data Pipeline Overview Cluster 2 Databricks Dep …. 11
  • 12.
    Databricks Deployment Customer Dep 0 Customer Dep1 Amazon Kinesis DBFS Databricks Jobs service 1 service 2 service x log-daemon …. Customer Dep 2 Cluster 0 service 0 service x log-daemon …. service 1 service y log-daemon …. Cluster 1 …. Sync daemon ETL jobs Raw record batch (json) Tables (parquet) Databricks Data Pipeline Overview Cluster 2 Databricks Dep …. 12
  • 13.
    Databricks Deployment Customer Dep 0 Customer Dep1 Amazon Kinesis DBFS Databricks Jobs service 1 service 2 service x log-daemon …. Customer Dep 2 Cluster 0 service 0 service x log-daemon …. service 1 service y log-daemon …. Cluster 1 …. Sync daemon ETL jobs Data analysis Raw record batch (json) Tables (parquet) Databricks Data Pipeline Overview Cluster 2 Real-time analysis Databricks Dep …. 13
  • 14.
    Log collection (Log-daemon) •Fault tolerance and at least once semantics • Streaming • Batch • Spark History Server • Multi-tenant and config driven • Spark container 14
  • 15.
    Log Daemon logStream1 Service 1 active.log 2015-11-30-20.log 2015-11-30-19.log logrotation ….. Service 2 active.log 2015-11-30-20.log 2015-11-30-19.log log rotation Kinesistopic-1 Service x active.log 2015-11-30-20.log 2015-11-30-19.log log rotation state files Log Daemon Architecture producer reader Message Producer logStream2 producer reader logStreamX producer reader …………... …………... …………... 15 topic-2
  • 16.
    Sync Daemon • Readfrom Kinesis and Write to DBFS • Buffer and write in batches (128 MB or 5 Mins) • Partitioned by date • A long running Apache Spark job • Easy to scale up and down 16
  • 17.
    Databricks Deployment ETL Jobs Databricks Filesystem Nodedup Append Dedup Overwrite 17 New files Current day All files Previous day Databricks Jobs Delta job (every 10 mins) Batch job (daily) Raw records Databricks Filesystem ETL Tables (Parquet)
  • 18.
    ETL Jobs • Usethe same code for Delta and Batch jobs • Run as scheduled Databricks jobs • Use spot instances and fallback to on-demand • Deliver to Databricks as parquet tables
  • 19.
    Lessons Learned - PartitionPruning can save a lot of time and money Reduced query time from 2800 seconds to just 15 seconds. Don’t partition too many levels as it leads to worse metadata discovery performance and cost. 19
  • 20.
    Lessons Learned - HighS3 costs: Lots of LIST Requests Metadata discovery on S3 is expensive. Spark SQL tries to refresh it’s metadata cache even after write operations. 20
  • 21.
    Running It Allin Databricks - Jobs
  • 22.
    Running It Allin Databricks - Spark
  • 23.
    Data Analysis &Tools We get the data in. What’s next? ● Monitoring ● Debugging ● Usage Analysis ● Product Design (A/B testing) 23
  • 24.
    Debugging Access to logsin a matter of seconds thanks to Apache Spark. 24
  • 25.
    Monitoring Monitor logs bylog level. Bug introduced on 2016-05-26 01:00:00 UTC. Fix deployed in 2 hours. 25
  • 26.
    Usage Analysis +Product Design SparkR + ggplot2 = Match made in heaven 26
  • 27.
    Summary Databricks + ApacheSpark create a unified platform for: - ETL - Data Warehousing - Data Analysis - Real time analytics Issues with DevOps out of the question: - No need to manage a huge cluster - Jobs are isolated, they don’t cannibalize each other’s resources - Can launch any Spark version
  • 28.
    Ongoing & FutureWork Structured Streaming - Reduce Complexity of pipeline: Sync Daemon + Delta + Batch Jobs => Single Streaming Job - Reduce Latency Availability of data in seconds instead of minutes - Event Time Dashboards 28
  • 29.
    Try Apache Sparkwith Databricks 29 http://databricks.com/try
  • 30.
    Thank you. Have questionsabout ETL with Spark? Join us at the Databricks Booth 3.45-6.00pm!