A Journey into Databricks' Pipelines: Journey and Lessons Learned

Databricks’ Data Pipelines:
Journey and Lessons Learned
Yu Peng, Burak Yavuz
07/06/2016

Who Are We
Yu Peng
Data Engineer at Databricks
Building Databricks’ next-generation data pipeline
on top of Apache Spark
BS in Xiamen University
Ph.D in The University of Hong Kong
Burak Yavuz
Software Engineer at Databricks
Contributor to Spark since Spark 1.1
Maintainer of Spark Packages
BS in Mechanical Engineering at Bogazici University
MS in Management Science & Engineering at Stanford
University

Building a data pipeline is hard
• At least once or exactly once semantics
• Fault tolerance
• Resource management
• Scalability
• Maintainability

Apache®
Spark™
+ Databricks = Our Solution
• All ETL jobs are built on top of Apache Spark
• Unified solution, everything in the same place
• All ETL jobs are run on Databricks platform
• Platform for Data Engineers and Scientists
• Test out Spark and Databricks new features
Apache, Apache Spark and Spark are trademarks of the Apache Software Foundation

Classic Lambda Data Pipeline
service 0
service ...
log collector
…
.
Centralized
Messaging
System
Delta ETL
Batch ETL
Storage
System
service 1
service ...
log collector
….
service x
service ...
log collector
…
.
…...

Customer
Dep 0
Customer
Dep 1
Amazon
Kinesis
Customer
Dep 2
Databricks Data Pipeline Overview
Databricks
Dep
….

Customer
Dep 0
Customer
Dep 1
Amazon
Kinesis
service 1
service 2
service x
log-daemon
….
Customer
Dep 2
Cluster 0
service 0
service x
log-daemon
….
service 1
service y
log-daemon
….
Cluster 1
….
Cluster 2 Databricks
Dep
….
7

Customer
Dep 0
Customer
Dep 1
Amazon
Kinesis
service 1
service 2
service x
log-daemon
….
Customer
Dep 2
Cluster 0
service 0
service x
log-daemon
….
service 1
service y
log-daemon
….
Cluster 1
….
Dep
….
8

Databricks Deployment
Customer
Dep 0
Customer
Dep 1
Amazon
Kinesis
Databricks Filesystem
Databricks Jobs
service 1
service 2
service x
log-daemon
….
Customer
Dep 2
Cluster 0
service 0
service x
log-daemon
….
service 1
service y
log-daemon
….
Cluster 1
….
Dep
….
9

Customer
Dep 0
Customer
Dep 1
Amazon
Kinesis
Databricks Filesystem
Databricks Jobs
service 1
service 2
service x
log-daemon
….
Customer
Dep 2
Cluster 0
service 0
service x
log-daemon
….
service 1
service y
log-daemon
….
Cluster 1
….
Cluster 2
Real-time analysis
Databricks
Dep
….
10

Customer
Dep 0
Customer
Dep 1
Amazon
Kinesis
DBFS
Databricks Jobs
service 1
service 2
service x
log-daemon
….
Customer
Dep 2
Cluster 0
service 0
service x
log-daemon
….
service 1
service y
log-daemon
….
Cluster 1
….
Sync daemonRaw record batch (json)
Dep
….
11

Customer
Dep 0
Customer
Dep 1
Amazon
Kinesis
DBFS
Databricks Jobs
service 1
service 2
service x
log-daemon
….
Customer
Dep 2
Cluster 0
service 0
service x
log-daemon
….
service 1
service y
log-daemon
….
Cluster 1
….
Sync daemon
ETL jobs
Raw record batch (json)
Tables (parquet)
Dep
….
12

Customer
Dep 0
Customer
Dep 1
Amazon
Kinesis
DBFS
Databricks Jobs
service 1
service 2
service x
log-daemon
….
Customer
Dep 2
Cluster 0
service 0
service x
log-daemon
….
service 1
service y
log-daemon
….
Cluster 1
….
Sync daemon
ETL jobs
Data analysis
Raw record batch (json)
Tables (parquet)
Cluster 2
Real-time analysis
Databricks
Dep
….
13

Log collection (Log-daemon)
• Fault tolerance and at least once semantics
• Streaming
• Batch
• Spark History Server
• Multi-tenant and config driven
• Spark container
14

Log Daemon
logStream1
Service 1
active.log
2015-11-30-20.log
2015-11-30-19.log
log rotation
…..
Service 2
active.log
2015-11-30-20.log
2015-11-30-19.log
log rotation
Kinesistopic-1
Service x
active.log
2015-11-30-20.log
2015-11-30-19.log
log rotation
state files
Log Daemon
Architecture
producer
reader
Message Producer
logStream2
producer
reader
logStreamX
producer
reader
…………... …………... …………...
15
topic-2

Sync Daemon
• Read from Kinesis and Write to DBFS
• Buffer and write in batches (128 MB or 5 Mins)
• Partitioned by date
• A long running Apache Spark job
• Easy to scale up and down
16

ETL Jobs
Databricks
Filesystem
No dedup
Append
Dedup
Overwrite
17
New files
Current day
All files
Previous day
Databricks Jobs
Delta job
(every 10 mins)
Batch job
(daily)
Raw records
Databricks
Filesystem
ETL Tables
(Parquet)

ETL Jobs
• Use the same code for Delta and Batch jobs
• Run as scheduled Databricks jobs
• Use spot instances and fallback to on-demand
• Deliver to Databricks as parquet tables

Lessons Learned
- Partition Pruning can save a lot of time and money
Reduced query time from 2800 seconds to just 15 seconds.
Don’t partition too many levels as it leads to worse metadata discovery
performance and cost.
19

Lessons Learned
- High S3 costs: Lots of LIST Requests
Metadata discovery on S3 is expensive. Spark SQL tries to refresh it’s
metadata cache even after write operations.
20

Running It All in Databricks - Jobs

Running It All in Databricks - Spark

Data Analysis & Tools
We get the data in. What’s next?
● Monitoring
● Debugging
● Usage Analysis
● Product Design (A/B testing)
23

Debugging
Access to logs in a matter of seconds thanks to Apache Spark.
24

Monitoring
Monitor logs by log level. Bug introduced on 2016-05-26 01:00:00 UTC. Fix deployed in 2 hours.
25

Usage Analysis + Product Design
SparkR + ggplot2 = Match made in heaven
26

Summary
Databricks + Apache Spark create a unified platform for:
- ETL
- Data Warehousing
- Data Analysis
- Real time analytics
Issues with DevOps out of the question:
- No need to manage a huge cluster
- Jobs are isolated, they don’t cannibalize each other’s resources
- Can launch any Spark version

Ongoing & Future Work
Structured Streaming
- Reduce Complexity of pipeline:
Sync Daemon + Delta + Batch Jobs => Single Streaming Job
- Reduce Latency
Availability of data in seconds instead of minutes
- Event Time Dashboards
28

Try Apache Spark with Databricks
29
http://databricks.com/try

Thank you.
Have questions about ETL with Spark?
Join us at the Databricks Booth 3.45-6.00pm!

A Journey into Databricks' Pipelines: Journey and Lessons Learned

More Related Content

What's hot

Viewers also liked

Similar to A Journey into Databricks' Pipelines: Journey and Lessons Learned

More from Databricks

Recently uploaded

A Journey into Databricks' Pipelines: Journey and Lessons Learned