InfoQ.com: News & Community Site
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every month
• 250,000 senior developers subscribe to our weekly newsletter
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• 2 dedicated podcast channels: The InfoQ Podcast, with a focus on
Architecture and The Engineering Culture Podcast, with a focus on building
• 96 deep dives on innovative topics packed as downloadable emags and
minibooks
• Over 40 new content items per week
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
datadog-cloud
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
The Evolution of a Data Project
The Evolution of a Data Project
Python
script
The Evolution of a Data Project
Python
script
SQL on 

live DB
The Evolution of a Data Project
Python
script
SQL on
reporting DB
SQL on 

live DB
The Evolution of a Data Project
Python
script
SQL on
reporting DB
SQL on 

live DB
Terrible
confusion
The Evolution of a Data Project
Python
script
SQL on
reporting DB
SQL on 

live DB
Terrible
confusion
Hadoop / Spark
cluster
What needs fixing
image: Pexels
What needs fixing
image: Pexels
• One cluster: data lock-in.
What needs fixing
image: Pexels
• One cluster: data lock-in.
• Want cluster time? You have to wait.
What needs fixing
image: Pexels
• One cluster: data lock-in.
• Want cluster time? You have to wait.
• Clusters are underutilized and EXPENSIVE
Elastic Big Data
Platform @ Datadog
Doug Daniels
Director, Engineering
WHOM
What’s our big data platform do?
Data Engineers
Data Scientists
WHOM
What’s our big data platform do?
Data Engineers
Data Scientists
do
WHAT
App features
Statistical Analysis/ML
Ad-hoc investigation
WHOM
What’s our big data platform do?
Data Engineers
Data Scientists
do
WHAT
App features
Statistical Analysis/ML
Ad-hoc investigation
WITH
Spark
Hadoop (Pig)
Python (Luigi)
with
Exploring the platform
COPIOUS

TOOLING
CLOUD
STORAGE
ELASTIC
COMPUTE
CLOUD STORAGE
What do we store?
150 Integrations
…and more
What’s time series data?
timestamp 1447020511
metric system.cpu.idle
value 98.16687
tags host:i-xyz,
role:cassandra, …
We collect
over a trillion
of these per day
…and growing!
Where to put the petabytes?
Amazon S3.
Amazon S3
How data gets to S3
116
- Buffer
- Sort + Dedupe
- Upload
GO
- Partition + Sort
- Write Parquet
- Update Metastore
LUIGI/SPARK/PIG
HIVE METASTORE
Internal Format
AMAZON S3
Parquet Metadata
Kafka
Isn’t this a job for HDFS?
What we don’t love about HDFS
What we don’t love about HDFS
• Causes the “one cluster” problem
What we don’t love about HDFS
• Causes the “one cluster” problem
• Come for the storage, get stuck with the servers
What we don’t love about HDFS
• Causes the “one cluster” problem
• Come for the storage, get stuck with the servers
• No Java? No data!
S3 is flexible!
• Read data from as many clusters as you want
S3 is flexible!
• Read data from as many clusters as you want
• Store unlimited stuff(*) with no management
* Accepting laws of physics and your credit card limit
S3 is flexible!
• Read data from as many clusters as you want
• Store unlimited stuff(*) with no management
• Rock solid: durability (99.999999999), availability (99.99)
* Accepting laws of physics and your credit card limit
S3 is flexible!
• Read data from as many clusters as you want
• Store unlimited stuff(*) with no management
• Rock solid: durability (99.999999999), availability (99.99)
• Access from any programming language
* Accepting laws of physics and your credit card limit
Decouple data and compute
(BREAK THE RULES!)
Breaking the rules is fine.
In benchmarks: S3 is ~2X slower than HDFS
Breaking the rules is fine.
In benchmarks: S3 is ~2X slower than HDFS
It’s not all roses
Listing is slooooow
(A CAUTIONARY TALE)
How to fix slow listing
Bigger filesParallelize it
HDFS
No way to quickly move data
Task
Intermediate Final
write atomic move
HDFS
No way to quickly move data
Task
Intermediate Final
write atomic move
S3 Task
write
No way to quickly move data
• Say goodbye to speculative execution
No way to quickly move data
• Say goodbye to speculative execution
• Say hello to better task timeouts
But really: We 💜S3
This is a great system.
✓ Data accessible from many clusters
✓ Storage is easy to manage
✓ It’s a multi-language paradise up in here
ELASTIC
COMPUTE
CLOUD
STORAGE
One cluster to
compute it all
TRADITIONALLY
Instead, we run many, many clusters
• New cluster for every
automated job
• 10–20 clusters at a time
• Median lifetime: 2hrs
Why so many clusters?
Total isolation
We know what’s happening and why
No more waiting on loaded clusters
• Tailor each cluster to the work you want to do
• Scale up when you need results faster
• Data scientists and data engineers don’t have to wait
🕐🕓🕥
Pick the best hardware for each job
for CPU-bound jobs
r3
if you don’t care (cheap!)
== ~30% savings over general purpose hardware
c3
for memory-bound jobs
m1.xlarge
100% spot-instance
clusters, all the time.*
* (ok, most of the time)
100% spot-instance
clusters, all the time.*
* (ok, most of the time)
Ridiculous
savings!
Disappearing
clusters!
How we do spot clusters
• Bid the on-demand price,
pay the spot price
In the big data platform
How we do spot clusters
• Bid the on-demand price,
pay the spot price
• Fallback to on-demand
instances if you can’t get spot
In the big data platform
How we do spot clusters
• Bid the on-demand price,
pay the spot price
• Fallback to on-demand
instances if you can’t get spot
• Monitor everything: jobs,
clusters, spot market
In the big data platform
How we do spot clusters
• Bid the on-demand price,
pay the spot price
• Fallback to on-demand
instances if you can’t get spot
• Monitor everything: jobs,
clusters, spot market
• ☞ Save up to 80% off the 

on-demand price
In the big data platform
Switch hardware when the market gets volatile
Monitor the spot price
We like this strategy a lot!
Cluster is oversubscribed; everyone waiting in line to do their work
Lots of expensive hardware sits idle when everyone’s gone
✓ No waiting for the cluster you need
✓ No waste from hardware sitting idle
✓ Spot clusters are affordable enough to use everywhere
What’s challenging,
though?
Many things that disappear.
ELASTIC
COMPUTE
CLOUD
STORAGE
COPIOUS

TOOLING
Web and APIs
Platform as a service
CLI
Jobs, Clusters, Schedules, Users, Code, Monitoring, Logs, and more
Big Data Platform Architecture
DATA Amazon S3
Big Data Platform Architecture
DATA Amazon S3
CLUSTER EMR
Big Data Platform Architecture
DATA Amazon S3
CLUSTER EMR
WORKER Pig Workers Spark Workers Luigi Workers
Big Data Platform Architecture
DATA Amazon S3
CLUSTER EMR
WORKER Pig Workers Spark Workers Luigi Workers
STORAGE Metadata DB Queueing Logs
Big Data Platform Architecture
DATA Amazon S3
CLUSTER EMR
WEB Web API
WORKER Pig Workers Spark Workers Luigi Workers
STORAGE Metadata DB Queueing Logs
Big Data Platform Architecture
DATA Amazon S3
CLUSTER EMR
WEB Web API
WORKER Pig Workers Spark Workers Luigi Workers
USER CLI API Clients Job Scheduler
STORAGE Metadata DB Queueing Logs
Big Data Platform Architecture
DATA Amazon S3
CLUSTER EMR
WEB Web API
WORKER Pig Workers Spark Workers Luigi Workers
USER CLI API Clients Job Scheduler
STORAGE Metadata DB Queueing Logs
Datadog
Monitoring
How to find the right cluster
when they disappear?
Cluster tagging 

for discovery
#anomaly
-detection
#monitor-report
How to monitor many
disappearing clusters?
Dashboards Monitors
Dynamic Monitoring on Tags
anomaly-detection
cluster_tags: anomaly-detection
How to debug problems
when the cluster’s gone?
Debugging In a Post-Cluster World
Debugging In a Post-Cluster World
Send all logs to S3
• HDFS
• YARN
• Pig
• Spark
Debugging In a Post-Cluster World
Visualize the pipeline
• Lipstick for Pig
• Spark History Server
• Luigi task flow
Send all logs to S3
• HDFS
• YARN
• Pig
• Spark
Debugging In a Post-Cluster World
Visualize the pipeline
• Lipstick for Pig
• Spark History Server
• Luigi task flow
Preserve historical
monitoring data
Keep history, by tag, after
the cluster disappears
Send all logs to S3
• HDFS
• YARN
• Pig
• Spark
How to handle
certain cluster failure
in your jobs?
Automatic cleanup and restart
Luigi: design for failure.
A B
Automatic cleanup and restart
Luigi: design for failure.
B
Automatic cleanup and restart
Luigi: design for failure.
❌
Automatic cleanup and restart
Luigi: design for failure.
ELASTIC
COMPUTE
CLOUD
STORAGE
COPIOUS

TOOLING
Recommendations 

for Cloud Big Data
Recommendations 

for Cloud Big Data
• Use S3 for permanent data, not HDFS
Recommendations 

for Cloud Big Data
• Use S3 for permanent data, not HDFS
• Start from EMR if building yourself
Recommendations 

for Cloud Big Data
• Use S3 for permanent data, not HDFS
• Start from EMR if building yourself
• Look into a PaaS: Netflix Genie, Qubole, Databricks
Recommendations 

for Cloud Big Data
• Use S3 for permanent data, not HDFS
• Start from EMR if building yourself
• Look into a PaaS: Netflix Genie, Qubole, Databricks
• Tag your clusters for dynamic monitoring
Recommendations 

for Cloud Big Data
• Use S3 for permanent data, not HDFS
• Start from EMR if building yourself
• Look into a PaaS: Netflix Genie, Qubole, Databricks
• Tag your clusters for dynamic monitoring
• Design for failure with a workflow tool (Luigi, Airflow)
Thanks!
Want to work with us on Spark, Hadoop,
Kafka, Parquet, and more?
jobs.datadoghq.com
DM me @ddaniels888 or doug@datadoghq.com
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
datadog-cloud

Elastic Data Analytics Platform @Datadog

  • 2.
    InfoQ.com: News &Community Site • Over 1,000,000 software developers, architects and CTOs read the site world- wide every month • 250,000 senior developers subscribe to our weekly newsletter • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • 2 dedicated podcast channels: The InfoQ Podcast, with a focus on Architecture and The Engineering Culture Podcast, with a focus on building • 96 deep dives on innovative topics packed as downloadable emags and minibooks • Over 40 new content items per week Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ datadog-cloud
  • 3.
    Purpose of QCon -to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon San Francisco www.qconsf.com
  • 4.
    The Evolution ofa Data Project
  • 5.
    The Evolution ofa Data Project Python script
  • 6.
    The Evolution ofa Data Project Python script SQL on 
 live DB
  • 7.
    The Evolution ofa Data Project Python script SQL on reporting DB SQL on 
 live DB
  • 8.
    The Evolution ofa Data Project Python script SQL on reporting DB SQL on 
 live DB Terrible confusion
  • 9.
    The Evolution ofa Data Project Python script SQL on reporting DB SQL on 
 live DB Terrible confusion Hadoop / Spark cluster
  • 10.
  • 11.
    What needs fixing image:Pexels • One cluster: data lock-in.
  • 12.
    What needs fixing image:Pexels • One cluster: data lock-in. • Want cluster time? You have to wait.
  • 13.
    What needs fixing image:Pexels • One cluster: data lock-in. • Want cluster time? You have to wait. • Clusters are underutilized and EXPENSIVE
  • 14.
    Elastic Big Data Platform@ Datadog Doug Daniels Director, Engineering
  • 15.
    WHOM What’s our bigdata platform do? Data Engineers Data Scientists
  • 16.
    WHOM What’s our bigdata platform do? Data Engineers Data Scientists do WHAT App features Statistical Analysis/ML Ad-hoc investigation
  • 17.
    WHOM What’s our bigdata platform do? Data Engineers Data Scientists do WHAT App features Statistical Analysis/ML Ad-hoc investigation WITH Spark Hadoop (Pig) Python (Luigi) with
  • 18.
  • 20.
  • 21.
    What do westore?
  • 22.
  • 23.
    What’s time seriesdata? timestamp 1447020511 metric system.cpu.idle value 98.16687 tags host:i-xyz, role:cassandra, …
  • 24.
    We collect over atrillion of these per day …and growing!
  • 25.
    Where to putthe petabytes? Amazon S3. Amazon S3
  • 26.
    How data getsto S3 116 - Buffer - Sort + Dedupe - Upload GO - Partition + Sort - Write Parquet - Update Metastore LUIGI/SPARK/PIG HIVE METASTORE Internal Format AMAZON S3 Parquet Metadata Kafka
  • 27.
    Isn’t this ajob for HDFS?
  • 28.
    What we don’tlove about HDFS
  • 29.
    What we don’tlove about HDFS • Causes the “one cluster” problem
  • 30.
    What we don’tlove about HDFS • Causes the “one cluster” problem • Come for the storage, get stuck with the servers
  • 31.
    What we don’tlove about HDFS • Causes the “one cluster” problem • Come for the storage, get stuck with the servers • No Java? No data!
  • 32.
    S3 is flexible! •Read data from as many clusters as you want
  • 33.
    S3 is flexible! •Read data from as many clusters as you want • Store unlimited stuff(*) with no management * Accepting laws of physics and your credit card limit
  • 34.
    S3 is flexible! •Read data from as many clusters as you want • Store unlimited stuff(*) with no management • Rock solid: durability (99.999999999), availability (99.99) * Accepting laws of physics and your credit card limit
  • 35.
    S3 is flexible! •Read data from as many clusters as you want • Store unlimited stuff(*) with no management • Rock solid: durability (99.999999999), availability (99.99) • Access from any programming language * Accepting laws of physics and your credit card limit
  • 36.
    Decouple data andcompute (BREAK THE RULES!)
  • 37.
    Breaking the rulesis fine. In benchmarks: S3 is ~2X slower than HDFS
  • 38.
    Breaking the rulesis fine. In benchmarks: S3 is ~2X slower than HDFS
  • 39.
  • 40.
    Listing is slooooow (ACAUTIONARY TALE)
  • 41.
    How to fixslow listing Bigger filesParallelize it
  • 42.
    HDFS No way toquickly move data Task Intermediate Final write atomic move
  • 43.
    HDFS No way toquickly move data Task Intermediate Final write atomic move S3 Task write
  • 44.
    No way toquickly move data • Say goodbye to speculative execution
  • 45.
    No way toquickly move data • Say goodbye to speculative execution • Say hello to better task timeouts
  • 46.
    But really: We💜S3 This is a great system. ✓ Data accessible from many clusters ✓ Storage is easy to manage ✓ It’s a multi-language paradise up in here
  • 47.
  • 48.
    One cluster to computeit all TRADITIONALLY
  • 49.
    Instead, we runmany, many clusters • New cluster for every automated job • 10–20 clusters at a time • Median lifetime: 2hrs
  • 50.
    Why so manyclusters?
  • 51.
    Total isolation We knowwhat’s happening and why
  • 52.
    No more waitingon loaded clusters • Tailor each cluster to the work you want to do • Scale up when you need results faster • Data scientists and data engineers don’t have to wait 🕐🕓🕥
  • 53.
    Pick the besthardware for each job for CPU-bound jobs r3 if you don’t care (cheap!) == ~30% savings over general purpose hardware c3 for memory-bound jobs m1.xlarge
  • 54.
    100% spot-instance clusters, allthe time.* * (ok, most of the time)
  • 55.
    100% spot-instance clusters, allthe time.* * (ok, most of the time) Ridiculous savings! Disappearing clusters!
  • 56.
    How we dospot clusters • Bid the on-demand price, pay the spot price In the big data platform
  • 57.
    How we dospot clusters • Bid the on-demand price, pay the spot price • Fallback to on-demand instances if you can’t get spot In the big data platform
  • 58.
    How we dospot clusters • Bid the on-demand price, pay the spot price • Fallback to on-demand instances if you can’t get spot • Monitor everything: jobs, clusters, spot market In the big data platform
  • 59.
    How we dospot clusters • Bid the on-demand price, pay the spot price • Fallback to on-demand instances if you can’t get spot • Monitor everything: jobs, clusters, spot market • ☞ Save up to 80% off the 
 on-demand price In the big data platform
  • 60.
    Switch hardware whenthe market gets volatile Monitor the spot price
  • 61.
    We like thisstrategy a lot! Cluster is oversubscribed; everyone waiting in line to do their work Lots of expensive hardware sits idle when everyone’s gone ✓ No waiting for the cluster you need ✓ No waste from hardware sitting idle ✓ Spot clusters are affordable enough to use everywhere
  • 62.
  • 63.
    Many things thatdisappear.
  • 64.
  • 65.
    Web and APIs Platformas a service CLI Jobs, Clusters, Schedules, Users, Code, Monitoring, Logs, and more
  • 66.
    Big Data PlatformArchitecture DATA Amazon S3
  • 67.
    Big Data PlatformArchitecture DATA Amazon S3 CLUSTER EMR
  • 68.
    Big Data PlatformArchitecture DATA Amazon S3 CLUSTER EMR WORKER Pig Workers Spark Workers Luigi Workers
  • 69.
    Big Data PlatformArchitecture DATA Amazon S3 CLUSTER EMR WORKER Pig Workers Spark Workers Luigi Workers STORAGE Metadata DB Queueing Logs
  • 70.
    Big Data PlatformArchitecture DATA Amazon S3 CLUSTER EMR WEB Web API WORKER Pig Workers Spark Workers Luigi Workers STORAGE Metadata DB Queueing Logs
  • 71.
    Big Data PlatformArchitecture DATA Amazon S3 CLUSTER EMR WEB Web API WORKER Pig Workers Spark Workers Luigi Workers USER CLI API Clients Job Scheduler STORAGE Metadata DB Queueing Logs
  • 72.
    Big Data PlatformArchitecture DATA Amazon S3 CLUSTER EMR WEB Web API WORKER Pig Workers Spark Workers Luigi Workers USER CLI API Clients Job Scheduler STORAGE Metadata DB Queueing Logs Datadog Monitoring
  • 73.
    How to findthe right cluster when they disappear?
  • 74.
    Cluster tagging 
 fordiscovery #anomaly -detection #monitor-report
  • 75.
    How to monitormany disappearing clusters?
  • 76.
    Dashboards Monitors Dynamic Monitoringon Tags anomaly-detection cluster_tags: anomaly-detection
  • 77.
    How to debugproblems when the cluster’s gone?
  • 78.
    Debugging In aPost-Cluster World
  • 79.
    Debugging In aPost-Cluster World Send all logs to S3 • HDFS • YARN • Pig • Spark
  • 80.
    Debugging In aPost-Cluster World Visualize the pipeline • Lipstick for Pig • Spark History Server • Luigi task flow Send all logs to S3 • HDFS • YARN • Pig • Spark
  • 81.
    Debugging In aPost-Cluster World Visualize the pipeline • Lipstick for Pig • Spark History Server • Luigi task flow Preserve historical monitoring data Keep history, by tag, after the cluster disappears Send all logs to S3 • HDFS • YARN • Pig • Spark
  • 82.
    How to handle certaincluster failure in your jobs?
  • 83.
    Automatic cleanup andrestart Luigi: design for failure. A B
  • 84.
    Automatic cleanup andrestart Luigi: design for failure. B
  • 85.
    Automatic cleanup andrestart Luigi: design for failure. ❌
  • 86.
    Automatic cleanup andrestart Luigi: design for failure.
  • 87.
  • 88.
  • 89.
    Recommendations 
 for CloudBig Data • Use S3 for permanent data, not HDFS
  • 90.
    Recommendations 
 for CloudBig Data • Use S3 for permanent data, not HDFS • Start from EMR if building yourself
  • 91.
    Recommendations 
 for CloudBig Data • Use S3 for permanent data, not HDFS • Start from EMR if building yourself • Look into a PaaS: Netflix Genie, Qubole, Databricks
  • 92.
    Recommendations 
 for CloudBig Data • Use S3 for permanent data, not HDFS • Start from EMR if building yourself • Look into a PaaS: Netflix Genie, Qubole, Databricks • Tag your clusters for dynamic monitoring
  • 93.
    Recommendations 
 for CloudBig Data • Use S3 for permanent data, not HDFS • Start from EMR if building yourself • Look into a PaaS: Netflix Genie, Qubole, Databricks • Tag your clusters for dynamic monitoring • Design for failure with a workflow tool (Luigi, Airflow)
  • 94.
    Thanks! Want to workwith us on Spark, Hadoop, Kafka, Parquet, and more? jobs.datadoghq.com DM me @ddaniels888 or doug@datadoghq.com
  • 95.
    Watch the videowith slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ datadog-cloud