Elastic Data Analytics Platform @Datadog

InfoQ.com: News & Community Site
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every month
• 250,000 senior developers subscribe to our weekly newsletter
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• 2 dedicated podcast channels: The InfoQ Podcast, with a focus on
Architecture and The Engineering Culture Podcast, with a focus on building
• 96 deep dives on innovative topics packed as downloadable emags and
minibooks
• Over 40 new content items per week
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
datadog-cloud

Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com

The Evolution of a Data Project

Python
script

Python
script
SQL on  
live DB

Python
script
SQL on
reporting DB
SQL on  
live DB

Python
script
SQL on
reporting DB
SQL on  
live DB
Terrible
confusion

Python
script
SQL on
reporting DB
SQL on  
live DB
Terrible
confusion
Hadoop / Spark
cluster

What needs fixing
image: Pexels

What needs fixing
image: Pexels
• One cluster: data lock-in.

What needs fixing
image: Pexels
• Want cluster time? You have to wait.

What needs fixing
image: Pexels
• Want cluster time? You have to wait.
• Clusters are underutilized and EXPENSIVE

Elastic Big Data
Platform @ Datadog
Doug Daniels
Director, Engineering

WHOM
What’s our big data platform do?
Data Engineers
Data Scientists

WHOM
Data Engineers
Data Scientists
do
WHAT
App features
Statistical Analysis/ML
Ad-hoc investigation

WHOM
Data Engineers
Data Scientists
do
WHAT
App features
Statistical Analysis/ML
Ad-hoc investigation
WITH
Spark
Hadoop (Pig)
Python (Luigi)
with

Exploring the platform
COPIOUS 
TOOLING
CLOUD
STORAGE
ELASTIC
COMPUTE

What’s time series data?
timestamp 1447020511
metric system.cpu.idle
value 98.16687
tags host:i-xyz,
role:cassandra, …

We collect
over a trillion
of these per day
…and growing!

Where to put the petabytes?
Amazon S3.
Amazon S3

How data gets to S3
116
- Buffer
- Sort + Dedupe
- Upload
GO
- Partition + Sort
- Write Parquet
- Update Metastore
LUIGI/SPARK/PIG
HIVE METASTORE
Internal Format
AMAZON S3
Parquet Metadata
Kafka

What we don’t love about HDFS

• Causes the “one cluster” problem

• Come for the storage, get stuck with the servers

• Come for the storage, get stuck with the servers
• No Java? No data!

S3 is flexible!
• Read data from as many clusters as you want

S3 is flexible!
• Store unlimited stuﬀ(*) with no management
* Accepting laws of physics and your credit card limit

S3 is flexible!
• Rock solid: durability (99.999999999), availability (99.99)

S3 is flexible!
• Rock solid: durability (99.999999999), availability (99.99)
• Access from any programming language

Decouple data and compute
(BREAK THE RULES!)

Breaking the rules is fine.
In benchmarks: S3 is ~2X slower than HDFS

Listing is slooooow
(A CAUTIONARY TALE)

How to fix slow listing
Bigger ﬁlesParallelize it

HDFS
No way to quickly move data
Task
Intermediate Final
write atomic move

HDFS
Task
Intermediate Final
write atomic move
S3 Task
write

• Say goodbye to speculative execution

• Say goodbye to speculative execution
• Say hello to better task timeouts

But really: We 💜S3
This is a great system.
✓ Data accessible from many clusters
✓ Storage is easy to manage
✓ It’s a multi-language paradise up in here

One cluster to
compute it all
TRADITIONALLY

Instead, we run many, many clusters
• New cluster for every
automated job
• 10–20 clusters at a time
• Median lifetime: 2hrs

Total isolation
We know what’s happening and why

No more waiting on loaded clusters
• Tailor each cluster to the work you want to do
• Scale up when you need results faster
• Data scientists and data engineers don’t have to wait
🕐🕓🕥

Pick the best hardware for each job
for CPU-bound jobs
r3
if you don’t care (cheap!)
== ~30% savings over general purpose hardware
c3
for memory-bound jobs
m1.xlarge

100% spot-instance
clusters, all the time.*
* (ok, most of the time)

100% spot-instance
clusters, all the time.*
* (ok, most of the time)
Ridiculous
savings!
Disappearing
clusters!

How we do spot clusters
• Bid the on-demand price,
pay the spot price
In the big data platform

pay the spot price
• Fallback to on-demand
instances if you can’t get spot

pay the spot price
• Monitor everything: jobs,
clusters, spot market

pay the spot price
• Monitor everything: jobs,
clusters, spot market
• ☞ Save up to 80% oﬀ the  
on-demand price

Switch hardware when the market gets volatile
Monitor the spot price

We like this strategy a lot!
Cluster is oversubscribed; everyone waiting in line to do their work
Lots of expensive hardware sits idle when everyone’s gone
✓ No waiting for the cluster you need
✓ No waste from hardware sitting idle
✓ Spot clusters are aﬀordable enough to use everywhere

ELASTIC
COMPUTE
CLOUD
STORAGE
COPIOUS 
TOOLING

Web and APIs
Platform as a service
CLI
Jobs, Clusters, Schedules, Users, Code, Monitoring, Logs, and more

Big Data Platform Architecture
DATA Amazon S3

DATA Amazon S3
CLUSTER EMR

DATA Amazon S3
CLUSTER EMR
WORKER Pig Workers Spark Workers Luigi Workers

DATA Amazon S3
CLUSTER EMR
STORAGE Metadata DB Queueing Logs

DATA Amazon S3
CLUSTER EMR
WEB Web API

DATA Amazon S3
CLUSTER EMR
WEB Web API
USER CLI API Clients Job Scheduler

DATA Amazon S3
CLUSTER EMR
WEB Web API
USER CLI API Clients Job Scheduler
Datadog
Monitoring

How to find the right cluster
when they disappear?

Cluster tagging  
for discovery
#anomaly
-detection
#monitor-report

How to monitor many
disappearing clusters?

Dashboards Monitors
Dynamic Monitoring on Tags
anomaly-detection
cluster_tags: anomaly-detection

How to debug problems
when the cluster’s gone?

Debugging In a Post-Cluster World

Send all logs to S3
• HDFS
• YARN
• Pig
• Spark

Visualize the pipeline
• Lipstick for Pig
• Spark History Server
• Luigi task ﬂow
Send all logs to S3
• HDFS
• YARN
• Pig
• Spark

Visualize the pipeline
• Lipstick for Pig
• Spark History Server
• Luigi task ﬂow
Preserve historical
monitoring data
Keep history, by tag, after
the cluster disappears
Send all logs to S3
• HDFS
• YARN
• Pig
• Spark

How to handle
certain cluster failure
in your jobs?

Automatic cleanup and restart
Luigi: design for failure.
A B

B

❌

Recommendations  
for Cloud Big Data

Recommendations  
for Cloud Big Data
• Use S3 for permanent data, not HDFS

Recommendations  
for Cloud Big Data
• Start from EMR if building yourself

Recommendations  
for Cloud Big Data
• Look into a PaaS: Netﬂix Genie, Qubole, Databricks

Recommendations  
for Cloud Big Data
• Tag your clusters for dynamic monitoring

Recommendations  
for Cloud Big Data
• Tag your clusters for dynamic monitoring
• Design for failure with a workﬂow tool (Luigi, Airﬂow)

Thanks!
Want to work with us on Spark, Hadoop,
Kafka, Parquet, and more?
jobs.datadoghq.com
DM me @ddaniels888 or doug@datadoghq.com

Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
datadog-cloud

Elastic Data Analytics Platform @Datadog

More Related Content

What's hot

Viewers also liked

Similar to Elastic Data Analytics Platform @Datadog

More from C4Media

Recently uploaded

Elastic Data Analytics Platform @Datadog