Demystifying Data Engineering

Data engineering
• Software engineering with an emphasis on
dealing with large amounts of data
• A “specialty” of software engineering

Why now?
• Always value in scale, but it was previously
too difﬁcult / expensive
• Economics and technology advances make
these scales accessible

Enable others to answer questions on
dataset within latency constraints

Data engineering
• Distributed systems – consensus,
consistency, availability, etc.
• Parallel processing
• Databases
• Queuing

Data engineering
• Human-fault tolerance
• Metrics and monitoring
• Multi-tenancy

BackType
• When I joined:
• Comment search by keyword
• Comment search by user
• Basic stats on commenters
• Link search on Twitter

BackType
Kyoto
Cabinet
Custom
workers
Custom
crawlers

BackType
• Inﬂexible
• Prone to corruption
• Heavy operational burden
• Not scalable
• Not fault-tolerant

BackType
• Enable asking any question (with high
latency)
• Allows exploration and experimentation
• Establishes human-fault tolerance

Collector
Collector
Collector
Collector

ElephantDB
• Export results of MapReduce pipelines for
querying
• Low latency querying but out of date by
many hours
• Incredibly simple

• Infrastructure
• Data pipelines
• Abstractions
Data engineering

Data pipeline example
Tweets
(S3)
Normalize
URLs
Compute
hour bucket
Sum by
hour/url
Emit
ElephantDB
indexes

Data pipeline example
Tweets
(Kafka)
Normalize
URLs
Compute
hour bucket
Update hour/
url bucket
Cassandra

Abstraction example
MapReduce Cascading Cascalog

Infrastructure
• HDFS
• MapReduce
• Kafka
• Storm
• Spark
• Cassandra
• HBase
• ElephantDB
• Zookeeper

Streaming compute
team at Twitter
• Started streaming compute team at Twitter
• One shared Storm cluster for entire
company

Multi-tenancy
• Independent applications on same cluster
• Topologies should not affect one another

Resource allocation
• Topologies should be given an appropriate
amount of resources

Initial approach
• Use Mesos to provide resource guarantees
• Users include resources needed as part of
topology submission

Solution
• Implement new scheduler which gives
production topologies dedicated hardware
• Only Storm team can conﬁgure production
topologies
• Left-over machines are used as failover or
for in-development topologies

Data Engineering vs Data Science
• Well-deﬁned problems
• No special statistics skills required
• Larger scope
• Not just analytics

Open source
• Almost all major Big Data tools are open
source (e.g. Hadoop, Storm, Spark, Kafka,
Cassandra, HBase, etc.)
• Many have commercial support

Open source
• Very important for recruiting data
engineers
• Strong developers want to work at places
where they can be involved with open
source

Open source
• Develop a technology brand for company
(in conjunction with a tech blog)
• Creating a popular open source project can
give you access to lots of strong engineers

Open source
• Identify strong engineers in the community
you may want to recruit
• Learn best practices and get help from the
people who know the tools the best
• *Do not* expect to get “free work” on
your projects

Ideal data engineer
• Strong software engineering skills
• Abstraction
• Testing
• Version control
• Refactoring

Ideal data engineer
• Strong algorithm skills

Ideal data engineer
• Good at digging into open source code

Ideal data engineer
• Good at digging into open source code
• Good at stress testing

Finding strong data engineers
• Standard “coding on the whiteboard”
interviews are near useless
• Use take home projects to gauge general
programming ability
• The best is to see projects that require
data engineering

Demystifying Data Engineering

More Related Content

What's hot

Viewers also liked

Similar to Demystifying Data Engineering

More from nathanmarz

Recently uploaded

Demystifying Data Engineering