Demystifying Data
Engineering
Data engineering
• Software engineering with an emphasis on
dealing with large amounts of data
• A “specialty” of software engineering
Why now?
• Always value in scale, but it was previously
too difficult / expensive
• Economics and technology advances make
these scales accessible
Enable others to answer questions on
dataset within latency constraints
Data engineering
• Distributed systems – consensus,
consistency, availability, etc.
• Parallel processing
• Databases
• Queuing
Data engineering
• Human-fault tolerance
• Metrics and monitoring
• Multi-tenancy
BackType
• When I joined:
• Comment search by keyword
• Comment search by user
• Basic stats on commenters
• Link search on Twitter
BackType
Kyoto
Cabinet
Custom
workers
Custom
crawlers
BackType
• Inflexible
• Prone to corruption
• Heavy operational burden
• Not scalable
• Not fault-tolerant
BackType
• Enable asking any question (with high
latency)
• Allows exploration and experimentation
• Establishes human-fault tolerance
Collector
Collector
Collector
Collector
ElephantDB
• Export results of MapReduce pipelines for
querying
• Low latency querying but out of date by
many hours
• Incredibly simple
• Infrastructure
• Data pipelines
• Abstractions
Data engineering
Data pipeline example
Tweets
(S3)
Normalize
URLs
Compute
hour bucket
Sum by
hour/url
Emit
ElephantDB
indexes
Data pipeline example
Tweets
(Kafka)
Normalize
URLs
Compute
hour bucket
Update hour/
url bucket
Cassandra
Abstraction example
MapReduce Cascading Cascalog
Infrastructure
• HDFS
• MapReduce
• Kafka
• Storm
• Spark
• Cassandra
• HBase
• ElephantDB
• Zookeeper
Streaming compute
team at Twitter
• Started streaming compute team at Twitter
• One shared Storm cluster for entire
company
Multi-tenancy
• Independent applications on same cluster
• Topologies should not affect one another
Resource allocation
• Topologies should be given an appropriate
amount of resources
Initial approach
• Use Mesos to provide resource guarantees
• Users include resources needed as part of
topology submission
Solution
• Implement new scheduler which gives
production topologies dedicated hardware
• Only Storm team can configure production
topologies
• Left-over machines are used as failover or
for in-development topologies
Data Engineering vs Data Science
• Well-defined problems
• No special statistics skills required
• Larger scope
• Not just analytics
Open source
• Almost all major Big Data tools are open
source (e.g. Hadoop, Storm, Spark, Kafka,
Cassandra, HBase, etc.)
• Many have commercial support
Open source
• Very important for recruiting data
engineers
• Strong developers want to work at places
where they can be involved with open
source
Open source
• Develop a technology brand for company
(in conjunction with a tech blog)
• Creating a popular open source project can
give you access to lots of strong engineers
Open source
• Identify strong engineers in the community
you may want to recruit
• Learn best practices and get help from the
people who know the tools the best
• *Do not* expect to get “free work” on
your projects
Ideal data engineer
• Strong software engineering skills
• Abstraction
• Testing
• Version control
• Refactoring
Ideal data engineer
• Strong software engineering skills
• Strong algorithm skills
Ideal data engineer
• Strong software engineering skills
• Strong algorithm skills
• Good at digging into open source code
Ideal data engineer
• Strong software engineering skills
• Strong algorithm skills
• Good at digging into open source code
• Good at stress testing
Finding strong data engineers
• Standard “coding on the whiteboard”
interviews are near useless
• Use take home projects to gauge general
programming ability
• The best is to see projects that require
data engineering
Questions?

Demystifying Data Engineering