Using Scalding for Data-Driven
Product Development
Sasha Ovsankin
LinkedIn
Presented to Scala By The Bay
Aug 9, 2014
/summary
Data-Driven
Product
Development
/summary
Data-Driven
Product
Development
Scalding =
Hadoop + Scala
/summary
Data-Driven
Product
Development
Scalding =
Hadoop + Scala
/data-driven
Your
Service
/data-driven
Your
Service
Value
/data-driven
Your
Service
Value Data
/data-driven
Your
Service
Value Data
/data-driven
Your
Service
Value Data
/data-driven
Your
Amazing
Service
Value Data
“Online” World
/data-driven/linkedin
Web Applications
NoSQL Data
Stores
“Offline” World (Hadoop)
HDFS
Hadoop Jobs
Tracking/l
ogging
Analytics
Data
Products
Messaging
Message delivery
Databases
/linkedin/big-data/links
• “LinkedIn Big Data Ecosystem”
– http://lnkd.in/big-data-ecosystem
• Grid Operations
– http://lnkd.in/gridops2013
/scalding
http://github.com/twitter/scalding
• Scala-based DSL for Map/Reduce jobs
• Built on Cascading, stable and mature Hadoop
framework
• Uses API similar to Scala collections:
class WordCountJob(args : Args) extends Job(args) {
TextLine( args("input") )
.flatMap('line -> 'word) { line : String => line.split("""s+""") }
.groupBy('word) { _.size }
.write( Tsv( args("output") ) )
}
• Succinct and powerful
• High level of abstraction
/data-driven/problem/scaling
• Problem: Scaling
• Solution
– Distributed processing
– High-level description of algorithms
– Functional programming
…/solution/scalding
../problem/complexity
• Problem: Complexity
• Solution
– Consistent way of organizing data
• Self-describing data formats (Avro)
• File organization
– Type safety
– Modularization
…/solution/scalding
/linkedin/hadoop/practices
• All online data end up in HDFS
– Avro encoding is standard
• Production Process
– CI/Automatic Build
• More info forthcoming
– Production Review
– Operations and Monitoring
• More info at http://lnkd.in/gridops2013
• Result: Thousands of jobs running in production
• More info at http://lnkd.in/big-data-ecosystem
../solution/scala/killer-argument
• Map & reduce -- primitives
scala> (1 to 1000) map { pow(_,2) } reduce { _ + _ }
res20: Int = 333833500
/linkedin/scalding/status
• Started >1 year ago
• Thousands of production LOC written in Scalding by
our team
– Pretty happy with readability, maintainability and tooling
support
• Dozens of flows are currently in production, and
counting
• Created Scalding user group
• Growing interest
• Learning:
– Scala[Scalding] < Scala[ _ ]
/summary
Data-Driven
Product
Development
Scalding =
Hadoop + Scala
/linkedin/join-us
• Work on unique and interesting problems
• Be part of great engineering community
• Use latest tools and technologies
• Help connect the world’s professionals to help them
become more productive and successful
• We are looking for amazing people interested in
Software Engineering and Data Science
– http://linkedin.com/careers
Questions?

Using Scalding for Data Driven Product Development at LinkedIn