Using Scalding for Data Driven Product Development at LinkedIn

Using Scalding for Data-Driven
Product Development
Sasha Ovsankin
LinkedIn
Presented to Scala By The Bay
Aug 9, 2014

/summary
Data-Driven
Product
Development

/summary
Data-Driven
Product
Development
Scalding =
Hadoop + Scala

/data-driven
Your
Service
Value

/data-driven
Your
Service
Value Data

/data-driven
Your
Amazing
Service
Value Data

“Online” World
/data-driven/linkedin
Web Applications
NoSQL Data
Stores
“Offline” World (Hadoop)
HDFS
Hadoop Jobs
Tracking/l
ogging
Analytics
Data
Products
Messaging
Message delivery
Databases

/linkedin/big-data/links
• “LinkedIn Big Data Ecosystem”
– http://lnkd.in/big-data-ecosystem
• Grid Operations
– http://lnkd.in/gridops2013

/scalding
http://github.com/twitter/scalding
• Scala-based DSL for Map/Reduce jobs
• Built on Cascading, stable and mature Hadoop
framework
• Uses API similar to Scala collections:
class WordCountJob(args : Args) extends Job(args) {
TextLine( args("input") )
.flatMap('line -> 'word) { line : String => line.split("""s+""") }
.groupBy('word) { _.size }
.write( Tsv( args("output") ) )
}
• Succinct and powerful
• High level of abstraction

/data-driven/problem/scaling
• Problem: Scaling
• Solution
– Distributed processing
– High-level description of algorithms
– Functional programming

../problem/complexity
• Problem: Complexity
• Solution
– Consistent way of organizing data
• Self-describing data formats (Avro)
• File organization
– Type safety
– Modularization

/linkedin/hadoop/practices
• All online data end up in HDFS
– Avro encoding is standard
• Production Process
– CI/Automatic Build
• More info forthcoming
– Production Review
– Operations and Monitoring
• More info at http://lnkd.in/gridops2013
• Result: Thousands of jobs running in production
• More info at http://lnkd.in/big-data-ecosystem

../solution/scala/killer-argument
• Map & reduce -- primitives
scala> (1 to 1000) map { pow(_,2) } reduce { _ + _ }
res20: Int = 333833500

/linkedin/scalding/status
• Started >1 year ago
• Thousands of production LOC written in Scalding by
our team
– Pretty happy with readability, maintainability and tooling
support
• Dozens of flows are currently in production, and
counting
• Created Scalding user group
• Growing interest
• Learning:
– Scala[Scalding] < Scala[ _ ]

/linkedin/join-us
• Work on unique and interesting problems
• Be part of great engineering community
• Use latest tools and technologies
• Help connect the world’s professionals to help them
become more productive and successful
• We are looking for amazing people interested in
Software Engineering and Data Science
– http://linkedin.com/careers
Questions?

Using Scalding for Data Driven Product Development at LinkedIn

More Related Content

What's hot

Similar to Using Scalding for Data Driven Product Development at LinkedIn

Recently uploaded

Using Scalding for Data Driven Product Development at LinkedIn