Taewook Eom
Data Infrastructure Team
SK planet
taewook@sk.com
2015-01-29
Scalding:
Big Data Programming with Scala
Big Data Processing
Apache
Storm
MapReduce, MR, Map-Reduce
https://twitter.com/brianabelson/status/506933310787186688
M = M+
MR = M+RM*
MRMR… = (M+RM*)+
Data Processing Pattern with MR
select
function
where(filter)
group by
Join
order by
windowing function
analytics function
Workflow
management
http://docs.cascading.org/impatient/impatient6.html
Data Workflow
= DAG
(Directed Acyclic Graph)
Cascading http://www.cascading.org/
http://docs.cascading.org/impatient/impatient6.html
• Pipe abstraction = Plumbing
• Operators like SQL
• DAG based workflow management
http://sommerdyke.com/wp-content/uploads/2014/11/plumbing3.jpg/ http://grammarchicblog.files.wordpress.com/2013/08/plumber-manchester.jpg
http://www.stuartplumbing.com.au/wp-content/uploads/2014/02/pipes_253.jpg http://acpfl.co/wp-content/uploads/2014/11/Plumbers.png
http://www.slideshare.net/taewook/programming-cascading
Object-Oriented vs. Functional
http://scott.sauyet.com/Javascript/Talk/2014/01/FuncProgTalk/#slide-10 http://scott.sauyet.com/Javascript/Talk/2014/01/FuncProgTalk/#slide-13
OOP focuses on the differences in the data
Data and the operations upon it are tightly coupled
The central model for abstraction is the data itself
FP concentrates on consistent data structures
Data is only loosely coupled to functions
The central model for abstraction is the function, not the data structure
FP describe what they want done, not how to do it
OOP uses mostly imperative techniques
Data Processing, Functional Programming
SQL
http://scott.sauyet.com/Javascript/Talk/2014/01/FuncProgTalk/#slide-6
uses a consistent data structure (table: rows x cols)
uses functions that can be combined
is declarative not imperative
Data is Immutable
 Transformable
by Composable Functions
Scalable Language
 Big Data
Seamless Java Interop
 Hadoop runs on the JVM
Functional
 Data Processing
REPL(Read-Evaluate-Print Loop)
 Interactive data analysis
Why
?
http://www.scala-lang.org/
Scala DSL for Cascading
Simple and concise syntax
maintained by Twitter
Scalding https://github.com/twitter/scalding
https://github.com/Cascading/Impatient/blob/master/part4/src/main/java/impatient/Main.java
https://github.com/sujitpal/hia-examples/blob/master/scala/scalding-impatient/src/main/scala/com/mycompany/impatient/Part4.scala
https://github.com/Cascading/Impatient/blob/master/part4/src/main/java/impatient/ScrubFunction.java
UDF(User-defined Function)
“If you need to write UDF’s all the time,
something is wrong with you.”
- Various authors of non-scalding frameworks
who happened to be completely WRONG
http://www.slideshare.net/danmckinley/scalding-at-etsy The Triumph of Scalding at Etsy (69/87)
At Etsy, it’s not just engineers who write and deploy code
– our designers and product managers regularly do too.
https://codeascraft.com/2014/12/22/engineering-rotation/ We Invite Everyone at Etsy to Do an Engineering Rotation: Here’s why
http://strataconf.com/strataeu2014/public/schedule/detail/37250 Data Consumers are better Data Producers
Etsy’s Data-Driven Culture
SBT Build script
- build.sbt, project/plugins.sbt
- libraryDependencies
- Main-Class in META-INF/MANIFEST.MF
Splitting project and deps JARs
Run command and arguments
https://github.com/taewookeom/scalding-example
Apache Spark™ is a fast and general engine for large-scale data processing.
https://spark.apache.org/
https://twitter.com/PGopalan/status/522747857288183808 https://twitter.com/drelu/status/523169685815042049
Next Try
Questions?
Questions.foreach( answer(_) )
http://www.slideshare.net/deview/a4de-view2012-scalamichinisougu Scala, 미지와의 조우
http://www.slideshare.net/kthcorp/scala-15041890 꽃보다 Scala http://goo.gl/O382Fh
https://twitter.github.io/scala_school/ko/index.html 스칼라 학교!
http://refcardz.dzone.com/refcardz/scala Refcardz: Getting Started with Scala
http://wrobstory.gitbooks.io/python-to-scala/ Python To Scala
http://mbonaci.github.io/scala/ Java developer's Scala cheatsheet
Learning Scala
Learning Scalding
http://docs.cascading.org/tutorials/scalding-data-processing/
https://github.com/twitter/scalding/wiki/Getting-Started
https://github.com/twitter/scalding/wiki/Fields-based-API-Reference
https://github.com/twitter/scalding/tree/master/tutorial
https://github.com/scalding-io/ProgrammingWithScalding
http://sujitpal.blogspot.kr/2012/08/scalding-for-impatient.html
https://github.com/snowplow/scalding-example-project
https://twitter.com/mfeathers/status/29581296216
https://twitter.com/taewooke/status/554776724290813953

Scalding - Big Data Programming with Scala