@Scalding
https://github.com/twitter/scalding


          Oscar Boykin
            Twitter
            @posco
#hadoopsummit
I encourage live tweeting (mention @posco/@scalding)
• What is Scalding?
• Why Scala for Map/Reduce?
• How is it used at Twitter?
• What’s next for Scalding?
Yep, we’re counting
               words:
Scalding jobs
subclass Job
Yep, we’re counting
                words:

Logic is in the
 constructor
Yep, we’re counting
              words:


Functions can
 be called or
defined inline
Scalding Model

• Source objects read and write data (from
  HDFS, DBs, MemCache, etc...)
• Pipes represent the flows of the data in the
  job. You can think of Pipe as a distributed
  list.
Yep, we’re counting
               words:


  Read and
  Write data
   through
Source objects
Yep, we’re counting
               words:


Data is modeled
  as streams of
named Tuples (of
     objects)
Why Scala
• The scala language has a lot of built-in
  features that make domain-specific
  languages easy to implement.
• Map/Reduce is already within the functional
  paradigm.
• Scala’s collection API covers almost all usual
  use cases.
Word Co-occurrence
Word Co-occurrence


 We can use
standard scala
  containers
Word Co-occurrence


We can do real
  logic in the
mapper without
external UDFs.
Word Co-occurrence

  Generalized
“plus” handles
lists/sets/maps
   and can be
  customized
  (implement
  Monoid[T])
GroupBuilder: enabling
 parallel reductions
           • groupBy takes a
             function that mutates a
             GroupBuilder.
           • GroupBuilder adds
             fields which are
             reductions of
             (potentially different)
             inputs.
           • On the left, we add 7
             fields.
scald.rb
• driver script that compiles the job and runs
  it locally or transfers and runs remotely.
• we plan to add EMR support.
Most functions in the API
have very close analogs in
 scala.collection.Iterable.
Cascading
• is the java library that handles most of the map/
  reduce planning for scalding.
• has years of production use.
• is used, tested, and optimized by many teams
  (Concurrent Inc., DSLs in Scala, Clojure, Python
  @Twitter. Ruby at Etsy).
• has a (very fast) local mode that runs without
  Hadoop.
• flow planner designed to be portable (cascading
  on Spark? Storm?)
mapReduceMap
• We abstract Cascading’s map-side
  aggregation ability with a function called
  mapReduceMap.
• If only mapReduceMaps are called, map-side
  aggregation works. If a foldLeft is called
  (which cannot be done map-side), scalding
  falls back to pushing everything to the
  reducers.
Most Reductions are
 mapReduceMap
Optimized Joins
• mapside join is called joinWithTiny.
  Implements left or inner join with a very
  small pipe.
• blockJoin: deals with data skew by
  replicating the data (useful for walking the
  Twitter follower graph, where everyone
  follows Gaga/Bieber/Obama).
• coming: combine the above to dynamically
  set replication on a per key basis: only Gaga
  is replicated, and just the right amount.
Scalding @Twitter
• Revenue quality team (ads targeting, market
  insight, click-prediction, traffic-quality) uses
  scalding for all our work.
• Scala engineers throughout the company
  use it (i.e. storage, platform).
• More than 60 in-production scalding jobs,
  more than 200 ad-hoc jobs.
• Not our only tool: Pig, PyCascading,
  Cascalog, Hive are also used.
Example: finding
         similarity
• A simple recommendation algorithm is
  cosine similarity.
• Represent user-tweet interaction as a
  vector, then find the users whose vectors
  point in directions near the user in
  question.
• We’ve developed a Matrix library on top of
  scalding to make this easy.
Cosine Similarity

Matrices are
 strongly
  typed.
Cosine Similarity
   Col,Row
types (Int,Int)
     can be
   anything
 comparable.
  Strings are
useful for text
    indices.
Cosine Similarity

    Value
(Double) can
 be anything
with a Ring[T]
 (plus/times)
Cosine Similarity

  Operator
 overloading
gives intuitive
    code.
Matrix in foreground,
        map/reduce behind

    With this
  syntax, we can
  focus on logic,
not how to map
linear algebra to
     Hadoop
Example uses:
• Do random-walks on the following graph.
  Matrix power iteration until convergence:
  (m * m * m * m).
• Dimensionality reduction of follower graph
  (Matrix product by a lower dimensional
  projection matrix).
• Triangle counting: (M*M*M).trace / 3
What is next?

• Improve the logical flow planning (reorder
  commuting filters/projections before maps,
  etc...).
• Improve Matrix flow planning to narrow
  the gap to hand optimized code.
One more thing:


• Type-safety geeks can relax: we just pushed
  a type-safe API to scalding 0.6.0 analogous
  to Scoobi/Scrunch/Spark, so relax.
That’s it.


• follow and mention: @scalding @posco
• pull reqs: http://github.com/twitter/scalding

Scalding: Twitter's Scala DSL for Hadoop/Cascading

  • 1.
  • 2.
    #hadoopsummit I encourage livetweeting (mention @posco/@scalding)
  • 3.
    • What isScalding? • Why Scala for Map/Reduce? • How is it used at Twitter? • What’s next for Scalding?
  • 4.
    Yep, we’re counting words: Scalding jobs subclass Job
  • 5.
    Yep, we’re counting words: Logic is in the constructor
  • 6.
    Yep, we’re counting words: Functions can be called or defined inline
  • 7.
    Scalding Model • Sourceobjects read and write data (from HDFS, DBs, MemCache, etc...) • Pipes represent the flows of the data in the job. You can think of Pipe as a distributed list.
  • 8.
    Yep, we’re counting words: Read and Write data through Source objects
  • 9.
    Yep, we’re counting words: Data is modeled as streams of named Tuples (of objects)
  • 10.
    Why Scala • Thescala language has a lot of built-in features that make domain-specific languages easy to implement. • Map/Reduce is already within the functional paradigm. • Scala’s collection API covers almost all usual use cases.
  • 11.
  • 12.
    Word Co-occurrence Wecan use standard scala containers
  • 13.
    Word Co-occurrence We cando real logic in the mapper without external UDFs.
  • 14.
    Word Co-occurrence Generalized “plus” handles lists/sets/maps and can be customized (implement Monoid[T])
  • 15.
    GroupBuilder: enabling parallelreductions • groupBy takes a function that mutates a GroupBuilder. • GroupBuilder adds fields which are reductions of (potentially different) inputs. • On the left, we add 7 fields.
  • 16.
    scald.rb • driver scriptthat compiles the job and runs it locally or transfers and runs remotely. • we plan to add EMR support.
  • 18.
    Most functions inthe API have very close analogs in scala.collection.Iterable.
  • 19.
    Cascading • is thejava library that handles most of the map/ reduce planning for scalding. • has years of production use. • is used, tested, and optimized by many teams (Concurrent Inc., DSLs in Scala, Clojure, Python @Twitter. Ruby at Etsy). • has a (very fast) local mode that runs without Hadoop. • flow planner designed to be portable (cascading on Spark? Storm?)
  • 20.
    mapReduceMap • We abstractCascading’s map-side aggregation ability with a function called mapReduceMap. • If only mapReduceMaps are called, map-side aggregation works. If a foldLeft is called (which cannot be done map-side), scalding falls back to pushing everything to the reducers.
  • 21.
  • 23.
    Optimized Joins • mapsidejoin is called joinWithTiny. Implements left or inner join with a very small pipe. • blockJoin: deals with data skew by replicating the data (useful for walking the Twitter follower graph, where everyone follows Gaga/Bieber/Obama). • coming: combine the above to dynamically set replication on a per key basis: only Gaga is replicated, and just the right amount.
  • 24.
    Scalding @Twitter • Revenuequality team (ads targeting, market insight, click-prediction, traffic-quality) uses scalding for all our work. • Scala engineers throughout the company use it (i.e. storage, platform). • More than 60 in-production scalding jobs, more than 200 ad-hoc jobs. • Not our only tool: Pig, PyCascading, Cascalog, Hive are also used.
  • 25.
    Example: finding similarity • A simple recommendation algorithm is cosine similarity. • Represent user-tweet interaction as a vector, then find the users whose vectors point in directions near the user in question. • We’ve developed a Matrix library on top of scalding to make this easy.
  • 26.
  • 27.
    Cosine Similarity Col,Row types (Int,Int) can be anything comparable. Strings are useful for text indices.
  • 28.
    Cosine Similarity Value (Double) can be anything with a Ring[T] (plus/times)
  • 29.
    Cosine Similarity Operator overloading gives intuitive code.
  • 30.
    Matrix in foreground, map/reduce behind With this syntax, we can focus on logic, not how to map linear algebra to Hadoop
  • 31.
    Example uses: • Dorandom-walks on the following graph. Matrix power iteration until convergence: (m * m * m * m). • Dimensionality reduction of follower graph (Matrix product by a lower dimensional projection matrix). • Triangle counting: (M*M*M).trace / 3
  • 32.
    What is next? •Improve the logical flow planning (reorder commuting filters/projections before maps, etc...). • Improve Matrix flow planning to narrow the gap to hand optimized code.
  • 33.
    One more thing: •Type-safety geeks can relax: we just pushed a type-safe API to scalding 0.6.0 analogous to Scoobi/Scrunch/Spark, so relax.
  • 34.
    That’s it. • followand mention: @scalding @posco • pull reqs: http://github.com/twitter/scalding