Social Network Mining
    Solutions using Google App Engine Map Reduce




     J Singh, DataThinks.org



                                        October 19, 2011
MapReduce: A Genealogical Perspective
• Roots
   – Lisp, Scheme
   – APL


• Google OS papers, 2004
   – Exploit extreme parallelism of data


• Apache Top Level Project (Hadoop)

• MapReduceGAE borrows from these




© J Singh, 2011                            2
                                   2
Social Network Mining
• Finding people based on data in social networks
   –   Love and Romance
   –   Common interests
   –   Similar buying habits
   –   Similar voting propensities
   –   Location


• It‟s not a new problem
   – We have additional solutions for the old problem
        • Examples based on proprietary data: eHarmony, etc.
        • Early examples based on social network data: ShoutFlow,
          WhoIsJustLikeMe.



© J Singh, 2011                                                     3
                                      3
Based on clustering algorithms
• On-line demo of clustering       • Resource intensive.
                                      – Best done in batch mode


                                   • Exploit data parallelism of the
                                     algorithm
                                      – App Engine Map Reduce,
                                        employing one map job for
                                        each cluster
                                      – App Engine Pipeline API,
                                        employing one stage of the
                                        pipeline for each „step‟


                                   • But first, a detour into Map
                                     Reduce…
© J Singh, 2011                                                      4
                               4
MapReduce Conceptual Underpinnings
• Based on Functional Programming model
   – From Lisp / Scheme
        • (map square '(1 2 3 4))   (1 4 9 16)
        • (reduce plus '(1 4 9 16))   30
   – From APL
        • +/ N    N  1 2 3 4


• Easy to distribute (based on each element of the vector)

• New for Map/Reduce: Nice failure/retry semantics
   – Hundreds and thousands of low-end servers are running at the
     same time



© J Singh, 2011                                                     5
                                  5
MapReduce Flow




© J Singh, 2011       6
                  6
MapReduce Components in GAE 2011
                  • Input Reader
                     – Several provided by GAE, can write your own


                  • Map function: Written by Programmer

                  • Shuffle function:
                     – Provided by GAE, can write your own


                  • Reduce function: Written by Programmer

                  • Output Writer
                     – Several provided by GAE, can write your own




© J Singh, 2011                                                      7
                               7
Invoking GAE Map Reduce
class MapreducePipeline (…):
    def run(self,
          job_name,             #   A string
          mapper_spec,          #   Mapper function
          reducer_spec,         #   Reducer function
          input_reader_spec,    #   Input reader fn
          output_writer_spec,   #   Output writer
          mapper_params,        #   A dictionary
          reducer_params,       #   A dictionary
          shards,               #   An int
            )


© J Singh, 2011                                        8
                          8
GAE Pipeline API
• Based on Python Generator functions

• The old Unix idea on steroids:
   – Perform complex operations by piping data between primitives
   – But the primitives are not so primitive
   – Data lives in permanent storage between pipeline stages


• MapreducePipeline (prev page) was just one type of pipeline




© J Singh, 2011                                                     9
                                   9
Pipeline API Example Code
Split and Merge example


  class aPipe(pipeline.Pipeline):
      def run(self, e_kind, prop_name, *value_list):
          all_bs = []
          for v in value_list:
              stage = yield bPipe(e_kind, prop_name, v)
              all_bs.append(stage)
          yield common.Append(*all_bs)




© J Singh, 2011                                           10
                            10
Pause and Assess
• Assertion:
   – GAE Map/Reduce is a complete solution for analysis of social
     network mining
   – We know it will scale, the question is how far.


• Working on one Proof of Concept for Social Network Mining
   – Recruiting a second test case


• Will report back in 3-4 months with data on
   – Performance
   – Cost
   – Limits of scalability


© J Singh, 2011                                                     11
                                     11
Adapting the algorithm to M/R
• Clustering Algorithm

   1. Create k randomly placed centroids       Map each
                                               data point

   2. Find the centroid (1-k) closest to each data point


   3. Move each centroid to the average of its members
                                              Reduce
                                           Each Centroid
   4. Repeat 2 and 3 until there is no more change

          Connect to next stage
           using Pipelining API

© J Singh, 2011                                             12
                                  12
About Us
• Involved with Map/Reduce and NoSQL technologies on several
  platforms
   – Google App Engine, MongoDB


• DataThinks.org is a new service of Early Stage IT
   – Building and operating “Big Data” analytics services




                           Thanks
© J Singh, 2011                                                13
                                   13

Social Media Mining using GAE Map Reduce

  • 1.
    Social Network Mining Solutions using Google App Engine Map Reduce J Singh, DataThinks.org October 19, 2011
  • 2.
    MapReduce: A GenealogicalPerspective • Roots – Lisp, Scheme – APL • Google OS papers, 2004 – Exploit extreme parallelism of data • Apache Top Level Project (Hadoop) • MapReduceGAE borrows from these © J Singh, 2011 2 2
  • 3.
    Social Network Mining •Finding people based on data in social networks – Love and Romance – Common interests – Similar buying habits – Similar voting propensities – Location • It‟s not a new problem – We have additional solutions for the old problem • Examples based on proprietary data: eHarmony, etc. • Early examples based on social network data: ShoutFlow, WhoIsJustLikeMe. © J Singh, 2011 3 3
  • 4.
    Based on clusteringalgorithms • On-line demo of clustering • Resource intensive. – Best done in batch mode • Exploit data parallelism of the algorithm – App Engine Map Reduce, employing one map job for each cluster – App Engine Pipeline API, employing one stage of the pipeline for each „step‟ • But first, a detour into Map Reduce… © J Singh, 2011 4 4
  • 5.
    MapReduce Conceptual Underpinnings •Based on Functional Programming model – From Lisp / Scheme • (map square '(1 2 3 4)) (1 4 9 16) • (reduce plus '(1 4 9 16)) 30 – From APL • +/ N N  1 2 3 4 • Easy to distribute (based on each element of the vector) • New for Map/Reduce: Nice failure/retry semantics – Hundreds and thousands of low-end servers are running at the same time © J Singh, 2011 5 5
  • 6.
    MapReduce Flow © JSingh, 2011 6 6
  • 7.
    MapReduce Components inGAE 2011 • Input Reader – Several provided by GAE, can write your own • Map function: Written by Programmer • Shuffle function: – Provided by GAE, can write your own • Reduce function: Written by Programmer • Output Writer – Several provided by GAE, can write your own © J Singh, 2011 7 7
  • 8.
    Invoking GAE MapReduce class MapreducePipeline (…): def run(self, job_name, # A string mapper_spec, # Mapper function reducer_spec, # Reducer function input_reader_spec, # Input reader fn output_writer_spec, # Output writer mapper_params, # A dictionary reducer_params, # A dictionary shards, # An int ) © J Singh, 2011 8 8
  • 9.
    GAE Pipeline API •Based on Python Generator functions • The old Unix idea on steroids: – Perform complex operations by piping data between primitives – But the primitives are not so primitive – Data lives in permanent storage between pipeline stages • MapreducePipeline (prev page) was just one type of pipeline © J Singh, 2011 9 9
  • 10.
    Pipeline API ExampleCode Split and Merge example class aPipe(pipeline.Pipeline): def run(self, e_kind, prop_name, *value_list): all_bs = [] for v in value_list: stage = yield bPipe(e_kind, prop_name, v) all_bs.append(stage) yield common.Append(*all_bs) © J Singh, 2011 10 10
  • 11.
    Pause and Assess •Assertion: – GAE Map/Reduce is a complete solution for analysis of social network mining – We know it will scale, the question is how far. • Working on one Proof of Concept for Social Network Mining – Recruiting a second test case • Will report back in 3-4 months with data on – Performance – Cost – Limits of scalability © J Singh, 2011 11 11
  • 12.
    Adapting the algorithmto M/R • Clustering Algorithm 1. Create k randomly placed centroids Map each data point 2. Find the centroid (1-k) closest to each data point 3. Move each centroid to the average of its members Reduce Each Centroid 4. Repeat 2 and 3 until there is no more change Connect to next stage using Pipelining API © J Singh, 2011 12 12
  • 13.
    About Us • Involvedwith Map/Reduce and NoSQL technologies on several platforms – Google App Engine, MongoDB • DataThinks.org is a new service of Early Stage IT – Building and operating “Big Data” analytics services Thanks © J Singh, 2011 13 13