{

    "_id" : ObjectId("4dcd3ebc9278000000005158"),

    "timestamp" : ISODate("2011-05-13T14:22:46.777Z"),

    "binary" : BinData(0,""),

    "string" : "abc",

    "number" : 3,

    "subobj" : {"subA": 1, "subB": 2 },

    "array" : [1, 2, 3],

    "dbref" : [_id1, _id2, _id3]

                        padding

}
{   db.coll.find({"string": "abc"});
db.coll.find({ "string" : /^a.*$/i });
    "_id" : ObjectId("4dcd3ebc9278000000005158"),

    "timestamp" : ISODate("2011-05-13T14:22:46.777Z"),
                   db.coll.find({"subobj.subA": 1});
                db.coll.find({"subobj.subB": {$exists: true} });
    "binary" : BinData(0,""),

    "string" : "abc",              db.coll.find({"number": 3});
                               db.coll.find({"number": {$gt: 1}});
    "number" : 3,

    "subobj" : {"subA": 1, "subB": 2 },

    "array" : [1, 2, 3],
                         db.coll.find({"array": {$all:[1, 2]} });
    "dbref" : [_id1, _id2, _id3]
                        db.coll.find({"array": {$in:[2, 4, 6]} });
                             padding

}
{

    "_id" : ObjectId("4dcd3ebc9278000000005158"),

    "timestamp" : ISODate("2011-05-13T14:22:46.777Z"),
          { $set : {"string": "def"} }

    "binary" : BinData(0,""), { $inc : {"number": 1} }

    "string" : "def",
                          { $pull : {"subobj": {"subB": 2 } } }
    "number" : 4,

    "subobj" : {"subA": 1, "subB": 2 },

    "array" : [1, 2, 3, 4, 5, 6],

    "dbref"$addToSet : { "array" : { $each : [ 4 , 5 , 6 ] } } }
         { : [_id1, _id2, _id3]


    "newkey" : "In-place"

}                              { $set : {"newkey": "In-place"} }
ScientificPython
def mapper(key, value):
   for word in value.split(): yield word,1
def reducer(key, values):
   yield key,sum(values)
if __name__ == "__main__":
   import dumbo
   dumbo.run(mapper, reducer)



dumbo start wordcount.py 
       -hadoop /path/to/hadoop 
       -input wc_input.txt 
       -output wc_output
[2011-07-01 12:01:48,447]
db.collection.insert(
 {hour:0,
   userId:”1234”,
   actionType:”login”,}
);
m = function(){
     this.tags.forEach{
          function(z) {
              emit(z, {count: 1});
          }
     };
};
r = function(key, values) {
     var total=0;
     for (i=0, i<values.length, i++)
          total += values[i].count;
     return { count : total };
}
res=db.things.mapReduce(m,!r);
#                              finalize
Examples
                Conclusions and Future Work


 Party Solutions




                                    Motivation
                                  Architecture
                                    Examples
                  Conclusions and Future Work


ummary of Features


 Hadoop-based: same limitations as Streaming (Dumbo) and
                       Streaming Jython Pydoop
 Jython (Happy), except for ease of use
           C/C++ Ext       Yes       No     Yes
 Other implementations: good if you have your own cluster
         Standard Lib      Full     Partial   Full
     Hadoop is the most widespread implementation
            MR API         No*       Full   Partial
           Java-like FW                   No           Yes            Yes
               HDFS                       No
                                  Leo, Zanetti
                                                       Yes            Yes
                                                 Pydoop: a Python MapReduce and HDFS API for Hadoop



     (*) you can only write the map and reduce parts as executable scripts.
Motivation
                               Architecture
                                 Examples
               Conclusions and Future Work


Hadoop Pipes



                                                      Communication with Java
                                                      framework via persistent
                                                      sockets
                                                      The C++ app provides a
                                                      factory used by the framework
                                                      to create MR components
                                                      Providing Mapper and
                                                      Reducer is mandatory




                               Leo, Zanetti   Pydoop: a Python MapReduce and HDFS API for Hadoop
Motivation
                              Architecture
                                Examples
              Conclusions and Future Work


Integration of Pydoop with C++


                                             Integration with Pipes:
                                                 Method calls flow from the
                                                 framework through the C++ and the
                                                 Pydoop API, ultimately reaching
                                                 user-defined methods
                                                 Results are wrapped by Boost and
                                                 returned to the framework
                                             Integration with HDFS:
                                                 Function calls initiated by Pydoop
                                                 Results wrapped and returned as
                                                 Python objects to the app
gawk '
 BEGIN{ reducenum='$REDUCE_NUM'; }
  { userid=$7; key=$8; }
  key ~ /a{GetLoginBonus}/ { incrby(userid,key,$9,a); next;}
  key ~ /a{SideJob}/          { incrby(userid,key,$11,a); next;}
  key ~ /a{CleanMyShop}/      { hincr(userid,key,$9,a); next; }
  key ~ /(GetAvatarPart|ChangeP|ChangeWakuwakuP|ChangeKonergy)/
                                { incrbydiff(userid,key,$9,a); next; }
 ...‘ $IN

# for reducer1 (such as “userid % reducenum == 0”)
# command userid key value
MULTI
HINCRBY 1111 a{ChangeGreed} 3
HINCRBY 1111 a{GianEvent} 7
HINCRBY 1111 a{TeamChallenge} 5
HINCRBY 2222 a{Battle} 3
HINCRBY 2222 a{ChangeMoney} 3
...
EXEC
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing Model

MongoDB & Hadoop: Flexible Hourly Batch Processing Model

  • 6.
    { "_id" : ObjectId("4dcd3ebc9278000000005158"), "timestamp" : ISODate("2011-05-13T14:22:46.777Z"), "binary" : BinData(0,""), "string" : "abc", "number" : 3, "subobj" : {"subA": 1, "subB": 2 }, "array" : [1, 2, 3], "dbref" : [_id1, _id2, _id3] padding }
  • 7.
    { db.coll.find({"string": "abc"}); db.coll.find({ "string" : /^a.*$/i }); "_id" : ObjectId("4dcd3ebc9278000000005158"), "timestamp" : ISODate("2011-05-13T14:22:46.777Z"), db.coll.find({"subobj.subA": 1}); db.coll.find({"subobj.subB": {$exists: true} }); "binary" : BinData(0,""), "string" : "abc", db.coll.find({"number": 3}); db.coll.find({"number": {$gt: 1}}); "number" : 3, "subobj" : {"subA": 1, "subB": 2 }, "array" : [1, 2, 3], db.coll.find({"array": {$all:[1, 2]} }); "dbref" : [_id1, _id2, _id3] db.coll.find({"array": {$in:[2, 4, 6]} }); padding }
  • 8.
    { "_id" : ObjectId("4dcd3ebc9278000000005158"), "timestamp" : ISODate("2011-05-13T14:22:46.777Z"), { $set : {"string": "def"} } "binary" : BinData(0,""), { $inc : {"number": 1} } "string" : "def", { $pull : {"subobj": {"subB": 2 } } } "number" : 4, "subobj" : {"subA": 1, "subB": 2 }, "array" : [1, 2, 3, 4, 5, 6], "dbref"$addToSet : { "array" : { $each : [ 4 , 5 , 6 ] } } } { : [_id1, _id2, _id3] "newkey" : "In-place" } { $set : {"newkey": "In-place"} }
  • 16.
  • 21.
    def mapper(key, value): for word in value.split(): yield word,1 def reducer(key, values): yield key,sum(values) if __name__ == "__main__": import dumbo dumbo.run(mapper, reducer) dumbo start wordcount.py -hadoop /path/to/hadoop -input wc_input.txt -output wc_output
  • 35.
  • 40.
    db.collection.insert( {hour:0, userId:”1234”, actionType:”login”,} );
  • 42.
    m = function(){ this.tags.forEach{ function(z) { emit(z, {count: 1}); } }; }; r = function(key, values) { var total=0; for (i=0, i<values.length, i++) total += values[i].count; return { count : total }; } res=db.things.mapReduce(m,!r); # finalize
  • 49.
    Examples Conclusions and Future Work Party Solutions Motivation Architecture Examples Conclusions and Future Work ummary of Features Hadoop-based: same limitations as Streaming (Dumbo) and Streaming Jython Pydoop Jython (Happy), except for ease of use C/C++ Ext Yes No Yes Other implementations: good if you have your own cluster Standard Lib Full Partial Full Hadoop is the most widespread implementation MR API No* Full Partial Java-like FW No Yes Yes HDFS No Leo, Zanetti Yes Yes Pydoop: a Python MapReduce and HDFS API for Hadoop (*) you can only write the map and reduce parts as executable scripts.
  • 50.
    Motivation Architecture Examples Conclusions and Future Work Hadoop Pipes Communication with Java framework via persistent sockets The C++ app provides a factory used by the framework to create MR components Providing Mapper and Reducer is mandatory Leo, Zanetti Pydoop: a Python MapReduce and HDFS API for Hadoop
  • 51.
    Motivation Architecture Examples Conclusions and Future Work Integration of Pydoop with C++ Integration with Pipes: Method calls flow from the framework through the C++ and the Pydoop API, ultimately reaching user-defined methods Results are wrapped by Boost and returned to the framework Integration with HDFS: Function calls initiated by Pydoop Results wrapped and returned as Python objects to the app
  • 55.
    gawk ' BEGIN{reducenum='$REDUCE_NUM'; } { userid=$7; key=$8; } key ~ /a{GetLoginBonus}/ { incrby(userid,key,$9,a); next;} key ~ /a{SideJob}/ { incrby(userid,key,$11,a); next;} key ~ /a{CleanMyShop}/ { hincr(userid,key,$9,a); next; } key ~ /(GetAvatarPart|ChangeP|ChangeWakuwakuP|ChangeKonergy)/ { incrbydiff(userid,key,$9,a); next; } ...‘ $IN # for reducer1 (such as “userid % reducenum == 0”) # command userid key value MULTI HINCRBY 1111 a{ChangeGreed} 3 HINCRBY 1111 a{GianEvent} 7 HINCRBY 1111 a{TeamChallenge} 5 HINCRBY 2222 a{Battle} 3 HINCRBY 2222 a{ChangeMoney} 3 ... EXEC