MongoDB Hadoop Connector
Luke Lovett
Maintainer, mongo-hadoop
https://github.com/mongodb/mongo-hadoop
Overview
• Hadoop Overview
• Why MongoDB and Hadoop
• Connector Overview
• Technical look into new features
• What’s on the horizon?
• Wrap-up
Hadoop Overview
• Distributed data processing
• Fulfills analytical requirements
• Jobs are infrequent, batch processes
Churn Analysis Recommendation Warehouse/ETL Risk Modeling
Trade Surveillance Predictive Analysis Ad Targeting Sentiment Analysis
MongoDB + Hadoop
• MongoDB backs application
• Satisfy queries in real-time
• MongoDB + Hadoop = application data analytics
Connector Overview
• Brings operational data into analytical lifecycle
• Supporting an evolving Hadoop ecosystem
– Apache Spark has made a huge entrance
• MongoDB interaction seamless, natural
Connector Examples
MongoInputFormat MongoOutputFormat
BSONFileInputFormat BSONFileOutputFormat
Pig
data =
LOAD “mongodb://myhost/db.collection”
USING com.mongodb.hadoop.MongoInputFormat
Connector Examples
MongoInputFormat MongoOutputFormat
BSONFileInputFormat BSONFileOutputFormat
Hive
CREATE EXTERNAL TABLE mongo (
title STRING,
address STRUCT<from:STRING, to:STRING>)
STORED BY
“com.mongodb.hadoop.hive.MongoStorageHandler”;
Connector Examples
MongoInputFormat MongoOutputFormat
BSONFileInputFormat BSONFileOutputFormat
Spark (Python)
import pymongo_spark
pymongo_spark.activate()
rdd = sc.MongoRDD(“mongodb://host/db.coll”)
New Features
• Hive predicate pushdown
• Pig projection
• Compression support for BSON
• PySpark support
• MongoSplitter improvements
PySpark
• Python shell
• Submit jobs written in Python
• Problem: How do we provide a natural Python syntax
for accessing the connector inside the JVM?
• What we want:
– Support for PyMongo’s objects
– Have a natural API for working with MongoDB inside
Spark’s Python shell
PySpark
We need to understand:
• How do the JVM and Python work together in Spark?
• What does data look like between these processes?
• How does the MongoDB Hadoop Connector fit into this?
We need to take a look inside PySpark.
What’s Inside PySpark?
• Uses py4j to connect to JVM running Spark
• Communicates objects to/from JVM using Python’s
pickle protocol
• org.apache.spark.api.python.Converter converts
Writables to Java Objects and vice-versa
• Special PythonRDD type encapsulates JVM gateway
and necessary Converters, Picklers, and Constructors
for un-pickling
What’s Inside PySpark?
JVM Gateway
python:
java:
What’s Inside PySpark?
PythonRDD
Python: Keeps Reference to SparkContext, JVM Gateway
Java: simply wrap a
JavaRDD and do
some conversions
What’s Inside PySpark?
Pickler/Unpickler – What is a Pickle, anyway?
• Pickle – a Python object
serialized into a byte stream,
can be saved to a file
• defines a set of opcodes that
operate as in a stack
machine
• pickling turns a Python
object into a stream of
opcodes
• unpickling performs the
operators, getting a Python
object out
Example (pickleversion2)
>>> pickletools.dis(pickletools.optimize(pickle.dumps(doc)))
0: ( MARK
1: d DICT (MARK at 0)
2: S STRING '_id'
9: c GLOBAL 'copy_reg _reconstructor'
34: ( MARK
35: c GLOBAL 'bson.objectid ObjectId'
59: c GLOBAL '__builtin__ object'
79: N NONE
80: t TUPLE (MARK at 34)
81: R REDUCE
82: S STRING 'VKxc7ln2xab`x8fSx14xea'
113: b BUILD
114: s SETITEM
115: S STRING 'hello'
124: S STRING 'world'
133: s SETITEM
134: . STOP
{'_id': ObjectId('564bc76c6e32ab608f5314ea'), 'hello': 'world'}
>>> pickletools.dis(pickletools.optimize(pickle.dumps(doc)))
0: ( MARK
1: d DICT (MARK at 0)
2: S STRING '_id'
9: c GLOBAL 'copy_reg _reconstructor'
34: ( MARK
35: c GLOBAL 'bson.objectid ObjectId'
59: c GLOBAL '__builtin__ object'
79: N NONE
80: t TUPLE (MARK at 34)
81: R REDUCE
82: S STRING 'VKxc7ln2xab`x8fSx14xea'
113: b BUILD
114: s SETITEM
115: S STRING 'hello'
124: S STRING 'world'
133: s SETITEM
134: . STOP
Example (pickleversion2)
{'_id': ObjectId('564bc76c6e32ab608f5314ea'), 'hello': 'world'}
>>> pickletools.dis(pickletools.optimize(pickle.dumps(doc)))
0: ( MARK
1: d DICT (MARK at 0)
2: S STRING '_id'
9: c GLOBAL 'copy_reg _reconstructor'
34: ( MARK
35: c GLOBAL 'bson.objectid ObjectId'
59: c GLOBAL '__builtin__ object'
79: N NONE
80: t TUPLE (MARK at 34)
81: R REDUCE
82: S STRING 'VKxc7ln2xab`x8fSx14xea'
113: b BUILD
114: s SETITEM
115: S STRING 'hello'
124: S STRING 'world'
133: s SETITEM
134: . STOP
Example (pickleversion2)
{'_id': ObjectId('564bc76c6e32ab608f5314ea'), 'hello': 'world'}
>>> pickletools.dis(pickletools.optimize(pickle.dumps(doc)))
0: ( MARK
1: d DICT (MARK at 0)
2: S STRING '_id'
9: c GLOBAL 'copy_reg _reconstructor'
34: ( MARK
35: c GLOBAL 'bson.objectid ObjectId'
59: c GLOBAL '__builtin__ object'
79: N NONE
80: t TUPLE (MARK at 34)
81: R REDUCE
82: S STRING 'VKxc7ln2xab`x8fSx14xea'
113: b BUILD
114: s SETITEM
115: S STRING 'hello'
124: S STRING 'world'
133: s SETITEM
134: . STOP
Example (pickleversion2)
{'_id': ObjectId('564bc76c6e32ab608f5314ea'), 'hello': 'world'}
>>> pickletools.dis(pickletools.optimize(pickle.dumps(doc)))
0: ( MARK
1: d DICT (MARK at 0)
2: S STRING '_id'
9: c GLOBAL 'copy_reg _reconstructor'
34: ( MARK
35: c GLOBAL 'bson.objectid ObjectId'
59: c GLOBAL '__builtin__ object'
79: N NONE
80: t TUPLE (MARK at 34)
81: R REDUCE
82: S STRING 'VKxc7ln2xab`x8fSx14xea'
113: b BUILD
114: s SETITEM
115: S STRING 'hello'
124: S STRING 'world'
133: s SETITEM
134: . STOP
Example (pickleversion2)
{'_id': ObjectId('564bc76c6e32ab608f5314ea'), 'hello': 'world'}
What’s Inside PySpark?
Pickle, implemented by Pyrolite library
Pyrolite - Python Remote Objects "light" and Pickle for Java/.NET
https://github.com/irmen/Pyrolite
• Pyrolite library allows Spark to use Python’s Pickle protocol to
serialize/deserialize Python objects across the gateway.
• Hooks available for handling custom types in each direction
– registerCustomPickler – define how to turn a Java object
into a Python Pickle byte stream
– registerConstructor – define how to construct a Java object
for a given Python type
What’s Inside PySpark?
BSONPickler – translates Java -> PyMongo
PyMongo – MongoDB Python driver
https://github.com/mongodb/mongo-python-driver
Special handling for
- Binary
- BSONTimestamp
- Code
- DBRef
- ObjectId
- Regex
- Min/MaxKey
“PySpark” – Before Picture
>>> config = {‘mongo.input.uri’: ‘mongodb://host/db.input’,
... ‘mongo.output.uri’: ‘mongodb://host/db.output’}
>>> rdd = sc.newAPIHadoopRDD(
... ‘com.mongodb.hadoop.MongoInputFormat’,
... ‘org.apache.hadoop.io.TextWritable’,
... ‘org.apache.hadoop.io.MapWritable’
... None, None, config)
>>> rdd.first()
({u'timeSecond': 1421872408, u'timestamp': 1421872408, u'__class__':
u'org.bson.types.ObjectId', u'machine': 374500293, u'time': 1421872408000, u'date':
datetime.datetime(2015, 1, 21, 12, 33, 28), u'new': False, u'inc': -1652246148}, {u’Hello’:
u’World’})
>>> # do some processing with RDD
>>> processed_rdd = …
>>> processed_rdd.saveAsNewAPIHadoopFile(
... ‘file:///unused’,
... ‘com.mongodb.hadoop.MongoOutputFormat’,
... None, None, None, None, config)
PySpark – After Picture
>>> import pymongo_spark
>>> pymongo_spark.activate()
>>> rdd = sc.MongoRDD(‘mongodb://host/db.input’)
>>> rdd.first()
{u‘_id’: ObjectId('562e64ea6e32ab169586f9cc'), u‘Hello’:
u‘World’}
>>> processed_rdd = ...
>>> processed_rdd.saveToMongoDB(
... ‘mongodb://host/db.output’)
MongoSplitter
• splitting – cutting up data to distribute among worker nodes
• Hadoop InputSplits / Spark Partitions
• very important to get splitting right for optimum performance
• improvements in splitting for mongo-hadoop
MongoSplitter
Splitting Algorithms
• split per shard chunk
• split per shard
• split using splitVector command
mongo
s
shard 1
connector
shard 0
config servers
MongoSplitter
Split per Shard Chunk
shards:
{ "_id" : "shard01", "host" : "shard01/llp:27018,llp:27019,llp:27020" }
{ "_id" : "shard02", "host" : "shard01/llp:27021,llp:27022,llp:27023" }
{ "_id" : "shard03", "host" : "shard01/llp:27024,llp:27025,llp:27026" }
databases:
{ "_id" : "customer", "partitioned" : true, "primary" : "shard01" }
customer.emails
shard key: { "headers.From" : 1 }
chunks:
shard01 21
shard02 21
shard03 20
{ "headers.From" : { "$minKey": 1}} -->>
{ "headers.From" : "charlie@foo.com" } on : shard01 Timestamp(42, 1)
{ "headers.From" : "charlie@foo.com": 1} -->>
{ "headers.From" : "mildred@foo.com" } on : shard02 Timestamp(42, 1)
{ "headers.From" : "mildred@foo.com" } -->>
{ "headers.From" : { "$maxKey": 1 }} on : shard01 Timestamp(41, 1)
MongoSplitter
Splitting Algorithms
• split per shard chunk
• split per shard
• split using splitVector command
mongos
shard 1
connector
shard 0
config server
MongoSplitter
Splitting Algorithms
• split per shard chunk
• split per shard
• split using splitVector command
_id_1
{“splitVector”: “db.collection”,
“keyPattern”: {“_id”: 1},
“maxChunkSize”: 42}
_id: 0 _id: 25 _id: 50 _id: 75 _id: 100
MongoSplitter
Problem: empty/unbalanced splits
Query
{“createdOn”:
{“$lte”: ISODate("2015-10-26T23:51:05.787Z")}})
• can use index on “createdOn”
• splitVector can’t split on a subset of the index
• some splits might be empty
MongoSplitter
Problem: empty/unbalanced splits
Query
{“createdOn”:
{“$lte”: ISODate("2015-10-26T23:51:05.787Z")}})
Solutions
• Create a new collection with subset of data
• Create index over relevant documents only
• Learn to live with empty splits
MongoSplitter
Alternatives
Filtering out empty splits:
mongo.input.split.filter_empty=true
• create cursor, check for empty
• empty splits are thrown out from the final list
• save resources from task processing empty split
MongoSplitter
Problem: empty/unbalanced splits
Query
{“published”: true}
• No index on “published” means splits more likely
unbalanced
• Query selects documents throughout index for split
pattern
MongoSplitter
Solution
PaginatingMongoSplitter
mongo.splitter.class=
com.mongodb.hadoop.splitter.MongoPaginatingSplitter
• one-time collection scan, but splits have efficient queries
• no empty splits
• splits of equal size (except for last)
MongoSplitter
• choose the right splitting algorithm
• more efficient splitting with input query
Future Work – Data Locality
• Processing happens where the data lives
• Hadoop
– namenode (NN) knows locations of blocks
– InputFormat can specify split locations
– jobtracker collaborates with NN to schedule tasks to
take advantage of data locality
• Spark
– RDD.getPreferredLocations
Future Work – Data Locality
https://jira.mongodb.org/browse/HADOOP-202
Idea:
• Data node/executor on same machine as shard
• Connector assigns work based on local chunks
Future Work – Data Locality
• Set up Spark exectutors or Hadoop data nodes on machines
with shards running
• Mark each InputSplit or Partition with the shard host that
contains it
Wrapping Up
• Investigating Python in Spark
• Understand splitting algorithms
• Data locality with MongoDB
Thank You!
Questions?
Github:
https://github.com/mongodb/mongo-hadoop
Issue Tracker:
https://jira.mongodb.org/browse/HADOOP
#MDBDays
mongodb.com
Get your technical questions answered
In the foyer, 10:00 - 5:00
By appointment only – register in person
Tell me how I didtoday on Guidebook and enter for achance to
winone of these
How to do it:
Download the Guidebook App
Search for MongoDB Silicon Valley
Submit session feedback

MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector

  • 1.
    MongoDB Hadoop Connector LukeLovett Maintainer, mongo-hadoop https://github.com/mongodb/mongo-hadoop
  • 2.
    Overview • Hadoop Overview •Why MongoDB and Hadoop • Connector Overview • Technical look into new features • What’s on the horizon? • Wrap-up
  • 3.
    Hadoop Overview • Distributeddata processing • Fulfills analytical requirements • Jobs are infrequent, batch processes Churn Analysis Recommendation Warehouse/ETL Risk Modeling Trade Surveillance Predictive Analysis Ad Targeting Sentiment Analysis
  • 4.
    MongoDB + Hadoop •MongoDB backs application • Satisfy queries in real-time • MongoDB + Hadoop = application data analytics
  • 5.
    Connector Overview • Bringsoperational data into analytical lifecycle • Supporting an evolving Hadoop ecosystem – Apache Spark has made a huge entrance • MongoDB interaction seamless, natural
  • 6.
    Connector Examples MongoInputFormat MongoOutputFormat BSONFileInputFormatBSONFileOutputFormat Pig data = LOAD “mongodb://myhost/db.collection” USING com.mongodb.hadoop.MongoInputFormat
  • 7.
    Connector Examples MongoInputFormat MongoOutputFormat BSONFileInputFormatBSONFileOutputFormat Hive CREATE EXTERNAL TABLE mongo ( title STRING, address STRUCT<from:STRING, to:STRING>) STORED BY “com.mongodb.hadoop.hive.MongoStorageHandler”;
  • 8.
    Connector Examples MongoInputFormat MongoOutputFormat BSONFileInputFormatBSONFileOutputFormat Spark (Python) import pymongo_spark pymongo_spark.activate() rdd = sc.MongoRDD(“mongodb://host/db.coll”)
  • 9.
    New Features • Hivepredicate pushdown • Pig projection • Compression support for BSON • PySpark support • MongoSplitter improvements
  • 10.
    PySpark • Python shell •Submit jobs written in Python • Problem: How do we provide a natural Python syntax for accessing the connector inside the JVM? • What we want: – Support for PyMongo’s objects – Have a natural API for working with MongoDB inside Spark’s Python shell
  • 11.
    PySpark We need tounderstand: • How do the JVM and Python work together in Spark? • What does data look like between these processes? • How does the MongoDB Hadoop Connector fit into this? We need to take a look inside PySpark.
  • 12.
    What’s Inside PySpark? •Uses py4j to connect to JVM running Spark • Communicates objects to/from JVM using Python’s pickle protocol • org.apache.spark.api.python.Converter converts Writables to Java Objects and vice-versa • Special PythonRDD type encapsulates JVM gateway and necessary Converters, Picklers, and Constructors for un-pickling
  • 13.
    What’s Inside PySpark? JVMGateway python: java:
  • 14.
    What’s Inside PySpark? PythonRDD Python:Keeps Reference to SparkContext, JVM Gateway Java: simply wrap a JavaRDD and do some conversions
  • 15.
    What’s Inside PySpark? Pickler/Unpickler– What is a Pickle, anyway? • Pickle – a Python object serialized into a byte stream, can be saved to a file • defines a set of opcodes that operate as in a stack machine • pickling turns a Python object into a stream of opcodes • unpickling performs the operators, getting a Python object out
  • 16.
    Example (pickleversion2) >>> pickletools.dis(pickletools.optimize(pickle.dumps(doc))) 0:( MARK 1: d DICT (MARK at 0) 2: S STRING '_id' 9: c GLOBAL 'copy_reg _reconstructor' 34: ( MARK 35: c GLOBAL 'bson.objectid ObjectId' 59: c GLOBAL '__builtin__ object' 79: N NONE 80: t TUPLE (MARK at 34) 81: R REDUCE 82: S STRING 'VKxc7ln2xab`x8fSx14xea' 113: b BUILD 114: s SETITEM 115: S STRING 'hello' 124: S STRING 'world' 133: s SETITEM 134: . STOP {'_id': ObjectId('564bc76c6e32ab608f5314ea'), 'hello': 'world'}
  • 17.
    >>> pickletools.dis(pickletools.optimize(pickle.dumps(doc))) 0: (MARK 1: d DICT (MARK at 0) 2: S STRING '_id' 9: c GLOBAL 'copy_reg _reconstructor' 34: ( MARK 35: c GLOBAL 'bson.objectid ObjectId' 59: c GLOBAL '__builtin__ object' 79: N NONE 80: t TUPLE (MARK at 34) 81: R REDUCE 82: S STRING 'VKxc7ln2xab`x8fSx14xea' 113: b BUILD 114: s SETITEM 115: S STRING 'hello' 124: S STRING 'world' 133: s SETITEM 134: . STOP Example (pickleversion2) {'_id': ObjectId('564bc76c6e32ab608f5314ea'), 'hello': 'world'}
  • 18.
    >>> pickletools.dis(pickletools.optimize(pickle.dumps(doc))) 0: (MARK 1: d DICT (MARK at 0) 2: S STRING '_id' 9: c GLOBAL 'copy_reg _reconstructor' 34: ( MARK 35: c GLOBAL 'bson.objectid ObjectId' 59: c GLOBAL '__builtin__ object' 79: N NONE 80: t TUPLE (MARK at 34) 81: R REDUCE 82: S STRING 'VKxc7ln2xab`x8fSx14xea' 113: b BUILD 114: s SETITEM 115: S STRING 'hello' 124: S STRING 'world' 133: s SETITEM 134: . STOP Example (pickleversion2) {'_id': ObjectId('564bc76c6e32ab608f5314ea'), 'hello': 'world'}
  • 19.
    >>> pickletools.dis(pickletools.optimize(pickle.dumps(doc))) 0: (MARK 1: d DICT (MARK at 0) 2: S STRING '_id' 9: c GLOBAL 'copy_reg _reconstructor' 34: ( MARK 35: c GLOBAL 'bson.objectid ObjectId' 59: c GLOBAL '__builtin__ object' 79: N NONE 80: t TUPLE (MARK at 34) 81: R REDUCE 82: S STRING 'VKxc7ln2xab`x8fSx14xea' 113: b BUILD 114: s SETITEM 115: S STRING 'hello' 124: S STRING 'world' 133: s SETITEM 134: . STOP Example (pickleversion2) {'_id': ObjectId('564bc76c6e32ab608f5314ea'), 'hello': 'world'}
  • 20.
    >>> pickletools.dis(pickletools.optimize(pickle.dumps(doc))) 0: (MARK 1: d DICT (MARK at 0) 2: S STRING '_id' 9: c GLOBAL 'copy_reg _reconstructor' 34: ( MARK 35: c GLOBAL 'bson.objectid ObjectId' 59: c GLOBAL '__builtin__ object' 79: N NONE 80: t TUPLE (MARK at 34) 81: R REDUCE 82: S STRING 'VKxc7ln2xab`x8fSx14xea' 113: b BUILD 114: s SETITEM 115: S STRING 'hello' 124: S STRING 'world' 133: s SETITEM 134: . STOP Example (pickleversion2) {'_id': ObjectId('564bc76c6e32ab608f5314ea'), 'hello': 'world'}
  • 21.
    What’s Inside PySpark? Pickle,implemented by Pyrolite library Pyrolite - Python Remote Objects "light" and Pickle for Java/.NET https://github.com/irmen/Pyrolite • Pyrolite library allows Spark to use Python’s Pickle protocol to serialize/deserialize Python objects across the gateway. • Hooks available for handling custom types in each direction – registerCustomPickler – define how to turn a Java object into a Python Pickle byte stream – registerConstructor – define how to construct a Java object for a given Python type
  • 22.
    What’s Inside PySpark? BSONPickler– translates Java -> PyMongo PyMongo – MongoDB Python driver https://github.com/mongodb/mongo-python-driver Special handling for - Binary - BSONTimestamp - Code - DBRef - ObjectId - Regex - Min/MaxKey
  • 23.
    “PySpark” – BeforePicture >>> config = {‘mongo.input.uri’: ‘mongodb://host/db.input’, ... ‘mongo.output.uri’: ‘mongodb://host/db.output’} >>> rdd = sc.newAPIHadoopRDD( ... ‘com.mongodb.hadoop.MongoInputFormat’, ... ‘org.apache.hadoop.io.TextWritable’, ... ‘org.apache.hadoop.io.MapWritable’ ... None, None, config) >>> rdd.first() ({u'timeSecond': 1421872408, u'timestamp': 1421872408, u'__class__': u'org.bson.types.ObjectId', u'machine': 374500293, u'time': 1421872408000, u'date': datetime.datetime(2015, 1, 21, 12, 33, 28), u'new': False, u'inc': -1652246148}, {u’Hello’: u’World’}) >>> # do some processing with RDD >>> processed_rdd = … >>> processed_rdd.saveAsNewAPIHadoopFile( ... ‘file:///unused’, ... ‘com.mongodb.hadoop.MongoOutputFormat’, ... None, None, None, None, config)
  • 24.
    PySpark – AfterPicture >>> import pymongo_spark >>> pymongo_spark.activate() >>> rdd = sc.MongoRDD(‘mongodb://host/db.input’) >>> rdd.first() {u‘_id’: ObjectId('562e64ea6e32ab169586f9cc'), u‘Hello’: u‘World’} >>> processed_rdd = ... >>> processed_rdd.saveToMongoDB( ... ‘mongodb://host/db.output’)
  • 25.
    MongoSplitter • splitting –cutting up data to distribute among worker nodes • Hadoop InputSplits / Spark Partitions • very important to get splitting right for optimum performance • improvements in splitting for mongo-hadoop
  • 26.
    MongoSplitter Splitting Algorithms • splitper shard chunk • split per shard • split using splitVector command mongo s shard 1 connector shard 0 config servers
  • 27.
    MongoSplitter Split per ShardChunk shards: { "_id" : "shard01", "host" : "shard01/llp:27018,llp:27019,llp:27020" } { "_id" : "shard02", "host" : "shard01/llp:27021,llp:27022,llp:27023" } { "_id" : "shard03", "host" : "shard01/llp:27024,llp:27025,llp:27026" } databases: { "_id" : "customer", "partitioned" : true, "primary" : "shard01" } customer.emails shard key: { "headers.From" : 1 } chunks: shard01 21 shard02 21 shard03 20 { "headers.From" : { "$minKey": 1}} -->> { "headers.From" : "charlie@foo.com" } on : shard01 Timestamp(42, 1) { "headers.From" : "charlie@foo.com": 1} -->> { "headers.From" : "mildred@foo.com" } on : shard02 Timestamp(42, 1) { "headers.From" : "mildred@foo.com" } -->> { "headers.From" : { "$maxKey": 1 }} on : shard01 Timestamp(41, 1)
  • 28.
    MongoSplitter Splitting Algorithms • splitper shard chunk • split per shard • split using splitVector command mongos shard 1 connector shard 0 config server
  • 29.
    MongoSplitter Splitting Algorithms • splitper shard chunk • split per shard • split using splitVector command _id_1 {“splitVector”: “db.collection”, “keyPattern”: {“_id”: 1}, “maxChunkSize”: 42} _id: 0 _id: 25 _id: 50 _id: 75 _id: 100
  • 30.
    MongoSplitter Problem: empty/unbalanced splits Query {“createdOn”: {“$lte”:ISODate("2015-10-26T23:51:05.787Z")}}) • can use index on “createdOn” • splitVector can’t split on a subset of the index • some splits might be empty
  • 31.
    MongoSplitter Problem: empty/unbalanced splits Query {“createdOn”: {“$lte”:ISODate("2015-10-26T23:51:05.787Z")}}) Solutions • Create a new collection with subset of data • Create index over relevant documents only • Learn to live with empty splits
  • 32.
    MongoSplitter Alternatives Filtering out emptysplits: mongo.input.split.filter_empty=true • create cursor, check for empty • empty splits are thrown out from the final list • save resources from task processing empty split
  • 33.
    MongoSplitter Problem: empty/unbalanced splits Query {“published”:true} • No index on “published” means splits more likely unbalanced • Query selects documents throughout index for split pattern
  • 34.
    MongoSplitter Solution PaginatingMongoSplitter mongo.splitter.class= com.mongodb.hadoop.splitter.MongoPaginatingSplitter • one-time collectionscan, but splits have efficient queries • no empty splits • splits of equal size (except for last)
  • 35.
    MongoSplitter • choose theright splitting algorithm • more efficient splitting with input query
  • 36.
    Future Work –Data Locality • Processing happens where the data lives • Hadoop – namenode (NN) knows locations of blocks – InputFormat can specify split locations – jobtracker collaborates with NN to schedule tasks to take advantage of data locality • Spark – RDD.getPreferredLocations
  • 37.
    Future Work –Data Locality https://jira.mongodb.org/browse/HADOOP-202 Idea: • Data node/executor on same machine as shard • Connector assigns work based on local chunks
  • 38.
    Future Work –Data Locality • Set up Spark exectutors or Hadoop data nodes on machines with shards running • Mark each InputSplit or Partition with the shard host that contains it
  • 39.
    Wrapping Up • InvestigatingPython in Spark • Understand splitting algorithms • Data locality with MongoDB
  • 40.
  • 41.
    #MDBDays mongodb.com Get your technicalquestions answered In the foyer, 10:00 - 5:00 By appointment only – register in person
  • 42.
    Tell me howI didtoday on Guidebook and enter for achance to winone of these How to do it: Download the Guidebook App Search for MongoDB Silicon Valley Submit session feedback