USING MONGODBUSING MONGODB
TO BUILD A FAST AND SCALABLETO BUILD A FAST AND SCALABLE
CONTENT REPOSITORYCONTENT REPOSITORY
SOME CONTEXTSOME CONTEXT
What we Do and What Problems We Try to Solve
NUXEONUXEO
Nuxeo
​we provide a Platform that developers can use to build highly
customized Content Applications
we provide components, and the tools to assemble them
everything we do is open source (for real)
various customers - various use cases
me: developer & CTO - joined the Nuxeo project 10+ years ago
Track game builds Electronic Flight Bags Central repository for Models Food industry PLM
https://github.com/nuxeo
Document Oriented Database
Document Repository
Store JSON Documents
Manage Document attributes,
hierarchy, blobs, security, lifecycle, versions
DOCUMENT REPOSITORYDOCUMENT REPOSITORY
Storage abstraction : be able to choose the right storage
​depending on the constraints
depending on the environment
Manage Content Model
Schemas, Mixins, facets
​​Manage Data level Security
​Document level permissions
Blob level permissions
Versioning
Life-Cycle
Blob management
​Efficient storage & CDN
HISTORY : NUXEO REPOSITORY & STORAGEHISTORY : NUXEO REPOSITORY & STORAGE
2006: Nuxeo Repository is based on ZODB (Python / Zope based)
This is not JSON in NoSQL, but Python serialization in ObjectDB
Conccurency and performances issues, Bad transaction handling
2007: Nuxeo Platform 5.1 - Apache JackRabbit (JCR based)
Mix SQL + Java Serialization + Lucene
Transaction and consistency issues
2009: Nuxeo 5.2 - Nuxeo VCS
SQL based repository : MVCC & ACID
very reliable, but some use cases can not fit in a SQL DB !
2014: Nuxeo 5.9 - Nuxeo DBS
Document Based Storage repository
MongoDB is the reference backend​
Object DB
Document DB
SQL DB
FROM SQL TO NOSQLFROM SQL TO NOSQL
Understanding the motivations
for moving to MongoDB
SQL BASED REPOSITORY - VCSSQL BASED REPOSITORY - VCS
Search API is the most used :
search is the main scalability challenge
KEY LIMITATIONS OF THE SQL APPROACHKEY LIMITATIONS OF THE SQL APPROACH
Impedance issue
storing Documents in tables is not easy
requires Caching and Lazy loading
Scalability
Document repository can become very large (versions, workflows ...)
​scaling out SQL DB is very complex (and never transparent)
Concurrency model
Heavy write is an issue (Quotas, Inheritance)
​​Hard to maintain good Read & Write performances
NEED A DIFFERENT STORAGE MODEL !NEED A DIFFERENT STORAGE MODEL !
FROM SQL TO NO SQLFROM SQL TO NO SQL
NO SQL WITH MONGODBNO SQL WITH MONGODB
No Impedance issue
One Nuxeo Document = One MongoDB Document
No Scalability issue for CRUD
​native distributed architecture allows scale out
No Concurrency performance issue
​​Document Level "Transactions"
No application level cache is needed
No need to manage invalidations
THAT'S WHY WE INTEGRATED MONGODBTHAT'S WHY WE INTEGRATED MONGODB
let's see the technical details
INTEGRATING MONGODBINTEGRATING MONGODB
Inside nuxeo-dbs storage adapter
DOCUMENT BASE STORAGE & MONGODBDOCUMENT BASE STORAGE & MONGODB
DOCUMENT BASE STORAGE & MONGODBDOCUMENT BASE STORAGE & MONGODB
STORING NUXEO DOCUMENTS IN MONGODBSTORING NUXEO DOCUMENTS IN MONGODB
{
"ecm:id":"52a7352b-041e-49ed-8676-328ce90cc103",
"ecm:primaryType":"MyFile",
"ecm:majorVersion":NumberLong(2),
"ecm:minorVersion":NumberLong(0),
"dc:title":"My Document",
"dc:contributors":[ "bob", "pete", "mary" ],
"dc:created": ISODate("2014-07-03T12:15:07+0200"),
...
"cust:primaryAddress":{
"street":"1 rue René Clair", "zip":"75018", "city":"Paris", "country":"France"},
"files:files":[
{ "name":"doc.txt", "length":1234, "mime-type":"plain/text",
"data":"0111fefdc8b14738067e54f30e568115"
},
{
"name":"doc.pdf", "length":29344, "mime-type":"application/pdf",
"data":"20f42df3221d61cb3e6ab8916b248216"
}
],
"ecm:acp":[
{
name:"local",
acl:[ { "grant":false, "perm":"Write", "user":"bob"},
{ "grant":true, "perm":"Read", "user":"members" } ]
}]
...
}
40+ fields by default
​depends on config
18 indexes
HIERARCHYHIERARCHY
Parent-child relationship
Recursion optimized through array
• Maintained by framework (create, delete, move, copy)
ecm:parentId
ecm:ancestorIds
{ ...
"ecm:parentId" : "3d7efffe-e36b-44bd-8d2e-d8a70c233e9d",
"ecm:ancestorIds" : [ "00000000-0000-0000-0000-000000000000",
"4f5c0e28-86cf-47b3-8269-2db2d8055848",
"3d7efffe-e36b-44bd-8d2e-d8a70c233e9d" ]
...}
SECURITYSECURITY
Generic ACP stored in ecm:acp field
Precomputed Read ACLs to avoid post-filtering on search
• Simple set of identities having access
• Semantic restrictions on blocking
• Maintained by framework
• Search matches if intersection
ecm:racl: ["Management", "Supervisors", "bob"]
db.default.find({"ecm:racl": {"$in": ["bob", "members", "Everyone"]}})
{...
"ecm:acp":[ {
name:"local",
acl:[ { "grant":false, "perm":"Write", "user":"bob"},
{ "grant":true, "perm":"Read", "user":"members" } ]}]
...}
SEARCHSEARCH
db.default.find({
$and: [
{"dc:title": { $in: ["Workspaces", "Sections"] } },
{"ecm:racl": {"$in": ["bob", "members", "Everyone"]}}
]
}
)
SELECT * FROM Document WHERE dc:title = 'Sections' OR dc:title = 'Workspaces'
CONSISTENCY CHALLENGESCONSISTENCY CHALLENGES
Unitary Document Operations are safe
No impedance issue
Large batch updates is not so much of an issue
SQL DB do not like long running transactions anyway
Multi-documents transactions are an issue
Workflows is a typical use case
Isolation issue
Other transactions can see intermediate states
Possible interleaving
Find a way to mitigate consistency issues
Transactions can not span across multiple documents
MITIGATING CONSISTENCY ISSUESMITIGATING CONSISTENCY ISSUES
Transient State Manager
Run all operations in Memory
Flush to MongoDB as late as possible
Populate an Undo Log
Replay backward in case of Rollback
Recover partial Transaction Management
Commit / Rollback model
But complete isolation is not possible
Need to flush transient state for queries
"uncommited" changes are visible to others
"​read uncommited" at best
WHEN TO USE MONGODB OVER TRADITIONAL SQL ?WHEN TO USE MONGODB OVER TRADITIONAL SQL ?
MONGODB REPOSITORYMONGODB REPOSITORY
Typical use cases
THERE IS NOT ONE UNIQUE SOLUTIONTHERE IS NOT ONE UNIQUE SOLUTION
Use each storage solution for what it does the best
SQL DB
store content in an ACID way
consistency over availability
MongoDB
store content in a BASE way
availability over consistency
elasticsearch
provide powerful and scalable queries
Storage does not impact application : this can be a deployment choice!
Atomic Consistent
Isolated Durable
Basic Availability
Soft state
Eventually consistent
IDEAL USE CASES FOR MONGODBIDEAL USE CASES FOR MONGODB
HUGE REPOSITORY - HEAVY LOADINGHUGE REPOSITORY - HEAVY LOADING
Massive amount of Documents
x00,000,000
Automatic versioning
create a version for each single change
Write intensive access
​daily imports or updates
recursive updates (quotas, inheritance)
SQL DB collapses (on commodity hardware)
MongoDB handles the volume
BENCHMARKING MASS IMPORTBENCHMARKING MASS IMPORT
SQL
with tunning
commodity hardware
SQL
7x faster
BENCHMARKING READ + WRITEBENCHMARKING READ + WRITE
Read & Write Operations
are competing
Write Operations
are not blocked
C4.xlarge (nuxeo)
C4.2Xlarge (DB)
SQL
DATA LOADING OVERFLOWDATA LOADING OVERFLOW
Lot of lazy loading
Very large Objects = lots of fragments
lot of lazy loading = create latency issues
​
​Cache trashing issue
SQL mapping requires caching
read lots of documents inside a single transaction
MongoDB has no impedance mismatch
no lazy loading
fast loading of big documents
no need for 2nd level cache
Side effects of impedance miss match
BENCHMARKING IMPEDANCE EFFECTBENCHMARKING IMPEDANCE EFFECT
Process 20,000 documents
700 documents/s with SQL backend (cold cache)
6,000 documents/s with MongoDB / mmapv1: x9
11,000 documents/s with MongoDB / wiredTiger: x15
Process 100,000 documents
750 documents/s with SQL backend (cold cache)
9,500 documents/s with MongoDB / mmapv1: x13
11,500 documents/s with MongoDB / wiredTiger: x15
Process 200,000 documents
750 documents/s with SQL backend (cold cache)
14,000 documents/s with MongoDB/mmapv1: x18
11,000 documents/s with MongoDB/wiredTiger: x15
processing benchmark
based on a real use case
ROBUST ARCHITECTUREROBUST ARCHITECTURE
native distributed architecture
ReplicaSet : data redundancy & fault tolerance
Geographically Redundant Replica Set : host data on multiple hosting sites​
active
active
A REAL LIFE EXAMPLEA REAL LIFE EXAMPLE
A REAL LIFE EXAMPLE - CONTEXTA REAL LIFE EXAMPLE - CONTEXT
Who: US Network Carrier
Goal: Provide VOD services
Requirements:
store videos
manage meta-data
manage workflows
generate thumbs
generate conversions
manage availability​
They chose Nuxeo to build their Video repository
A REAL LIFE EXAMPLE - CHALLENGESA REAL LIFE EXAMPLE - CHALLENGES
Very Large Objects:
lots of meta-data (dublincore, ADI, ratings ...)
Massive daily updates
updates on rights and availability
Need to track all changes
prove what was the availability for a given date
looks like a good use case for MongoDB
lots of data + lots of updates
A REAL LIFE EXAMPLE - MONGODB CHOICEA REAL LIFE EXAMPLE - MONGODB CHOICE
because they have a good use case for MongoDB
​Lots of large objects, lots of updates
because they wanted to use MongoDB
change work habits (Opensouces, NoSQL)
​doing a project with MongoDB is cool
they chose MongoDB
they are happy with it !
ANY QUESTIONS ?ANY QUESTIONS ?
Thank You !
https://github.com/nuxeo
http://www.nuxeo.com/careers/

Using MongoDB to Build a Fast and Scalable Content Repository

  • 1.
    USING MONGODBUSING MONGODB TOBUILD A FAST AND SCALABLETO BUILD A FAST AND SCALABLE CONTENT REPOSITORYCONTENT REPOSITORY
  • 2.
    SOME CONTEXTSOME CONTEXT Whatwe Do and What Problems We Try to Solve
  • 3.
    NUXEONUXEO Nuxeo ​we provide aPlatform that developers can use to build highly customized Content Applications we provide components, and the tools to assemble them everything we do is open source (for real) various customers - various use cases me: developer & CTO - joined the Nuxeo project 10+ years ago Track game builds Electronic Flight Bags Central repository for Models Food industry PLM https://github.com/nuxeo
  • 4.
    Document Oriented Database DocumentRepository Store JSON Documents Manage Document attributes, hierarchy, blobs, security, lifecycle, versions
  • 5.
    DOCUMENT REPOSITORYDOCUMENT REPOSITORY Storageabstraction : be able to choose the right storage ​depending on the constraints depending on the environment Manage Content Model Schemas, Mixins, facets ​​Manage Data level Security ​Document level permissions Blob level permissions Versioning Life-Cycle Blob management ​Efficient storage & CDN
  • 6.
    HISTORY : NUXEOREPOSITORY & STORAGEHISTORY : NUXEO REPOSITORY & STORAGE 2006: Nuxeo Repository is based on ZODB (Python / Zope based) This is not JSON in NoSQL, but Python serialization in ObjectDB Conccurency and performances issues, Bad transaction handling 2007: Nuxeo Platform 5.1 - Apache JackRabbit (JCR based) Mix SQL + Java Serialization + Lucene Transaction and consistency issues 2009: Nuxeo 5.2 - Nuxeo VCS SQL based repository : MVCC & ACID very reliable, but some use cases can not fit in a SQL DB ! 2014: Nuxeo 5.9 - Nuxeo DBS Document Based Storage repository MongoDB is the reference backend​ Object DB Document DB SQL DB
  • 7.
    FROM SQL TONOSQLFROM SQL TO NOSQL Understanding the motivations for moving to MongoDB
  • 8.
    SQL BASED REPOSITORY- VCSSQL BASED REPOSITORY - VCS Search API is the most used : search is the main scalability challenge
  • 9.
    KEY LIMITATIONS OFTHE SQL APPROACHKEY LIMITATIONS OF THE SQL APPROACH Impedance issue storing Documents in tables is not easy requires Caching and Lazy loading Scalability Document repository can become very large (versions, workflows ...) ​scaling out SQL DB is very complex (and never transparent) Concurrency model Heavy write is an issue (Quotas, Inheritance) ​​Hard to maintain good Read & Write performances
  • 10.
    NEED A DIFFERENTSTORAGE MODEL !NEED A DIFFERENT STORAGE MODEL !
  • 11.
    FROM SQL TONO SQLFROM SQL TO NO SQL
  • 12.
    NO SQL WITHMONGODBNO SQL WITH MONGODB No Impedance issue One Nuxeo Document = One MongoDB Document No Scalability issue for CRUD ​native distributed architecture allows scale out No Concurrency performance issue ​​Document Level "Transactions" No application level cache is needed No need to manage invalidations
  • 13.
    THAT'S WHY WEINTEGRATED MONGODBTHAT'S WHY WE INTEGRATED MONGODB let's see the technical details
  • 14.
  • 15.
    DOCUMENT BASE STORAGE& MONGODBDOCUMENT BASE STORAGE & MONGODB
  • 16.
    DOCUMENT BASE STORAGE& MONGODBDOCUMENT BASE STORAGE & MONGODB
  • 17.
    STORING NUXEO DOCUMENTSIN MONGODBSTORING NUXEO DOCUMENTS IN MONGODB { "ecm:id":"52a7352b-041e-49ed-8676-328ce90cc103", "ecm:primaryType":"MyFile", "ecm:majorVersion":NumberLong(2), "ecm:minorVersion":NumberLong(0), "dc:title":"My Document", "dc:contributors":[ "bob", "pete", "mary" ], "dc:created": ISODate("2014-07-03T12:15:07+0200"), ... "cust:primaryAddress":{ "street":"1 rue René Clair", "zip":"75018", "city":"Paris", "country":"France"}, "files:files":[ { "name":"doc.txt", "length":1234, "mime-type":"plain/text", "data":"0111fefdc8b14738067e54f30e568115" }, { "name":"doc.pdf", "length":29344, "mime-type":"application/pdf", "data":"20f42df3221d61cb3e6ab8916b248216" } ], "ecm:acp":[ { name:"local", acl:[ { "grant":false, "perm":"Write", "user":"bob"}, { "grant":true, "perm":"Read", "user":"members" } ] }] ... } 40+ fields by default ​depends on config 18 indexes
  • 18.
    HIERARCHYHIERARCHY Parent-child relationship Recursion optimizedthrough array • Maintained by framework (create, delete, move, copy) ecm:parentId ecm:ancestorIds { ... "ecm:parentId" : "3d7efffe-e36b-44bd-8d2e-d8a70c233e9d", "ecm:ancestorIds" : [ "00000000-0000-0000-0000-000000000000", "4f5c0e28-86cf-47b3-8269-2db2d8055848", "3d7efffe-e36b-44bd-8d2e-d8a70c233e9d" ] ...}
  • 19.
    SECURITYSECURITY Generic ACP storedin ecm:acp field Precomputed Read ACLs to avoid post-filtering on search • Simple set of identities having access • Semantic restrictions on blocking • Maintained by framework • Search matches if intersection ecm:racl: ["Management", "Supervisors", "bob"] db.default.find({"ecm:racl": {"$in": ["bob", "members", "Everyone"]}}) {... "ecm:acp":[ { name:"local", acl:[ { "grant":false, "perm":"Write", "user":"bob"}, { "grant":true, "perm":"Read", "user":"members" } ]}] ...}
  • 20.
    SEARCHSEARCH db.default.find({ $and: [ {"dc:title": {$in: ["Workspaces", "Sections"] } }, {"ecm:racl": {"$in": ["bob", "members", "Everyone"]}} ] } ) SELECT * FROM Document WHERE dc:title = 'Sections' OR dc:title = 'Workspaces'
  • 21.
    CONSISTENCY CHALLENGESCONSISTENCY CHALLENGES UnitaryDocument Operations are safe No impedance issue Large batch updates is not so much of an issue SQL DB do not like long running transactions anyway Multi-documents transactions are an issue Workflows is a typical use case Isolation issue Other transactions can see intermediate states Possible interleaving Find a way to mitigate consistency issues Transactions can not span across multiple documents
  • 22.
    MITIGATING CONSISTENCY ISSUESMITIGATINGCONSISTENCY ISSUES Transient State Manager Run all operations in Memory Flush to MongoDB as late as possible Populate an Undo Log Replay backward in case of Rollback Recover partial Transaction Management Commit / Rollback model But complete isolation is not possible Need to flush transient state for queries "uncommited" changes are visible to others "​read uncommited" at best
  • 23.
    WHEN TO USEMONGODB OVER TRADITIONAL SQL ?WHEN TO USE MONGODB OVER TRADITIONAL SQL ?
  • 24.
  • 25.
    THERE IS NOTONE UNIQUE SOLUTIONTHERE IS NOT ONE UNIQUE SOLUTION Use each storage solution for what it does the best SQL DB store content in an ACID way consistency over availability MongoDB store content in a BASE way availability over consistency elasticsearch provide powerful and scalable queries Storage does not impact application : this can be a deployment choice! Atomic Consistent Isolated Durable Basic Availability Soft state Eventually consistent
  • 26.
    IDEAL USE CASESFOR MONGODBIDEAL USE CASES FOR MONGODB
  • 27.
    HUGE REPOSITORY -HEAVY LOADINGHUGE REPOSITORY - HEAVY LOADING Massive amount of Documents x00,000,000 Automatic versioning create a version for each single change Write intensive access ​daily imports or updates recursive updates (quotas, inheritance) SQL DB collapses (on commodity hardware) MongoDB handles the volume
  • 28.
    BENCHMARKING MASS IMPORTBENCHMARKINGMASS IMPORT SQL with tunning commodity hardware SQL 7x faster
  • 29.
    BENCHMARKING READ +WRITEBENCHMARKING READ + WRITE Read & Write Operations are competing Write Operations are not blocked C4.xlarge (nuxeo) C4.2Xlarge (DB) SQL
  • 30.
    DATA LOADING OVERFLOWDATALOADING OVERFLOW Lot of lazy loading Very large Objects = lots of fragments lot of lazy loading = create latency issues ​ ​Cache trashing issue SQL mapping requires caching read lots of documents inside a single transaction MongoDB has no impedance mismatch no lazy loading fast loading of big documents no need for 2nd level cache Side effects of impedance miss match
  • 31.
    BENCHMARKING IMPEDANCE EFFECTBENCHMARKINGIMPEDANCE EFFECT Process 20,000 documents 700 documents/s with SQL backend (cold cache) 6,000 documents/s with MongoDB / mmapv1: x9 11,000 documents/s with MongoDB / wiredTiger: x15 Process 100,000 documents 750 documents/s with SQL backend (cold cache) 9,500 documents/s with MongoDB / mmapv1: x13 11,500 documents/s with MongoDB / wiredTiger: x15 Process 200,000 documents 750 documents/s with SQL backend (cold cache) 14,000 documents/s with MongoDB/mmapv1: x18 11,000 documents/s with MongoDB/wiredTiger: x15 processing benchmark based on a real use case
  • 32.
    ROBUST ARCHITECTUREROBUST ARCHITECTURE nativedistributed architecture ReplicaSet : data redundancy & fault tolerance Geographically Redundant Replica Set : host data on multiple hosting sites​ active active
  • 33.
    A REAL LIFEEXAMPLEA REAL LIFE EXAMPLE
  • 34.
    A REAL LIFEEXAMPLE - CONTEXTA REAL LIFE EXAMPLE - CONTEXT Who: US Network Carrier Goal: Provide VOD services Requirements: store videos manage meta-data manage workflows generate thumbs generate conversions manage availability​ They chose Nuxeo to build their Video repository
  • 35.
    A REAL LIFEEXAMPLE - CHALLENGESA REAL LIFE EXAMPLE - CHALLENGES Very Large Objects: lots of meta-data (dublincore, ADI, ratings ...) Massive daily updates updates on rights and availability Need to track all changes prove what was the availability for a given date looks like a good use case for MongoDB lots of data + lots of updates
  • 36.
    A REAL LIFEEXAMPLE - MONGODB CHOICEA REAL LIFE EXAMPLE - MONGODB CHOICE because they have a good use case for MongoDB ​Lots of large objects, lots of updates because they wanted to use MongoDB change work habits (Opensouces, NoSQL) ​doing a project with MongoDB is cool they chose MongoDB they are happy with it !
  • 37.
    ANY QUESTIONS ?ANYQUESTIONS ? Thank You ! https://github.com/nuxeo http://www.nuxeo.com/careers/