Aggregation Framework MongoDB Days Munich

#mongodbdays @nleite #aggfwk
Aggregation Framework
Norberto Leite
Senior Solutions Architect, MongoDB

Agenda
• What is the Aggregation Framework?
• The Aggregation Pipeline
• Usage and Limitations
• Aggregation and Sharding
• Summary

What is the Aggregation
Framework?

Aggregation in Nutshell
• We're storing our data in MongoDB
• Our applications need to run ad-hoc queries for
grouping, summarizations, reporting, etc.
• We must have a way to reshape data easily to support
these access patterns
• You can use Aggregation Framework for this!

MapReduce is great, but…
• Extremely versatile, powerful
• Overkill for simple aggregation tasks
• Averages
• Summation
• Grouping
• Reshaping
• High level of complexity
• Difficult to program and debug

Aggregation Framework
• Executes in native code
– Written in C++
– JSON parameters
• Plays nice with sharding
• Flexible, functional, and simple
– Operation pipeline
– Computational expressions

What is an Aggregation Pipeline?
• A Series of Document Transformations
– Executed in stages
– Original input is a collection
– Output as a cursor or a collection
$match $project $group $sort
• Rich Library of Functions
– Filter, compute, group, and summarize data
– Output of one stage sent to input of next
– Operations executed in sequential order

Pipeline Operators
• $sort
• Order documents
• $limit / $skip
• Paginate documents
• $redact
• Restrict documents
• $geoNear
• Proximity sort documents
• $let, $map
• Define variables
• $match
• Filter documents
• $project
• Reshape documents
• $group
• Summarize documents
• $unwind
• Expand documents

Our Example Data
{
"_id" : ObjectId("54523d2d25784427c6fabce1"),
"From" : "norberto@mongodb.com",
"To" : "mongodb-user@googlegroups.com",
"Date" : ISODate("2012-08-15T22:32:34Z"),
"body" : {
"text/plain" : ”Hello Munich, nice to see yalll!"
},
"Subject" : ”Live From MongoDB World"
}

$match
• Filter documents
– Uses existing query syntax
– No $where (server side Javascript)

Matching Field Values
{
subject: "Hello There",
words: 218,
from: "norberto@mongodb.com"
}
{ $match: {
from: "hipster@somemail.com"
}}
{
subject: "I love Hofbrauhaus",
words: 90,
}
{
subject: "MongoDB Rules!",
words: 100,
}
{
words: 100,
}

Matching with Query Operators
{
words: 218,
}
{ $match: {
words: {$gt: 100}
}}
{
words: 90,
}
{
words: 100,
}
{
words: 100,
}
{
words: 218,
}

$project
• Reshape Documents
– Include, exclude or rename fields
– Inject computed fields
– Create sub-document fields

Including and Excluding Fields
{
_id: 12345,
words: 218,
from:"norberto@mongodb.com"
to: [ "marc@mongodb.com",
"sam@mongodb.com" ],
account: "mongodb mail",
date: ISODate("2012-08-05"),
replies: 3,
folder: "Inbox",
...
}
{ $project: {
_id: 0,
subject: 1,
from: 1
}}
{
}

Renaming and Computing Fields
{ $project: {
spamIndex: {
$mul: ["$words",
"$replies"]
},
user: "$from"
}}
{
_id: 12345,
spamIndex: 72.6666 ,
user: "norberto@mongodb.com"
}
{
_id: 12345,
words: 218,
date: ISODate("2012-08-05"),
replies: 3,
folder: "Inbox",
...
}

Creating Sub-Document Fields
{ $project: {
subject: 1,
stats: {
replies: "$replies",
from: "$from",
date: "$date"
}}}
{
_id: 375,
stats: {
replies: 3,
from: "norberto@mongodb.com",
date: ISODate("2012-08-05")
}}
{
_id: 12345,
words: 218,
date: ISODate("2012-08-05"),
replies: 3,
folder: "Inbox",
...
}

$group
• Group documents by value
– Field reference, object, constant
– Other output fields are computed
• $max, $min, $avg, $sum
• $addToSet, $push
• $first, $last
– Processes all data in memory by
default

Calculating An Average
{ $group: {
_id: "$from",
avgWords: { $avg:
"$words" }
}}
{
_id: "norberto@mongodb.com",
avgPages: 154
}
{
_id: "hipster@somemail.com",
avgPages: 100
}
{
words: 218,
}
{
words: 90,
}
{
words: 100,
}

Summing Fields and Counting
{ $group: {
_id: "$from",
words: { $sum: "$words" },
mails: { $sum: 1 }
}}
{
_id: "norberto@mongodb.com",
words: 308,
mails: 2
}
{
_id: "hipster@somemail.com",
words: 100,
mails: 1
}
{
words: 218,
}
{
words: 90,
}
{
words: 100,
}

$unwind
• Operate on an array field
– Create documents from array elements
• Array replaced by element value
• Missing/empty fields → no output
• Non-array fields → error
– Pipe to $group to aggregate

Collecting Distinct Values
{ $unwind: "$to" }
{ subject: "2.8 will be great!",
to: "marc@mongodb.com",
account : "mongodb mail” }
{
_id: 2222,
subject: "2.8 will be great!",
"eliot@mongodb.com",
"asya@mongodb.com",
],
account: "mongodb mail"
}
to: "eliot@mongodb.com",
to: "asya@mongodb.com",

$sort, $limit, $skip
• Sort documents by one or more fields
– Same order syntax as cursors
– Waits for earlier pipeline operator to return
– In-memory unless early and indexed
• Limit and skip follow cursor behavior

$redact
• Restrict access to Documents
– Use document fields to define privileges
– Apply conditional queries to validate users
• Field Level Access Control
– $$DESCEND, $$PRUNE, $$KEEP
– Applies to root and subdocument fields

$redact Example Data
{
_id: 375,
item: "Sony XBR55X900A 55Inch 4K Ultra High Definition TV",
Manufacturer: "Sony",
security: 0,
quantity: 12,
list: 4999,
pricing: {
security: 1,
sale: 2698,
wholesale: {
security: 2,
amount: 2300 }
}
}

Query by Security Level
security = 0
db.catalog.aggregate([
{
$match: {item: /^.*XBR55X900A*/}
},
{
$redact: {
$cond: {
if: { $lte: [ "$security", ?? ] },
then: "$$DESCEND",
else: "$$PRUNE"
}
}
}])
{
"_id" : 375,
"item" : "Sony XBR55X900A 55Inch 4K Ultra High Definition TV",
"Manufacturer" : "Sony”,
"security" : 0,
"quantity" : 12,
"list" : 4999
}
{
"_id" : 375,
"item" : "Sony XBR55X900A 55Inch 4K Ultra High Definition TV",
"Manufacturer" : "Sony",
"security" : 0,
"quantity" : 12,
"list" : 4999,
"pricing" : {
"security" : 1,
"sale" : 2698,
"wholesale" : {
"security" : 2,
"amount" : 2300
}
}
}
security = 2

$geoNear
• Order/Filter Documents by Location
– Requires a geospatial index
– Output includes physical distance
– Must be first aggregation stage

$geonear Example Data
{
"_id" : 35089,
"city" : “Sony”,
"loc" : [
-86.048397,
32.979068
],
"pop" : 1584,
"state" : "AL”
}

Query by Proximity
db.catalog.aggregate([
{
$geoNear : {
near: [ -86.000, 33.000 ],
distanceField: "dist",
maxDistance: .050,
spherical: true,
num: 3
}
}])
{
"_id" : "35089",
"city" : "KELLYTON",
"loc" : [ -86.048397, 32.979068 ],
"pop" : 1584,
"state" : "AL",
"dist" : 0.0007971432165364155
},
{
"_id" : "35010",
"city" : "NEW SITE",
"loc" : [ -85.951086, 32.941445 ],
"pop" : 19942,
"state" : "AL",
"dist" : 0.0012479615347306806
},
{
"_id" : "35072",
"city" : "GOODWATER",
"loc" : [ -86.078149, 33.074642 ],
"pop" : 3813,
"state" : "AL",
"dist" : 0.0017333719627032555
}

Usage
• collection.aggregate([…], {<options>})
– Returns a cursor
– Takes an optional document to specify aggregation options
• allowDiskUse, explain
– Use $out to send results to a Collection
• db.runCommand({aggregate:<collection>, pipeline:[…]})
– Returns a document, limited to 16 MB

Collection
db.books.aggregate([
{ $project: { language: 1 }},
{ $group: { _id: "$language", numTitles: { $sum: 1 }}}
])
{ _id: "Russian", numTitles: 1 },
{ _id: "English", numTitles: 2 }

Database Command
db.runCommand({
aggregate: "books",
pipeline: [
{ $project: { language: 1 }},
1 } {} }$ group: { _id: "$language", numTitles: { $sum:
]
})
{
result : [
{ _id: "Russian", numTitles: 1 },
{ _id: "English", numTitles: 2 }
],
“ok” : 1
}

Limitations
• Pipeline operator memory limits
– Stages limited to 100 MB
– Use “allowDiskUse” option to use disk for larger data sets
• Some BSON types unsupported
– Symbol, MinKey, MaxKey, DBRef, Code, and CodeWScope

Sharding
Result
mongos
Shard 1 (Primary)
$match, $project,
$group
Shard 2
$match, $project,
$group
Shard 3
excluded
Shard 4
$match, $project,
$group
• Workload split between shards
– Shards execute pipeline up to a point
– Primary shard merges cursors and
continues processing*
– Use explain to analyze pipeline split
– Early $match may excuse shards
– Potential CPU and memory implications
for primary shard host
* Prior to v2.6 second stage pipeline processing was done by mongos

Framework Use Cases
• Basic aggregation queries
• Ad-hoc reporting
• Real-time analytics
• Visualizing and reshaping data

Extending the Framework
• Adding new pipeline operators, expressions
• $out and $tee for output control
– https://jira.mongodb.org/browse/SERVER-3253

Future Enhancements
• Automatically move $match earlier if possible
• Pipeline explain facility
• Memory usage improvements
– Grouping input sorted by _id
– Sorting with limited output

Enabling Developers
• Doing more within MongoDB, faster
• Refactoring MapReduce and groupings
– Replace pages of JavaScript
– Longer aggregation pipelines
• Quick aggregations from the shell

#mongodbdays #aggfwk #devs @mongodb
Obrigado!
Norberto Leite
SA | Eng – norberto@mongodb.com

Aggregation Framework MongoDB Days Munich

More Related Content

What's hot

Viewers also liked

Similar to Aggregation Framework MongoDB Days Munich

More from Norberto Leite

Recently uploaded

Aggregation Framework MongoDB Days Munich