MongoDB in Data Science
How to convert a Pandas Proof-of-Concept to a scalable product and
why MongoDB is the key to success !
Who I am
Software Engineer
Compiler Engineer
Compiler Engineer
LLVM contributor
Software Engineer
R/D
Lead ML Engineer
Backend
Infrastructure
Sr. ML Engineer
What will we learn ?
● Understand existing tools for delivering Data Science projects and when to use them.
● Why MongoDB could be crucial for your product and business
● How to easily productionize a Pandas Proof-of-Concept
● How to use MongoDB while being open to other technologies.
Motivation
Speed of
inference
Speed of
development
Key factors
Feature
Aggregation
Model
Prediction Service
Speed of
inference
Key factors
Research
Data Scientist
Productionization
Data/ML Engineer
Speed of
development
Key factors
What is Pandas?
Most popular Python framework for data manipulation and data wrangling in Data
Science community.
What is Pandas?
Most popular Python framework for data manipulation and data wrangling in Data
Science community.
Source: numpy.org, scipy.org, matplotlib.org, scikit-learn.org, pandas.pydata.org
Source: Stackoverflow post by David Robinson
Why use Pandas Dataframes ?
Why use Pandas Dataframes ?
Why use Pandas Dataframes ?
Why use Pandas Dataframes ?
Why use Pandas Dataframes ?
Drawbacks of Pandas
● Doesn’t have persistence layer
● Doesn’t support primary and secondary indexes
○ As a result, not efficient for querying
● Doesn’t support multi-threading
Productionization options
Real time
service
Batch Job
Productionization options
Real time
service
Batch Job
Slow
Inference
Productionization options
Real time
service
Batch Job
Slow
Inference
Fast
Inference
Real time service demo (recommendation)
Event
Store
Real time service demo (recommendation)
Event
Store
Model Training
Job
Real time service demo (recommendation)
Event
Store
Model Training
Job
Model
store
Real time service demo (recommendation)
Inference 1
Event
Store
Inference 2
Inference N
Model Training
Job
Model
store
Real time service demo (recommendation)
Inference 1
Event
Store
Inference 2
Inference N
Model Training
Job
Model
store
Real time service demo (recommendation)
Inference 1
Event
Store
Inference 2
Inference N
Model Training
Job
Model
store
Real time service demo (recommendation)
Event
Store
Feature
Aggregation
Model Inference
Inference Service
request respond
Real time service demo (recommendation)
Real time service demo (recommendation)
Real time service demo (recommendation)
Things to avoid
● Don’t forget to put indexes on your collection
● Don’t put indexes on every field
● Don’t read and write from the same replica
But… we generate a tons of user events!
Is this solution going to work for us?
user events
Consumer 1
Consumer 2
Consumer N
MongoDB
Postgres
DFS
Typical data pipeline
user events
Consumer 1
Consumer 2
Consumer N
MongoDB
Postgres
DFS
Typical data pipeline
MongoDB
TTL index
Filters
event_type
...
Consumer
Shrink down the amount of data
Real time service demo (recommendation)
Inference 1
Event
Store
Inference 2
Inference N
Model Training
Job
Model
store
Training Job
Inference 1
Event
Store
Inference 2
Inference N
Model Training
Job
Model
store
Source: mongodb.com
MongoDB
Connector
Event
Store
Model
Training
Job
Model Training job
MongoDB
Connector
Event
Store
Inference Job
Inference as a batch job
Flexibility
Spark
DataFrame
MongoDB
Aggregate
Pandas
Dataframe
Batch Job versus Real Time Service
Real Time Service Batch Job
Pros On demand (scales as needed) Easier to develop and maintain
Cons Harder to develop and maintain Constantly utilizing resources
Benefits of MongoDB
● Schema-Less
● Horizontally scalable
● Available as PaaS from many vendors.
● Has a huge community
● Easier to hire people
Summary
● Allows to provide a real time experience
● Could help save expensive computational resources
● Provides a way to do real time as well as batch inference
We are hiring !!!
careers.shopbonsai.ca
References
● https://stackoverflow.blog/2017/09/14/python-growing-quickly/
● https://www.mongodb.com/products/spark-connector
● https://pandas.pydata.org/
● https://scikit-learn.org/
● https://matplotlib.org/
● https://www.scipy.org/
● https://www.numpy.org/
● https://iconscout.com/icon/device-management-mobile-computer-seo-tool-analyze-7
Thanks !!!

MongoDB World 2019: MongoDB in Data Science: How to Build a Scalable Product Using MongoDB