ElevateYour Enterprise
Architecture with an In-Memory
Computing Strategy
Dylan Tong
Principal Solutions Architect
dylan.tong@mongodb.com
In-Memory Computing
How can we process data as fast as possible
by leveraging in-memory speed at it’s best?
What are the possibilities if we could?
High-frequency trading (HFT) is a program trading platform that uses
powerful computers to transact a large number of orders at very fast
speeds. It uses complex algorithms to analyze multiple markets and
execute orders based on market conditions.
Typically, the traders with the fastest execution speeds are more
profitable than traders with slower execution speeds.
Source: Investopedia
Speed Matters…
Speed Matters…
Amazon found that it increased revenue by 1% for every 100ms of
improvement [source: Amazon]
A 1-second delay in page load time equals 11% fewer page views,
a 16% decrease in customer satisfaction, and 7% loss in
conversions. [Source: Aberdeen Group]
A study found that 27% of the participants who did mobile shopping
were dissatisfied due to the experience being too slow. [Source:
Forrester Consulting]
How Fast?
Latency Unit
RAM access 100s ns
SSD access 100s µs
HDD access 10s ms
Normalized to 1 s
~6 min
~6 days
~12 months
Why Now?
*Average $/GB
2015 $4.37
2013 $5.5
2010 $12.37
2005 $189
2000 $1,107
1995 $30,875
1990 $103,880
1985 $859,375
1980 $6,328,125
$0
$20
$40
$60
$80
$100
$120
$140
$160
$180
$200
2005 2010 2013 2015
Last 10 Years…
“Generally affordable”
*http://www.statisticbrain.com/average-historic-price-of-ram/
Why Now?
$0.00
$2.00
$4.00
$6.00
$8.00
$10.00
$12.00
$14.00
2010 2013 2015
“An Option at Scale”
*Average $/GB
2015 $4.37
2013 $5.5
2010 $12.37
2005 $189
2000 $1,107
1995 $30,875
1990 $103,880
1985 $859,375
1980 $6,328,125
Last 5 Years…
*http://www.statisticbrain.com/average-historic-price-of-ram/
"This will process these data using algorithms for machine
learning and artificial intelligence before sending the data
back to the car.
The zFAS board will in this way continuously extend its
capabilities to master even complex situations increasingly
better," Audi stated. "The piloted cars from Audi thus learn
more every day and with each new situation they
experience.”
Source: T3.com
The possibilities…
Challenges: Scale
Challenges: Cost Viability
= $34,777/yr.  ~$1.74M/yr. for infrastructure to support 100TB
Challenges: Cost Viability
Storage Type Avg. Cost ($/GB) Cost at 100TB ($)
RAM 5.00 500K
SSD 0.47-1.00 47K to 100K
HDD 0.03 3K
http://www.statisticbrain.com/average-cost-of-hard-drive-storage/
http://www.myce.com/news/ssd-price-per-gb-drops-below-0-50-how-low-can-they-go-70703/
Challenges: Durability
Volatile Memory
• What happens when things fail,
and what data maybe loss?
• How does the system synchronize
with your durable storage? Does it
do this well, and is it simple to
implement?
Challenges: Design Still Matters
on RAM
Scenario : ECommerce Modernization
Initiative
Business Problems Technology Limitation
Customer experience is suffering during high traffic
events.
Too expensive to scale system to support spike
events.
Scaling system is hard, and engineering teams
can’t react fast enough in the event of unexpected
growth
Some caching solution implemented, but it mostly
only helps with read performance; synchronizing
writes has been a development nightmare.
Lack of mobile customers in Europe and Asia has
been attributed to latency issues.
Difficult to extend data architecture globally, so
effort is put on hold
Scenario : ECommerce Modernization
Initiative
Business Problems Technology Limitation
Below industry conversation rate performance
has been attributed partly to poor personalization
Customer info is siloed across across the
Enterprise, and it’s too complicated to bring this
data together so effective models can be built to
drive personalization
“Big Data” project to bring data together to drive
machine learning and cognitive capabilities in
platform failed as data scientists report platform
was too slow to develop on, and performance
was impractical.
Business analysts have siloed views of the
eCommerce channel, and information isn’t
getting to them fast enough
Related to limitations above
Integrating data into data warehouse is slow and
hard to maintain
Orders
Product
Catalog
Customer Data:
Profile, Sessions,
Carts, Personalization
Inventory
NoSQLRDBMS
Platform Services
eCommerce Datastores Dependent External Data Sources and Integrations
CRM ERP PIM
Data warehouse
BI Tools
…
Platform API
Scenario : ECommerce Modernization
Initiative
Customer Data:
Profile, Sessions,
Carts, Personalization
NoSQLRDBMS CRM ERP PIM
Partner Sources: Supplier
databases…etc.
Legacy:
Mainframe
Product
Catalog
Silo Data-sources Problem
SLOW AND POOR SCALABILITY
NoSQLRDBMS CRM ERP PIM
Partner Sources: Supplier
databases…etc.
Legacy:
Mainframe
Operational Single View
Operational Single View
Customer Data:
Profile, Sessions,
Carts, Personalization
Product
Catalog
Operational Single View
MongoDB
Enterprise Data Hub
Operational Single View
Reference: Metlife Wall Presentation
{
product_name: ‘Acme Paint’,
color: [‘Red’, ‘Green’],
size_oz: [8, 32],
finish: [‘satin’, ‘eggshell’]
}
{
product_name: ‘T-shirt’,
size: [‘S’, ‘M’, ‘L’, ‘XL’],
color: [‘Heather Gray’ … ],
material: ‘100% cotton’,
wash: ‘cold’,
dry: ‘tumble dry low’
}
{
product_name: ‘Mountain Bike’,
brake_style: ‘mechanical disc’,
color: ‘grey’,
frame_material: ‘aluminum’,
no_speeds: 21,
package_height: ‘7.5x32.9x55’,
weight_lbs: 44.05,
suspension_type: ‘dual’,
wheel_size_in: 26
}
Documents in the same product catalog collection in MongoDB
Dynamic Schema
Flexible Data Model: facilitates
agile development and continuous
delivery methodologies
Scalability: scale-out dynamically
as demand grows
Still Agile, Scalable and Simple
High Performance:
• More predictable, and lower
latency on less in-memory
infrastructure.
In-Memory Storage Engine
Infrastructure Optimization:
• Assign a data subset on the
In-Memory SE via Zone
Sharding.
• Optimize on cost vs.
performance without silos.
.Rich Query Capability:
• Full MongoDB Query and
Indexing Support.
IN-MEMORY SE NODES WIREDTIGER NODES
WEST EAST
Update
SHARD 4
TAG: EAST, WT
Local Read/Write with Strong Consistency
Session Data Geographically Localized, and with In-memory Engine Latency
SHARD 2
TAG: WEST, WT
SHARD 3
TAG: EAST, IN_MEM
SHARD 1
TAG: WEST, IN_MEM
Durability and Fault-Tolerance:
• Mixed ReplicaSets allow data to
be replicated from In-Memory SE
to WT SE.
• Full High Availability: automatic
fail-over, cross geography.
In-Memory Storage Engine
NoSQLRDBMS
Platform Databases Dependent External Data Sources and Integrations
CRM ERP PIM
Partner Sources: Supplier
databases…etc.
Legacy:
Mainframe
Operational Unified View
Advance Personalization
1. TRAIN/RE-TRAIN
ML MODELS
2. APPLY MODELS TO
REAL-TIME
STREAM OF
INTERACTIONS
3. DRIVE TARGETED
CONTENT,
RECOMMENDATIONS…ET
C.
Why ?
Speed. By exploiting in-memory optimizations, Spark
has shown up to 100x higher performance than
MapReduce running on Hadoop.
Simplicity. Easy-to-use APIs for operating on large
datasets. This includes a collection of sophisticated
operators for transforming and manipulating
semi-structured data.
Unified Framework. Packaged with higher-level libraries,
including support for SQL queries, machine learning,
stream and graph processing. These standard libraries
increase developer productivity and can be combined to
create complex workflows.
Operational Single View
+Spark Connector
• Native Scala connector,
certified by Databricks
• Exposes all Spark APIs &
libraries
• Efficient data filtering
with predicate
pushdown, secondary
indexes, & in-database
aggregations
• Locality awareness to
reduce data movement
Locality Awareness
CLUSTER
MANAGER
Task
Task
Task
Task
Task
DRIVER
PROGRAM
SPARK
CONTEXT
Operational Single View
+Spark Connector
Blend client data from multiple
internal and external sources to
drive real time campaign
optimization
MongoDB+Spark at China Eastern
180m fare calculations & 1.6
billion searches per day
Oracle database peaked at 200
searches per second.
Radically re-architect their fare
engine to meet the required
100x growth in search traffic.
ETL
(Yesterday’s) Data at the Speed of Thought?
BI Connector
BI Connector
db.orders.aggregate( [
{
$group: {
_id: null,
total: { $sum:
"$price" }
}
}
] )
SELECT SUM(price)
AS total
FROM orders
Resources for You
Spark Connector
• Download: Spark Packages
GitHub
• Documentation
• Whitepaper:
Turning Analytics into Real-Time
Action
• Education:M233: Getting
Started with Spark and
MongoDB
In-Memory Storage Engine
• Download: Enterprise Server
• Documentation
BI Connector
• Download: BI Connector
• Documentation
Dylan Tong
Principal Solutions Architect
dylan.tong@mongodb.com
Q&A

MongoDB and In-Memory Computing

  • 1.
    ElevateYour Enterprise Architecture withan In-Memory Computing Strategy Dylan Tong Principal Solutions Architect dylan.tong@mongodb.com
  • 2.
    In-Memory Computing How canwe process data as fast as possible by leveraging in-memory speed at it’s best? What are the possibilities if we could?
  • 3.
    High-frequency trading (HFT)is a program trading platform that uses powerful computers to transact a large number of orders at very fast speeds. It uses complex algorithms to analyze multiple markets and execute orders based on market conditions. Typically, the traders with the fastest execution speeds are more profitable than traders with slower execution speeds. Source: Investopedia Speed Matters…
  • 4.
    Speed Matters… Amazon foundthat it increased revenue by 1% for every 100ms of improvement [source: Amazon] A 1-second delay in page load time equals 11% fewer page views, a 16% decrease in customer satisfaction, and 7% loss in conversions. [Source: Aberdeen Group] A study found that 27% of the participants who did mobile shopping were dissatisfied due to the experience being too slow. [Source: Forrester Consulting]
  • 5.
    How Fast? Latency Unit RAMaccess 100s ns SSD access 100s µs HDD access 10s ms Normalized to 1 s ~6 min ~6 days ~12 months
  • 6.
    Why Now? *Average $/GB 2015$4.37 2013 $5.5 2010 $12.37 2005 $189 2000 $1,107 1995 $30,875 1990 $103,880 1985 $859,375 1980 $6,328,125 $0 $20 $40 $60 $80 $100 $120 $140 $160 $180 $200 2005 2010 2013 2015 Last 10 Years… “Generally affordable” *http://www.statisticbrain.com/average-historic-price-of-ram/
  • 7.
    Why Now? $0.00 $2.00 $4.00 $6.00 $8.00 $10.00 $12.00 $14.00 2010 20132015 “An Option at Scale” *Average $/GB 2015 $4.37 2013 $5.5 2010 $12.37 2005 $189 2000 $1,107 1995 $30,875 1990 $103,880 1985 $859,375 1980 $6,328,125 Last 5 Years… *http://www.statisticbrain.com/average-historic-price-of-ram/
  • 8.
    "This will processthese data using algorithms for machine learning and artificial intelligence before sending the data back to the car. The zFAS board will in this way continuously extend its capabilities to master even complex situations increasingly better," Audi stated. "The piloted cars from Audi thus learn more every day and with each new situation they experience.” Source: T3.com The possibilities…
  • 9.
  • 10.
    Challenges: Cost Viability =$34,777/yr.  ~$1.74M/yr. for infrastructure to support 100TB
  • 11.
    Challenges: Cost Viability StorageType Avg. Cost ($/GB) Cost at 100TB ($) RAM 5.00 500K SSD 0.47-1.00 47K to 100K HDD 0.03 3K http://www.statisticbrain.com/average-cost-of-hard-drive-storage/ http://www.myce.com/news/ssd-price-per-gb-drops-below-0-50-how-low-can-they-go-70703/
  • 12.
    Challenges: Durability Volatile Memory •What happens when things fail, and what data maybe loss? • How does the system synchronize with your durable storage? Does it do this well, and is it simple to implement?
  • 13.
  • 14.
  • 15.
    Scenario : ECommerceModernization Initiative Business Problems Technology Limitation Customer experience is suffering during high traffic events. Too expensive to scale system to support spike events. Scaling system is hard, and engineering teams can’t react fast enough in the event of unexpected growth Some caching solution implemented, but it mostly only helps with read performance; synchronizing writes has been a development nightmare. Lack of mobile customers in Europe and Asia has been attributed to latency issues. Difficult to extend data architecture globally, so effort is put on hold
  • 16.
    Scenario : ECommerceModernization Initiative Business Problems Technology Limitation Below industry conversation rate performance has been attributed partly to poor personalization Customer info is siloed across across the Enterprise, and it’s too complicated to bring this data together so effective models can be built to drive personalization “Big Data” project to bring data together to drive machine learning and cognitive capabilities in platform failed as data scientists report platform was too slow to develop on, and performance was impractical. Business analysts have siloed views of the eCommerce channel, and information isn’t getting to them fast enough Related to limitations above Integrating data into data warehouse is slow and hard to maintain
  • 17.
    Orders Product Catalog Customer Data: Profile, Sessions, Carts,Personalization Inventory NoSQLRDBMS Platform Services eCommerce Datastores Dependent External Data Sources and Integrations CRM ERP PIM Data warehouse BI Tools … Platform API Scenario : ECommerce Modernization Initiative
  • 18.
    Customer Data: Profile, Sessions, Carts,Personalization NoSQLRDBMS CRM ERP PIM Partner Sources: Supplier databases…etc. Legacy: Mainframe Product Catalog Silo Data-sources Problem SLOW AND POOR SCALABILITY
  • 19.
    NoSQLRDBMS CRM ERPPIM Partner Sources: Supplier databases…etc. Legacy: Mainframe Operational Single View Operational Single View Customer Data: Profile, Sessions, Carts, Personalization Product Catalog
  • 20.
    Operational Single View MongoDB EnterpriseData Hub Operational Single View
  • 21.
  • 22.
    { product_name: ‘Acme Paint’, color:[‘Red’, ‘Green’], size_oz: [8, 32], finish: [‘satin’, ‘eggshell’] } { product_name: ‘T-shirt’, size: [‘S’, ‘M’, ‘L’, ‘XL’], color: [‘Heather Gray’ … ], material: ‘100% cotton’, wash: ‘cold’, dry: ‘tumble dry low’ } { product_name: ‘Mountain Bike’, brake_style: ‘mechanical disc’, color: ‘grey’, frame_material: ‘aluminum’, no_speeds: 21, package_height: ‘7.5x32.9x55’, weight_lbs: 44.05, suspension_type: ‘dual’, wheel_size_in: 26 } Documents in the same product catalog collection in MongoDB Dynamic Schema
  • 23.
    Flexible Data Model:facilitates agile development and continuous delivery methodologies Scalability: scale-out dynamically as demand grows Still Agile, Scalable and Simple
  • 24.
    High Performance: • Morepredictable, and lower latency on less in-memory infrastructure. In-Memory Storage Engine Infrastructure Optimization: • Assign a data subset on the In-Memory SE via Zone Sharding. • Optimize on cost vs. performance without silos. .Rich Query Capability: • Full MongoDB Query and Indexing Support. IN-MEMORY SE NODES WIREDTIGER NODES
  • 25.
    WEST EAST Update SHARD 4 TAG:EAST, WT Local Read/Write with Strong Consistency Session Data Geographically Localized, and with In-memory Engine Latency SHARD 2 TAG: WEST, WT SHARD 3 TAG: EAST, IN_MEM SHARD 1 TAG: WEST, IN_MEM
  • 26.
    Durability and Fault-Tolerance: •Mixed ReplicaSets allow data to be replicated from In-Memory SE to WT SE. • Full High Availability: automatic fail-over, cross geography. In-Memory Storage Engine
  • 27.
    NoSQLRDBMS Platform Databases DependentExternal Data Sources and Integrations CRM ERP PIM Partner Sources: Supplier databases…etc. Legacy: Mainframe Operational Unified View Advance Personalization 1. TRAIN/RE-TRAIN ML MODELS 2. APPLY MODELS TO REAL-TIME STREAM OF INTERACTIONS 3. DRIVE TARGETED CONTENT, RECOMMENDATIONS…ET C.
  • 28.
    Why ? Speed. Byexploiting in-memory optimizations, Spark has shown up to 100x higher performance than MapReduce running on Hadoop. Simplicity. Easy-to-use APIs for operating on large datasets. This includes a collection of sophisticated operators for transforming and manipulating semi-structured data. Unified Framework. Packaged with higher-level libraries, including support for SQL queries, machine learning, stream and graph processing. These standard libraries increase developer productivity and can be combined to create complex workflows.
  • 29.
    Operational Single View +SparkConnector • Native Scala connector, certified by Databricks • Exposes all Spark APIs & libraries • Efficient data filtering with predicate pushdown, secondary indexes, & in-database aggregations • Locality awareness to reduce data movement
  • 30.
  • 31.
    Operational Single View +SparkConnector Blend client data from multiple internal and external sources to drive real time campaign optimization
  • 32.
    MongoDB+Spark at ChinaEastern 180m fare calculations & 1.6 billion searches per day Oracle database peaked at 200 searches per second. Radically re-architect their fare engine to meet the required 100x growth in search traffic.
  • 33.
    ETL (Yesterday’s) Data atthe Speed of Thought?
  • 34.
    BI Connector BI Connector db.orders.aggregate([ { $group: { _id: null, total: { $sum: "$price" } } } ] ) SELECT SUM(price) AS total FROM orders
  • 35.
    Resources for You SparkConnector • Download: Spark Packages GitHub • Documentation • Whitepaper: Turning Analytics into Real-Time Action • Education:M233: Getting Started with Spark and MongoDB In-Memory Storage Engine • Download: Enterprise Server • Documentation BI Connector • Download: BI Connector • Documentation
  • 37.
    Dylan Tong Principal SolutionsArchitect dylan.tong@mongodb.com Q&A

Editor's Notes

  • #3 Put simply, there are two big questions that I think define and drive in-memory computing: How can we process data s fast as possible by leveraging in-memory speed at it’s best? Secondly, what are the possibilities if we could?
  • #4 Why do we care about speed? It matters in a lot of cases… In the Financial world, it matters in areas like High Frequency trading, which is estimated to account for 50-70% of trades in the past 5 years. HFT platforms transact a large number of orders at very fast speeds, and often use complex algorithms to analyze multiple markets and market conditions Typically, the traders with the fastest execution speeds are more profitable than traders with slower execution speeds.
  • #5 Research by Enterprises and Analysts correlating performance, online experiences and revenue are well documented. I list a few here from some Analysts and Amazon, but there are other public studies from Google and Walmart demonstrating the same Well known study by Aberdeen Group discovered: A 1-second delay in page load time equals 11% fewer page views, a 16% decrease in customer satisfaction, and 7% loss in conversions translated to dollars, if your business earn just $100,000 a day, this equates to $2.5M in potential sales annually. – faster is better. Slow online experiences translate to lost opportunities and we as users and consumers can relate.
  • #6 So, how fast is in-memory? Here’s the rough units that best measure data access times across different storage mediums. Click If we normalize to 1s, it is clear that the magnitude in speed is drastic between RAM and even fast SSD storage.
  • #7 Some may already be nodding their heads… RAM isn’t new technology, and we’re aware that the price of RAM has dropped drastically over the decade. By 2010, the sharp decline in average cost has made RAM “generally affordable” for mainstream use; however, it is far from cheap especially when we consider the data volumes that we work with today.
  • #8 However, prices continue to fall, and an average price of $4.37 in 2015 make RAM an option even at scale for greenfield projects that need the speed.
  • #9 IOT is certainly not a space short of innovation and possibilities, and the ability to scale in-memory performance only makes possibilities more exciting. I came across an article where Audi is discussing their plans for their connected self-driving car, and their intentions to send data collected from various sensors on the car back to the cloud where they will leverage ML to process data to send back to the car so that it can learn and better adapt to complex situations. “…machine learning it will mean adverse weather conditions, such as snow, which can affect sensors will be less of a problem as cars will have a thorough understanding of the piece of tarmac it is traversing” Consider the future, the scale of every vehicle on the road, the amount of data collected that needs to be processed. In-memory computing solutions will be needed to process big data fast especially in the world of smart cars where information will drive important decisions in real-time.
  • #10 Despite the significant increase in the amount of RAM you could put on a single server in the past couple of years, there are still limits, and the data volumes that we work with today continue to grow due to the type of applications we build, and the type of data sources we analyze and data mine. For many organizations, the bulk of workloads are being moved to or are in the cloud, and the ability to scale on cloud infrastructure is critical. The ability to scale-out to fit large data-sets in RAM across servers is critical. If not, data volume, then compute to support large scale services in the cloud.
  • #11 We previously discuss how cost has lowered dramatically, and while it is an option at scale, it can still be cost prohibitive for certain projects. Consider AWS’s X1 instance. Impressively provides nearly 2TB of RAM, but at a hefty price. At a scale of 100TBs, $1.74M just for infrastructure isn’t an option for certain projects. Question is, does the problem really require to have all your data in RAM?
  • #12 While memory is magnitudes faster than other storage mediums, the difference in relative cost is also significant. With that said, in-memory solutions shouldn’t be designed around needing your Enterprise data-architecture or even application to run entirely in-memory. The value of the data and the problem you’re solving should dictate what is the right medium, and an in-memory solution should seamless integrate into a Enterprise Data Architecture that supports all storage mediums.
  • #13 Generally, when we talk about memory we refer to what is readily available-- volatile memory; if you server goes down, then the data stored in that server’s RAM is lost unless it has also been put on durable storage like disk. Trading off data-loss for speed, in most use cases, isn’t acceptable. A good in-memory solution needs to provide fault tolerance, and it needs to synchronize with durable storage, and just as importantly, simply and reliably (which often isn’t the case for some solutions like external distributed caches).
  • #14 As fast as RAM is, it doesn’t remedy bad design. More importantly, any in-memory computing technology shouldn’t introduce new bottlenecks into the architecture, or limit your data architecture to addressing the biggest performance bottlenecks in your system. For instance: Does your in-memory computing solution require you to move large volumes of data around? If so, is that creating bottlenecks in other ways? How does your solution bring data into RAM? Is there an efficient caching algorithm, and is relevant data selected and filtered efficiently? How is your data being processed in RAM? Is there an efficient algorithm? Is it introducing inefficiencies and new performance bottlenecks by shuffling data unnecessarily across a distributed system?
  • #15 So know that we understand the challenges and core requirements around introducing in-memory technologies into your Enterprise Data Architecture, let’s understand how MongoDB fits into the big picture and what it can offer in this area.
  • #19 Let’s hone in on the product catalog and customer session management parts of the system as the problem is most clear. Customer session management component is key to driving customer experience like personalization, and effective personalization needs to be based on full picture of the customer – realistically, in an Enterprise, customer touch points and information is siloed across many systems, and rarely is there one place in an Enterprise where an operational system can get everything it needs to know about the customer. Likewise, with the Product Catalog, information about products will be siloed. Perhaps some info is stored within the ecommerce platform, but likely has to be synchronized with external systems like PIMs, and Supplier systems. Additionally, a modern platform should also be able to keep availability up to date as part of the product search, so problems aren’t caused downstream around order fulfillment. Finally, the business analysts will also need to analyze the same data sources. Consolidating these systems isn’t realistic Integration is necessary, and ideally it shouldn’t involve heavy redundancy; for instance, across operational and BI environments. Federated data access of these systems isn’t an option on many fronts due to performance and scale. Sufficient integration of data into the DW via traditional ETL is a huge effort and likely too slow to make happen.
  • #32 This component would be well served by MongoDB, and in fact, is one of the most common use cases for MongoDB.
  • #33 This component would be well served by MongoDB, and in fact, is one of the most common use cases for MongoDB.