Performing Real-Time Analytics
with In-Memory Data Grids
Copyright © 2013 by ScaleOut Software, Inc.
Cloud Expo
June 10, 2013
Mikhail Sobolev (sobolev@scaleoutsoftware.com)
David Brinker (daveb@scaleoutsoftware.com)
2 ScaleOut Software, Inc.
• What is an In-Memory Data Grid (IMDG)?
• Top Benefits of IMDGs
• The Need for Real-Time Analytics
• Example: A Platform for Managing Hedging Strategies
• Using an IMDG to Perform Real-Time Analysis
• Benchmark Results
• Integrating an IMDG into Hadoop
2
Agenda
3 ScaleOut Software, Inc.
• Dr. Mikhail Sobolev, Lead Java Architect
• Ph.D. from Moscow Institute of Physics and Technology
• Research and consulting focus in parallel computing
• Responsible for development of scalable software services in Java
• David Brinker, COO
• 20 years software business and executive management experience
• Mentor Graphics, Cadence, Webridge
• Company: ScaleOut Software
• Develops and markets IMDG products
• Founded in September 2003
• Offices in Bellevue, WA and Beaverton, OR
• Eight years market experience in Windows
& Linux
About the Speakers
4 ScaleOut Software, Inc.
• ScaleOut StateServer®
• Flagship product
• IMDG middleware for Windows
and Linux
• Industry-leading performance and ease of use
• ScaleOut GeoServer® adds
• WAN based data replication for DR
• Breakthrough technology for global
data access
• ScaleOut Analytics Server™ adds
• Real-time data analysis for operational data
• Comprehensive management tools
• ScaleOut hServer™ adds
• 1st step for Hadoop real-time analytics
• Accelerates data access and execution.
ScaleOut Software Products
ScaleOut StateServer In-Memory Data Grid
Grid
Service
Grid
Service
Grid
Service
Grid
Service
5 ScaleOut Software, Inc.
In-memory storage for fast updates and retrieval of live data
• Fits in the business logic layer:
• Stores collections of Java/.NET
objects shared by multiple clients.
• Uses create/read/update/delete
and query APIs to access data.
• Implemented across a cluster of
servers or VMs:
• Scales storage and throughput
by adding servers.
• Provides high availability
in case a server fails.
What is an In-Memory Data Grid?
6 ScaleOut Software, Inc.
Scaling Data Access Using an IMDG
Example: Cloud-Hosted App
• Application runs as multiple virtual
servers (VS).
• Application instances store and
retrieve LOB data from cloud-based
file system or database-.
• Applications need fast, scalable
storage for live data.
• In-memory data grid runs as
multiple virtual servers to provide
“elastic” in-memory storage for
live data.
7 ScaleOut Software, Inc.
• As a “vertical” storage tier:
• Runs as middleware software.
• Adds missing storage layer to boost
performance.
• Uses out-of-process memory.
• Avoids repeated trips to a backing store.
Where IMDGs Are Deployed
Processor
Cache
Application
Memory
“In-Process”
L2 Cache
Processor
Cache
Application
Memory
“In-Process”
L2 Cache
Backing
Storage
• As a “horizontal” storage tier:
• Allows data sharing among servers.
• Scales performance & capacity.
• Adds high availability.
• Can be used independently of backing
storage.
In-Memory
Data Grid
“Out-of-
Process”
In-Memory
Data Grid
“Out-of-
Process”
8 ScaleOut Software, Inc.
• IMDG incorporates a client-side in-process
cache (“near cache”):
• Transparent to the application
• Holds recently accessed data
• Boosts performance:
• Eliminates repeated network data transfers &
deserialization
• Reduces access times to near “in-process”
latency
• Is automatically updated if the grid is
updated
• Supports various coherency models
(coherent, polled, event-driven)
The Secret to Fast Access Time
Application
Memory
“In-Process”
Client-side
Cache
“In-Process”
In-Memory
Data Grid
“Out-of-
Process”
9 ScaleOut Software, Inc.
• IMDGs enable seamless data access across on-premise sites and
cloud-based deployments:
• Automatically access
remote data as needed.
• Efficiently manage
WAN bandwidth.
• Enable full data
coherency across sites.
• Supports multiple usage
models:
• Replication for DR
• Remote access
• Synchronized read/write
Global Data Integration
10 ScaleOut Software, Inc.
• IMDG bridges on-premise and cloud-based in-memory storage of
Web session state.
• IMDG automatically migrates session-state objects into the cloud
on demand.
• This enables seamless access to data across multiple sites.
Example: Web Farm Cloud-Bursting
11 ScaleOut Software, Inc.
In-Memory Data Grid is middleware software which provides:
1. Fast access time for fast-changing, “live” data
2. Scalable throughput and storage capacity to match a
growing workload and keep response times low
3. High availability to prevent data loss if a grid server (or
network link) fails
4. Shared access to data
across the server farm
5. Global data access across
multiple sites and the cloud
6. And … fast data analysis
for quickly and easily mining
data using “map/reduce”
Top Benefits of IMDGs
AccessLatency
Throughput
Grid DBMS
Access Latency vs. Throughput
Faster
Scales
12 ScaleOut Software, Inc.
• Traditional “big data” analysis
platforms analyze offline data:
• Example: Hadoop
• Very large, static datasets
• Data is often copied from other
disk-based storage systems to a
distributed file system for analysis.
• IMDGs store and analyze online data:
• Fast-changing, operational data
• Data storage is memory-based.
• Data motion is minimized for fast,
continuous analysis.
IMDGs Analyze Live Data
13 ScaleOut Software, Inc.
A few examples:
• Equity trading: to minimize risk during a trading day
• Ecommerce: to optimize real-time shopping activity
• Reservations systems: to identify issues, reroute, etc.
• Credit cards: to detect fraud in real time
• Smart grids: to optimize power distribution & detect issues
Online Systems Need Real-Time Analysis
14 ScaleOut Software, Inc.
A platform for managing hedging strategies:
• A hedge fund manages a set of hedging strategies:
• Strategies can cover various market
sectors, such as high-tech, automotive,
energy, consumer, real estate, etc.
• Each strategy contains list of holdings
and rules for managing the holdings
(such as target allocations).
• Updates to market data
continuously arrive during
the trading day.
• Challenge: The hedge fund must be able to quickly update and
analyze its hedging strategies and provide alerts to traders.
Example in Financial Services
15 ScaleOut Software, Inc.
• Deliver a stream of alerts to traders
within a few seconds.
• Enable the trader to examine strategy details in real time:
The Result: Real-Time Alerts
16 ScaleOut Software, Inc.
• The IMDG holds the set of strategy objects as an in-memory collection.
• Updates to market data
continuously flow through
the IMDG.
• The IMDG performs
repeated map/reduce
analysis on hedging
strategies every
second.
• Each analysis iteration both updates
and analyzes every strategy object.
• The IMDG collects alerts after each
analysis and delivers them to the
trader.
The Solution: Real-Time Analytics
Using an IMDG
17 ScaleOut Software, Inc.
• Analyze every selected strategy object in parallel within the IMDG:
• Update the strategy’s positions with latest market prices.
• Evaluate the strategy’s rules to see if a trade is needed.
• Example: Alert if current allocation exceeds target threshold.
• Generate an alert if holdings need to be changed.
• Merge the results across all strategy objects to create a set of
alerts.
The Analysis Algorithm
18 ScaleOut Software, Inc.
Shipping Analysis Code to the IMDG
• IMDG creates Java or .NET execution environment for analysis:
• Spans all IMDG servers.
• Ensures tight integration with memory-based data storage.
• IMDG client ships jars/assemblies to IMDG servers for execution:
• Keeps development model simple.
• Optionally allows pre-staging for multiple runs to shorten startup time.
• Optionally allows automatic re-staging if code changes between runs.
• Client starts analysis:
• Sends invocation to
the IMDG.
• IMDG returns
analysis results.
19 ScaleOut Software, Inc.
The parallel analysis executes in three steps:
• Step 1: The application first selects all relevant objects in the
collection with a parallel query run on all grid servers.
• Note: Query spec matches data’s object-oriented properties.
Running the Analysis
20 ScaleOut Software, Inc.
• Step 2: The IMDG automatically schedules analysis operations
across all grid servers and cores.
• The analysis runs on all objects selected
by the parallel query.
• Each grid server analyzes its locally stored
objects to minimize data motion.
• Parallel execution ensures fast
completion time:
• IMDG automatically distributes
workload across servers/cores.
• Scaling the IMDG automatically
handles larger data sets.
Running the Analysis: Step 2
21 ScaleOut Software, Inc.
• File-based map/reduce must move data to memory for analysis:
• IMDG’s memory-based computation engine analyzes data in place:
IMDG Minimizes Data Motion
D D D D D D D D D
D D D D D D D D D
Grid ServerGrid ServerGrid Server
E E E
M/R Server
E
M/R Server
E
M/R Server
E
File System /
Database
Server
Memory
In-Memory
Data Grid
22 ScaleOut Software, Inc.
• Step 3: The IMDG automatically merges all analysis results.
• The IMDG first merges all results within each grid server in parallel.
• It then merges results across all grid servers to create one combined
result.
• Efficient parallel merge
minimizes the delay in
combining all results.
• The IMDG delivers the
combined result to the
trader’s display as one
object.
Running the Analysis: Step 3
23 ScaleOut Software, Inc.
Running a similar analysis algorithm (stock back-testing) within an
IMDG:
• IMDG hosted in Amazon cloud using 75 servers.
• IMDG holds 1 TB of stock history data in memory.
• IMDG handles continuous stream of updates (1.1 GB/s) while
performing real-time analysis on live data.
• Entire data set analyzed in
4.1 seconds (250 GB/s).
• IMDG scales linearly by
adding servers as
workload grows.
Benchmark Results
24 ScaleOut Software, Inc.
• Typically used for very large, static, offline datasets
• Data is held on disk in a file system (HDFS) or DBMS
• Data is often copied from other disk-based storage systems to
HDFS for analysis.
Problem: Hadoop Cannot Efficiently
Perform Real-Time Analytics
25 ScaleOut Software, Inc.
Comparison of IMDGs and Hadoop
IMDG Hadoop
Data set size Gigabytes->terabytes Terabytes->petabytes
Data repository In-memory File / database
Data view Queried object collection File-based key/value
pairs
Development time Low High
Automatic
scalability
Yes Application dependent
Best use Real-time analysis of
live, memory-based data
Batch analysis of
large, static datasets
I/O overhead Low High
Cluster mgt. Simple Complex
High availability Memory-based File-based
26 ScaleOut Software, Inc.
• Survey result from Strata 2013: 93% of Hadoop users would
benefit from real-time data analytics.
• Strategy: Integrate IMDG into Hadoop.
• How:
• Stage data in IMDG for fast access.
• Thereby allow updates to data during
Hadoop execution.
• Automatically retrieve
data from HDFS as
necessary.
• Enable unchanged
Hadoop program
structure.
• Combine scalability
of Hadoop map/reduce
and IMDG.
Enabling Hadoop to Perform
Real-Time Analysis
27 ScaleOut Software, Inc.
• IMDG adds Hadoop grid record
reader for accessing key/value
pairs held in the IMDG.
• Hadoop programs optionally can
output results to IMDG with grid
record writer.
• Applications can access and update
key/value pairs as live data during
analysis.
• Grid record reader and writer
optimize access to key/value pairs
to eliminate network overhead.
Accessing IMDG Data in Hadoop
28 ScaleOut Software, Inc.
• IMDG adds wrapper for HDFS record reader to cache HDFS data
during program execution.
• Hadoop automatically retrieves data from IMDG on subsequent runs.
• Wrapper accesses IMDG to
store and retrieve data
with minimum network
overhead.
• Useful in multiple “what-if”
analyses on one data set
• Tests with Terasort
benchmark have
demonstrated 11X
lower access latency
over HDFS without IMDG.
Using IMDG as an HDFS Cache
29 ScaleOut Software, Inc.
• IMDGs use in-memory storage to scale access to data for
applications which process live, fast-changing data.
• IMDGs can be deployed in the cloud and provide global data
integration across sites.
• Many applications need to
perform real-time analytics
on live data.
• IMDGs can meet this need,
delivering results in seconds
instead of minutes or hours.
• Hadoop was not designed for
real-time analytics, but…
• IMDGs can enable Hadoop to accelerate access to data.
Summary
In-Memory Data Grids for
Server Farms & Cloud Computing
www.scaleoutsoftware.com

Real-time analysis using an in-memory data grid - Cloud Expo 2013

  • 1.
    Performing Real-Time Analytics withIn-Memory Data Grids Copyright © 2013 by ScaleOut Software, Inc. Cloud Expo June 10, 2013 Mikhail Sobolev (sobolev@scaleoutsoftware.com) David Brinker (daveb@scaleoutsoftware.com)
  • 2.
    2 ScaleOut Software,Inc. • What is an In-Memory Data Grid (IMDG)? • Top Benefits of IMDGs • The Need for Real-Time Analytics • Example: A Platform for Managing Hedging Strategies • Using an IMDG to Perform Real-Time Analysis • Benchmark Results • Integrating an IMDG into Hadoop 2 Agenda
  • 3.
    3 ScaleOut Software,Inc. • Dr. Mikhail Sobolev, Lead Java Architect • Ph.D. from Moscow Institute of Physics and Technology • Research and consulting focus in parallel computing • Responsible for development of scalable software services in Java • David Brinker, COO • 20 years software business and executive management experience • Mentor Graphics, Cadence, Webridge • Company: ScaleOut Software • Develops and markets IMDG products • Founded in September 2003 • Offices in Bellevue, WA and Beaverton, OR • Eight years market experience in Windows & Linux About the Speakers
  • 4.
    4 ScaleOut Software,Inc. • ScaleOut StateServer® • Flagship product • IMDG middleware for Windows and Linux • Industry-leading performance and ease of use • ScaleOut GeoServer® adds • WAN based data replication for DR • Breakthrough technology for global data access • ScaleOut Analytics Server™ adds • Real-time data analysis for operational data • Comprehensive management tools • ScaleOut hServer™ adds • 1st step for Hadoop real-time analytics • Accelerates data access and execution. ScaleOut Software Products ScaleOut StateServer In-Memory Data Grid Grid Service Grid Service Grid Service Grid Service
  • 5.
    5 ScaleOut Software,Inc. In-memory storage for fast updates and retrieval of live data • Fits in the business logic layer: • Stores collections of Java/.NET objects shared by multiple clients. • Uses create/read/update/delete and query APIs to access data. • Implemented across a cluster of servers or VMs: • Scales storage and throughput by adding servers. • Provides high availability in case a server fails. What is an In-Memory Data Grid?
  • 6.
    6 ScaleOut Software,Inc. Scaling Data Access Using an IMDG Example: Cloud-Hosted App • Application runs as multiple virtual servers (VS). • Application instances store and retrieve LOB data from cloud-based file system or database-. • Applications need fast, scalable storage for live data. • In-memory data grid runs as multiple virtual servers to provide “elastic” in-memory storage for live data.
  • 7.
    7 ScaleOut Software,Inc. • As a “vertical” storage tier: • Runs as middleware software. • Adds missing storage layer to boost performance. • Uses out-of-process memory. • Avoids repeated trips to a backing store. Where IMDGs Are Deployed Processor Cache Application Memory “In-Process” L2 Cache Processor Cache Application Memory “In-Process” L2 Cache Backing Storage • As a “horizontal” storage tier: • Allows data sharing among servers. • Scales performance & capacity. • Adds high availability. • Can be used independently of backing storage. In-Memory Data Grid “Out-of- Process” In-Memory Data Grid “Out-of- Process”
  • 8.
    8 ScaleOut Software,Inc. • IMDG incorporates a client-side in-process cache (“near cache”): • Transparent to the application • Holds recently accessed data • Boosts performance: • Eliminates repeated network data transfers & deserialization • Reduces access times to near “in-process” latency • Is automatically updated if the grid is updated • Supports various coherency models (coherent, polled, event-driven) The Secret to Fast Access Time Application Memory “In-Process” Client-side Cache “In-Process” In-Memory Data Grid “Out-of- Process”
  • 9.
    9 ScaleOut Software,Inc. • IMDGs enable seamless data access across on-premise sites and cloud-based deployments: • Automatically access remote data as needed. • Efficiently manage WAN bandwidth. • Enable full data coherency across sites. • Supports multiple usage models: • Replication for DR • Remote access • Synchronized read/write Global Data Integration
  • 10.
    10 ScaleOut Software,Inc. • IMDG bridges on-premise and cloud-based in-memory storage of Web session state. • IMDG automatically migrates session-state objects into the cloud on demand. • This enables seamless access to data across multiple sites. Example: Web Farm Cloud-Bursting
  • 11.
    11 ScaleOut Software,Inc. In-Memory Data Grid is middleware software which provides: 1. Fast access time for fast-changing, “live” data 2. Scalable throughput and storage capacity to match a growing workload and keep response times low 3. High availability to prevent data loss if a grid server (or network link) fails 4. Shared access to data across the server farm 5. Global data access across multiple sites and the cloud 6. And … fast data analysis for quickly and easily mining data using “map/reduce” Top Benefits of IMDGs AccessLatency Throughput Grid DBMS Access Latency vs. Throughput Faster Scales
  • 12.
    12 ScaleOut Software,Inc. • Traditional “big data” analysis platforms analyze offline data: • Example: Hadoop • Very large, static datasets • Data is often copied from other disk-based storage systems to a distributed file system for analysis. • IMDGs store and analyze online data: • Fast-changing, operational data • Data storage is memory-based. • Data motion is minimized for fast, continuous analysis. IMDGs Analyze Live Data
  • 13.
    13 ScaleOut Software,Inc. A few examples: • Equity trading: to minimize risk during a trading day • Ecommerce: to optimize real-time shopping activity • Reservations systems: to identify issues, reroute, etc. • Credit cards: to detect fraud in real time • Smart grids: to optimize power distribution & detect issues Online Systems Need Real-Time Analysis
  • 14.
    14 ScaleOut Software,Inc. A platform for managing hedging strategies: • A hedge fund manages a set of hedging strategies: • Strategies can cover various market sectors, such as high-tech, automotive, energy, consumer, real estate, etc. • Each strategy contains list of holdings and rules for managing the holdings (such as target allocations). • Updates to market data continuously arrive during the trading day. • Challenge: The hedge fund must be able to quickly update and analyze its hedging strategies and provide alerts to traders. Example in Financial Services
  • 15.
    15 ScaleOut Software,Inc. • Deliver a stream of alerts to traders within a few seconds. • Enable the trader to examine strategy details in real time: The Result: Real-Time Alerts
  • 16.
    16 ScaleOut Software,Inc. • The IMDG holds the set of strategy objects as an in-memory collection. • Updates to market data continuously flow through the IMDG. • The IMDG performs repeated map/reduce analysis on hedging strategies every second. • Each analysis iteration both updates and analyzes every strategy object. • The IMDG collects alerts after each analysis and delivers them to the trader. The Solution: Real-Time Analytics Using an IMDG
  • 17.
    17 ScaleOut Software,Inc. • Analyze every selected strategy object in parallel within the IMDG: • Update the strategy’s positions with latest market prices. • Evaluate the strategy’s rules to see if a trade is needed. • Example: Alert if current allocation exceeds target threshold. • Generate an alert if holdings need to be changed. • Merge the results across all strategy objects to create a set of alerts. The Analysis Algorithm
  • 18.
    18 ScaleOut Software,Inc. Shipping Analysis Code to the IMDG • IMDG creates Java or .NET execution environment for analysis: • Spans all IMDG servers. • Ensures tight integration with memory-based data storage. • IMDG client ships jars/assemblies to IMDG servers for execution: • Keeps development model simple. • Optionally allows pre-staging for multiple runs to shorten startup time. • Optionally allows automatic re-staging if code changes between runs. • Client starts analysis: • Sends invocation to the IMDG. • IMDG returns analysis results.
  • 19.
    19 ScaleOut Software,Inc. The parallel analysis executes in three steps: • Step 1: The application first selects all relevant objects in the collection with a parallel query run on all grid servers. • Note: Query spec matches data’s object-oriented properties. Running the Analysis
  • 20.
    20 ScaleOut Software,Inc. • Step 2: The IMDG automatically schedules analysis operations across all grid servers and cores. • The analysis runs on all objects selected by the parallel query. • Each grid server analyzes its locally stored objects to minimize data motion. • Parallel execution ensures fast completion time: • IMDG automatically distributes workload across servers/cores. • Scaling the IMDG automatically handles larger data sets. Running the Analysis: Step 2
  • 21.
    21 ScaleOut Software,Inc. • File-based map/reduce must move data to memory for analysis: • IMDG’s memory-based computation engine analyzes data in place: IMDG Minimizes Data Motion D D D D D D D D D D D D D D D D D D Grid ServerGrid ServerGrid Server E E E M/R Server E M/R Server E M/R Server E File System / Database Server Memory In-Memory Data Grid
  • 22.
    22 ScaleOut Software,Inc. • Step 3: The IMDG automatically merges all analysis results. • The IMDG first merges all results within each grid server in parallel. • It then merges results across all grid servers to create one combined result. • Efficient parallel merge minimizes the delay in combining all results. • The IMDG delivers the combined result to the trader’s display as one object. Running the Analysis: Step 3
  • 23.
    23 ScaleOut Software,Inc. Running a similar analysis algorithm (stock back-testing) within an IMDG: • IMDG hosted in Amazon cloud using 75 servers. • IMDG holds 1 TB of stock history data in memory. • IMDG handles continuous stream of updates (1.1 GB/s) while performing real-time analysis on live data. • Entire data set analyzed in 4.1 seconds (250 GB/s). • IMDG scales linearly by adding servers as workload grows. Benchmark Results
  • 24.
    24 ScaleOut Software,Inc. • Typically used for very large, static, offline datasets • Data is held on disk in a file system (HDFS) or DBMS • Data is often copied from other disk-based storage systems to HDFS for analysis. Problem: Hadoop Cannot Efficiently Perform Real-Time Analytics
  • 25.
    25 ScaleOut Software,Inc. Comparison of IMDGs and Hadoop IMDG Hadoop Data set size Gigabytes->terabytes Terabytes->petabytes Data repository In-memory File / database Data view Queried object collection File-based key/value pairs Development time Low High Automatic scalability Yes Application dependent Best use Real-time analysis of live, memory-based data Batch analysis of large, static datasets I/O overhead Low High Cluster mgt. Simple Complex High availability Memory-based File-based
  • 26.
    26 ScaleOut Software,Inc. • Survey result from Strata 2013: 93% of Hadoop users would benefit from real-time data analytics. • Strategy: Integrate IMDG into Hadoop. • How: • Stage data in IMDG for fast access. • Thereby allow updates to data during Hadoop execution. • Automatically retrieve data from HDFS as necessary. • Enable unchanged Hadoop program structure. • Combine scalability of Hadoop map/reduce and IMDG. Enabling Hadoop to Perform Real-Time Analysis
  • 27.
    27 ScaleOut Software,Inc. • IMDG adds Hadoop grid record reader for accessing key/value pairs held in the IMDG. • Hadoop programs optionally can output results to IMDG with grid record writer. • Applications can access and update key/value pairs as live data during analysis. • Grid record reader and writer optimize access to key/value pairs to eliminate network overhead. Accessing IMDG Data in Hadoop
  • 28.
    28 ScaleOut Software,Inc. • IMDG adds wrapper for HDFS record reader to cache HDFS data during program execution. • Hadoop automatically retrieves data from IMDG on subsequent runs. • Wrapper accesses IMDG to store and retrieve data with minimum network overhead. • Useful in multiple “what-if” analyses on one data set • Tests with Terasort benchmark have demonstrated 11X lower access latency over HDFS without IMDG. Using IMDG as an HDFS Cache
  • 29.
    29 ScaleOut Software,Inc. • IMDGs use in-memory storage to scale access to data for applications which process live, fast-changing data. • IMDGs can be deployed in the cloud and provide global data integration across sites. • Many applications need to perform real-time analytics on live data. • IMDGs can meet this need, delivering results in seconds instead of minutes or hours. • Hadoop was not designed for real-time analytics, but… • IMDGs can enable Hadoop to accelerate access to data. Summary
  • 30.
    In-Memory Data Gridsfor Server Farms & Cloud Computing www.scaleoutsoftware.com