Director, WSO2
The Rise of Streaming SQL
Sriskandarajah Suhothayan
What is Streaming Data?
A series of events/data having the same schema/format
appearing continuously
Coke 24 Fanta 14 Sprite 20 Coke 4
<coke>24</coke> <fanta>14</fanta> <sprite>20</sprite> <coke>4</coke>
Almost All Data is Streaming!
All data is generated one by one,
hence batch data is at one point streaming
● Logs
● Transaction data
● Sensor data
● Traffic data
Data is streaming at
the source!
● Process data at the source or process before we store
● Identify insights in real-time and act immediately
● Reduce unnecessary data storage and batch processing
Streaming Data Processing
Stream Processing
Logs
Senors
Devices
Apps
Services
Alerts
Dashboards
Services
Databases
Streaming Data
Processing
Operations
● Event driven architecture
● Steaming data integration
● Streaming data preprocessing
● Data store integration
● Service integration
● Streaming data summarization
● KPI analysis and alerts
● Event correlation
● Pattern matching
● Trend analysis
● Real-time prediction
● Streaming machine learning
● … more
Positives
● Analytics and machine
learning use cases shifting to
stream processing
● Positive trends
○ Microservices and observability
○ Rise of IoT
○ Security analytics
○ ETL and messaging
Stream Processing Market
Negatives
● Lack of proficient
developers are slowing it
down
● Success depends on the
success of the analytics
and integration market
● Market size
○ 300 ~ 500 million having 30%
1. Code it yourself
+ Customized for your
requirement
− A lot of glue code needs to
be written
2. Stream Processors
+ Code only actors and data
handlers
+ Can scale and handle failure
− Hard to maintain and change
Building Streaming Apps
3. Graphical Tools
+ Good for primitive users & can
visualize the topology
− Inefficient for advanced users
4. Streaming SQL
+ Good for advanced users
+ Easier to understand and
faster implementation
− Not easy to visualize the
topology
History of Stream Processing
Databases: Users query when they need data
History of Stream Processing
Databases: Users query when they need data
Active Databases: Users want to act when data meets a condition
History of Stream Processing
Databases: Users query when they need data
Active Databases: Users want to act when data meets a condition
TelegraphCQ (based PostgreSQL):
Long-running continuous queries over data streams
History of Stream Processing
TelegraphCQ (based PostgreSQL):
Long-running continuous queries over data streams
Complex Event Processing:
Detect complex event patterns
and correlations,
1 or 2 nodes & not scalable
E.g. SASE, Esper, Cayuga, and
Siddhi (powers WSO2 SP),
Apama, IBM Infosphere
Stream Processing:
Scalable processing of data
using a graph of actors
run on many nodes & scales
E.g. Aurora, PIPES, STREAM,
Borealis (academic)
History of Stream Processing
Complex Event Processing:
Detect complex event patterns
and correlations,
1 or 2 nodes & not scalable
E.g. SASE, Esper, Cayuga, and
Siddhi (powers WSO2 SP),
Apama, IBM Infosphere
Stream Processing:
Scalable processing of data
using a graph of actors
run on many nodes & scales
E.g. Aurora, PIPES, STREAM,
Borealis (academic)
Niche Applications:
Stock markets, monitoring and alerts, & surveillance
History of Stream Processing
Niche Applications:
Stock markets, monitoring and alerts, & surveillance
Stream Processing Enters Big Data:
Yahoo S4 (2010) , Twitter Storm (2011) was donated to Apache
History of Stream Processing
Niche Applications:
Stock markets, monitoring and alerts, & surveillance
Stream Processing enter Big Data:
Yahoo S4 (2010) , Twitter Storm (2011) was donated to Apache
Described as “like Hadoop, but in real-time”
Wide adoption and visibility:
Spark Streaming, Samza, Flink
History of Stream Processing
Big Data Switched to SQL:
From coding based MapReduce
History of Stream Processing
Big Data Switched to SQL:
From coding based MapReduce
Stream Processing + CEP Merge:
Support SQL over many nodes in real-time
History of Stream Processing
Big Data Switched to SQL:
From coding based MapReduce
Stream Processing + CEP Merge:
Support SQL over many nodes in real-time
Streaming SQL :
Apache Storm, Apache Flink, WSO2 SP, Apache Kafka (KSQL), Apache
Samza and Calcite
Streaming SQL
Source :https://tdwi.org/articles/2017/08/07/data-all-enabling-real-time-enterprise-with-data-streaming.aspx
SQL vs Streaming SQL
SQL
● Work on a finite data table
● Queries run over static
data
● Synchronous response
Streaming SQL
● Works on infinite data
table == data stream
● Data runs over static
queries
● Asynchronous response
data data
data data
Query
data data Query data data
Siddhi Streaming SQL Overview
@app:name(‘Sweet-Factory-Analytics’)
@source(type = mqtt, …, @map(type = json, …))
define stream SweetProductionStream(name string, amount double);
from SweetProductionStream[amount < 100 and name == ‘candy’]
select name, sum(amount) as cost
group by name
insert into LawCostCandyProdcutionStream ;
@store(type=‘rdbms’, … )
@primaryKey(‘id’)
@Index(amount)
define table ProductionTable(name string, cost double);
Source/Sink & Streams
Queries
Tables
ChallengesChallenges
Source : https://www.pardot.com/blog/3-pressing-b2b-marketing-challenges-solved-with-marketing-automation/
Challenges
In streaming SQL
● Not easy to visualize the topology
In stream processing
● Inability to handle state
● Needs multiple nodes
● Does not support online machine learning
● Does not support long running aggregates in real-time
WSO2 Stream Processor
WOS2 Stream Processor
How Does WSO2
Stream Processor
Solve Them?
● Graphical stream
SQL query editor
● Drag & drop
support
● Switch to source
& design
Challenge: Not Easy to Visualize Topology
Challenge: Handle State & Need for Multi Nodes
• 2 node minimum HA
– Process upto 100k
events/sec
– While most other stream
processing systems need
around 5+ nodes
• Scale more with Kafka
• Incremental state
persistence and recovery
Stream Processor
Stream Processor
Event Sources
Dashboard
Notification
Invocation
Data Source
Siddhi App
Siddhi App
Siddhi App
Siddhi App
Siddhi App
Siddhi App
Event
Store
Running PMML Models for predictions
● Build PMML models via Apache Spark MLlib, H2O.ai, R or Python
● Load built PMML Model into Siddhi and predict in real-time
Supporting native prediction models:
● Spark MLlib Models, and Java based Tensorflow Models
Online Learning and predictions
● Regression analytics
● Markov models
● Anomaly detections
● K-Means clustering
● …more
Challenge: Lack of Knowledge About Future
● Incremental aggregation
○ Aggregation for every second, minute, hour, … , year
● Built on top of architecture
● No big data storage is necessary
● Current values in memory and others in disk
● Executed in a single query
Challenge: Cannot Run Long Running
Aggregates
Current Min
Current Hour
Sec
Min
Hour
0 - 1 - 5 ...
- 1
- 2 - 3 - 4 - 64 - 65 ...
- 2
- 124
1. Start with 2 nodes and scale without changing queries
2. Detect complex event patterns over time
3. Run machine learning models to perform online learning
4. Fuse data in motion and data at rest
5. Perform aggregations from seconds to years
6. Let end users tweak queries
7. Achieve real-time ETL
8. Run rule-based decision making
9. ....more
When to Use WSO2 Stream Processor
THANK YOU
wso2.com

The Rise of Streaming SQL

  • 1.
    Director, WSO2 The Riseof Streaming SQL Sriskandarajah Suhothayan
  • 2.
    What is StreamingData? A series of events/data having the same schema/format appearing continuously Coke 24 Fanta 14 Sprite 20 Coke 4 <coke>24</coke> <fanta>14</fanta> <sprite>20</sprite> <coke>4</coke>
  • 3.
    Almost All Datais Streaming! All data is generated one by one, hence batch data is at one point streaming ● Logs ● Transaction data ● Sensor data ● Traffic data Data is streaming at the source!
  • 4.
    ● Process dataat the source or process before we store ● Identify insights in real-time and act immediately ● Reduce unnecessary data storage and batch processing Streaming Data Processing Stream Processing Logs Senors Devices Apps Services Alerts Dashboards Services Databases
  • 5.
    Streaming Data Processing Operations ● Eventdriven architecture ● Steaming data integration ● Streaming data preprocessing ● Data store integration ● Service integration ● Streaming data summarization ● KPI analysis and alerts ● Event correlation ● Pattern matching ● Trend analysis ● Real-time prediction ● Streaming machine learning ● … more
  • 6.
    Positives ● Analytics andmachine learning use cases shifting to stream processing ● Positive trends ○ Microservices and observability ○ Rise of IoT ○ Security analytics ○ ETL and messaging Stream Processing Market Negatives ● Lack of proficient developers are slowing it down ● Success depends on the success of the analytics and integration market ● Market size ○ 300 ~ 500 million having 30%
  • 7.
    1. Code ityourself + Customized for your requirement − A lot of glue code needs to be written 2. Stream Processors + Code only actors and data handlers + Can scale and handle failure − Hard to maintain and change Building Streaming Apps 3. Graphical Tools + Good for primitive users & can visualize the topology − Inefficient for advanced users 4. Streaming SQL + Good for advanced users + Easier to understand and faster implementation − Not easy to visualize the topology
  • 8.
    History of StreamProcessing Databases: Users query when they need data
  • 9.
    History of StreamProcessing Databases: Users query when they need data Active Databases: Users want to act when data meets a condition
  • 10.
    History of StreamProcessing Databases: Users query when they need data Active Databases: Users want to act when data meets a condition TelegraphCQ (based PostgreSQL): Long-running continuous queries over data streams
  • 11.
    History of StreamProcessing TelegraphCQ (based PostgreSQL): Long-running continuous queries over data streams Complex Event Processing: Detect complex event patterns and correlations, 1 or 2 nodes & not scalable E.g. SASE, Esper, Cayuga, and Siddhi (powers WSO2 SP), Apama, IBM Infosphere Stream Processing: Scalable processing of data using a graph of actors run on many nodes & scales E.g. Aurora, PIPES, STREAM, Borealis (academic)
  • 12.
    History of StreamProcessing Complex Event Processing: Detect complex event patterns and correlations, 1 or 2 nodes & not scalable E.g. SASE, Esper, Cayuga, and Siddhi (powers WSO2 SP), Apama, IBM Infosphere Stream Processing: Scalable processing of data using a graph of actors run on many nodes & scales E.g. Aurora, PIPES, STREAM, Borealis (academic) Niche Applications: Stock markets, monitoring and alerts, & surveillance
  • 13.
    History of StreamProcessing Niche Applications: Stock markets, monitoring and alerts, & surveillance Stream Processing Enters Big Data: Yahoo S4 (2010) , Twitter Storm (2011) was donated to Apache
  • 14.
    History of StreamProcessing Niche Applications: Stock markets, monitoring and alerts, & surveillance Stream Processing enter Big Data: Yahoo S4 (2010) , Twitter Storm (2011) was donated to Apache Described as “like Hadoop, but in real-time” Wide adoption and visibility: Spark Streaming, Samza, Flink
  • 15.
    History of StreamProcessing Big Data Switched to SQL: From coding based MapReduce
  • 16.
    History of StreamProcessing Big Data Switched to SQL: From coding based MapReduce Stream Processing + CEP Merge: Support SQL over many nodes in real-time
  • 17.
    History of StreamProcessing Big Data Switched to SQL: From coding based MapReduce Stream Processing + CEP Merge: Support SQL over many nodes in real-time Streaming SQL : Apache Storm, Apache Flink, WSO2 SP, Apache Kafka (KSQL), Apache Samza and Calcite
  • 18.
  • 19.
    SQL vs StreamingSQL SQL ● Work on a finite data table ● Queries run over static data ● Synchronous response Streaming SQL ● Works on infinite data table == data stream ● Data runs over static queries ● Asynchronous response data data data data Query data data Query data data
  • 20.
    Siddhi Streaming SQLOverview @app:name(‘Sweet-Factory-Analytics’) @source(type = mqtt, …, @map(type = json, …)) define stream SweetProductionStream(name string, amount double); from SweetProductionStream[amount < 100 and name == ‘candy’] select name, sum(amount) as cost group by name insert into LawCostCandyProdcutionStream ; @store(type=‘rdbms’, … ) @primaryKey(‘id’) @Index(amount) define table ProductionTable(name string, cost double); Source/Sink & Streams Queries Tables
  • 21.
  • 22.
    Challenges In streaming SQL ●Not easy to visualize the topology In stream processing ● Inability to handle state ● Needs multiple nodes ● Does not support online machine learning ● Does not support long running aggregates in real-time
  • 23.
  • 24.
  • 25.
    How Does WSO2 StreamProcessor Solve Them?
  • 26.
    ● Graphical stream SQLquery editor ● Drag & drop support ● Switch to source & design Challenge: Not Easy to Visualize Topology
  • 27.
    Challenge: Handle State& Need for Multi Nodes • 2 node minimum HA – Process upto 100k events/sec – While most other stream processing systems need around 5+ nodes • Scale more with Kafka • Incremental state persistence and recovery Stream Processor Stream Processor Event Sources Dashboard Notification Invocation Data Source Siddhi App Siddhi App Siddhi App Siddhi App Siddhi App Siddhi App Event Store
  • 28.
    Running PMML Modelsfor predictions ● Build PMML models via Apache Spark MLlib, H2O.ai, R or Python ● Load built PMML Model into Siddhi and predict in real-time Supporting native prediction models: ● Spark MLlib Models, and Java based Tensorflow Models Online Learning and predictions ● Regression analytics ● Markov models ● Anomaly detections ● K-Means clustering ● …more Challenge: Lack of Knowledge About Future
  • 29.
    ● Incremental aggregation ○Aggregation for every second, minute, hour, … , year ● Built on top of architecture ● No big data storage is necessary ● Current values in memory and others in disk ● Executed in a single query Challenge: Cannot Run Long Running Aggregates Current Min Current Hour Sec Min Hour 0 - 1 - 5 ... - 1 - 2 - 3 - 4 - 64 - 65 ... - 2 - 124
  • 30.
    1. Start with2 nodes and scale without changing queries 2. Detect complex event patterns over time 3. Run machine learning models to perform online learning 4. Fuse data in motion and data at rest 5. Perform aggregations from seconds to years 6. Let end users tweak queries 7. Achieve real-time ETL 8. Run rule-based decision making 9. ....more When to Use WSO2 Stream Processor
  • 31.