Getting Insights from IoT data
with Apache Spark &
Apache Bahir
Luciano Resende
June 20th
, 2018
1
About me - Luciano Resende
2
Data Science Platform Architect – IBM – CODAIT
• Have been contributing to open source at ASF for over 10 years
• Currently contributing to : Jupyter Notebook ecosystem, Apache Bahir, Apache
Toree, Apache Spark among other projects related to AI/ML platforms
lresende@apache.org
https://www.linkedin.com/in/lresende
@lresende1975
https://github.com/lresende
4
Open Source Leadership & Contributions
IBM generated open source innovation
• 137 Code Open (dWO) projects w/1000+ Github projects
• 4 graduates: Node-Red, OpenWhisk, SystemML,
Blockchain fabric to full open governance in the last year
• developer.ibm.com/code/open/code/
Community
• IBM focused on 18 strategic communities
• Drive open governance in “Centers of Gravity”
• IBM Leaders drive key technologies and assure freedom
of action
The IBM OS Way is now open sourced
• Training, Recognition, Tooling
• Organization, Consuming, Contributing
2018 / © 2018 IBM Corporation
5
IBM’s history of strong AI leadership
1997: Deep Blue
• Deep Blue became the first machine to beat a world chess
champion in tournament play
2011: Jeopardy!
• Watson beat two top
Jeopardy! champions
1968, 2001: A Space Odyssey
• IBM was a technical
advisor
• HAL is “the latest in
machine intelligence”
2018: Open Tech, AI & emerging
standards
• New IBM centers of gravity for AI
• OS projects increasing exponentially
• Emerging global standards in AI
Center for Open Source
Data and AI Technologies
CODAIT
codait.org
codait (French)
= coder/coded
https://m.interglot.com/fr/en/codait
CODAIT aims to make AI solutions
dramatically easier to create, deploy,
and manage in the enterprise
Relaunch of the Spark Technology
Center (STC) to reflect expanded
mission
6
Agenda
7
Internet of Things
IoT Use Cases
IoT Design Patterns
Apache Spark & Apache Bahir
Live Demo – Anomaly Detection
Summary
References
Q&A
Internet of
Things - IoT
8
What is THE INTERNET OF THINGS (IoT) ?
THE TERM IOT WAS FIRST TOUTED BY
KEVIN ASHTON, IN 1999. It’s more than
just machine to machine communications.
It’s about ecosystems of devices that form
relevant connections to people and other
devices to exchange data. It’s a useful term
to describe the world of connected and
wearable devices that’s emerging.
9
IoT - INTERACTIONS BETWEEN multiple ENTITIES
10
control
observe
inform
command
actuate
inform
PEOPLE
THINGS SOFTWARE
Some IoT EXAMPLES
11
SMART HOMES
TRANSPORT
WEARABLES INDUSTRY
DISPLAYS HEALTH
From thermostats
to smart switches
and remote
controls to security
systems
Self-driving cars,
drones for
delivering goods ,
etc
Smartwatches,
and other devices
enabling control
and providing
monitoring
capabilities
Robotics, sensors
for predicting
quality, failures,
etc
Not only VR, but
many new displays
enabling gesture
controls, haptic
interfaces, etc
Connected health,
in partnership with
wearables for
monitoring health
metrics, among
other examples
Some IoT PATTERNS
12
• Remote control
• Security analysis
• Edge analytics
• Historical data analysis
• Distributed Platforms
• Real-time decisions
IoT Applications
13
The Weather Company
The Weather Company data
- Surface observations
- precipitation
- radar
- satellite
- personal weather stations
- lightning sources
- data collected from planes
every day
- etc.
Home Automation & Security
- Multiple connected or
standalone devices
- Controlled by Voice
- Amazon Echo (Alexa)
- Google Home
- Apple HomePod (Siri)
15
TESLA connected cars
CONNECTED VEHICLES IS ONE
EXAMPLE OF THE IoT. It’s not
just about Google Maps in cars.
When Tesla finds a software
fault with their vehicle rather
than issuing an expensive and
damaging recall, they simply
updated the car’s operating
system over the air.
[hcp://www.wired.com/2014/02
/teslas- air-fix-best-example-
yet-internet-things/]
16
AMAZON Go
AMAZON GO – No lines, no
checkout, just grab and go
17
Industrial Internet of Things
18
- Smart factory
- Predictive and remote
maintenance
- Smart metering and
smart grid
- Industrial security
- Industrial heating,
ventilation and air
conditioning
- Asset tracking and
smart logistics
Reference: https://www.iiconsortium.org/
IoT
Design Patterns
19
LAMBDA Architecture
Lambda architecture is a data-
processing architecture designed to
handle massive quantities of data by
taking advantage of both batch-
and stream-processing methods. This
approach to architecture attempts to
balance latency, throughput, and fault-
tolerance by using batch processing to
provide comprehensive and accurate
views of batch data, while simultaneously
using real-time stream processing to
provide views of online data
20
Images: https://www.oreilly.com/ideas/applying-the-kappa-architecture-in-the-telco-industry
KAPA Architecture
21
Images: https://www.oreilly.com/ideas/applying-the-kappa-architecture-in-the-telco-industry
The Kappa architecture simplifies
the Lambda architecture by removing
the batch layer and replacing it with a
streaming layer.
REALTIME Data Processing best practice
22
Pub/Sub Component Data Processor Data Storage
One recommendation for processing
streaming of massive quantities of
data is to add a queue component to
front-end the processing of the data.
This enables a more fault-tolerant
solution, where in conjunction with
state management, the runtime
application can have failures and
subsequently continue processing
from the same data point.
Building an IoT
Application
23
MQTT – IoT Connectivity Protocol
24
• Constrained devices
• Low bandwidth connection
• Intermittent connections
MQTT – IoT Connectivity Protocol
25
Connect
+
Publish
+
Subscribe
~1990
IBM / Eurotech
2010
Published
2011
Eclipse M2M / Paho
2014
OASIS
Open spec
+ 40 client
implementations
Minimal
overhead
Tiny
Clients
(Java 170KB)
History
Header
2-4 bytes
(publish)
14 bytes
(connect)
V5
May 2018
MQTT – Quality of Service
26
MQTT
Broker
QoS0
QoS1
QoS2
At most once
At least once
Exactly once
. No connection failover
. Never duplicate
. Has connection failover
. Can duplicate
. Has connection failover
. Never duplicate
MQTT – World usage
Smart Home Automation
Messaging
Notable Mentions:
- IBM IoT Platform
- AWS IoT
- Microsoft IoT Hub
- Facebook Messenger
27
Apache Spark
28
Apache Spark Introduction
29
Spark Core
Spark
SQL
Spark
Streaming
Spark
ML
Spark
GraphX
executes SQL
statements
performs
streaming
analytics using
micro-batches
common
machine
learning and
statistical
algorithms
distributed
graph
processing
framework
general compute engine, handles
distributed task dispatching,
scheduling and basic I/O
functions
large variety of data sources
and formats can be supported,
both on-premise or cloud
BigInsights
(HDFS)
Cloudant
dashDB
SQL
DB
Apache Spark Evolution
30
Apache Spark – Spark SQL
31
Spark
SQL
Unified data access APIS: Query
structured data sets with SQL or
Dataset/DataFrame APIs
Fast, familiar query language across
all of your enterprise data
RDBMS
Data Sources
Structured
Streaming
Data Sources
Apache Spark – Spark SQL
32
You can run SQL statement with SparkSession.sql(…) interface:
val spark = SparkSession.builder()
.appName(“Demo”)
.getOrCreate()
spark.sql(“create table T1 (c1 int, c2 int) stored as parquet”)
val ds = spark.sql(“select * from T1”)
You can further transform the resultant dataset:
val ds1 = ds.groupBy(“c1”).agg(“c2”-> “sum”)
val ds2 = ds.orderBy(“c1”)
The result is a DataFrame / Dataset[Row]
ds.show() displays the rows
Apache Spark – Spark SQL
You can read from data sources using SparkSession.read.format(…)
val spark = SparkSession.builder()
.appName(“Demo”)
.getOrCreate()
case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer)
// loading csv data to a Dataset of Bank type
val bankFromCSV = spark.read.csv(“hdfs://localhost:9000/data/bank.csv").as[Bank]
// loading JSON data to a Dataset of Bank type
val bankFromJSON = spark.read.json(“hdfs://localhost:9000/data/bank.json").as[Bank]
// select a column value from the Dataset
bankFromCSV.select(‘age’).show() will return all rows of column “age” from this dataset.
33
Apache Spark – Spark SQL
You can also configure a specific data source with specific options
val spark = SparkSession.builder()
.appName(“Demo”)
.getOrCreate()
case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer)
// loading csv data to a Dataset of Bank type
val bankFromCSV = sparkSession.read
.option("header", ”true") // Use first line of all files as header
.option("inferSchema", ”true") // Automatically infer data types
.option("delimiter", " ")
.csv("/users/lresende/data.csv”)
.as[Bank]
bankFromCSV.select(‘age).show() // will return all rows of column “age” from this dataset.
34
Apache Spark – Spark SQL Structured Streaming
Unified programming model for streaming, interactive and batch queries
35Image source: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
Considers the data stream as an
unbounded table
Apache Spark – Spark SQL Structured Streaming
SQL regular APIs
val spark = SparkSession.builder()
.appName(“Demo”)
.getOrCreate()
val input = spark.read
.schema(schema)
.format(”csv")
.load(”input-path")
val result = input
.select(”age”)
.where(”age > 18”)
result.write
.format(”json”)
. save(” dest-path”)
36
Structured Streaming APIs
val spark = SparkSession.builder()
.appName(“Demo”)
.getOrCreate()
val input = spark.readStream
.schema(schema)
.format(”csv")
.load(”input-path")
val result = input
.select(”age”)
.where(”age > 18”)
result.write
.format(”json”)
. startStream(” dest-path”)
Apache Spark – Spark Streaming
37
Spark
Streaming
Micro-batch event processing for
near-real time analytics
e.g. Internet of Things (IoT) devices,
Twitter feeds, Kafka (event hub), etc.
No multi-threading or parallel process
programming required
Apache Spark – Spark Streaming
Also known as discretized stream or DStream
Abstracts a continuous stream of data
Based on micro-batching
Based on RDDs
38
Apache Spark – Spark Streaming
val sparkConf = new SparkConf()
.setAppName("MQTTWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(2))
val lines = MQTTUtils.createStream(ssc, brokerUrl, topic, StorageLevel.MEMORY_ONLY_SER_2)
val words = lines.flatMap(x => x.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
39
Apache Bahir
40
Apache Bahir Project
Apache Bahir provides
extensions to multiple
distributed analytic platforms,
extending their reach with a
diversity of streaming
connectors and SQL data
sources.
Currently, Bahir provides
extensions for Apache
Spark and Apache Flink.
Bahir extensions for Apache Spark
MQTT – Enables reading data from MQTT Servers using Spark Streaming or Structured streaming.
• http://bahir.apache.org/docs/spark/current/spark-sql-streaming-mqtt/
• http://bahir.apache.org/docs/spark/current/spark-streaming-mqtt/
Couch DB/Cloudant– Enables reading data from CouchDB/Cloudant using Spark SQL and Spark
Streaming.
Twitter – Enables reading social data from twitter using Spark Streaming.
• http://bahir.apache.org/docs/spark/current/spark-streaming-twitter/
Akka – Enables reading data from Akka Actors using Spark Streaming or Structured Streaming.
• http://bahir.apache.org/docs/spark/current/spark-streaming-akka/
ZeroMQ – Enables reading data from ZeroMQ using Spark Streaming.
• http://bahir.apache.org/docs/spark/current/spark-streaming-zeromq/
42
Bahir extensions for Apache Spark
Google Cloud Pub/Sub – Add spark streaming connector to Google Cloud Pub/Sub
43
Apache Spark extensions in Bahir
Adding Bahir extensions into your application
- Using SBT
libraryDependencies += "org.apache.bahir" %% "spark-streaming-mqtt" % "2.2.0”
- Using Maven
<dependency>
<groupId>org.apache.bahir</groupId>
<artifactId>spark-streaming-mqtt_2.11 </artifactId>
<version>2.2.0</version>
</dependency>
44
Apache Spark extensions in Bahir
Submitting applications with Bahir extensions to Spark
- Spark-shell
bin/spark-shell --packages org.apache.bahir:spark-streaming_mqtt_2.11:2.2.0 …..
- Spark-submit
bin/spark-submit --packages org.apache.bahir:spark-streaming_mqtt_2.11:2.2.0 …..
45
Live Demo
46
IoT Analytics – Anomaly Detection
The demo environment
https://github.com/lresende/bahir-iot-demo
47
Node.js Web app
Simulates Elevator IoT devices
Elevator simulator
Metrics:
• Weight
• Speed
• Power
• Temperature
• System
MQTT
Mosquitto
MQTT
JavaScript Client
MQTT
Streaming Connector
IoT Analytics – Anomaly Detection
The Moving Z-Score model scores
anomalies in a univariate sequential dataset,
often a time series.
The moving Z-score is a very simple model
for measuring the anomalousness of each
point in a sequential dataset like a time
series. Given a window size w, the moving
Z-score is the number of standard
deviations each observation is away from
the mean, where the mean and standard
deviation are computed only over the
previous w observations.
48
Reference: https://turi.com/learn/userguide/anomaly_detection/moving_zscore.html
Reference: https://www.isixsigma.com/tools-templates/statistical-analysis/improved-forecasting-moving-averages-and-z-scores/
Z-Score
Moving Z-Score
Moving mean and moving standard deviation
Summary
49
Summary – Take away points
IoT Design Patterns
- Lambda and Kappa Architecture
- Real Time data processing best practice
Apache Spark
- IoT Analytics Runtime with support for ”Continuous Applications”
Apache Bahir
- Bring access to IoT data via supported connectors (e.g. MQTT)
IoT Applications
- Using Spark and Bahir to start processing IoT data in near real time using Spark Streaming
- Anomaly detection using Moving Z-Scores model
50
Join the Apache
Bahir community
51
References
Apache Bahir
http://bahir.apache.org
Documentation for Apache Spark extensions
http://bahir.apache.org/docs/spark/current/documentation/
Source Repositories
https://github.com/apache/bahir
https://github.com/apache/bahir-website
Demo Repository
https://github.com/lresende/bahir-iot-demo
52Image source: http://az616578.vo.msecnd.net/files/2016/03/21/6359412499310138501557867529_thank-you-1400x800-c-default.gif
53May 17, 2018 / © 2018 IBM Corporation
CALL FOR CODE KEYNOTE
With Jeffrey Borek, 10:00 AM
54March 30 2018 / © 2018 IBM Corporation

Getting insights from IoT data with Apache Spark and Apache Bahir

  • 1.
    Getting Insights fromIoT data with Apache Spark & Apache Bahir Luciano Resende June 20th , 2018 1
  • 2.
    About me -Luciano Resende 2 Data Science Platform Architect – IBM – CODAIT • Have been contributing to open source at ASF for over 10 years • Currently contributing to : Jupyter Notebook ecosystem, Apache Bahir, Apache Toree, Apache Spark among other projects related to AI/ML platforms lresende@apache.org https://www.linkedin.com/in/lresende @lresende1975 https://github.com/lresende
  • 3.
    4 Open Source Leadership& Contributions IBM generated open source innovation • 137 Code Open (dWO) projects w/1000+ Github projects • 4 graduates: Node-Red, OpenWhisk, SystemML, Blockchain fabric to full open governance in the last year • developer.ibm.com/code/open/code/ Community • IBM focused on 18 strategic communities • Drive open governance in “Centers of Gravity” • IBM Leaders drive key technologies and assure freedom of action The IBM OS Way is now open sourced • Training, Recognition, Tooling • Organization, Consuming, Contributing 2018 / © 2018 IBM Corporation
  • 4.
    5 IBM’s history ofstrong AI leadership 1997: Deep Blue • Deep Blue became the first machine to beat a world chess champion in tournament play 2011: Jeopardy! • Watson beat two top Jeopardy! champions 1968, 2001: A Space Odyssey • IBM was a technical advisor • HAL is “the latest in machine intelligence” 2018: Open Tech, AI & emerging standards • New IBM centers of gravity for AI • OS projects increasing exponentially • Emerging global standards in AI
  • 5.
    Center for OpenSource Data and AI Technologies CODAIT codait.org codait (French) = coder/coded https://m.interglot.com/fr/en/codait CODAIT aims to make AI solutions dramatically easier to create, deploy, and manage in the enterprise Relaunch of the Spark Technology Center (STC) to reflect expanded mission 6
  • 6.
    Agenda 7 Internet of Things IoTUse Cases IoT Design Patterns Apache Spark & Apache Bahir Live Demo – Anomaly Detection Summary References Q&A
  • 7.
  • 8.
    What is THEINTERNET OF THINGS (IoT) ? THE TERM IOT WAS FIRST TOUTED BY KEVIN ASHTON, IN 1999. It’s more than just machine to machine communications. It’s about ecosystems of devices that form relevant connections to people and other devices to exchange data. It’s a useful term to describe the world of connected and wearable devices that’s emerging. 9
  • 9.
    IoT - INTERACTIONSBETWEEN multiple ENTITIES 10 control observe inform command actuate inform PEOPLE THINGS SOFTWARE
  • 10.
    Some IoT EXAMPLES 11 SMARTHOMES TRANSPORT WEARABLES INDUSTRY DISPLAYS HEALTH From thermostats to smart switches and remote controls to security systems Self-driving cars, drones for delivering goods , etc Smartwatches, and other devices enabling control and providing monitoring capabilities Robotics, sensors for predicting quality, failures, etc Not only VR, but many new displays enabling gesture controls, haptic interfaces, etc Connected health, in partnership with wearables for monitoring health metrics, among other examples
  • 11.
    Some IoT PATTERNS 12 •Remote control • Security analysis • Edge analytics • Historical data analysis • Distributed Platforms • Real-time decisions
  • 12.
  • 13.
    The Weather Company TheWeather Company data - Surface observations - precipitation - radar - satellite - personal weather stations - lightning sources - data collected from planes every day - etc.
  • 14.
    Home Automation &Security - Multiple connected or standalone devices - Controlled by Voice - Amazon Echo (Alexa) - Google Home - Apple HomePod (Siri) 15
  • 15.
    TESLA connected cars CONNECTEDVEHICLES IS ONE EXAMPLE OF THE IoT. It’s not just about Google Maps in cars. When Tesla finds a software fault with their vehicle rather than issuing an expensive and damaging recall, they simply updated the car’s operating system over the air. [hcp://www.wired.com/2014/02 /teslas- air-fix-best-example- yet-internet-things/] 16
  • 16.
    AMAZON Go AMAZON GO– No lines, no checkout, just grab and go 17
  • 17.
    Industrial Internet ofThings 18 - Smart factory - Predictive and remote maintenance - Smart metering and smart grid - Industrial security - Industrial heating, ventilation and air conditioning - Asset tracking and smart logistics Reference: https://www.iiconsortium.org/
  • 18.
  • 19.
    LAMBDA Architecture Lambda architectureis a data- processing architecture designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods. This approach to architecture attempts to balance latency, throughput, and fault- tolerance by using batch processing to provide comprehensive and accurate views of batch data, while simultaneously using real-time stream processing to provide views of online data 20 Images: https://www.oreilly.com/ideas/applying-the-kappa-architecture-in-the-telco-industry
  • 20.
    KAPA Architecture 21 Images: https://www.oreilly.com/ideas/applying-the-kappa-architecture-in-the-telco-industry TheKappa architecture simplifies the Lambda architecture by removing the batch layer and replacing it with a streaming layer.
  • 21.
    REALTIME Data Processingbest practice 22 Pub/Sub Component Data Processor Data Storage One recommendation for processing streaming of massive quantities of data is to add a queue component to front-end the processing of the data. This enables a more fault-tolerant solution, where in conjunction with state management, the runtime application can have failures and subsequently continue processing from the same data point.
  • 22.
  • 23.
    MQTT – IoTConnectivity Protocol 24 • Constrained devices • Low bandwidth connection • Intermittent connections
  • 24.
    MQTT – IoTConnectivity Protocol 25 Connect + Publish + Subscribe ~1990 IBM / Eurotech 2010 Published 2011 Eclipse M2M / Paho 2014 OASIS Open spec + 40 client implementations Minimal overhead Tiny Clients (Java 170KB) History Header 2-4 bytes (publish) 14 bytes (connect) V5 May 2018
  • 25.
    MQTT – Qualityof Service 26 MQTT Broker QoS0 QoS1 QoS2 At most once At least once Exactly once . No connection failover . Never duplicate . Has connection failover . Can duplicate . Has connection failover . Never duplicate
  • 26.
    MQTT – Worldusage Smart Home Automation Messaging Notable Mentions: - IBM IoT Platform - AWS IoT - Microsoft IoT Hub - Facebook Messenger 27
  • 27.
  • 28.
    Apache Spark Introduction 29 SparkCore Spark SQL Spark Streaming Spark ML Spark GraphX executes SQL statements performs streaming analytics using micro-batches common machine learning and statistical algorithms distributed graph processing framework general compute engine, handles distributed task dispatching, scheduling and basic I/O functions large variety of data sources and formats can be supported, both on-premise or cloud BigInsights (HDFS) Cloudant dashDB SQL DB
  • 29.
  • 30.
    Apache Spark –Spark SQL 31 Spark SQL Unified data access APIS: Query structured data sets with SQL or Dataset/DataFrame APIs Fast, familiar query language across all of your enterprise data RDBMS Data Sources Structured Streaming Data Sources
  • 31.
    Apache Spark –Spark SQL 32 You can run SQL statement with SparkSession.sql(…) interface: val spark = SparkSession.builder() .appName(“Demo”) .getOrCreate() spark.sql(“create table T1 (c1 int, c2 int) stored as parquet”) val ds = spark.sql(“select * from T1”) You can further transform the resultant dataset: val ds1 = ds.groupBy(“c1”).agg(“c2”-> “sum”) val ds2 = ds.orderBy(“c1”) The result is a DataFrame / Dataset[Row] ds.show() displays the rows
  • 32.
    Apache Spark –Spark SQL You can read from data sources using SparkSession.read.format(…) val spark = SparkSession.builder() .appName(“Demo”) .getOrCreate() case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer) // loading csv data to a Dataset of Bank type val bankFromCSV = spark.read.csv(“hdfs://localhost:9000/data/bank.csv").as[Bank] // loading JSON data to a Dataset of Bank type val bankFromJSON = spark.read.json(“hdfs://localhost:9000/data/bank.json").as[Bank] // select a column value from the Dataset bankFromCSV.select(‘age’).show() will return all rows of column “age” from this dataset. 33
  • 33.
    Apache Spark –Spark SQL You can also configure a specific data source with specific options val spark = SparkSession.builder() .appName(“Demo”) .getOrCreate() case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer) // loading csv data to a Dataset of Bank type val bankFromCSV = sparkSession.read .option("header", ”true") // Use first line of all files as header .option("inferSchema", ”true") // Automatically infer data types .option("delimiter", " ") .csv("/users/lresende/data.csv”) .as[Bank] bankFromCSV.select(‘age).show() // will return all rows of column “age” from this dataset. 34
  • 34.
    Apache Spark –Spark SQL Structured Streaming Unified programming model for streaming, interactive and batch queries 35Image source: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html Considers the data stream as an unbounded table
  • 35.
    Apache Spark –Spark SQL Structured Streaming SQL regular APIs val spark = SparkSession.builder() .appName(“Demo”) .getOrCreate() val input = spark.read .schema(schema) .format(”csv") .load(”input-path") val result = input .select(”age”) .where(”age > 18”) result.write .format(”json”) . save(” dest-path”) 36 Structured Streaming APIs val spark = SparkSession.builder() .appName(“Demo”) .getOrCreate() val input = spark.readStream .schema(schema) .format(”csv") .load(”input-path") val result = input .select(”age”) .where(”age > 18”) result.write .format(”json”) . startStream(” dest-path”)
  • 36.
    Apache Spark –Spark Streaming 37 Spark Streaming Micro-batch event processing for near-real time analytics e.g. Internet of Things (IoT) devices, Twitter feeds, Kafka (event hub), etc. No multi-threading or parallel process programming required
  • 37.
    Apache Spark –Spark Streaming Also known as discretized stream or DStream Abstracts a continuous stream of data Based on micro-batching Based on RDDs 38
  • 38.
    Apache Spark –Spark Streaming val sparkConf = new SparkConf() .setAppName("MQTTWordCount") val ssc = new StreamingContext(sparkConf, Seconds(2)) val lines = MQTTUtils.createStream(ssc, brokerUrl, topic, StorageLevel.MEMORY_ONLY_SER_2) val words = lines.flatMap(x => x.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() 39
  • 39.
  • 40.
    Apache Bahir Project ApacheBahir provides extensions to multiple distributed analytic platforms, extending their reach with a diversity of streaming connectors and SQL data sources. Currently, Bahir provides extensions for Apache Spark and Apache Flink.
  • 41.
    Bahir extensions forApache Spark MQTT – Enables reading data from MQTT Servers using Spark Streaming or Structured streaming. • http://bahir.apache.org/docs/spark/current/spark-sql-streaming-mqtt/ • http://bahir.apache.org/docs/spark/current/spark-streaming-mqtt/ Couch DB/Cloudant– Enables reading data from CouchDB/Cloudant using Spark SQL and Spark Streaming. Twitter – Enables reading social data from twitter using Spark Streaming. • http://bahir.apache.org/docs/spark/current/spark-streaming-twitter/ Akka – Enables reading data from Akka Actors using Spark Streaming or Structured Streaming. • http://bahir.apache.org/docs/spark/current/spark-streaming-akka/ ZeroMQ – Enables reading data from ZeroMQ using Spark Streaming. • http://bahir.apache.org/docs/spark/current/spark-streaming-zeromq/ 42
  • 42.
    Bahir extensions forApache Spark Google Cloud Pub/Sub – Add spark streaming connector to Google Cloud Pub/Sub 43
  • 43.
    Apache Spark extensionsin Bahir Adding Bahir extensions into your application - Using SBT libraryDependencies += "org.apache.bahir" %% "spark-streaming-mqtt" % "2.2.0” - Using Maven <dependency> <groupId>org.apache.bahir</groupId> <artifactId>spark-streaming-mqtt_2.11 </artifactId> <version>2.2.0</version> </dependency> 44
  • 44.
    Apache Spark extensionsin Bahir Submitting applications with Bahir extensions to Spark - Spark-shell bin/spark-shell --packages org.apache.bahir:spark-streaming_mqtt_2.11:2.2.0 ….. - Spark-submit bin/spark-submit --packages org.apache.bahir:spark-streaming_mqtt_2.11:2.2.0 ….. 45
  • 45.
  • 46.
    IoT Analytics –Anomaly Detection The demo environment https://github.com/lresende/bahir-iot-demo 47 Node.js Web app Simulates Elevator IoT devices Elevator simulator Metrics: • Weight • Speed • Power • Temperature • System MQTT Mosquitto MQTT JavaScript Client MQTT Streaming Connector
  • 47.
    IoT Analytics –Anomaly Detection The Moving Z-Score model scores anomalies in a univariate sequential dataset, often a time series. The moving Z-score is a very simple model for measuring the anomalousness of each point in a sequential dataset like a time series. Given a window size w, the moving Z-score is the number of standard deviations each observation is away from the mean, where the mean and standard deviation are computed only over the previous w observations. 48 Reference: https://turi.com/learn/userguide/anomaly_detection/moving_zscore.html Reference: https://www.isixsigma.com/tools-templates/statistical-analysis/improved-forecasting-moving-averages-and-z-scores/ Z-Score Moving Z-Score Moving mean and moving standard deviation
  • 48.
  • 49.
    Summary – Takeaway points IoT Design Patterns - Lambda and Kappa Architecture - Real Time data processing best practice Apache Spark - IoT Analytics Runtime with support for ”Continuous Applications” Apache Bahir - Bring access to IoT data via supported connectors (e.g. MQTT) IoT Applications - Using Spark and Bahir to start processing IoT data in near real time using Spark Streaming - Anomaly detection using Moving Z-Scores model 50
  • 50.
  • 51.
    References Apache Bahir http://bahir.apache.org Documentation forApache Spark extensions http://bahir.apache.org/docs/spark/current/documentation/ Source Repositories https://github.com/apache/bahir https://github.com/apache/bahir-website Demo Repository https://github.com/lresende/bahir-iot-demo 52Image source: http://az616578.vo.msecnd.net/files/2016/03/21/6359412499310138501557867529_thank-you-1400x800-c-default.gif
  • 52.
    53May 17, 2018/ © 2018 IBM Corporation CALL FOR CODE KEYNOTE With Jeffrey Borek, 10:00 AM
  • 53.
    54March 30 2018/ © 2018 IBM Corporation