Getting insights from IoT data with Apache Spark and Apache Bahir

Getting Insights from IoT data
with Apache Spark &
Apache Bahir
Luciano Resende
June 20th
, 2018
1

About me - Luciano Resende
2
Data Science Platform Architect – IBM – CODAIT
• Have been contributing to open source at ASF for over 10 years
• Currently contributing to : Jupyter Notebook ecosystem, Apache Bahir, Apache
Toree, Apache Spark among other projects related to AI/ML platforms
lresende@apache.org
https://www.linkedin.com/in/lresende
@lresende1975
https://github.com/lresende

4
Open Source Leadership & Contributions
IBM generated open source innovation
• 137 Code Open (dWO) projects w/1000+ Github projects
• 4 graduates: Node-Red, OpenWhisk, SystemML,
Blockchain fabric to full open governance in the last year
• developer.ibm.com/code/open/code/
Community
• IBM focused on 18 strategic communities
• Drive open governance in “Centers of Gravity”
• IBM Leaders drive key technologies and assure freedom
of action
The IBM OS Way is now open sourced
• Training, Recognition, Tooling
• Organization, Consuming, Contributing
2018 / © 2018 IBM Corporation

5
IBM’s history of strong AI leadership
1997: Deep Blue
• Deep Blue became the first machine to beat a world chess
champion in tournament play
2011: Jeopardy!
• Watson beat two top
Jeopardy! champions
1968, 2001: A Space Odyssey
• IBM was a technical
advisor
• HAL is “the latest in
machine intelligence”
2018: Open Tech, AI & emerging
standards
• New IBM centers of gravity for AI
• OS projects increasing exponentially
• Emerging global standards in AI

Center for Open Source
Data and AI Technologies
CODAIT
codait.org
codait (French)
= coder/coded
https://m.interglot.com/fr/en/codait
CODAIT aims to make AI solutions
dramatically easier to create, deploy,
and manage in the enterprise
Relaunch of the Spark Technology
Center (STC) to reflect expanded
mission
6

Agenda
7
Internet of Things
IoT Use Cases
IoT Design Patterns
Apache Spark & Apache Bahir
Live Demo – Anomaly Detection
Summary
References
Q&A

What is THE INTERNET OF THINGS (IoT) ?
THE TERM IOT WAS FIRST TOUTED BY
KEVIN ASHTON, IN 1999. It’s more than
just machine to machine communications.
It’s about ecosystems of devices that form
relevant connections to people and other
devices to exchange data. It’s a useful term
to describe the world of connected and
wearable devices that’s emerging.
9

IoT - INTERACTIONS BETWEEN multiple ENTITIES
10
control
observe
inform
command
actuate
inform
PEOPLE
THINGS SOFTWARE

Some IoT EXAMPLES
11
SMART HOMES
TRANSPORT
WEARABLES INDUSTRY
DISPLAYS HEALTH
From thermostats
to smart switches
and remote
controls to security
systems
Self-driving cars,
drones for
delivering goods ,
etc
Smartwatches,
and other devices
enabling control
and providing
monitoring
capabilities
Robotics, sensors
for predicting
quality, failures,
etc
Not only VR, but
many new displays
enabling gesture
controls, haptic
interfaces, etc
Connected health,
in partnership with
wearables for
monitoring health
metrics, among
other examples

Some IoT PATTERNS
12
• Remote control
• Security analysis
• Edge analytics
• Historical data analysis
• Distributed Platforms
• Real-time decisions

The Weather Company
The Weather Company data
- Surface observations
- precipitation
- radar
- satellite
- personal weather stations
- lightning sources
- data collected from planes
every day
- etc.

Home Automation & Security
- Multiple connected or
standalone devices
- Controlled by Voice
- Amazon Echo (Alexa)
- Google Home
- Apple HomePod (Siri)
15

TESLA connected cars
CONNECTED VEHICLES IS ONE
EXAMPLE OF THE IoT. It’s not
just about Google Maps in cars.
When Tesla finds a software
fault with their vehicle rather
than issuing an expensive and
damaging recall, they simply
updated the car’s operating
system over the air.
[hcp://www.wired.com/2014/02
/teslas- air-fix-best-example-
yet-internet-things/]
16

AMAZON Go
AMAZON GO – No lines, no
checkout, just grab and go
17

Industrial Internet of Things
18
- Smart factory
- Predictive and remote
maintenance
- Smart metering and
smart grid
- Industrial security
- Industrial heating,
ventilation and air
conditioning
- Asset tracking and
smart logistics
Reference: https://www.iiconsortium.org/

LAMBDA Architecture
Lambda architecture is a data-
processing architecture designed to
handle massive quantities of data by
taking advantage of both batch-
and stream-processing methods. This
approach to architecture attempts to
balance latency, throughput, and fault-
tolerance by using batch processing to
provide comprehensive and accurate
views of batch data, while simultaneously
using real-time stream processing to
provide views of online data
20
Images: https://www.oreilly.com/ideas/applying-the-kappa-architecture-in-the-telco-industry

KAPA Architecture
21
Images: https://www.oreilly.com/ideas/applying-the-kappa-architecture-in-the-telco-industry
The Kappa architecture simplifies
the Lambda architecture by removing
the batch layer and replacing it with a
streaming layer.

REALTIME Data Processing best practice
22
Pub/Sub Component Data Processor Data Storage
One recommendation for processing
streaming of massive quantities of
data is to add a queue component to
front-end the processing of the data.
This enables a more fault-tolerant
solution, where in conjunction with
state management, the runtime
application can have failures and
subsequently continue processing
from the same data point.

Building an IoT
Application
23

MQTT – IoT Connectivity Protocol
24
• Constrained devices
• Low bandwidth connection
• Intermittent connections

MQTT – IoT Connectivity Protocol
25
Connect
+
Publish
+
Subscribe
~1990
IBM / Eurotech
2010
Published
2011
Eclipse M2M / Paho
2014
OASIS
Open spec
+ 40 client
implementations
Minimal
overhead
Tiny
Clients
(Java 170KB)
History
Header
2-4 bytes
(publish)
14 bytes
(connect)
V5
May 2018

MQTT – Quality of Service
26
MQTT
Broker
QoS0
QoS1
QoS2
At most once
At least once
Exactly once
. No connection failover
. Never duplicate
. Has connection failover
. Can duplicate
. Has connection failover
. Never duplicate

MQTT – World usage
Smart Home Automation
Messaging
Notable Mentions:
- IBM IoT Platform
- AWS IoT
- Microsoft IoT Hub
- Facebook Messenger
27

Apache Spark Introduction
29
Spark Core
Spark
SQL
Spark
Streaming
Spark
ML
Spark
GraphX
executes SQL
statements
performs
streaming
analytics using
micro-batches
common
machine
learning and
statistical
algorithms
distributed
graph
processing
framework
general compute engine, handles
distributed task dispatching,
scheduling and basic I/O
functions
large variety of data sources
and formats can be supported,
both on-premise or cloud
BigInsights
(HDFS)
Cloudant
dashDB
SQL
DB

Apache Spark – Spark SQL
31
Spark
SQL
Unified data access APIS: Query
structured data sets with SQL or
Dataset/DataFrame APIs
Fast, familiar query language across
all of your enterprise data
RDBMS
Data Sources
Structured
Streaming
Data Sources

32
You can run SQL statement with SparkSession.sql(…) interface:
val spark = SparkSession.builder()
.appName(“Demo”)
.getOrCreate()
spark.sql(“create table T1 (c1 int, c2 int) stored as parquet”)
val ds = spark.sql(“select * from T1”)
You can further transform the resultant dataset:
val ds1 = ds.groupBy(“c1”).agg(“c2”-> “sum”)
val ds2 = ds.orderBy(“c1”)
The result is a DataFrame / Dataset[Row]
ds.show() displays the rows

You can read from data sources using SparkSession.read.format(…)
.getOrCreate()
case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer)
// loading csv data to a Dataset of Bank type
val bankFromCSV = spark.read.csv(“hdfs://localhost:9000/data/bank.csv").as[Bank]
// loading JSON data to a Dataset of Bank type
val bankFromJSON = spark.read.json(“hdfs://localhost:9000/data/bank.json").as[Bank]
// select a column value from the Dataset
bankFromCSV.select(‘age’).show() will return all rows of column “age” from this dataset.
33

You can also configure a specific data source with specific options
.getOrCreate()
case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer)
// loading csv data to a Dataset of Bank type
val bankFromCSV = sparkSession.read
.option("header", ”true") // Use first line of all files as header
.option("inferSchema", ”true") // Automatically infer data types
.option("delimiter", " ")
.csv("/users/lresende/data.csv”)
.as[Bank]
bankFromCSV.select(‘age).show() // will return all rows of column “age” from this dataset.
34

Apache Spark – Spark SQL Structured Streaming
Unified programming model for streaming, interactive and batch queries
35Image source: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
Considers the data stream as an
unbounded table

Apache Spark – Spark SQL Structured Streaming
SQL regular APIs
.getOrCreate()
val input = spark.read
.schema(schema)
.format(”csv")
.load(”input-path")
val result = input
.select(”age”)
.where(”age > 18”)
result.write
.format(”json”)
. save(” dest-path”)
36
Structured Streaming APIs
.getOrCreate()
val input = spark.readStream
.schema(schema)
.format(”csv")
.load(”input-path")
val result = input
.select(”age”)
.where(”age > 18”)
result.write
.format(”json”)
. startStream(” dest-path”)

Apache Spark – Spark Streaming
37
Spark
Streaming
Micro-batch event processing for
near-real time analytics
e.g. Internet of Things (IoT) devices,
Twitter feeds, Kafka (event hub), etc.
No multi-threading or parallel process
programming required

Also known as discretized stream or DStream
Abstracts a continuous stream of data
Based on micro-batching
Based on RDDs
38

val sparkConf = new SparkConf()
.setAppName("MQTTWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(2))
val lines = MQTTUtils.createStream(ssc, brokerUrl, topic, StorageLevel.MEMORY_ONLY_SER_2)
val words = lines.flatMap(x => x.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
39

Apache Bahir Project
Apache Bahir provides
extensions to multiple
distributed analytic platforms,
extending their reach with a
diversity of streaming
connectors and SQL data
sources.
Currently, Bahir provides
extensions for Apache
Spark and Apache Flink.

Bahir extensions for Apache Spark
MQTT – Enables reading data from MQTT Servers using Spark Streaming or Structured streaming.
• http://bahir.apache.org/docs/spark/current/spark-sql-streaming-mqtt/
• http://bahir.apache.org/docs/spark/current/spark-streaming-mqtt/
Couch DB/Cloudant– Enables reading data from CouchDB/Cloudant using Spark SQL and Spark
Streaming.
Twitter – Enables reading social data from twitter using Spark Streaming.
• http://bahir.apache.org/docs/spark/current/spark-streaming-twitter/
Akka – Enables reading data from Akka Actors using Spark Streaming or Structured Streaming.
• http://bahir.apache.org/docs/spark/current/spark-streaming-akka/
ZeroMQ – Enables reading data from ZeroMQ using Spark Streaming.
• http://bahir.apache.org/docs/spark/current/spark-streaming-zeromq/
42

Bahir extensions for Apache Spark
Google Cloud Pub/Sub – Add spark streaming connector to Google Cloud Pub/Sub
43

Apache Spark extensions in Bahir
Adding Bahir extensions into your application
- Using SBT
libraryDependencies += "org.apache.bahir" %% "spark-streaming-mqtt" % "2.2.0”
- Using Maven
<dependency>
<groupId>org.apache.bahir</groupId>
<artifactId>spark-streaming-mqtt_2.11 </artifactId>
<version>2.2.0</version>
</dependency>
44

Apache Spark extensions in Bahir
Submitting applications with Bahir extensions to Spark
- Spark-shell
bin/spark-shell --packages org.apache.bahir:spark-streaming_mqtt_2.11:2.2.0 …..
- Spark-submit
bin/spark-submit --packages org.apache.bahir:spark-streaming_mqtt_2.11:2.2.0 …..
45

IoT Analytics – Anomaly Detection
The demo environment
https://github.com/lresende/bahir-iot-demo
47
Node.js Web app
Simulates Elevator IoT devices
Elevator simulator
Metrics:
• Weight
• Speed
• Power
• Temperature
• System
MQTT
Mosquitto
MQTT
JavaScript Client
MQTT
Streaming Connector

IoT Analytics – Anomaly Detection
The Moving Z-Score model scores
anomalies in a univariate sequential dataset,
often a time series.
The moving Z-score is a very simple model
for measuring the anomalousness of each
point in a sequential dataset like a time
series. Given a window size w, the moving
Z-score is the number of standard
deviations each observation is away from
the mean, where the mean and standard
deviation are computed only over the
previous w observations.
48
Reference: https://turi.com/learn/userguide/anomaly_detection/moving_zscore.html
Reference: https://www.isixsigma.com/tools-templates/statistical-analysis/improved-forecasting-moving-averages-and-z-scores/
Z-Score
Moving Z-Score
Moving mean and moving standard deviation

Summary – Take away points
IoT Design Patterns
- Lambda and Kappa Architecture
- Real Time data processing best practice
Apache Spark
- IoT Analytics Runtime with support for ”Continuous Applications”
Apache Bahir
- Bring access to IoT data via supported connectors (e.g. MQTT)
IoT Applications
- Using Spark and Bahir to start processing IoT data in near real time using Spark Streaming
- Anomaly detection using Moving Z-Scores model
50

Join the Apache
Bahir community
51

References
Apache Bahir
http://bahir.apache.org
Documentation for Apache Spark extensions
http://bahir.apache.org/docs/spark/current/documentation/
Source Repositories
https://github.com/apache/bahir
https://github.com/apache/bahir-website
Demo Repository
https://github.com/lresende/bahir-iot-demo
52Image source: http://az616578.vo.msecnd.net/files/2016/03/21/6359412499310138501557867529_thank-you-1400x800-c-default.gif

Getting insights from IoT data with Apache Spark and Apache Bahir

More Related Content

What's hot

Similar to Getting insights from IoT data with Apache Spark and Apache Bahir

More from Luciano Resende

Recently uploaded

Getting insights from IoT data with Apache Spark and Apache Bahir