Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams/KSQL

BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENF
HAMBURG KOPENHAGEN LAUSANNE MÜNCHEN STUTTGART WIEN ZÜRICH
Ingesting and Processing IoT Data -
using MQTT, Kafka Connect and KSQL
Guido Schmutz
Kafka Summit 2018 – 16.10.2018
@gschmutz guidoschmutz.wordpress.com

Guido Schmutz
Working at Trivadis for more than 21 years
Oracle ACE Director for Fusion Middleware and SOA
Consultant, Trainer Software Architect for Java, Oracle, SOA and
Big Data / Fast Data
Head of Trivadis Architecture Board
Technology Manager @ Trivadis
More than 30 years of software development experience
Contact: guido.schmutz@trivadis.com
Blog: http://guidoschmutz.wordpress.com
Slideshare: http://www.slideshare.net/gschmutz
Twitter: gschmutz

Agenda
1. Introduction
2. IoT Logistics use case – Kafka Ecosystem "in Action”
3. Stream Data Integration – IoT Device to Kafka over MQTT
4. Stream Analytics with KSQL
5. Summary

Hadoop Clusterd
Hadoop Cluster
Big Data
Reference Architecture for Data Analytics Solutions
SQL
Search
Service
BI Tools
Enterprise Data
Warehouse
Search / Explore
File Import / SQL Import
Event
Hub
D
ata
Flow
D
ata
Flow
Change DataCapture Parallel
Processing
Storage
Storage
RawRefined
Results
SQL
Export
Microservice State
{ }
API
Stream
Processor
State
{ }
API
Event
Stream
Event
Stream
Search
Service
Stream Analytics
Microservices
Enterprise Apps
Logic
{ }
API
Edge Node
Rules
Event Hub
Storage
Bulk Source
Event Source
Location
DB
Extract
File
DB
IoT
Data
Mobile
Apps
Social
Event Stream
Telemetry

Hadoop Clusterd
Hadoop Cluster
Big Data
Reference Architecture for Data Analytics Solutions
SQL
Search
Service
BI Tools
Enterprise Data
Warehouse
Search / Explore
Event
Hub
D
ata
Flow
D
ata
Flow
Processing
Storage
Storage
RawRefined
SQL
Export
Microservice State
{ }
API
Event
Stream
Event
Stream
Search
Service
Microservices
Enterprise Apps
Logic
{ }
API
Edge Node
Rules
Event Hub
Storage
Bulk Source
Event Source
Location
DB
Extract
File
IoT
Data
Mobile
Apps
Social
Event Stream
Telemetry
Stream
Processor
State
{ }
API
Stream Analytics
Results
DB

Two Types of Stream Processing
(from Gartner)
Stream Data Integration
• Primarily cover streaming ETL
• Integration of data source and data sinks
• Filter and transform data
• (Enrich data)
• Route data
Stream Analytics
• analytics use cases
• calculating aggregates and detecting
patterns to generate higher-level, more
relevant summary information (complex
events => used to be CEP)
• Complex events may signify threats or
opportunities that require a response

Stream Integration and Stream Analytics with Kafka
Source
Connector
trucking_
driver
Kafka Broker
Sink
Connector
Stream
Processing

Stream Data Integration and Stream Analytics with
Kafka
Source
Connector
trucking_
driver
Kafka Broker
Sink
Connector
Stream
Processing

Hadoop Clusterd
Hadoop Cluster
Big Data
Unified Architecture for Modern Data Analytics Solutions
SQL
Search
Service
BI Tools
Enterprise Data
Warehouse
Search / Explore
Event
Hub
D
ata
Flow
D
ata
Flow
Processing
Storage
Storage
RawRefined
Results
SQL
Export
Microservice State
{ }
API
Stream
Processor
State
{ }
API
Event
Stream
Event
Stream
Search
Service
Stream Analytics
Microservices
Enterprise Apps
Logic
{ }
API
Edge Node
Rules
Event Hub
Storage
Bulk Source
Event Source
Location
DB
Extract
File
DB
IoT
Data
Mobile
Apps
Social
Event Stream
Telemetry

Various IoT Data Protocols
• MQTT (Message Queue Telemetry Transport)
• CoaP
• AMQP
• DDS (Data Distribution Service)
• STOMP
• REST
• WebSockets
• …

IoT Logistics use case – Kafka
Ecosystem "in Action"

Demo - IoT Logistics Use Case
Trucks are sending driving info and geo-position
data in one single message
Position &
Driving Info
Testdata-Generator originally by Hortonworks
{"timestamp":1537343400827,"truckId":87,
"driverId":13,"routeId":987179512,"eventType":"Normal",
,"latitude":38.65,"longitude":-90.21, "correlationId":"-
3208700263746910537"}
{
"timestamp":1537343400827,
"truckId":87,
"driverId":13,
"routeId":987179512,
"eventType":"Normal",
"latitude":38.65,
"longitude":-90.21,
"correlationId":"-32087002637”
}
?

Stream Data Integration – IoT
Device to Kafka over MQTT

Stream Data Integration
Source
Connector
trucking_
driver
Kafka Broker
Sink
Connector
Stream
Processing

(I) IoT Device sends data via MQTT
Message Queue Telemetry Transport (MQTT)
Pub/Sub architecture with Message Broker
Built in retry / QoS mechanism
Last Will and Testament (LWT)
Not all MQTT brokers are scalable
Available
Does not provide state (history)
truck/nn/
position
3208700263746910537"}
Position &
Driving Info

MQTT to Kafka using Confluent MQTT Connector

IoT Device sends data via MQTTs – how to get the data
into Kafka?
truck
position
truck/nn/
position
?
3208700263746910537"}
Position &
Driving Info

2 Ways for MQTT with Confluent Streaming Platform
Confluent MQTT Connector (Preview)
• Pull-based
• integrate with (existing) MQTT servers
• can be used both as a Source and Sink
• output is an envelope with all of the
properties of the incoming message
• Value: body of MQTT message
• Key: is the MQTT topic the message was
written to
• Can consume multiple MQTT topics and write to
one single Kafka topic
• RegexRouter SMT can be used to change topic
names
Confluent MQTT Proxy
• Push-based
• enables MQTT clients to use the MQTT
protocol to publish data directly to Kafka
• MQTT Proxy is stateless and independent
of other instances
• simple mapping scheme of MQTT topics to
Kafka topics based on regular expressions
• reduced lag in message publishing
compared to traditional MQTT brokers

(II) MQTT to Kafka using Confluent MQTT Connector
truck/nn/
position
mqtt to
kafka
truck_position kafkacat
3208700263746910537"}
Position &
Driving Info

Confluent MQTT Connector
Currently available as a Preview on Confluent Hub
Setup plugin.path to specify the additional folder
confluent-hub install confluentinc/kafka-connect-mqtt:1.0.0-preview
plugin.path=/usr/share/java,/etc/kafka-connect/custom-plugins,
/usr/share/confluent-hub-components

Create an instance of Confluent MQTT Connector
#!/bin/bash
curl -X "POST" "http://192.168.69.138:8083/connectors"
-H "Content-Type: application/json"
-d $'{
"name": "mqtt-source",
"config": {
"connector.class": "io.confluent.connect.mqtt.MqttSourceConnector",
"tasks.max": "1",
"name": "mqtt-source",
"mqtt.server.uri": "tcp://mosquitto:1883",
"mqtt.topics": "truck/+/position",
"kafka.topic":"truck_position",
"mqtt.clean.session.enabled":"true",
"mqtt.connect.timeout.seconds":"30",
"mqtt.keepalive.interval.seconds":"60",
"mqtt.qos":"0"
}
}'

MQTTProxy
(III) MQTT to Kafka using Confluent MQTT Proxy
truck
position
engine metrics
console
consumer
Engine
Metrics
3208700263746910537"}
Position &
Driving Info

Configure MQTT Proxy
Configure MQTT Proxy
Start MQTT Proxy
topic.regex.list=truck_position:.*position,
engine_metric:.*engine_metric
listeners=0.0.0.0:1883
bootstrap.servers=PLAINTEXT://broker-1:9092
confluent.topic.replication.factor=1
bin/kafka-mqtt-start kafka-mqtt.properties

MQTTProxy
MQTT Connector vs. MQTT Proxy
MQTT Connector
• Pull-based
• Use existing MQTT infrastructures
• Bi-directional
MQTT Proxy
• Push-based
• Does not provide all MQTT functionality
• Only uni-directional
Position
Position
Position
truck/nn/
driving info
mqtt to
kafka
truck
driving info
truck/nn/
position
mqtt to
kafka
truck
position
Position
Position
Position
truck/nn/
driving info
mqtt to
kafka
truck/nn/
position
mqtt to
kafka
Position
Position
Position
truck
driving info
truck
position
Position
Position
Position
REGION-1 DC
REGION-2 DC
REGION-1 DC
REGION-2 DC
Headquarter DC
Headquarter DC

(IV) MQTT to Kafka using StreamSets Data Collector
truck/nn/
position
mqtt to
kafka
truck_position
console
consumer
3208700263746910537"}
Position &
Driving Info

MQTT to Kafka using StreamSets Data Collector

MQTT
Proxy
Wait … there is more ….
truck/nn/
position
mqtt to
kafka
truck_driving
info
truck_position
console
consumer
what about some
analytics ?
console
consumer
3208700263746910537"}
Position &
Driving Info
Position &
Driving Info
3208700263746910537"}

Stream Analytics
Source
Connector
trucking_
driver
Kafka Broker
Sink
Connector
Stream
Processing

KSQL - Terminology
Stream
• “History”
• an unbounded sequence of structured data
("facts")
• Facts in a stream are immutable
• new facts can be inserted to a stream
• existing facts can never be updated or
deleted
• Streams can be created from a Kafka topic
or derived from an existing stream
Table
• “State”
• a view of a stream, or another table, and
represents a collection of evolving facts
• Facts in a table are mutable
• new facts can be inserted to the table
• existing facts can be updated or deleted
• Tables can be created from a Kafka topic or
derived from existing streams and tables
Enables stream processing with zero coding required
The simplest way to process streams of data in real-time

(V) Create STREAM on truck_position and use it in
KSQL CLI
truck/nn/
position
mqtt-to-
kafka
truck-
position
Stream
3208700263746910537"}
Position &
Driving Info
KSQL CLI

Create a STREAM on truck_driving_info
ksql> CREATE STREAM truck_driving_info_s
(ts VARCHAR,
truckId VARCHAR,
driverId BIGINT,
routeId BIGINT,
eventType VARCHAR,
latitude DOUBLE,
longitude DOUBLE,
correlationId VARCHAR)
WITH (kafka_topic='truck_driving_info',
value_format=‘JSON');
Message
----------------
Stream created

KSQL - SELECT
Selects rows from a KSQL stream or table
Result of this statement will not be persisted in a Kafka topic and will only be printed out
in the console
from_item is one of the following: stream_name, table_name
SELECT select_expr [, ...]
FROM from_item
[ LEFT JOIN join_table ON join_criteria ]
[ WINDOW window_expression ]
[ WHERE condition ]
[ GROUP BY grouping_expression ]
[ HAVING having_expression ]
[ LIMIT count ];

Use SELECT to browse from Stream
ksql> SELECT * FROM truck_driving_info_s;
1539711991642 | truck/24/position | null | 24 | 10 | 1198242881 | Normal |
36.84 | -94.83 | -6187001306629414077
42.04 | -88.02 | -6187001306629414077
38.33 | -94.35 | -6187001306629414077
36.73 | -95.01 | -6187001306629414077
ksql> SELECT * FROM truck_position_s WHERE eventType != 'Normal';
1539712101614 | truck/67/position | null | 67 | 11 | 160405074 | Lane
Departure | 38.98 | -92.53 | -6187001306629414077
1539712116450 | truck/18/position | null | 18 | 25 | 987179512 | Overspeed
| 40.76 | -88.77 | -6187001306629414077
1539712120102 | truck/31/position | null | 31 | 12 | 927636994 | Unsafe
following distance | 38.22 | -91.18 | -6187001306629414077

(VI) – CREATE AS … SELECT …
detect_dangero
us_driving
truck/nn/
position
mqtt-to-
kafka
truck-
position
Stream
Stream
Dangerous-
driving
3208700263746910537"}
Position &
Driving Info

CREATE STREAM … AS SELECT …
Create a new KSQL table along with the corresponding Kafka topic and stream the
result of the SELECT query as a changelog into the topic
WINDOW clause can only be used if the from_item is a stream
CREATE STREAM stream_name
[WITH ( property_name = expression [, ...] )]
AS SELECT select_expr [, ...]
FROM from_stream [ LEFT | FULL | INNER ]
JOIN [join_table | join_stream]
[ WITHIN [(before TIMEUNIT, after TIMEUNIT) | N TIMEUNIT] ] ON join_criteria
[ WHERE condition ]
[PARTITION BY column_name];

INSERT INTO … AS SELECT …
Stream the result of the SELECT query into an existing stream and its underlying topic
schema and partitioning column produced by the query must match the stream’s
schema and key
If the schema and partitioning column are incompatible with the stream, then the
statement will return an error
stream_name and from_item must both
refer to a Stream. Tables are not supported!
CREATE STREAM stream_name ...;
INSERT INTO stream_name
SELECT select_expr [., ...]
FROM from_stream
[ WHERE condition ]
[ PARTITION BY column_name ];

CREATE AS … SELECT …
ksql> CREATE STREAM dangerous_driving_s
WITH (kafka_topic= dangerous_driving_s',
value_format='JSON')
AS SELECT * FROM truck_position_s
WHERE eventtype != 'Normal';
Message
----------------------------
Stream created and running
ksql> select * from dangerous_driving_s;
1539712430051 | truck/18/position | null | 18 | 25 | 987179512 | Lane
Departure | 35.1 | -90.07 | -6187001306629414077

Windowing
streams are unbounded
need some meaningful time frames to do
computations (i.e. aggregations)
Computations over events done using
windows of data
Windows are tracked per unique key
Fixed Window Sliding Window Session Window
Time
Stream of Data Window of Data

(VII) Aggregate and Window
detect_dangero
us_driving
truck/nn/
position
mqtt-to-
kafka
truck-
position
Stream
Stream
Dangerous-
driving
count_by_
eventType
Table
Dangergous-
driving-count
3208700263746910537"}
Position &
Driving Info

SELECT COUNT … GROUP BY
ksql> CREATE TABLE dangerous_driving_count AS
SELECT eventType, count(*) nof
FROM dangerous_driving_s
WINDOW TUMBLING (SIZE 30 SECONDS)
GROUP BY eventType;
Message
----------------------------
Table created and running
ksql> SELECT TIMESTAMPTOSTRING(ROWTIME, 'yyyy-MM-dd HH:mm:ss.SSS’),
eventType, nof
FROM dangerous_driving_count;;
2018-10-16 05:12:19.408 | Unsafe following distance | 1
2018-10-16 05:12:38.926 | Unsafe following distance | 1
2018-10-16 05:12:39.615 | Unsafe tail distance | 1
2018-10-16 05:12:43.155 | Overspeed | 1

Joining
Stream to Static (Table) Join Stream to Stream Join (one window join)
Stream to Stream Join (two window join)
Stream-to-
Static Join
Stream-to-
Stream
Join
Stream-to-
Stream
Join
TimeTime
Time

(VIII) – Join Table to enrich with Driver data
Truck
Driver
kdbc-to-
kafka
truck-
driver
27, Walter, Ward, Y,
24-JUL-85, 2017-10-
02 15:19:00
Table
join dangerous-
driving & driver
Stream
Dangerous-
driving & driver
detect_dangero
us_driving
truck/nn/
position
mqtt-to-
kafka
truck-
position
Stream
Stream
Dangerous-
driving
count_by_
eventType
Table
Dangergous-
driving-count
{"id":27,"firstName":"Walter","lastName":"W
ard","available":"Y","birthdate":"24-JUL-
85","last_update":1506923052012}
3208700263746910537"}
Position &
Driving Info

Join Table to enrich with Driver data
#!/bin/bash
curl -X "POST" "http://192.168.69.138:8083/connectors"
-H "Content-Type: application/json"
-d $'{
"name": "jdbc-driver-source",
"config": {
"connector.class": "JdbcSourceConnector",
"connection.url":"jdbc:postgresql://db/sample?user=sample&password=sample",
"mode": "timestamp",
"timestamp.column.name":"last_update",
"table.whitelist":"driver",
"validate.non.null":"false",
"topic.prefix":"truck_",
"key.converter":"org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable": "false",
"value.converter":"org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": "false",
"name": "jdbc-driver-source",
"transforms":"createKey,extractInt",
"transforms.createKey.type":"org.apache.kafka.connect.transforms.ValueToKey",
"transforms.createKey.fields":"id",
"transforms.extractInt.type":"org.apache.kafka.connect.transforms.ExtractField$Key",
"transforms.extractInt.field":"id"
}
}'

Create Table with Driver State
ksql> CREATE TABLE driver_t
(id BIGINT,
first_name VARCHAR,
last_name VARCHAR,
available VARCHAR)
WITH (kafka_topic='truck_driver',
value_format='JSON',
key='id');
Message
----------------
Table created

Create Table with Driver State
ksql> CREATE STREAM dangerous_driving_and_driver_s
WITH (kafka_topic='dangerous_driving_and_driver_s',
value_format='JSON’, partitions=8)
AS SELECT driverId, first_name, last_name, truckId, routeId, eventtype,
latitude, longitude
FROM truck_position_s
LEFT JOIN driver_t
ON dangerous_driving_and_driver_s.driverId = driver_t.id;
Message
----------------------------
Stream created and running
ksql> select * from dangerous_driving_and_driver_s;
1539713095921 | 11 | 11 | Micky | Isaacson | 67 | 160405074 | Lane Departure |
39.01 | -93.85
1539713113254 | 11 | 11 | Micky | Isaacson | 67 | 160405074 | Unsafe following
distance | 39.0 | -93.65

(IX) – Custom UDF for calculating Geohash
Truck
Driver
kdbc-to-
kafka
truck-
driver
27, Walter, Ward, Y,
24-JUL-85, 2017-10-
02 15:19:00
Table
join dangerous-
driving & driver
Stream
Dangerous-
driving & driver
detect_dangero
us_driving
truck/nn/
position
mqtt-to-
kafka
truck-
position
Stream
Stream
Dangerous-
driving
count_by_
eventType
Table
Dangergous-
driving-count
{"id":27,"firstName":"Walter","lastName":"W
ard","available":"Y","birthdate":"24-JUL-
85","last_update":1506923052012}
3208700263746910537"}
Position &
Driving Info
dangerous
driving by geo
Stream
dangerous-
drving-geohash

Custom UDF for calculating Geohashes
Geohash is a geocoding which encodes a
geographic location into a short string of letters
and digits
hierarchical spatial data structure which
subdivides space into buckets of grid shape
Length Area width x height
1 5,009.4km x 4,992.6km
2 1,252.3km x 624.1km
3 156.5km x 156km
4 39.1km x 19.5km
5 39.1km x 19.5km
12 3.7cm x 1.9cm
ksql> SELECT latitude, longitude,
geohash(latitude, longitude, 4)
FROM dangerous_driving_s;
38.31 | -91.07 | 9yz1
37.7 | -92.61 | 9ywn
34.78 | -92.31 | 9ynm
42.23 | -91.78 | 9zw8xw
...
http://geohash.gofreerange.com/

Add an UDF sample
Geohash and join to some important messages for drivers
@UdfDescription(name = "geohash",
description = "returns the geohash for a given LatLong")
public class GeoHashUDF {
@Udf(description = "encode lat/long to geohash of specified length.")
public String geohash(final double latitude, final double longitude,
int length) {
return GeoHash.encodeHash(latitude, longitude, length);
}
@Udf(description = "encode lat/long to geohash.")
public String geohash(final double latitude, final double longitude) {
return GeoHash.encodeHash(latitude, longitude);
}
}

Summary
Two ways to bring in MQTT data => MQTT Connector or MQTT Proxy
KSQL is another way to work with data in Kafka => you can (re)use some of your SQL
knowledge
• Similar semantics to SQL, but is for queries on continuous, streaming data
Well-suited for structured data (there is the "S" in KSQL)
There is more
• Stream to Stream Join
• REST API for executing KSQL
• Avro Format & Schema Registry
• Using Kafka Connect to write results to data stores
• …

Choosing the Right API
• Java, c#, c++, scala,
phyton, node.js,
go, php …
• subscribe()
• poll()
• send()
• flush()
• Anything Kafka
• Fluent Java API
• mapValues()
• filter()
• flush()
• Stream Analytics
• SQL dialect
• SELECT … FROM …
• JOIN ... WHERE
• GROUP BY
• Stream Analytics
Consumer,
Producer API
Kafka Streams KSQL
• Declarative
• Configuration
• REST API
• Out-of-the-box
connectors
• Stream Integration
Kafka Connect
Flexibility Simplicity
Source: adapted from Confluent

Technology on its own won't help you.
You need to know how to use it properly.

Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams/KSQL

More Related Content

What's hot

Similar to Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams/KSQL

More from confluent

Recently uploaded

In this document

Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams/KSQL