BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENF
HAMBURG KOPENHAGEN LAUSANNE MÜNCHEN STUTTGART WIEN ZÜRICH
Ingesting and Processing IoT Data -
using MQTT, Kafka Connect and KSQL
Guido Schmutz
Kafka Summit 2018 – 16.10.2018
@gschmutz guidoschmutz.wordpress.com
Guido Schmutz
Working at Trivadis for more than 21 years
Oracle ACE Director for Fusion Middleware and SOA
Consultant, Trainer Software Architect for Java, Oracle, SOA and
Big Data / Fast Data
Head of Trivadis Architecture Board
Technology Manager @ Trivadis
More than 30 years of software development experience
Contact: guido.schmutz@trivadis.com
Blog: http://guidoschmutz.wordpress.com
Slideshare: http://www.slideshare.net/gschmutz
Twitter: gschmutz
Agenda
1. Introduction
2. IoT Logistics use case – Kafka Ecosystem "in Action”
3. Stream Data Integration – IoT Device to Kafka over MQTT
4. Stream Analytics with KSQL
5. Summary
Introduction
Hadoop Clusterd
Hadoop Cluster
Big Data
Reference Architecture for Data Analytics Solutions
SQL
Search
Service
BI Tools
Enterprise Data
Warehouse
Search / Explore
File Import / SQL Import
Event
Hub
D
ata
Flow
D
ata
Flow
Change DataCapture Parallel
Processing
Storage
Storage
RawRefined
Results
SQL
Export
Microservice State
{ }
API
Stream
Processor
State
{ }
API
Event
Stream
Event
Stream
Search
Service
Stream Analytics
Microservices
Enterprise Apps
Logic
{ }
API
Edge Node
Rules
Event Hub
Storage
Bulk Source
Event Source
Location
DB
Extract
File
DB
IoT
Data
Mobile
Apps
Social
Event Stream
Telemetry
Hadoop Clusterd
Hadoop Cluster
Big Data
Reference Architecture for Data Analytics Solutions
SQL
Search
Service
BI Tools
Enterprise Data
Warehouse
Search / Explore
File Import / SQL Import
Event
Hub
D
ata
Flow
D
ata
Flow
Change DataCapture Parallel
Processing
Storage
Storage
RawRefined
SQL
Export
Microservice State
{ }
API
Event
Stream
Event
Stream
Search
Service
Microservices
Enterprise Apps
Logic
{ }
API
Edge Node
Rules
Event Hub
Storage
Bulk Source
Event Source
Location
DB
Extract
File
IoT
Data
Mobile
Apps
Social
Event Stream
Telemetry
Stream
Processor
State
{ }
API
Stream Analytics
Results
DB
Two Types of Stream Processing
(from Gartner)
Stream Data Integration
• Primarily cover streaming ETL
• Integration of data source and data sinks
• Filter and transform data
• (Enrich data)
• Route data
Stream Analytics
• analytics use cases
• calculating aggregates and detecting
patterns to generate higher-level, more
relevant summary information (complex
events => used to be CEP)
• Complex events may signify threats or
opportunities that require a response
Stream Integration and Stream Analytics with Kafka
Source
Connector
trucking_
driver
Kafka Broker
Sink
Connector
Stream
Processing
Stream Data Integration and Stream Analytics with
Kafka
Source
Connector
trucking_
driver
Kafka Broker
Sink
Connector
Stream
Processing
Hadoop Clusterd
Hadoop Cluster
Big Data
Unified Architecture for Modern Data Analytics Solutions
SQL
Search
Service
BI Tools
Enterprise Data
Warehouse
Search / Explore
File Import / SQL Import
Event
Hub
D
ata
Flow
D
ata
Flow
Change DataCapture Parallel
Processing
Storage
Storage
RawRefined
Results
SQL
Export
Microservice State
{ }
API
Stream
Processor
State
{ }
API
Event
Stream
Event
Stream
Search
Service
Stream Analytics
Microservices
Enterprise Apps
Logic
{ }
API
Edge Node
Rules
Event Hub
Storage
Bulk Source
Event Source
Location
DB
Extract
File
DB
IoT
Data
Mobile
Apps
Social
Event Stream
Telemetry
Various IoT Data Protocols
• MQTT (Message Queue Telemetry Transport)
• CoaP
• AMQP
• DDS (Data Distribution Service)
• STOMP
• REST
• WebSockets
• …
IoT Logistics use case – Kafka
Ecosystem "in Action"
Demo - IoT Logistics Use Case
Trucks are sending driving info and geo-position
data in one single message
Position &
Driving Info
Testdata-Generator originally by Hortonworks
{"timestamp":1537343400827,"truckId":87,
"driverId":13,"routeId":987179512,"eventType":"Normal",
,"latitude":38.65,"longitude":-90.21, "correlationId":"-
3208700263746910537"}
{
"timestamp":1537343400827,
"truckId":87,
"driverId":13,
"routeId":987179512,
"eventType":"Normal",
"latitude":38.65,
"longitude":-90.21,
"correlationId":"-32087002637”
}
?
Stream Data Integration – IoT
Device to Kafka over MQTT
Stream Data Integration
Source
Connector
trucking_
driver
Kafka Broker
Sink
Connector
Stream
Processing
(I) IoT Device sends data via MQTT
Message Queue Telemetry Transport (MQTT)
Pub/Sub architecture with Message Broker
Built in retry / QoS mechanism
Last Will and Testament (LWT)
Not all MQTT brokers are scalable
Available
Does not provide state (history)
truck/nn/
position
{"timestamp":1537343400827,"truckId":87,
"driverId":13,"routeId":987179512,"eventType":"Normal",
,"latitude":38.65,"longitude":-90.21, "correlationId":"-
3208700263746910537"}
Position &
Driving Info
MQTT to Kafka using Confluent MQTT Connector
IoT Device sends data via MQTTs – how to get the data
into Kafka?
truck
position
truck/nn/
position
?
{"timestamp":1537343400827,"truckId":87,
"driverId":13,"routeId":987179512,"eventType":"Normal",
,"latitude":38.65,"longitude":-90.21, "correlationId":"-
3208700263746910537"}
Position &
Driving Info
2 Ways for MQTT with Confluent Streaming Platform
Confluent MQTT Connector (Preview)
• Pull-based
• integrate with (existing) MQTT servers
• can be used both as a Source and Sink
• output is an envelope with all of the
properties of the incoming message
• Value: body of MQTT message
• Key: is the MQTT topic the message was
written to
• Can consume multiple MQTT topics and write to
one single Kafka topic
• RegexRouter SMT can be used to change topic
names
Confluent MQTT Proxy
• Push-based
• enables MQTT clients to use the MQTT
protocol to publish data directly to Kafka
• MQTT Proxy is stateless and independent
of other instances
• simple mapping scheme of MQTT topics to
Kafka topics based on regular expressions
• reduced lag in message publishing
compared to traditional MQTT brokers
(II) MQTT to Kafka using Confluent MQTT Connector
truck/nn/
position
mqtt to
kafka
truck_position kafkacat
{"timestamp":1537343400827,"truckId":87,
"driverId":13,"routeId":987179512,"eventType":"Normal",
,"latitude":38.65,"longitude":-90.21, "correlationId":"-
3208700263746910537"}
Position &
Driving Info
Confluent MQTT Connector
Currently available as a Preview on Confluent Hub
Setup plugin.path to specify the additional folder
confluent-hub install confluentinc/kafka-connect-mqtt:1.0.0-preview
plugin.path=/usr/share/java,/etc/kafka-connect/custom-plugins,
/usr/share/confluent-hub-components
Create an instance of Confluent MQTT Connector
#!/bin/bash
curl -X "POST" "http://192.168.69.138:8083/connectors" 
-H "Content-Type: application/json" 
-d $'{
"name": "mqtt-source",
"config": {
"connector.class": "io.confluent.connect.mqtt.MqttSourceConnector",
"tasks.max": "1",
"name": "mqtt-source",
"mqtt.server.uri": "tcp://mosquitto:1883",
"mqtt.topics": "truck/+/position",
"kafka.topic":"truck_position",
"mqtt.clean.session.enabled":"true",
"mqtt.connect.timeout.seconds":"30",
"mqtt.keepalive.interval.seconds":"60",
"mqtt.qos":"0"
}
}'
MQTTProxy
(III) MQTT to Kafka using Confluent MQTT Proxy
truck
position
engine metrics
console
consumer
Engine
Metrics
{"timestamp":1537343400827,"truckId":87,
"driverId":13,"routeId":987179512,"eventType":"Normal",
,"latitude":38.65,"longitude":-90.21, "correlationId":"-
3208700263746910537"}
Position &
Driving Info
Configure MQTT Proxy
Configure MQTT Proxy
Start MQTT Proxy
topic.regex.list=truck_position:.*position,
engine_metric:.*engine_metric
listeners=0.0.0.0:1883
bootstrap.servers=PLAINTEXT://broker-1:9092
confluent.topic.replication.factor=1
bin/kafka-mqtt-start kafka-mqtt.properties
MQTTProxy
MQTT Connector vs. MQTT Proxy
MQTT Connector
• Pull-based
• Use existing MQTT infrastructures
• Bi-directional
MQTT Proxy
• Push-based
• Does not provide all MQTT functionality
• Only uni-directional
Position
Position
Position
truck/nn/
driving info
mqtt to
kafka
truck
driving info
truck/nn/
position
mqtt to
kafka
truck
position
Position
Position
Position
truck/nn/
driving info
mqtt to
kafka
truck/nn/
position
mqtt to
kafka
Position
Position
Position
truck
driving info
truck
position
Position
Position
Position
REGION-1 DC
REGION-2 DC
REGION-1 DC
REGION-2 DC
Headquarter DC
Headquarter DC
(IV) MQTT to Kafka using StreamSets Data Collector
truck/nn/
position
mqtt to
kafka
truck_position
console
consumer
{"timestamp":1537343400827,"truckId":87,
"driverId":13,"routeId":987179512,"eventType":"Normal",
,"latitude":38.65,"longitude":-90.21, "correlationId":"-
3208700263746910537"}
Position &
Driving Info
MQTT to Kafka using StreamSets Data Collector
MQTT
Proxy
Wait … there is more ….
truck/nn/
position
mqtt to
kafka
truck_driving
info
truck_position
console
consumer
what about some
analytics ?
console
consumer
{"timestamp":1537343400827,"truckId":87,
"driverId":13,"routeId":987179512,"eventType":"Normal",
,"latitude":38.65,"longitude":-90.21, "correlationId":"-
3208700263746910537"}
Position &
Driving Info
Position &
Driving Info
{"timestamp":1537343400827,"truckId":87,
"driverId":13,"routeId":987179512,"eventType":"Normal",
,"latitude":38.65,"longitude":-90.21, "correlationId":"-
3208700263746910537"}
Stream Analytics with KSQL
Stream Analytics
Source
Connector
trucking_
driver
Kafka Broker
Sink
Connector
Stream
Processing
KSQL - Terminology
Stream
• “History”
• an unbounded sequence of structured data
("facts")
• Facts in a stream are immutable
• new facts can be inserted to a stream
• existing facts can never be updated or
deleted
• Streams can be created from a Kafka topic
or derived from an existing stream
Table
• “State”
• a view of a stream, or another table, and
represents a collection of evolving facts
• Facts in a table are mutable
• new facts can be inserted to the table
• existing facts can be updated or deleted
• Tables can be created from a Kafka topic or
derived from existing streams and tables
Enables stream processing with zero coding required
The simplest way to process streams of data in real-time
(V) Create STREAM on truck_position and use it in
KSQL CLI
truck/nn/
position
mqtt-to-
kafka
truck-
position
Stream
{"timestamp":1537343400827,"truckId":87,
"driverId":13,"routeId":987179512,"eventType":"Normal",
,"latitude":38.65,"longitude":-90.21, "correlationId":"-
3208700263746910537"}
Position &
Driving Info
KSQL CLI
Create a STREAM on truck_driving_info
ksql> CREATE STREAM truck_driving_info_s 
(ts VARCHAR, 
truckId VARCHAR, 
driverId BIGINT, 
routeId BIGINT, 
eventType VARCHAR, 
latitude DOUBLE, 
longitude DOUBLE, 
correlationId VARCHAR) 
WITH (kafka_topic='truck_driving_info', 
value_format=‘JSON');
Message
----------------
Stream created
Create a STREAM on truck_driving_info
ksql> describe truck_position_s;
Field | Type
---------------------------------
ROWTIME | BIGINT
ROWKEY | VARCHAR(STRING)
TS | VARCHAR(STRING)
TRUCKID | VARCHAR(STRING)
DRIVERID | BIGINT
ROUTEID | BIGINT
EVENTTYPE | VARCHAR(STRING)
LATITUDE | DOUBLE
LONGITUDE | DOUBLE
CORRELATIONID | VARCHAR(STRING)
KSQL - SELECT
Selects rows from a KSQL stream or table
Result of this statement will not be persisted in a Kafka topic and will only be printed out
in the console
from_item is one of the following: stream_name, table_name
SELECT select_expr [, ...]
FROM from_item
[ LEFT JOIN join_table ON join_criteria ]
[ WINDOW window_expression ]
[ WHERE condition ]
[ GROUP BY grouping_expression ]
[ HAVING having_expression ]
[ LIMIT count ];
Use SELECT to browse from Stream
ksql> SELECT * FROM truck_driving_info_s;
1539711991642 | truck/24/position | null | 24 | 10 | 1198242881 | Normal |
36.84 | -94.83 | -6187001306629414077
1539711991691 | truck/26/position | null | 26 | 13 | 1390372503 | Normal |
42.04 | -88.02 | -6187001306629414077
1539711991882 | truck/66/position | null | 66 | 22 | 1565885487 | Normal |
38.33 | -94.35 | -6187001306629414077
1539711991902 | truck/22/position | null | 22 | 26 | 1198242881 | Normal |
36.73 | -95.01 | -6187001306629414077
ksql> SELECT * FROM truck_position_s WHERE eventType != 'Normal';
1539712101614 | truck/67/position | null | 67 | 11 | 160405074 | Lane
Departure | 38.98 | -92.53 | -6187001306629414077
1539712116450 | truck/18/position | null | 18 | 25 | 987179512 | Overspeed
| 40.76 | -88.77 | -6187001306629414077
1539712120102 | truck/31/position | null | 31 | 12 | 927636994 | Unsafe
following distance | 38.22 | -91.18 | -6187001306629414077
(VI) – CREATE AS … SELECT …
detect_dangero
us_driving
truck/nn/
position
mqtt-to-
kafka
truck-
position
Stream
Stream
Dangerous-
driving
{"timestamp":1537343400827,"truckId":87,
"driverId":13,"routeId":987179512,"eventType":"Normal",
,"latitude":38.65,"longitude":-90.21, "correlationId":"-
3208700263746910537"}
Position &
Driving Info
CREATE STREAM … AS SELECT …
Create a new KSQL table along with the corresponding Kafka topic and stream the
result of the SELECT query as a changelog into the topic
WINDOW clause can only be used if the from_item is a stream
CREATE STREAM stream_name
[WITH ( property_name = expression [, ...] )]
AS SELECT select_expr [, ...]
FROM from_stream [ LEFT | FULL | INNER ]
JOIN [join_table | join_stream]
[ WITHIN [(before TIMEUNIT, after TIMEUNIT) | N TIMEUNIT] ] ON join_criteria
[ WHERE condition ]
[PARTITION BY column_name];
INSERT INTO … AS SELECT …
Stream the result of the SELECT query into an existing stream and its underlying topic
schema and partitioning column produced by the query must match the stream’s
schema and key
If the schema and partitioning column are incompatible with the stream, then the
statement will return an error
stream_name and from_item must both
refer to a Stream. Tables are not supported!
CREATE STREAM stream_name ...;
INSERT INTO stream_name
SELECT select_expr [., ...]
FROM from_stream
[ WHERE condition ]
[ PARTITION BY column_name ];
CREATE AS … SELECT …
ksql> CREATE STREAM dangerous_driving_s 
WITH (kafka_topic= dangerous_driving_s', 
value_format='JSON') 
AS SELECT * FROM truck_position_s 
WHERE eventtype != 'Normal';
Message
----------------------------
Stream created and running
ksql> select * from dangerous_driving_s;
1539712399201 | truck/67/position | null | 67 | 11 | 160405074 | Unsafe
following distance | 38.65 | -90.21 | -6187001306629414077
1539712416623 | truck/67/position | null | 67 | 11 | 160405074 | Unsafe
following distance | 39.1 | -94.59 | -6187001306629414077
1539712430051 | truck/18/position | null | 18 | 25 | 987179512 | Lane
Departure | 35.1 | -90.07 | -6187001306629414077
Windowing
streams are unbounded
need some meaningful time frames to do
computations (i.e. aggregations)
Computations over events done using
windows of data
Windows are tracked per unique key
Fixed Window Sliding Window Session Window
Time
Stream of Data Window of Data
(VII) Aggregate and Window
detect_dangero
us_driving
truck/nn/
position
mqtt-to-
kafka
truck-
position
Stream
Stream
Dangerous-
driving
count_by_
eventType
Table
Dangergous-
driving-count
{"timestamp":1537343400827,"truckId":87,
"driverId":13,"routeId":987179512,"eventType":"Normal",
,"latitude":38.65,"longitude":-90.21, "correlationId":"-
3208700263746910537"}
Position &
Driving Info
SELECT COUNT … GROUP BY
ksql> CREATE TABLE dangerous_driving_count AS 
SELECT eventType, count(*) nof 
FROM dangerous_driving_s 
WINDOW TUMBLING (SIZE 30 SECONDS) 
GROUP BY eventType;
Message
----------------------------
Table created and running
ksql> SELECT TIMESTAMPTOSTRING(ROWTIME, 'yyyy-MM-dd HH:mm:ss.SSS’),
eventType, nof
FROM dangerous_driving_count;;
2018-10-16 05:12:19.408 | Unsafe following distance | 1
2018-10-16 05:12:38.926 | Unsafe following distance | 1
2018-10-16 05:12:39.615 | Unsafe tail distance | 1
2018-10-16 05:12:43.155 | Overspeed | 1
Joining
Stream to Static (Table) Join Stream to Stream Join (one window join)
Stream to Stream Join (two window join)
Stream-to-
Static Join
Stream-to-
Stream
Join
Stream-to-
Stream
Join
TimeTime
Time
(VIII) – Join Table to enrich with Driver data
Truck
Driver
kdbc-to-
kafka
truck-
driver
27, Walter, Ward, Y,
24-JUL-85, 2017-10-
02 15:19:00
Table
join dangerous-
driving & driver
Stream
Dangerous-
driving & driver
detect_dangero
us_driving
truck/nn/
position
mqtt-to-
kafka
truck-
position
Stream
Stream
Dangerous-
driving
count_by_
eventType
Table
Dangergous-
driving-count
{"id":27,"firstName":"Walter","lastName":"W
ard","available":"Y","birthdate":"24-JUL-
85","last_update":1506923052012}
{"timestamp":1537343400827,"truckId":87,
"driverId":13,"routeId":987179512,"eventType":"Normal",
,"latitude":38.65,"longitude":-90.21, "correlationId":"-
3208700263746910537"}
Position &
Driving Info
Join Table to enrich with Driver data
#!/bin/bash
curl -X "POST" "http://192.168.69.138:8083/connectors" 
-H "Content-Type: application/json" 
-d $'{
"name": "jdbc-driver-source",
"config": {
"connector.class": "JdbcSourceConnector",
"connection.url":"jdbc:postgresql://db/sample?user=sample&password=sample",
"mode": "timestamp",
"timestamp.column.name":"last_update",
"table.whitelist":"driver",
"validate.non.null":"false",
"topic.prefix":"truck_",
"key.converter":"org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable": "false",
"value.converter":"org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": "false",
"name": "jdbc-driver-source",
"transforms":"createKey,extractInt",
"transforms.createKey.type":"org.apache.kafka.connect.transforms.ValueToKey",
"transforms.createKey.fields":"id",
"transforms.extractInt.type":"org.apache.kafka.connect.transforms.ExtractField$Key",
"transforms.extractInt.field":"id"
}
}'
Create Table with Driver State
ksql> CREATE TABLE driver_t 
(id BIGINT, 
first_name VARCHAR, 
last_name VARCHAR, 
available VARCHAR) 
WITH (kafka_topic='truck_driver', 
value_format='JSON', 
key='id');
Message
----------------
Table created
Create Table with Driver State
ksql> CREATE STREAM dangerous_driving_and_driver_s 
WITH (kafka_topic='dangerous_driving_and_driver_s', 
value_format='JSON’, partitions=8) 
AS SELECT driverId, first_name, last_name, truckId, routeId, eventtype,
latitude, longitude 
FROM truck_position_s 
LEFT JOIN driver_t 
ON dangerous_driving_and_driver_s.driverId = driver_t.id;
Message
----------------------------
Stream created and running
ksql> select * from dangerous_driving_and_driver_s;
1539713095921 | 11 | 11 | Micky | Isaacson | 67 | 160405074 | Lane Departure |
39.01 | -93.85
1539713113254 | 11 | 11 | Micky | Isaacson | 67 | 160405074 | Unsafe following
distance | 39.0 | -93.65
(IX) – Custom UDF for calculating Geohash
Truck
Driver
kdbc-to-
kafka
truck-
driver
27, Walter, Ward, Y,
24-JUL-85, 2017-10-
02 15:19:00
Table
join dangerous-
driving & driver
Stream
Dangerous-
driving & driver
detect_dangero
us_driving
truck/nn/
position
mqtt-to-
kafka
truck-
position
Stream
Stream
Dangerous-
driving
count_by_
eventType
Table
Dangergous-
driving-count
{"id":27,"firstName":"Walter","lastName":"W
ard","available":"Y","birthdate":"24-JUL-
85","last_update":1506923052012}
{"timestamp":1537343400827,"truckId":87,
"driverId":13,"routeId":987179512,"eventType":"Normal",
,"latitude":38.65,"longitude":-90.21, "correlationId":"-
3208700263746910537"}
Position &
Driving Info
dangerous
driving by geo
Stream
dangerous-
drving-geohash
Custom UDF for calculating Geohashes
Geohash is a geocoding which encodes a
geographic location into a short string of letters
and digits
hierarchical spatial data structure which
subdivides space into buckets of grid shape
Length Area width x height
1 5,009.4km x 4,992.6km
2 1,252.3km x 624.1km
3 156.5km x 156km
4 39.1km x 19.5km
5 39.1km x 19.5km
12 3.7cm x 1.9cm
ksql> SELECT latitude, longitude, 
geohash(latitude, longitude, 4) 
FROM dangerous_driving_s;
38.31 | -91.07 | 9yz1
37.7 | -92.61 | 9ywn
34.78 | -92.31 | 9ynm
42.23 | -91.78 | 9zw8xw
...
http://geohash.gofreerange.com/
Add an UDF sample
Geohash and join to some important messages for drivers
@UdfDescription(name = "geohash",
description = "returns the geohash for a given LatLong")
public class GeoHashUDF {
@Udf(description = "encode lat/long to geohash of specified length.")
public String geohash(final double latitude, final double longitude,
int length) {
return GeoHash.encodeHash(latitude, longitude, length);
}
@Udf(description = "encode lat/long to geohash.")
public String geohash(final double latitude, final double longitude) {
return GeoHash.encodeHash(latitude, longitude);
}
}
Summary
Summary
Two ways to bring in MQTT data => MQTT Connector or MQTT Proxy
KSQL is another way to work with data in Kafka => you can (re)use some of your SQL
knowledge
• Similar semantics to SQL, but is for queries on continuous, streaming data
Well-suited for structured data (there is the "S" in KSQL)
There is more
• Stream to Stream Join
• REST API for executing KSQL
• Avro Format & Schema Registry
• Using Kafka Connect to write results to data stores
• …
Choosing the Right API
• Java, c#, c++, scala,
phyton, node.js,
go, php …
• subscribe()
• poll()
• send()
• flush()
• Anything Kafka
• Fluent Java API
• mapValues()
• filter()
• flush()
• Stream Analytics
• SQL dialect
• SELECT … FROM …
• JOIN ... WHERE
• GROUP BY
• Stream Analytics
Consumer,
Producer API
Kafka Streams KSQL
• Declarative
• Configuration
• REST API
• Out-of-the-box
connectors
• Stream Integration
Kafka Connect
Flexibility Simplicity
Source: adapted from Confluent
Technology on its own won't help you.
You need to know how to use it properly.

Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams/KSQL

  • 1.
    BASEL BERN BRUGGDÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENF HAMBURG KOPENHAGEN LAUSANNE MÜNCHEN STUTTGART WIEN ZÜRICH Ingesting and Processing IoT Data - using MQTT, Kafka Connect and KSQL Guido Schmutz Kafka Summit 2018 – 16.10.2018 @gschmutz guidoschmutz.wordpress.com
  • 2.
    Guido Schmutz Working atTrivadis for more than 21 years Oracle ACE Director for Fusion Middleware and SOA Consultant, Trainer Software Architect for Java, Oracle, SOA and Big Data / Fast Data Head of Trivadis Architecture Board Technology Manager @ Trivadis More than 30 years of software development experience Contact: guido.schmutz@trivadis.com Blog: http://guidoschmutz.wordpress.com Slideshare: http://www.slideshare.net/gschmutz Twitter: gschmutz
  • 3.
    Agenda 1. Introduction 2. IoTLogistics use case – Kafka Ecosystem "in Action” 3. Stream Data Integration – IoT Device to Kafka over MQTT 4. Stream Analytics with KSQL 5. Summary
  • 4.
  • 5.
    Hadoop Clusterd Hadoop Cluster BigData Reference Architecture for Data Analytics Solutions SQL Search Service BI Tools Enterprise Data Warehouse Search / Explore File Import / SQL Import Event Hub D ata Flow D ata Flow Change DataCapture Parallel Processing Storage Storage RawRefined Results SQL Export Microservice State { } API Stream Processor State { } API Event Stream Event Stream Search Service Stream Analytics Microservices Enterprise Apps Logic { } API Edge Node Rules Event Hub Storage Bulk Source Event Source Location DB Extract File DB IoT Data Mobile Apps Social Event Stream Telemetry
  • 6.
    Hadoop Clusterd Hadoop Cluster BigData Reference Architecture for Data Analytics Solutions SQL Search Service BI Tools Enterprise Data Warehouse Search / Explore File Import / SQL Import Event Hub D ata Flow D ata Flow Change DataCapture Parallel Processing Storage Storage RawRefined SQL Export Microservice State { } API Event Stream Event Stream Search Service Microservices Enterprise Apps Logic { } API Edge Node Rules Event Hub Storage Bulk Source Event Source Location DB Extract File IoT Data Mobile Apps Social Event Stream Telemetry Stream Processor State { } API Stream Analytics Results DB
  • 7.
    Two Types ofStream Processing (from Gartner) Stream Data Integration • Primarily cover streaming ETL • Integration of data source and data sinks • Filter and transform data • (Enrich data) • Route data Stream Analytics • analytics use cases • calculating aggregates and detecting patterns to generate higher-level, more relevant summary information (complex events => used to be CEP) • Complex events may signify threats or opportunities that require a response
  • 8.
    Stream Integration andStream Analytics with Kafka Source Connector trucking_ driver Kafka Broker Sink Connector Stream Processing
  • 9.
    Stream Data Integrationand Stream Analytics with Kafka Source Connector trucking_ driver Kafka Broker Sink Connector Stream Processing
  • 10.
    Hadoop Clusterd Hadoop Cluster BigData Unified Architecture for Modern Data Analytics Solutions SQL Search Service BI Tools Enterprise Data Warehouse Search / Explore File Import / SQL Import Event Hub D ata Flow D ata Flow Change DataCapture Parallel Processing Storage Storage RawRefined Results SQL Export Microservice State { } API Stream Processor State { } API Event Stream Event Stream Search Service Stream Analytics Microservices Enterprise Apps Logic { } API Edge Node Rules Event Hub Storage Bulk Source Event Source Location DB Extract File DB IoT Data Mobile Apps Social Event Stream Telemetry
  • 11.
    Various IoT DataProtocols • MQTT (Message Queue Telemetry Transport) • CoaP • AMQP • DDS (Data Distribution Service) • STOMP • REST • WebSockets • …
  • 12.
    IoT Logistics usecase – Kafka Ecosystem "in Action"
  • 13.
    Demo - IoTLogistics Use Case Trucks are sending driving info and geo-position data in one single message Position & Driving Info Testdata-Generator originally by Hortonworks {"timestamp":1537343400827,"truckId":87, "driverId":13,"routeId":987179512,"eventType":"Normal", ,"latitude":38.65,"longitude":-90.21, "correlationId":"- 3208700263746910537"} { "timestamp":1537343400827, "truckId":87, "driverId":13, "routeId":987179512, "eventType":"Normal", "latitude":38.65, "longitude":-90.21, "correlationId":"-32087002637” } ?
  • 14.
    Stream Data Integration– IoT Device to Kafka over MQTT
  • 15.
  • 16.
    (I) IoT Devicesends data via MQTT Message Queue Telemetry Transport (MQTT) Pub/Sub architecture with Message Broker Built in retry / QoS mechanism Last Will and Testament (LWT) Not all MQTT brokers are scalable Available Does not provide state (history) truck/nn/ position {"timestamp":1537343400827,"truckId":87, "driverId":13,"routeId":987179512,"eventType":"Normal", ,"latitude":38.65,"longitude":-90.21, "correlationId":"- 3208700263746910537"} Position & Driving Info
  • 17.
    MQTT to Kafkausing Confluent MQTT Connector
  • 18.
    IoT Device sendsdata via MQTTs – how to get the data into Kafka? truck position truck/nn/ position ? {"timestamp":1537343400827,"truckId":87, "driverId":13,"routeId":987179512,"eventType":"Normal", ,"latitude":38.65,"longitude":-90.21, "correlationId":"- 3208700263746910537"} Position & Driving Info
  • 19.
    2 Ways forMQTT with Confluent Streaming Platform Confluent MQTT Connector (Preview) • Pull-based • integrate with (existing) MQTT servers • can be used both as a Source and Sink • output is an envelope with all of the properties of the incoming message • Value: body of MQTT message • Key: is the MQTT topic the message was written to • Can consume multiple MQTT topics and write to one single Kafka topic • RegexRouter SMT can be used to change topic names Confluent MQTT Proxy • Push-based • enables MQTT clients to use the MQTT protocol to publish data directly to Kafka • MQTT Proxy is stateless and independent of other instances • simple mapping scheme of MQTT topics to Kafka topics based on regular expressions • reduced lag in message publishing compared to traditional MQTT brokers
  • 20.
    (II) MQTT toKafka using Confluent MQTT Connector truck/nn/ position mqtt to kafka truck_position kafkacat {"timestamp":1537343400827,"truckId":87, "driverId":13,"routeId":987179512,"eventType":"Normal", ,"latitude":38.65,"longitude":-90.21, "correlationId":"- 3208700263746910537"} Position & Driving Info
  • 21.
    Confluent MQTT Connector Currentlyavailable as a Preview on Confluent Hub Setup plugin.path to specify the additional folder confluent-hub install confluentinc/kafka-connect-mqtt:1.0.0-preview plugin.path=/usr/share/java,/etc/kafka-connect/custom-plugins, /usr/share/confluent-hub-components
  • 22.
    Create an instanceof Confluent MQTT Connector #!/bin/bash curl -X "POST" "http://192.168.69.138:8083/connectors" -H "Content-Type: application/json" -d $'{ "name": "mqtt-source", "config": { "connector.class": "io.confluent.connect.mqtt.MqttSourceConnector", "tasks.max": "1", "name": "mqtt-source", "mqtt.server.uri": "tcp://mosquitto:1883", "mqtt.topics": "truck/+/position", "kafka.topic":"truck_position", "mqtt.clean.session.enabled":"true", "mqtt.connect.timeout.seconds":"30", "mqtt.keepalive.interval.seconds":"60", "mqtt.qos":"0" } }'
  • 23.
    MQTTProxy (III) MQTT toKafka using Confluent MQTT Proxy truck position engine metrics console consumer Engine Metrics {"timestamp":1537343400827,"truckId":87, "driverId":13,"routeId":987179512,"eventType":"Normal", ,"latitude":38.65,"longitude":-90.21, "correlationId":"- 3208700263746910537"} Position & Driving Info
  • 24.
    Configure MQTT Proxy ConfigureMQTT Proxy Start MQTT Proxy topic.regex.list=truck_position:.*position, engine_metric:.*engine_metric listeners=0.0.0.0:1883 bootstrap.servers=PLAINTEXT://broker-1:9092 confluent.topic.replication.factor=1 bin/kafka-mqtt-start kafka-mqtt.properties
  • 25.
    MQTTProxy MQTT Connector vs.MQTT Proxy MQTT Connector • Pull-based • Use existing MQTT infrastructures • Bi-directional MQTT Proxy • Push-based • Does not provide all MQTT functionality • Only uni-directional Position Position Position truck/nn/ driving info mqtt to kafka truck driving info truck/nn/ position mqtt to kafka truck position Position Position Position truck/nn/ driving info mqtt to kafka truck/nn/ position mqtt to kafka Position Position Position truck driving info truck position Position Position Position REGION-1 DC REGION-2 DC REGION-1 DC REGION-2 DC Headquarter DC Headquarter DC
  • 26.
    (IV) MQTT toKafka using StreamSets Data Collector truck/nn/ position mqtt to kafka truck_position console consumer {"timestamp":1537343400827,"truckId":87, "driverId":13,"routeId":987179512,"eventType":"Normal", ,"latitude":38.65,"longitude":-90.21, "correlationId":"- 3208700263746910537"} Position & Driving Info
  • 27.
    MQTT to Kafkausing StreamSets Data Collector
  • 28.
    MQTT Proxy Wait … thereis more …. truck/nn/ position mqtt to kafka truck_driving info truck_position console consumer what about some analytics ? console consumer {"timestamp":1537343400827,"truckId":87, "driverId":13,"routeId":987179512,"eventType":"Normal", ,"latitude":38.65,"longitude":-90.21, "correlationId":"- 3208700263746910537"} Position & Driving Info Position & Driving Info {"timestamp":1537343400827,"truckId":87, "driverId":13,"routeId":987179512,"eventType":"Normal", ,"latitude":38.65,"longitude":-90.21, "correlationId":"- 3208700263746910537"}
  • 29.
  • 30.
  • 31.
    KSQL - Terminology Stream •“History” • an unbounded sequence of structured data ("facts") • Facts in a stream are immutable • new facts can be inserted to a stream • existing facts can never be updated or deleted • Streams can be created from a Kafka topic or derived from an existing stream Table • “State” • a view of a stream, or another table, and represents a collection of evolving facts • Facts in a table are mutable • new facts can be inserted to the table • existing facts can be updated or deleted • Tables can be created from a Kafka topic or derived from existing streams and tables Enables stream processing with zero coding required The simplest way to process streams of data in real-time
  • 32.
    (V) Create STREAMon truck_position and use it in KSQL CLI truck/nn/ position mqtt-to- kafka truck- position Stream {"timestamp":1537343400827,"truckId":87, "driverId":13,"routeId":987179512,"eventType":"Normal", ,"latitude":38.65,"longitude":-90.21, "correlationId":"- 3208700263746910537"} Position & Driving Info KSQL CLI
  • 33.
    Create a STREAMon truck_driving_info ksql> CREATE STREAM truck_driving_info_s (ts VARCHAR, truckId VARCHAR, driverId BIGINT, routeId BIGINT, eventType VARCHAR, latitude DOUBLE, longitude DOUBLE, correlationId VARCHAR) WITH (kafka_topic='truck_driving_info', value_format=‘JSON'); Message ---------------- Stream created
  • 34.
    Create a STREAMon truck_driving_info ksql> describe truck_position_s; Field | Type --------------------------------- ROWTIME | BIGINT ROWKEY | VARCHAR(STRING) TS | VARCHAR(STRING) TRUCKID | VARCHAR(STRING) DRIVERID | BIGINT ROUTEID | BIGINT EVENTTYPE | VARCHAR(STRING) LATITUDE | DOUBLE LONGITUDE | DOUBLE CORRELATIONID | VARCHAR(STRING)
  • 35.
    KSQL - SELECT Selectsrows from a KSQL stream or table Result of this statement will not be persisted in a Kafka topic and will only be printed out in the console from_item is one of the following: stream_name, table_name SELECT select_expr [, ...] FROM from_item [ LEFT JOIN join_table ON join_criteria ] [ WINDOW window_expression ] [ WHERE condition ] [ GROUP BY grouping_expression ] [ HAVING having_expression ] [ LIMIT count ];
  • 36.
    Use SELECT tobrowse from Stream ksql> SELECT * FROM truck_driving_info_s; 1539711991642 | truck/24/position | null | 24 | 10 | 1198242881 | Normal | 36.84 | -94.83 | -6187001306629414077 1539711991691 | truck/26/position | null | 26 | 13 | 1390372503 | Normal | 42.04 | -88.02 | -6187001306629414077 1539711991882 | truck/66/position | null | 66 | 22 | 1565885487 | Normal | 38.33 | -94.35 | -6187001306629414077 1539711991902 | truck/22/position | null | 22 | 26 | 1198242881 | Normal | 36.73 | -95.01 | -6187001306629414077 ksql> SELECT * FROM truck_position_s WHERE eventType != 'Normal'; 1539712101614 | truck/67/position | null | 67 | 11 | 160405074 | Lane Departure | 38.98 | -92.53 | -6187001306629414077 1539712116450 | truck/18/position | null | 18 | 25 | 987179512 | Overspeed | 40.76 | -88.77 | -6187001306629414077 1539712120102 | truck/31/position | null | 31 | 12 | 927636994 | Unsafe following distance | 38.22 | -91.18 | -6187001306629414077
  • 37.
    (VI) – CREATEAS … SELECT … detect_dangero us_driving truck/nn/ position mqtt-to- kafka truck- position Stream Stream Dangerous- driving {"timestamp":1537343400827,"truckId":87, "driverId":13,"routeId":987179512,"eventType":"Normal", ,"latitude":38.65,"longitude":-90.21, "correlationId":"- 3208700263746910537"} Position & Driving Info
  • 38.
    CREATE STREAM …AS SELECT … Create a new KSQL table along with the corresponding Kafka topic and stream the result of the SELECT query as a changelog into the topic WINDOW clause can only be used if the from_item is a stream CREATE STREAM stream_name [WITH ( property_name = expression [, ...] )] AS SELECT select_expr [, ...] FROM from_stream [ LEFT | FULL | INNER ] JOIN [join_table | join_stream] [ WITHIN [(before TIMEUNIT, after TIMEUNIT) | N TIMEUNIT] ] ON join_criteria [ WHERE condition ] [PARTITION BY column_name];
  • 39.
    INSERT INTO …AS SELECT … Stream the result of the SELECT query into an existing stream and its underlying topic schema and partitioning column produced by the query must match the stream’s schema and key If the schema and partitioning column are incompatible with the stream, then the statement will return an error stream_name and from_item must both refer to a Stream. Tables are not supported! CREATE STREAM stream_name ...; INSERT INTO stream_name SELECT select_expr [., ...] FROM from_stream [ WHERE condition ] [ PARTITION BY column_name ];
  • 40.
    CREATE AS …SELECT … ksql> CREATE STREAM dangerous_driving_s WITH (kafka_topic= dangerous_driving_s', value_format='JSON') AS SELECT * FROM truck_position_s WHERE eventtype != 'Normal'; Message ---------------------------- Stream created and running ksql> select * from dangerous_driving_s; 1539712399201 | truck/67/position | null | 67 | 11 | 160405074 | Unsafe following distance | 38.65 | -90.21 | -6187001306629414077 1539712416623 | truck/67/position | null | 67 | 11 | 160405074 | Unsafe following distance | 39.1 | -94.59 | -6187001306629414077 1539712430051 | truck/18/position | null | 18 | 25 | 987179512 | Lane Departure | 35.1 | -90.07 | -6187001306629414077
  • 41.
    Windowing streams are unbounded needsome meaningful time frames to do computations (i.e. aggregations) Computations over events done using windows of data Windows are tracked per unique key Fixed Window Sliding Window Session Window Time Stream of Data Window of Data
  • 42.
    (VII) Aggregate andWindow detect_dangero us_driving truck/nn/ position mqtt-to- kafka truck- position Stream Stream Dangerous- driving count_by_ eventType Table Dangergous- driving-count {"timestamp":1537343400827,"truckId":87, "driverId":13,"routeId":987179512,"eventType":"Normal", ,"latitude":38.65,"longitude":-90.21, "correlationId":"- 3208700263746910537"} Position & Driving Info
  • 43.
    SELECT COUNT …GROUP BY ksql> CREATE TABLE dangerous_driving_count AS SELECT eventType, count(*) nof FROM dangerous_driving_s WINDOW TUMBLING (SIZE 30 SECONDS) GROUP BY eventType; Message ---------------------------- Table created and running ksql> SELECT TIMESTAMPTOSTRING(ROWTIME, 'yyyy-MM-dd HH:mm:ss.SSS’), eventType, nof FROM dangerous_driving_count;; 2018-10-16 05:12:19.408 | Unsafe following distance | 1 2018-10-16 05:12:38.926 | Unsafe following distance | 1 2018-10-16 05:12:39.615 | Unsafe tail distance | 1 2018-10-16 05:12:43.155 | Overspeed | 1
  • 44.
    Joining Stream to Static(Table) Join Stream to Stream Join (one window join) Stream to Stream Join (two window join) Stream-to- Static Join Stream-to- Stream Join Stream-to- Stream Join TimeTime Time
  • 45.
    (VIII) – JoinTable to enrich with Driver data Truck Driver kdbc-to- kafka truck- driver 27, Walter, Ward, Y, 24-JUL-85, 2017-10- 02 15:19:00 Table join dangerous- driving & driver Stream Dangerous- driving & driver detect_dangero us_driving truck/nn/ position mqtt-to- kafka truck- position Stream Stream Dangerous- driving count_by_ eventType Table Dangergous- driving-count {"id":27,"firstName":"Walter","lastName":"W ard","available":"Y","birthdate":"24-JUL- 85","last_update":1506923052012} {"timestamp":1537343400827,"truckId":87, "driverId":13,"routeId":987179512,"eventType":"Normal", ,"latitude":38.65,"longitude":-90.21, "correlationId":"- 3208700263746910537"} Position & Driving Info
  • 46.
    Join Table toenrich with Driver data #!/bin/bash curl -X "POST" "http://192.168.69.138:8083/connectors" -H "Content-Type: application/json" -d $'{ "name": "jdbc-driver-source", "config": { "connector.class": "JdbcSourceConnector", "connection.url":"jdbc:postgresql://db/sample?user=sample&password=sample", "mode": "timestamp", "timestamp.column.name":"last_update", "table.whitelist":"driver", "validate.non.null":"false", "topic.prefix":"truck_", "key.converter":"org.apache.kafka.connect.json.JsonConverter", "key.converter.schemas.enable": "false", "value.converter":"org.apache.kafka.connect.json.JsonConverter", "value.converter.schemas.enable": "false", "name": "jdbc-driver-source", "transforms":"createKey,extractInt", "transforms.createKey.type":"org.apache.kafka.connect.transforms.ValueToKey", "transforms.createKey.fields":"id", "transforms.extractInt.type":"org.apache.kafka.connect.transforms.ExtractField$Key", "transforms.extractInt.field":"id" } }'
  • 47.
    Create Table withDriver State ksql> CREATE TABLE driver_t (id BIGINT, first_name VARCHAR, last_name VARCHAR, available VARCHAR) WITH (kafka_topic='truck_driver', value_format='JSON', key='id'); Message ---------------- Table created
  • 48.
    Create Table withDriver State ksql> CREATE STREAM dangerous_driving_and_driver_s WITH (kafka_topic='dangerous_driving_and_driver_s', value_format='JSON’, partitions=8) AS SELECT driverId, first_name, last_name, truckId, routeId, eventtype, latitude, longitude FROM truck_position_s LEFT JOIN driver_t ON dangerous_driving_and_driver_s.driverId = driver_t.id; Message ---------------------------- Stream created and running ksql> select * from dangerous_driving_and_driver_s; 1539713095921 | 11 | 11 | Micky | Isaacson | 67 | 160405074 | Lane Departure | 39.01 | -93.85 1539713113254 | 11 | 11 | Micky | Isaacson | 67 | 160405074 | Unsafe following distance | 39.0 | -93.65
  • 49.
    (IX) – CustomUDF for calculating Geohash Truck Driver kdbc-to- kafka truck- driver 27, Walter, Ward, Y, 24-JUL-85, 2017-10- 02 15:19:00 Table join dangerous- driving & driver Stream Dangerous- driving & driver detect_dangero us_driving truck/nn/ position mqtt-to- kafka truck- position Stream Stream Dangerous- driving count_by_ eventType Table Dangergous- driving-count {"id":27,"firstName":"Walter","lastName":"W ard","available":"Y","birthdate":"24-JUL- 85","last_update":1506923052012} {"timestamp":1537343400827,"truckId":87, "driverId":13,"routeId":987179512,"eventType":"Normal", ,"latitude":38.65,"longitude":-90.21, "correlationId":"- 3208700263746910537"} Position & Driving Info dangerous driving by geo Stream dangerous- drving-geohash
  • 50.
    Custom UDF forcalculating Geohashes Geohash is a geocoding which encodes a geographic location into a short string of letters and digits hierarchical spatial data structure which subdivides space into buckets of grid shape Length Area width x height 1 5,009.4km x 4,992.6km 2 1,252.3km x 624.1km 3 156.5km x 156km 4 39.1km x 19.5km 5 39.1km x 19.5km 12 3.7cm x 1.9cm ksql> SELECT latitude, longitude, geohash(latitude, longitude, 4) FROM dangerous_driving_s; 38.31 | -91.07 | 9yz1 37.7 | -92.61 | 9ywn 34.78 | -92.31 | 9ynm 42.23 | -91.78 | 9zw8xw ... http://geohash.gofreerange.com/
  • 51.
    Add an UDFsample Geohash and join to some important messages for drivers @UdfDescription(name = "geohash", description = "returns the geohash for a given LatLong") public class GeoHashUDF { @Udf(description = "encode lat/long to geohash of specified length.") public String geohash(final double latitude, final double longitude, int length) { return GeoHash.encodeHash(latitude, longitude, length); } @Udf(description = "encode lat/long to geohash.") public String geohash(final double latitude, final double longitude) { return GeoHash.encodeHash(latitude, longitude); } }
  • 52.
  • 53.
    Summary Two ways tobring in MQTT data => MQTT Connector or MQTT Proxy KSQL is another way to work with data in Kafka => you can (re)use some of your SQL knowledge • Similar semantics to SQL, but is for queries on continuous, streaming data Well-suited for structured data (there is the "S" in KSQL) There is more • Stream to Stream Join • REST API for executing KSQL • Avro Format & Schema Registry • Using Kafka Connect to write results to data stores • …
  • 54.
    Choosing the RightAPI • Java, c#, c++, scala, phyton, node.js, go, php … • subscribe() • poll() • send() • flush() • Anything Kafka • Fluent Java API • mapValues() • filter() • flush() • Stream Analytics • SQL dialect • SELECT … FROM … • JOIN ... WHERE • GROUP BY • Stream Analytics Consumer, Producer API Kafka Streams KSQL • Declarative • Configuration • REST API • Out-of-the-box connectors • Stream Integration Kafka Connect Flexibility Simplicity Source: adapted from Confluent
  • 55.
    Technology on itsown won't help you. You need to know how to use it properly.