Modern Data Flow
Data Pipelines Done Right
Kai Waehner
Field CTO
kai.waehner@confluent.io
linkedin.com/in/kaiwaehner
@KaiWaehner
kai-waehner.de
2
Huge Growth in the Community
>100,000+
Organizations
Using Kafka
>41,000
Kafka Meetup
Attendees
>32,000
Stack Overflow
Questions
>12,000
Jiras for
Apache Kafka
3
Huge Growth
in Usage
Active Monthly Unique Users
Kafka Java Client Library
Source: Sonatype
Kafka
JAN 2017 JAN 2018 JAN 2019 JAN 2020 JAN 2021 JAN 2022
0
200,000
400,000
600,000
800,000
4
Growing Demand
31,000 Open Job Listings
Request Kafka Skills
4
Real-time Data beats Slow Data.
Logistics
Real-time sensor
diagnostics
Delivery planning
Estimated time of
arrival updates
Payment
Fraud detection
Risk systems
Mobile applications /
customer experience
Retail
Real-time inventory
Real-time
point of sale reporting
Personalization
Sales
Real-time
recommendations
Personalized
coupon feed
Pay by walking out
Major New Platform Around
Data in Motion
6
Data Streaming as Central Nervous System
MES
ERP
Sensors
Mobile
Customer 360
Real-time
Alerting System
Data
warehouse
Supplier
Alert
Forecast
Inventory Customer
Order
7
8
Credit Card Fraud Detection
Deidentification
Personal Recommendations
Shipment Tracking / Alerting
JIT Inventory Logistics
Geofencing
Route Optimization
Payment Verification
Dynamic Pricing
APT Detection
Customer 360
Cloud DWH Ingestion
Clickstream
Log Aggregation
Real-Time Analytics Ingestion
Real-Time Analytics
Change Data Capture
Complex Events Processing
Hybrid / Multicloud Data Flow
Saas App Integration
Streaming ETL
Ml Feature Pipelines
Customer Retention / Loyalty
Fleet Management
Credit Card Fraud Detection
Customer Retention / Loyalty Deidentification
Payment Verification
Fleet Management
Shipment Tracking / Alerting
JIT Inventory Logistics
Geofencing
Route Optimization
Personal Recommendations
Dynamic Pricing
APT Detection
Real-Time Analytics
Hybrid / Multicloud Data Flow Customer 360
Clickstream
Ml Feature Pipelines
Cloud DWH Ingestion
Real-Time Analytics Ingestion
Change Data Capture
Log Aggregation
Complex Events Processing
Streaming ETL
Saas App Integration
Applications Pipelines
Streaming Platform
9
10
Data Pipeline Technologies Old and New
AWS Glue
11
Today’s data pipelines:
Kafka is necessary
but not sufficient.
12
Principles of
Modern Data Flow
Governed & Observable
Streaming
Decentralized
Declarative
Developer-Oriented
13
Principles of Modern Data Flow:
Streaming
14
Reality is Real-Time
15
Would you blindly cross the
street with traffic information
that is 5 minutes old?
15
Real-time Data beats Slow Data.
Manufacturing
Sensor diagnostics
MES/ERP Integration
Reporting
Edge Computing
Condition Monitoring
Predictive
Maintenance
Quality Assurance
Logistics
Supply Chain
Inventory
management
Track & Trace
Context-specific
routing
Cybersecurity
Threat detection
Intrusion detection
Incident response
Military decisions
Taxis become Software
2
min
18
You Can
(Increasingly)
Have Your Cake
and Eat It Too
19
Batch Streaming
20
Streaming is a
Generalization
of Batch
Streaming
Batch
Pipeline for Legacy and Modern APIs @ Allianz
21
22
Principles of Modern Data Flow:
Decentralized
Centralized Data Warehouse ETL
Classic
ETL
Classic
Data Warehousing
23
24
Central Nervous System
25
Publish/Subscribe, in the Large
26
Customer Topic
Analysis
Customer Support
System
CRM
Personalization &
Marketing Systems
27
Kafka+Schema ≈ Rest+JSON
Publish/Subscribe, in the Large
28
Customer Topic
Analysis
Customer Support
System
CRM
Personalization &
Marketing Systems
Decentralized Data Streaming
29
Data Mesh @ Raiffeisen Bank International
30
31
Principles of Modern Data Flow:
Declarative
32
Say what you mean.
33
“Everything should be
made as simple as possible,
but no simpler.”
Albert Einstein
34
SQL provides this for
Data at Rest.
35
SQL provides this for
Data in Motion.
36
Pages Per Minute
Sessions
Errors Per Minute
Connector
-- number of errors per min, using 'HAVING' Filter to show
ERROR codes > 400
-- where count > 5
CREATE TABLE errors_per_min_alert WITH (KAFKA_TOPIC =
'errors_per_min_alert') AS
SELECT
status AS k1,
AS_VALUE(status) AS status,
WINDOWSTART AS EVENT_TS,
COUNT(*) AS errors
FROM clickstream WINDOW HOPPING (SIZE 60 SECOND, ADVANCE
BY 20 SECOND)
WHERE status > 400
GROUP BY status
HAVING COUNT(*) > 5 AND COUNT(*) IS NOT NULL;
-- Enriched user details table:
-- Aggregate (count&groupBy) using a TABLE-Window
CREATE TABLE user_ip_activity WITH (KEY_FORMAT = 'JSON',
KAFKA_TOPIC = 'user_ip_activity') AS
SELECT
username AS k1,
ip AS k2,
city AS k3,
AS_VALUE(username) AS username,
WINDOWSTART AS EVENT_TS,
AS_VALUE(ip) AS ip,
AS_VALUE(city) AS city,
COUNT(*) AS count
FROM user_clickstream WINDOW TUMBLING (SIZE 60 SECOND)
GROUP BY username, ip, city
HAVING COUNT(*) > 1;
Fraud Detection @ Kakao Games
Detecting and operation anomalies with 300+ patterns through
KSQL, including bonus abuse, multiple account usage, account
takeover, chargeback fraud, affiliate fraud...
38
Principles of Modern Data Flow:
Developer-Oriented
Data Pipeline Technologies Old and New
AWS Glue
39
Code is
King
Development is About
Evolution
Open Platforms
Win
40
Variety of streaming APIs @ Big Commerce
41
Bot
Filter
Confluent Stream Designer
42
43
Principles of Modern Data Flow:
Governed & Observable
44
Two Conflicting Pressures On Organizations
Unlock the data to enable innovation Lock up the data to keep it safe
45
Three Key Tools For Governing Data Flows
Where to get the data streams Catalog
How the data streams got here Lineage
What the data streams looks like Schema
The catalog
makes your streams
discoverable.
46
47
Schemas
Become the “API”
for Data Flow
{
"$schema": "http://json-schema.org/draft-04/schema#",
"title": "Product",
"description": "A product from Acme's catalog",
"type": "object",
"properties": {
"id": {
"description": "The unique identifier for a
product",
"type": "integer"
},
"name": {
"description": "Name of the product",
"type": "string"
},
"price": {
"type": "number",
"minimum": 0,
"exclusiveMinimum": true
}
},
“required": ["id", "name", "price"]
}
48
Lineage lets you
observe your real-time
data flow graph.
49
These governance features
must be designed for a
decentralized, streaming,
developer-oriented world.
50
Governance @ Raiffeisen Bank International (RBI)
51
Principles of
Modern Data Flow
Governed & Observable
Streaming
Decentralized
Declarative
Developer-Oriented
52
Ungoverned & Dark
Batch
Centralized
Infrastructure-Heavy
GUI-Oriented
Governed & Observable
Streaming
Decentralized
Declarative
Developer-Oriented
Thank You

Modern Data Flow

  • 1.
    Modern Data Flow DataPipelines Done Right Kai Waehner Field CTO kai.waehner@confluent.io linkedin.com/in/kaiwaehner @KaiWaehner kai-waehner.de
  • 2.
    2 Huge Growth inthe Community >100,000+ Organizations Using Kafka >41,000 Kafka Meetup Attendees >32,000 Stack Overflow Questions >12,000 Jiras for Apache Kafka
  • 3.
    3 Huge Growth in Usage ActiveMonthly Unique Users Kafka Java Client Library Source: Sonatype Kafka JAN 2017 JAN 2018 JAN 2019 JAN 2020 JAN 2021 JAN 2022 0 200,000 400,000 600,000 800,000
  • 4.
    4 Growing Demand 31,000 OpenJob Listings Request Kafka Skills 4
  • 5.
    Real-time Data beatsSlow Data. Logistics Real-time sensor diagnostics Delivery planning Estimated time of arrival updates Payment Fraud detection Risk systems Mobile applications / customer experience Retail Real-time inventory Real-time point of sale reporting Personalization Sales Real-time recommendations Personalized coupon feed Pay by walking out
  • 6.
    Major New PlatformAround Data in Motion 6
  • 7.
    Data Streaming asCentral Nervous System MES ERP Sensors Mobile Customer 360 Real-time Alerting System Data warehouse Supplier Alert Forecast Inventory Customer Order 7
  • 8.
    8 Credit Card FraudDetection Deidentification Personal Recommendations Shipment Tracking / Alerting JIT Inventory Logistics Geofencing Route Optimization Payment Verification Dynamic Pricing APT Detection Customer 360 Cloud DWH Ingestion Clickstream Log Aggregation Real-Time Analytics Ingestion Real-Time Analytics Change Data Capture Complex Events Processing Hybrid / Multicloud Data Flow Saas App Integration Streaming ETL Ml Feature Pipelines Customer Retention / Loyalty Fleet Management
  • 9.
    Credit Card FraudDetection Customer Retention / Loyalty Deidentification Payment Verification Fleet Management Shipment Tracking / Alerting JIT Inventory Logistics Geofencing Route Optimization Personal Recommendations Dynamic Pricing APT Detection Real-Time Analytics Hybrid / Multicloud Data Flow Customer 360 Clickstream Ml Feature Pipelines Cloud DWH Ingestion Real-Time Analytics Ingestion Change Data Capture Log Aggregation Complex Events Processing Streaming ETL Saas App Integration Applications Pipelines Streaming Platform 9
  • 10.
    10 Data Pipeline TechnologiesOld and New AWS Glue
  • 11.
    11 Today’s data pipelines: Kafkais necessary but not sufficient.
  • 12.
    12 Principles of Modern DataFlow Governed & Observable Streaming Decentralized Declarative Developer-Oriented
  • 13.
    13 Principles of ModernData Flow: Streaming
  • 14.
  • 15.
    15 Would you blindlycross the street with traffic information that is 5 minutes old? 15
  • 16.
    Real-time Data beatsSlow Data. Manufacturing Sensor diagnostics MES/ERP Integration Reporting Edge Computing Condition Monitoring Predictive Maintenance Quality Assurance Logistics Supply Chain Inventory management Track & Trace Context-specific routing Cybersecurity Threat detection Intrusion detection Incident response Military decisions
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
    Pipeline for Legacyand Modern APIs @ Allianz 21
  • 22.
    22 Principles of ModernData Flow: Decentralized
  • 23.
    Centralized Data WarehouseETL Classic ETL Classic Data Warehousing 23
  • 24.
  • 25.
  • 26.
    Publish/Subscribe, in theLarge 26 Customer Topic Analysis Customer Support System CRM Personalization & Marketing Systems
  • 27.
  • 28.
    Publish/Subscribe, in theLarge 28 Customer Topic Analysis Customer Support System CRM Personalization & Marketing Systems
  • 29.
  • 30.
    Data Mesh @Raiffeisen Bank International 30
  • 31.
    31 Principles of ModernData Flow: Declarative
  • 32.
  • 33.
    33 “Everything should be madeas simple as possible, but no simpler.” Albert Einstein
  • 34.
    34 SQL provides thisfor Data at Rest.
  • 35.
    35 SQL provides thisfor Data in Motion.
  • 36.
    36 Pages Per Minute Sessions ErrorsPer Minute Connector -- number of errors per min, using 'HAVING' Filter to show ERROR codes > 400 -- where count > 5 CREATE TABLE errors_per_min_alert WITH (KAFKA_TOPIC = 'errors_per_min_alert') AS SELECT status AS k1, AS_VALUE(status) AS status, WINDOWSTART AS EVENT_TS, COUNT(*) AS errors FROM clickstream WINDOW HOPPING (SIZE 60 SECOND, ADVANCE BY 20 SECOND) WHERE status > 400 GROUP BY status HAVING COUNT(*) > 5 AND COUNT(*) IS NOT NULL; -- Enriched user details table: -- Aggregate (count&groupBy) using a TABLE-Window CREATE TABLE user_ip_activity WITH (KEY_FORMAT = 'JSON', KAFKA_TOPIC = 'user_ip_activity') AS SELECT username AS k1, ip AS k2, city AS k3, AS_VALUE(username) AS username, WINDOWSTART AS EVENT_TS, AS_VALUE(ip) AS ip, AS_VALUE(city) AS city, COUNT(*) AS count FROM user_clickstream WINDOW TUMBLING (SIZE 60 SECOND) GROUP BY username, ip, city HAVING COUNT(*) > 1;
  • 37.
    Fraud Detection @Kakao Games Detecting and operation anomalies with 300+ patterns through KSQL, including bonus abuse, multiple account usage, account takeover, chargeback fraud, affiliate fraud...
  • 38.
    38 Principles of ModernData Flow: Developer-Oriented
  • 39.
    Data Pipeline TechnologiesOld and New AWS Glue 39
  • 40.
    Code is King Development isAbout Evolution Open Platforms Win 40
  • 41.
    Variety of streamingAPIs @ Big Commerce 41 Bot Filter
  • 42.
  • 43.
    43 Principles of ModernData Flow: Governed & Observable
  • 44.
    44 Two Conflicting PressuresOn Organizations Unlock the data to enable innovation Lock up the data to keep it safe
  • 45.
    45 Three Key ToolsFor Governing Data Flows Where to get the data streams Catalog How the data streams got here Lineage What the data streams looks like Schema
  • 46.
    The catalog makes yourstreams discoverable. 46
  • 47.
    47 Schemas Become the “API” forData Flow { "$schema": "http://json-schema.org/draft-04/schema#", "title": "Product", "description": "A product from Acme's catalog", "type": "object", "properties": { "id": { "description": "The unique identifier for a product", "type": "integer" }, "name": { "description": "Name of the product", "type": "string" }, "price": { "type": "number", "minimum": 0, "exclusiveMinimum": true } }, “required": ["id", "name", "price"] }
  • 48.
    48 Lineage lets you observeyour real-time data flow graph.
  • 49.
    49 These governance features mustbe designed for a decentralized, streaming, developer-oriented world.
  • 50.
    50 Governance @ RaiffeisenBank International (RBI)
  • 51.
    51 Principles of Modern DataFlow Governed & Observable Streaming Decentralized Declarative Developer-Oriented
  • 52.
    52 Ungoverned & Dark Batch Centralized Infrastructure-Heavy GUI-Oriented Governed& Observable Streaming Decentralized Declarative Developer-Oriented
  • 53.