© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Eric Johnson
Senior Developer Advocate - Serverless
AWS
@edjgeek
Big “Serverless” Data
Powering Big Data with Serverless
Background Image by Эдуард Ризванов from Pixabay
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Who am I?
• Sr. Developer Advocate – Serverless, AWS
• Serverless / Tooling / Automation Geek
• Software Architect / Solutions Architect
• Husband to Brigitte
• Father to Noah, Jake, Owen
Sophie Anne, & Gracie Mae
• Music lover
• Pizza / Diet Dr. Pepper fanatic
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Why are
we here?
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Serverless in big data processing
Amazon Kinesis
Video Streams
Amazon Kinesis
Data Streams
Amazon Kinesis
Data Firehose
Amazon Kinesis
Data Analytics
Amazon Athena AWS Lambda Amazon Simple
Storage Service
Amazon DynamoDB
Understanding the role Serverless plays in Big Data
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Agenda
Ingestion
Real-time processing
Real-time analytics
Post processing
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What is serverless?
No infrastructure provisioning,
no management
Automatic scaling
Pay for value Highly available and secure
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Ingestion
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Ingesting data at scale
Amazon Kinesis
Video Streams
Amazon Kinesis
Data Streams
Amazon Kinesis
Data Firehose
Video Ingestion Data Ingestion
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Video ingestion
• Fully managed infrastructure
that scales to load
• Offers SDK in C++ and Java
• Supports live and on-demand
playback of streams
• Durable storage using
Amazon S3
• Works with many forms of
time encoded data
• Supports multiple time code
based formats
Amazon Kinesis
Video Streams
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data ingestion – Kinesis Data Streams
• Uses shards to scale
• 1 MB or 1000 records /second/shard
ingress
• 2 MB/second/shard egress
• Works with Kinesis Data
Analytics
• Can support connected
consumers for enhanced
fanout
• Can store data up to 168
hours (7 days)
Amazon Kinesis
Data Streams
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data ingestion – Kinesis Firehose
• Auto-scales to meet load
• Different regions have different
capacity
• US East: 5,000 records/second, 2,000
transactions/second, and 5 MiB/second.
• Works with Kinesis Data
Analytics
• Can transform data before
delivery to target
• Stores data up to 24 hours
on failed delivery
Amazon Kinesis
Firehose
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data ingestion – Kinesis Firehose
Data Sources Targets
• Firehose PUT APIs
• Amazon Kinesis
Agent
• AWS IoT
• CloudWatch Logs
• CloudWatch Events
• Amazon S3
• Amazon Redshift
• Amazon Elasticsearch
Service
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Kinesis Data Stream
Kinesis Data Streams vs. Kinesis Firehose
Kinesis Firehose
Amazon Kinesis
Data Stream
Data Producers
010001110010100
01000111001001101010100
010010100010100
01000100101110100
010010100010100
010010100010100
010010100010100
010010100010100
010010100010100
Data Producers
Amazon Kinesis
Data Firehose
01000111001001101010100
010010100010100
010010100010100
010010100010100
010010100010100
01000111001001101010100
01000111001001101010100
01000111001001101010100
01000111001001101010100
010001101010100
010001101100
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Kinesis Firehose
Kinesis Firehose
Data Producers
Amazon Kinesis
Data Firehose
01000111001001101010100
010010100010100
010010100010100
010010100010100
010010100010100
01000111001001101010100
01000111001001101010100
01000111001001101010100
01000111001001101010100
010001101010100
010001101100
Use Kinesis Firehose when you need:
• Ability to transform data in the stream
• Auto scaling for unpredictable load
• Multiple targets for final data
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Kinesis Data Stream
Kinesis Data Streams
Amazon Kinesis
Data Stream
Data Producers
010001110010100
01000111001001101010100
010010100010100
01000100101110100
010010100010100
010010100010100
010010100010100
010010100010100
010010100010100
Use Kinesis Data Streams when:
• You have semi-predictable traffic
• You need to perform real-time action on
data in the stream
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Real-time
processing
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Kinesis Data Stream + Lambda
Amazon Kinesis
Data Stream
Data Producers
Lambda
function
Lambda
function
Lambda
function
Amazon
DynamoDB
Amazon Kinesis
Data Stream
AWS IoT
Core
Lambda services handles
intermittent polling
via GetRecords API
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Kinesis Data Stream + Lambda
Amazon Kinesis
Data Stream
Data Producers
Lambda
function
Lambda
function
Lambda
function
Amazon
DynamoDB
Amazon Kinesis
Data Stream
AWS IoT
Core
Lambda services handles
intermittent polling
via GetRecords API
All applications share 2 MB/second/shard egress
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Kinesis Data Stream + Enhanced Fanout + Lambda
Amazon Kinesis
Data Stream
Data Producers
Lambda
function
Lambda
function
Lambda
function
Amazon
DynamoDB
Amazon Kinesis
Data Stream
AWS IoT
Core
Functions triggered
by consumers
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Kinesis
Data Stream
Data Producers
Lambda
function
Lambda
function
Lambda
function
Amazon
DynamoDB
Amazon Kinesis
Data Stream
AWS IoT
Core
Functions triggered
by consumers
Each consumer provides an
individual 2 MB/second/shard
egress
Kinesis Data Stream + Enhanced Fanout + Lambda
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Video Processing
Amazon Kinesis
Video Streams
Amazon
Rekognition video
Amazon
SageMaker
S3 Bucket
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Video Processing
Amazon Kinesis
Video Streams
Amazon
Rekognition video
Amazon
SageMaker
Real time analysis and machine learning
S3 Bucket
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Video Processing
Amazon Kinesis
Video Streams
Amazon
Rekognition video
Amazon
SageMaker
Real time analysis and machine learning
S3 Bucket
HLS Compatible live or
on-demand playback
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Video Processing
Amazon Kinesis
Video Streams
Amazon
Rekognition video
Amazon
SageMaker
Real time analysis and machine learning
HLS Compatible live or
on-demand playback
S3 Bucket
Near real-time
processing
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Real-time
analytics
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Kinesis Data Analytics
• Built-in functions to filter,
aggregate, and transform
streaming data
• Processes streaming data with
sub-second latencies
• Build SQL queries that
perform joins, aggregations
over time windows and filters
• includes open source libraries
based on Apache Flink that
enable you to build an
application in hours instead of
months
Amazon Kinesis
Data Analytics
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Real-time analytics
Amazon Kinesis
Data Stream
Amazon Kinesis
Data Firehose
Amazon Kinesis
Data Analytics
Stream source can be
Kinesis Data Stream
or Firehose
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Inside Kinesis Data Analytics
Stream
data
-- Create Fail Stream --
CREATE OR REPLACE STREAM "FAIL_STREAM" (
sensorId INT,
currentTemperature INT,
status VARCHAR(10)
);
CREATE OR REPLACE PUMP "FAIL_STREAM_PUMP" AS INSERT INTO "FAIL_STREAM"
SELECT "sensorId", "currentTemperature", "status"
FROM "SOURCE_SQL_STREAM_001"
WHERE "status" SIMILAR TO '%FAIL%';
-- Create Warn Stream --
CREATE OR REPLACE STREAM "WARN_STREAM" (
sensorId INT,
currentTemperature INT,
status VARCHAR(10)
);
CREATE OR REPLACE PUMP "WARN_STREAM_PUMP" AS INSERT INTO "WARN_STREAM"
SELECT "sensorId", "currentTemperature", "status"
FROM "SOURCE_SQL_STREAM_001"
WHERE "status" SIMILAR TO '%WARN%';
FAIL_STREAM
WARN_STREAM
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Inside Kinesis Data Analytics
Stream
data
-- Create Fail Stream --
CREATE OR REPLACE STREAM "FAIL_STREAM" (
sensorId INT,
currentTemperature INT,
status VARCHAR(10)
);
CREATE OR REPLACE PUMP "FAIL_STREAM_PUMP" AS INSERT INTO "FAIL_STREAM"
SELECT "sensorId", "currentTemperature", "status"
FROM "SOURCE_SQL_STREAM_001"
WHERE "status" SIMILAR TO '%FAIL%';
-- Create Warn Stream --
CREATE OR REPLACE STREAM "WARN_STREAM" (
sensorId INT,
currentTemperature INT,
status VARCHAR(10)
);
CREATE OR REPLACE PUMP "WARN_STREAM_PUMP" AS INSERT INTO "WARN_STREAM"
SELECT "sensorId", "currentTemperature", "status"
FROM "SOURCE_SQL_STREAM_001"
WHERE "status" SIMILAR TO '%WARN%';
FAIL_STREAM
WARN_STREAM
Use SQL or Apache Flink to filter data
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Inside Kinesis Data Analytics
Stream
data
-- Create Fail Stream --
CREATE OR REPLACE STREAM "FAIL_STREAM" (
sensorId INT,
currentTemperature INT,
status VARCHAR(10)
);
CREATE OR REPLACE PUMP "FAIL_STREAM_PUMP" AS INSERT INTO "FAIL_STREAM"
SELECT "sensorId", "currentTemperature", "status"
FROM "SOURCE_SQL_STREAM_001"
WHERE "status" SIMILAR TO '%FAIL%';
-- Create Warn Stream --
CREATE OR REPLACE STREAM "WARN_STREAM" (
sensorId INT,
currentTemperature INT,
status VARCHAR(10)
);
CREATE OR REPLACE PUMP "WARN_STREAM_PUMP" AS INSERT INTO "WARN_STREAM"
SELECT "sensorId", "currentTemperature", "status"
FROM "SOURCE_SQL_STREAM_001"
WHERE "status" SIMILAR TO '%WARN%';
FAIL_STREAM
AWS Lambda
• Alert
• Diagnose
• Remediate
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Inside Kinesis Data Analytics
Stream
data
-- Create Fail Stream --
CREATE OR REPLACE STREAM "FAIL_STREAM" (
sensorId INT,
currentTemperature INT,
status VARCHAR(10)
);
CREATE OR REPLACE PUMP "FAIL_STREAM_PUMP" AS INSERT INTO "FAIL_STREAM"
SELECT "sensorId", "currentTemperature", "status"
FROM "SOURCE_SQL_STREAM_001"
WHERE "status" SIMILAR TO '%FAIL%';
-- Create Warn Stream --
CREATE OR REPLACE STREAM "WARN_STREAM" (
sensorId INT,
currentTemperature INT,
status VARCHAR(10)
);
CREATE OR REPLACE PUMP "WARN_STREAM_PUMP" AS INSERT INTO "WARN_STREAM"
SELECT "sensorId", "currentTemperature", "status"
FROM "SOURCE_SQL_STREAM_001"
WHERE "status" SIMILAR TO '%WARN%';
WARN_STREAM
Amazon Kinesis
Data Stream
• Dashboards
• Consumer
response
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Real-time analytics
Amazon Kinesis
Data Stream
Amazon Kinesis
Data Firehose
Amazon Kinesis
Data Analytics
Amazon Kinesis
Data Stream
AWS Lambda
FAIL_STREAM
WARN_STREAM
What about the raw data?
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Real-time analytics
Amazon Kinesis
Data Stream
Amazon Kinesis
Data Firehose
Amazon Kinesis
Data Analytics
Amazon Kinesis
Data Stream
Amazon Kinesis
Data Firehose
AWS Lambda
FAIL_STREAM
WARN_STREAM
Raw Data Archive
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Post processing
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Serverless data storage
Amazon Simple
Storage Service
Amazon
DynamoDB
Amazon
Timestream
Amazon Quantum
Ledger Database
Amazon
CloudWatch
Amazon Kinesis
Data Firehose
Amazon Kinesis
Data Streams
Amazon Kinesis
Data Analytics
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Serverless data storage
Amazon Simple
Storage Service
Amazon
DynamoDB
Amazon
Timestream
Amazon Quantum
Ledger Database
Amazon
CloudWatch
Amazon Kinesis
Data Firehose
Amazon Kinesis
Data Streams
Amazon Kinesis
Data Analytics
How you need to process
your data determines
where to store it
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Serverless storage options
Amazon Simple
Storage Service
Amazon
DynamoDB
Amazon
Timestream
Amazon Quantum
Ledger Database
Amazon
CloudWatch
• Immutable and
transparent
• Cryptographically
Verifiable
• Object storage
• Unstructured data
• Structured data
• Alerting built in
• NoSQL
• Key value or
document data
• Time series
database
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Post processing – Serverless Tools
Amazon Athena
Query S3 data with standard
SQL expressions
Amazon S3 Select
Retrieve subsets of object data,
instead of the entire object.
AWS Glue
Extract, transform, and load
(ETL) service that works across
multiple services.
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue
Bucket
Bucket
Bucket
Bucket
Bucket
DynamoDB
TableDynamoDB
Table
DynamoDB
Table
DynamoDB
Table
DynamoDB
Table
Other non-
serverless services
• MariaDB
• Microsoft SQL Server
• MySQL
• Oracle
• PostgreSQL
Critical data can be stored in many places
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue
Bucket
Bucket
Bucket
Bucket
Bucket
DynamoDB
TableDynamoDB
Table
DynamoDB
Table
DynamoDB
Table
DynamoDB
Table
Crawler Data Catalog
Other non-
serverless services
• MariaDB
• Microsoft SQL Server
• MySQL
• Oracle
• PostgreSQL
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue
Bucket
Bucket
Bucket
Bucket
Bucket
DynamoDB
TableDynamoDB
Table
DynamoDB
Table
DynamoDB
Table
DynamoDB
Table
Crawler Data Catalog
Other non-
serverless services
• MariaDB
• Microsoft SQL Server
• MySQL
• Oracle
• PostgreSQL
What it is doing
• Classifies data to determine the format, schema,
and associated properties of the raw data
• Groups data into tables or partitions – Data is
grouped based on crawler heuristics.
• Writes metadata to the Data Catalog
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue
Bucket
Bucket
Bucket
Bucket
Bucket
DynamoDB
TableDynamoDB
Table
DynamoDB
Table
DynamoDB
Table
DynamoDB
Table
Crawler Data Catalog
Other non-
serverless services
• MariaDB
• Microsoft SQL Server
• MySQL
• Oracle
• PostgreSQL
This catalog contains meta-data about the
data stores. How do I get the data itself in
a meaningful way?
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Enter: AWS Athena
Bucket
Bucket
Bucket
Bucket
Bucket
DynamoDB
TableDynamoDB
Table
DynamoDB
Table
DynamoDB
Table
DynamoDB
Table
Crawler Data Catalog
Other non-
serverless services
• MariaDB
• Microsoft SQL Server
• MySQL
• Oracle
• PostgreSQL
Amazon Athena
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Athena
Bucket
Bucket
Bucket
Bucket
Bucket
DynamoDB
TableDynamoDB
Table
DynamoDB
Table
DynamoDB
Table
DynamoDB
Table
Crawler Data Catalog
Other non-
serverless services
• Amazon Aurora
• MariaDB
• Microsoft SQL Server
• MySQL
• Oracle
• PostgreSQL
Athena queries Glue Data Catalog
Glue returns data from data source Amazon Athena
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Athena
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Question
I have HUGE compressed CSV files
stored on Amazon S3.
How do I get small bits of data without
reading the entire file?
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Enter: Amazon S3 Select
import boto3
s3 = boto3.client('s3’)
r = s3.select_object_content(
Bucket='jbarr-us-west-2’,
Key='sample-data/airportCodes.csv’,
ExpressionType='SQL’,
Expression="select * from s3object s where s."Country (Name)" like '%United States%’”,
InputSerialization = {'CSV': {"FileHeaderInfo": "Use"}},
OutputSerialization = {'CSV': {}}, )
for event in r['Payload’]:
if 'Records' in event:
records = event['Records']['Payload'].decode('utf-8’)
print(records)
elif 'Stats' in event:
statsDetails = event['Stats']['Details’]
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Before S3 Select
Lambda
function
Bucket
0010100101101100100101010100101001
1100100100110110010110010101000100
1001111001000011001001001111110010
0000001101100101001100000101001011
0110010010101010010100111001001001
1011001011001010100010010011110010
0001100100100111111001000000011011
0010100110000010100101101100100101
0101001010011100100100110110010110
0101010001001001111001000011001001
0011111100100000001101100101001100
0001010010110110010010101010010100
1110010010011011001011001010100010
0100111100100001100100100111111001
0000000110110010100110000010100101
1011001001010101001010011100100100
1101100101100101010001001001111001
0000110010010011111100100000001101
1001010011000001010010110110010010
1010100101001110010010011011001011
0010101000100100111100100001100100
1001111110010000000110110010100110
Entire file returned
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
0010100101101100100101010100101001
1100100100110110010110010101000100
1001111001000011001001001111110010
0000001101100101001100000101001011
0110010010101010010100111001001001
1011001011001010100010010011110010
0001100100100111111001000000011011
0010100110000010100101101100100101
0101001010011100100100110110010110
0101010001001001111001000011001001
0011111100100000001101100101001100
0001010010110110010010101010010100
1110010010011011001011001010100010
0100111100100001100100100111111001
0000000110110010100110000010100101
1011001001010101001010011100100100
1101100101100101010001001001111001
0000110010010011111100100000001101
1001010011000001010010110110010010
1010100101001110010010011011001011
0010101000100100111100100001100100
1001111110010000000110110010100110
After S3 Select
Lambda
function
Bucket
Parsed value returned
Up to 400% faster
and 80% cheaper
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Questions?
https://pixabay.com/illustrations/questions-font-who-what-how-why-2245264/
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Eric Johnson
@edjgeek
Image Source: https://pixabay.com/illustrations/thank-you-polaroid-letters-2490552/

Serverless in Big Data

  • 1.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Eric Johnson Senior Developer Advocate - Serverless AWS @edjgeek Big “Serverless” Data Powering Big Data with Serverless Background Image by Эдуард Ризванов from Pixabay
  • 2.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Who am I? • Sr. Developer Advocate – Serverless, AWS • Serverless / Tooling / Automation Geek • Software Architect / Solutions Architect • Husband to Brigitte • Father to Noah, Jake, Owen Sophie Anne, & Gracie Mae • Music lover • Pizza / Diet Dr. Pepper fanatic
  • 3.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Why are we here?
  • 4.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Serverless in big data processing Amazon Kinesis Video Streams Amazon Kinesis Data Streams Amazon Kinesis Data Firehose Amazon Kinesis Data Analytics Amazon Athena AWS Lambda Amazon Simple Storage Service Amazon DynamoDB Understanding the role Serverless plays in Big Data
  • 5.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Agenda Ingestion Real-time processing Real-time analytics Post processing
  • 6.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. What is serverless? No infrastructure provisioning, no management Automatic scaling Pay for value Highly available and secure
  • 7.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Ingestion
  • 8.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Ingesting data at scale Amazon Kinesis Video Streams Amazon Kinesis Data Streams Amazon Kinesis Data Firehose Video Ingestion Data Ingestion
  • 9.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Video ingestion • Fully managed infrastructure that scales to load • Offers SDK in C++ and Java • Supports live and on-demand playback of streams • Durable storage using Amazon S3 • Works with many forms of time encoded data • Supports multiple time code based formats Amazon Kinesis Video Streams
  • 10.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Data ingestion – Kinesis Data Streams • Uses shards to scale • 1 MB or 1000 records /second/shard ingress • 2 MB/second/shard egress • Works with Kinesis Data Analytics • Can support connected consumers for enhanced fanout • Can store data up to 168 hours (7 days) Amazon Kinesis Data Streams
  • 11.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Data ingestion – Kinesis Firehose • Auto-scales to meet load • Different regions have different capacity • US East: 5,000 records/second, 2,000 transactions/second, and 5 MiB/second. • Works with Kinesis Data Analytics • Can transform data before delivery to target • Stores data up to 24 hours on failed delivery Amazon Kinesis Firehose
  • 12.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Data ingestion – Kinesis Firehose Data Sources Targets • Firehose PUT APIs • Amazon Kinesis Agent • AWS IoT • CloudWatch Logs • CloudWatch Events • Amazon S3 • Amazon Redshift • Amazon Elasticsearch Service
  • 13.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Kinesis Data Stream Kinesis Data Streams vs. Kinesis Firehose Kinesis Firehose Amazon Kinesis Data Stream Data Producers 010001110010100 01000111001001101010100 010010100010100 01000100101110100 010010100010100 010010100010100 010010100010100 010010100010100 010010100010100 Data Producers Amazon Kinesis Data Firehose 01000111001001101010100 010010100010100 010010100010100 010010100010100 010010100010100 01000111001001101010100 01000111001001101010100 01000111001001101010100 01000111001001101010100 010001101010100 010001101100
  • 14.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Kinesis Firehose Kinesis Firehose Data Producers Amazon Kinesis Data Firehose 01000111001001101010100 010010100010100 010010100010100 010010100010100 010010100010100 01000111001001101010100 01000111001001101010100 01000111001001101010100 01000111001001101010100 010001101010100 010001101100 Use Kinesis Firehose when you need: • Ability to transform data in the stream • Auto scaling for unpredictable load • Multiple targets for final data
  • 15.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Kinesis Data Stream Kinesis Data Streams Amazon Kinesis Data Stream Data Producers 010001110010100 01000111001001101010100 010010100010100 01000100101110100 010010100010100 010010100010100 010010100010100 010010100010100 010010100010100 Use Kinesis Data Streams when: • You have semi-predictable traffic • You need to perform real-time action on data in the stream
  • 16.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Real-time processing
  • 17.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Kinesis Data Stream + Lambda Amazon Kinesis Data Stream Data Producers Lambda function Lambda function Lambda function Amazon DynamoDB Amazon Kinesis Data Stream AWS IoT Core Lambda services handles intermittent polling via GetRecords API
  • 18.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Kinesis Data Stream + Lambda Amazon Kinesis Data Stream Data Producers Lambda function Lambda function Lambda function Amazon DynamoDB Amazon Kinesis Data Stream AWS IoT Core Lambda services handles intermittent polling via GetRecords API All applications share 2 MB/second/shard egress
  • 19.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Kinesis Data Stream + Enhanced Fanout + Lambda Amazon Kinesis Data Stream Data Producers Lambda function Lambda function Lambda function Amazon DynamoDB Amazon Kinesis Data Stream AWS IoT Core Functions triggered by consumers
  • 20.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Amazon Kinesis Data Stream Data Producers Lambda function Lambda function Lambda function Amazon DynamoDB Amazon Kinesis Data Stream AWS IoT Core Functions triggered by consumers Each consumer provides an individual 2 MB/second/shard egress Kinesis Data Stream + Enhanced Fanout + Lambda
  • 21.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Video Processing Amazon Kinesis Video Streams Amazon Rekognition video Amazon SageMaker S3 Bucket
  • 22.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Video Processing Amazon Kinesis Video Streams Amazon Rekognition video Amazon SageMaker Real time analysis and machine learning S3 Bucket
  • 23.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Video Processing Amazon Kinesis Video Streams Amazon Rekognition video Amazon SageMaker Real time analysis and machine learning S3 Bucket HLS Compatible live or on-demand playback
  • 24.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Video Processing Amazon Kinesis Video Streams Amazon Rekognition video Amazon SageMaker Real time analysis and machine learning HLS Compatible live or on-demand playback S3 Bucket Near real-time processing
  • 25.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Real-time analytics
  • 26.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Amazon Kinesis Data Analytics • Built-in functions to filter, aggregate, and transform streaming data • Processes streaming data with sub-second latencies • Build SQL queries that perform joins, aggregations over time windows and filters • includes open source libraries based on Apache Flink that enable you to build an application in hours instead of months Amazon Kinesis Data Analytics
  • 27.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Real-time analytics Amazon Kinesis Data Stream Amazon Kinesis Data Firehose Amazon Kinesis Data Analytics Stream source can be Kinesis Data Stream or Firehose
  • 28.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Inside Kinesis Data Analytics Stream data -- Create Fail Stream -- CREATE OR REPLACE STREAM "FAIL_STREAM" ( sensorId INT, currentTemperature INT, status VARCHAR(10) ); CREATE OR REPLACE PUMP "FAIL_STREAM_PUMP" AS INSERT INTO "FAIL_STREAM" SELECT "sensorId", "currentTemperature", "status" FROM "SOURCE_SQL_STREAM_001" WHERE "status" SIMILAR TO '%FAIL%'; -- Create Warn Stream -- CREATE OR REPLACE STREAM "WARN_STREAM" ( sensorId INT, currentTemperature INT, status VARCHAR(10) ); CREATE OR REPLACE PUMP "WARN_STREAM_PUMP" AS INSERT INTO "WARN_STREAM" SELECT "sensorId", "currentTemperature", "status" FROM "SOURCE_SQL_STREAM_001" WHERE "status" SIMILAR TO '%WARN%'; FAIL_STREAM WARN_STREAM
  • 29.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Inside Kinesis Data Analytics Stream data -- Create Fail Stream -- CREATE OR REPLACE STREAM "FAIL_STREAM" ( sensorId INT, currentTemperature INT, status VARCHAR(10) ); CREATE OR REPLACE PUMP "FAIL_STREAM_PUMP" AS INSERT INTO "FAIL_STREAM" SELECT "sensorId", "currentTemperature", "status" FROM "SOURCE_SQL_STREAM_001" WHERE "status" SIMILAR TO '%FAIL%'; -- Create Warn Stream -- CREATE OR REPLACE STREAM "WARN_STREAM" ( sensorId INT, currentTemperature INT, status VARCHAR(10) ); CREATE OR REPLACE PUMP "WARN_STREAM_PUMP" AS INSERT INTO "WARN_STREAM" SELECT "sensorId", "currentTemperature", "status" FROM "SOURCE_SQL_STREAM_001" WHERE "status" SIMILAR TO '%WARN%'; FAIL_STREAM WARN_STREAM Use SQL or Apache Flink to filter data
  • 30.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Inside Kinesis Data Analytics Stream data -- Create Fail Stream -- CREATE OR REPLACE STREAM "FAIL_STREAM" ( sensorId INT, currentTemperature INT, status VARCHAR(10) ); CREATE OR REPLACE PUMP "FAIL_STREAM_PUMP" AS INSERT INTO "FAIL_STREAM" SELECT "sensorId", "currentTemperature", "status" FROM "SOURCE_SQL_STREAM_001" WHERE "status" SIMILAR TO '%FAIL%'; -- Create Warn Stream -- CREATE OR REPLACE STREAM "WARN_STREAM" ( sensorId INT, currentTemperature INT, status VARCHAR(10) ); CREATE OR REPLACE PUMP "WARN_STREAM_PUMP" AS INSERT INTO "WARN_STREAM" SELECT "sensorId", "currentTemperature", "status" FROM "SOURCE_SQL_STREAM_001" WHERE "status" SIMILAR TO '%WARN%'; FAIL_STREAM AWS Lambda • Alert • Diagnose • Remediate
  • 31.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Inside Kinesis Data Analytics Stream data -- Create Fail Stream -- CREATE OR REPLACE STREAM "FAIL_STREAM" ( sensorId INT, currentTemperature INT, status VARCHAR(10) ); CREATE OR REPLACE PUMP "FAIL_STREAM_PUMP" AS INSERT INTO "FAIL_STREAM" SELECT "sensorId", "currentTemperature", "status" FROM "SOURCE_SQL_STREAM_001" WHERE "status" SIMILAR TO '%FAIL%'; -- Create Warn Stream -- CREATE OR REPLACE STREAM "WARN_STREAM" ( sensorId INT, currentTemperature INT, status VARCHAR(10) ); CREATE OR REPLACE PUMP "WARN_STREAM_PUMP" AS INSERT INTO "WARN_STREAM" SELECT "sensorId", "currentTemperature", "status" FROM "SOURCE_SQL_STREAM_001" WHERE "status" SIMILAR TO '%WARN%'; WARN_STREAM Amazon Kinesis Data Stream • Dashboards • Consumer response
  • 32.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Real-time analytics Amazon Kinesis Data Stream Amazon Kinesis Data Firehose Amazon Kinesis Data Analytics Amazon Kinesis Data Stream AWS Lambda FAIL_STREAM WARN_STREAM What about the raw data?
  • 33.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Real-time analytics Amazon Kinesis Data Stream Amazon Kinesis Data Firehose Amazon Kinesis Data Analytics Amazon Kinesis Data Stream Amazon Kinesis Data Firehose AWS Lambda FAIL_STREAM WARN_STREAM Raw Data Archive
  • 34.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Post processing
  • 35.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Serverless data storage Amazon Simple Storage Service Amazon DynamoDB Amazon Timestream Amazon Quantum Ledger Database Amazon CloudWatch Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Kinesis Data Analytics
  • 36.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Serverless data storage Amazon Simple Storage Service Amazon DynamoDB Amazon Timestream Amazon Quantum Ledger Database Amazon CloudWatch Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Kinesis Data Analytics How you need to process your data determines where to store it
  • 37.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Serverless storage options Amazon Simple Storage Service Amazon DynamoDB Amazon Timestream Amazon Quantum Ledger Database Amazon CloudWatch • Immutable and transparent • Cryptographically Verifiable • Object storage • Unstructured data • Structured data • Alerting built in • NoSQL • Key value or document data • Time series database
  • 38.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Post processing – Serverless Tools Amazon Athena Query S3 data with standard SQL expressions Amazon S3 Select Retrieve subsets of object data, instead of the entire object. AWS Glue Extract, transform, and load (ETL) service that works across multiple services.
  • 39.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. AWS Glue Bucket Bucket Bucket Bucket Bucket DynamoDB TableDynamoDB Table DynamoDB Table DynamoDB Table DynamoDB Table Other non- serverless services • MariaDB • Microsoft SQL Server • MySQL • Oracle • PostgreSQL Critical data can be stored in many places
  • 40.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. AWS Glue Bucket Bucket Bucket Bucket Bucket DynamoDB TableDynamoDB Table DynamoDB Table DynamoDB Table DynamoDB Table Crawler Data Catalog Other non- serverless services • MariaDB • Microsoft SQL Server • MySQL • Oracle • PostgreSQL
  • 41.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. AWS Glue Bucket Bucket Bucket Bucket Bucket DynamoDB TableDynamoDB Table DynamoDB Table DynamoDB Table DynamoDB Table Crawler Data Catalog Other non- serverless services • MariaDB • Microsoft SQL Server • MySQL • Oracle • PostgreSQL What it is doing • Classifies data to determine the format, schema, and associated properties of the raw data • Groups data into tables or partitions – Data is grouped based on crawler heuristics. • Writes metadata to the Data Catalog
  • 42.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. AWS Glue Bucket Bucket Bucket Bucket Bucket DynamoDB TableDynamoDB Table DynamoDB Table DynamoDB Table DynamoDB Table Crawler Data Catalog Other non- serverless services • MariaDB • Microsoft SQL Server • MySQL • Oracle • PostgreSQL This catalog contains meta-data about the data stores. How do I get the data itself in a meaningful way?
  • 43.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Enter: AWS Athena Bucket Bucket Bucket Bucket Bucket DynamoDB TableDynamoDB Table DynamoDB Table DynamoDB Table DynamoDB Table Crawler Data Catalog Other non- serverless services • MariaDB • Microsoft SQL Server • MySQL • Oracle • PostgreSQL Amazon Athena
  • 44.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. AWS Athena Bucket Bucket Bucket Bucket Bucket DynamoDB TableDynamoDB Table DynamoDB Table DynamoDB Table DynamoDB Table Crawler Data Catalog Other non- serverless services • Amazon Aurora • MariaDB • Microsoft SQL Server • MySQL • Oracle • PostgreSQL Athena queries Glue Data Catalog Glue returns data from data source Amazon Athena
  • 45.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. AWS Athena
  • 46.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Question I have HUGE compressed CSV files stored on Amazon S3. How do I get small bits of data without reading the entire file?
  • 47.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Enter: Amazon S3 Select import boto3 s3 = boto3.client('s3’) r = s3.select_object_content( Bucket='jbarr-us-west-2’, Key='sample-data/airportCodes.csv’, ExpressionType='SQL’, Expression="select * from s3object s where s."Country (Name)" like '%United States%’”, InputSerialization = {'CSV': {"FileHeaderInfo": "Use"}}, OutputSerialization = {'CSV': {}}, ) for event in r['Payload’]: if 'Records' in event: records = event['Records']['Payload'].decode('utf-8’) print(records) elif 'Stats' in event: statsDetails = event['Stats']['Details’]
  • 48.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Before S3 Select Lambda function Bucketntire file returned
  • 49.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reservedfter S3 Select Lambda function Bucket Parsed value returned Up to 400% faster and 80% cheaper
  • 50.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Questions? https://pixabay.com/illustrations/questions-font-who-what-how-why-2245264/
  • 51.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Eric Johnson @edjgeek Image Source: https://pixabay.com/illustrations/thank-you-polaroid-letters-2490552/