AWS SSA Webinar 21 - Getting Started with Data lakes on AWS

Cobus Bernard
Sr Developer Advocate
Amazon Web Services
Getting Started with Data Lakes on
AWS
@cobusbernard
cobusbernard
cobusbernard

Agenda
What is a Data Lake
Storing data in S3
Steps to build a Data Lake
AWS Lakeformation
Demo
Q&A

A data lake is a
centralised repository
that allows you to store all your
structured and unstructured
data at any scale

Why data lakes?
Data Lakes provide:
Relational and non-relational data
Scale-out to EBs (1EB = 1,024 PB = 1,048,576 TB)
Diverse set of analytics and machine learning tools
Work on data without any data movement
Designed for low cost storage and analytics
OLTP ERP CRM LOB
Data Warehouse
Business
Intelligence
Data Lake
1001100001001010111001
0101011100101010000101
1111011010
0011110010110010110
0100011000010
Devices Web Sensors Social
Catalog
Machine
Learning
DW Queries Big data
processing
Interactive Real-time

Build a secure data lake on Amazon S3
Amazon S3 Block
Public Access
Amazon S3
object lock
Amazon S3
object tags
Amazon S3
access points
• Multi-tenant bucket
• Dedicated access
points
• Custom permissions
from a virtual private
cloud (VPC)
• Controls public access
• Across AWS accounts &
individual S3 bucket
levels
• Specify any type of
public permissions via
ACL or policy
• Immutable
Amazon S3 objects
• Retention management
controls
• Data protection and
compliance
• Access control, lifecycle
policies, analysis
• Classify data, filter
objects
• Define replication
policies
FSx for Lustre

Choosing the right data lake storage class
Select storage class by data pipeline stage
Raw data ETL
• Small log files
• Overwrites if synced
• Short lived
• Moved & deleted
• Batched & archived
Production
data lake Historical data
Amazon S3
Standard
Amazon S3
Standard
Amazon S3
Intelligent-Tiering
Amazon S3 Glacier or
S3 Glacier Deep Archive
• Data churn
• Small intermediates
• Multiple transforms
• Deletes <30 days
• Output to data lake
• Optimized sizes (MBs)
• Many users
• Unpredictable access
• Long-lived assets
• Hot to cool
• Historical assets
• ML model training
• Compliance/Audit
• Data protection
• Planned restores
Online cool data
Amazon S3 Standard
Infrequent Access
and One Zone-IA
• Replicated DR data
• Infrequently accessed
• Infrequent queries
• ML model training

Typical steps of building a data lake
Setup storage1
Move data2
Cleanse, prep, and
catalog data
3
Configure and enforce
security and compliance
policies
4
Make data available
for analytics
5

Data preparation accounts for ~80% of the work
Building training sets
Cleaning and organizing data
Collecting data sets
Mining data for patterns
Refining algorithms
Other

Sample of steps required Find sources
Create Amazon Simple Storage Service (Amazon S3) locations
Configure access policies
Map tables to Amazon S3 locations
ETL jobs to load and clean data
Create metadata access policies
Configure access from analytics services
Rinse and repeat for other:
data sets, users, and end-services
And more:
manage and monitor ETL jobs
update metadata catalog as data changes
update policies across services as users and permissions change
manually maintain cleansing scripts
create audit processes for compliance
…
Manual | Error-prone | Time consuming

Enforce security policies
across multiple services
Gain and manage new
insights
Identify, ingest, clean, and
transform data
Build a secure data lake in days
AWS Lake Formation

Register existing data or import new
Amazon S3 forms the storage layer for
Lake Formation
Register existing S3 buckets that
contain your data
Ask Lake Formation to create required
S3 buckets and import data into them
Data is stored in your account. You have
direct access to it. No lock-in.
Data Lake Storage
Data
Catalog
Access
Control
Data
import
Lake Formation
Crawlers ML-based
data prep

Easily load data to your data lake
logs
DBs
Blueprints
Data Lake Storage
Data
Catalog
Access
Control
Data
import
Lake Formation
Crawlers ML-based
data prep
one-shot
incremental

With blueprints
You
1. Point us to the source
2. Tell us the location to load to
in your data lake
3. Specify how often you want
to load the data
Blueprints
1. Discover the source table(s)
schema
2. Automatically convert to the
target data format
3. Automatically partition the
data based on the
partitioning schema
4. Keep track of data that was
already processed
5. You can customize any of the
above

https://aws.training
https://aws.amazon.com/training/path-databases
https://aws.amazon.com/training/path-advanced-networking
https://bit.ly/aws-office-hours
https://bit.ly/africa-virtual-day
https://bit.ly/emea-summit
https://bit.ly/cobus-youtube
Useful links

AWS SSA Webinar 21 - Getting Started with Data lakes on AWS

More Related Content

Similar to AWS SSA Webinar 21 - Getting Started with Data lakes on AWS

More from Cobus Bernard

Recently uploaded

AWS SSA Webinar 21 - Getting Started with Data lakes on AWS

Editor's Notes