Cobus Bernard
Sr Developer Advocate
Amazon Web Services
Getting Started with Data Lakes on
AWS
@cobusbernard
cobusbernard
cobusbernard
Agenda
What is a Data Lake
Storing data in S3
Steps to build a Data Lake
AWS Lakeformation
Demo
Q&A
A data lake is a
centralised repository
that allows you to store all your
structured and unstructured
data at any scale
Why data lakes?
Data Lakes provide:
Relational and non-relational data
Scale-out to EBs (1EB = 1,024 PB = 1,048,576 TB)
Diverse set of analytics and machine learning tools
Work on data without any data movement
Designed for low cost storage and analytics
OLTP ERP CRM LOB
Data Warehouse
Business
Intelligence
Data Lake
1001100001001010111001
0101011100101010000101
1111011010
0011110010110010110
0100011000010
Devices Web Sensors Social
Catalog
Machine
Learning
DW Queries Big data
processing
Interactive Real-time
Build a secure data lake on Amazon S3
Amazon S3 Block
Public Access
Amazon S3
object lock
Amazon S3
object tags
Amazon S3
access points
• Multi-tenant bucket
• Dedicated access
points
• Custom permissions
from a virtual private
cloud (VPC)
• Controls public access
• Across AWS accounts &
individual S3 bucket
levels
• Specify any type of
public permissions via
ACL or policy
• Immutable
Amazon S3 objects
• Retention management
controls
• Data protection and
compliance
• Access control, lifecycle
policies, analysis
• Classify data, filter
objects
• Define replication
policies
FSx for Lustre
Choosing the right data lake storage class
Select storage class by data pipeline stage
Raw data ETL
• Small log files
• Overwrites if synced
• Short lived
• Moved & deleted
• Batched & archived
Production
data lake Historical data
Amazon S3
Standard
Amazon S3
Standard
Amazon S3
Intelligent-Tiering
Amazon S3 Glacier or
S3 Glacier Deep Archive
• Data churn
• Small intermediates
• Multiple transforms
• Deletes <30 days
• Output to data lake
• Optimized sizes (MBs)
• Many users
• Unpredictable access
• Long-lived assets
• Hot to cool
• Historical assets
• ML model training
• Compliance/Audit
• Data protection
• Planned restores
Online cool data
Amazon S3 Standard
Infrequent Access
and One Zone-IA
• Replicated DR data
• Infrequently accessed
• Infrequent queries
• ML model training
Typical steps of building a data lake
Setup storage1
Move data2
Cleanse, prep, and
catalog data
3
Configure and enforce
security and compliance
policies
4
Make data available
for analytics
5
Data preparation accounts for ~80% of the work
Building training sets
Cleaning and organizing data
Collecting data sets
Mining data for patterns
Refining algorithms
Other
Sample of steps required Find sources
Create Amazon Simple Storage Service (Amazon S3) locations
Configure access policies
Map tables to Amazon S3 locations
ETL jobs to load and clean data
Create metadata access policies
Configure access from analytics services
Rinse and repeat for other:
data sets, users, and end-services
And more:
manage and monitor ETL jobs
update metadata catalog as data changes
update policies across services as users and permissions change
manually maintain cleansing scripts
create audit processes for compliance
…
Manual | Error-prone | Time consuming
Enforce security policies
across multiple services
Gain and manage new
insights
Identify, ingest, clean, and
transform data
Build a secure data lake in days
AWS Lake Formation
How it works
Register existing data or import new
Amazon S3 forms the storage layer for
Lake Formation
Register existing S3 buckets that
contain your data
Ask Lake Formation to create required
S3 buckets and import data into them
Data is stored in your account. You have
direct access to it. No lock-in.
Data Lake Storage
Data
Catalog
Access
Control
Data
import
Lake Formation
Crawlers ML-based
data prep
Easily load data to your data lake
logs
DBs
Blueprints
Data Lake Storage
Data
Catalog
Access
Control
Data
import
Lake Formation
Crawlers ML-based
data prep
one-shot
incremental
With blueprints
You
1. Point us to the source
2. Tell us the location to load to
in your data lake
3. Specify how often you want
to load the data
Blueprints
1. Discover the source table(s)
schema
2. Automatically convert to the
target data format
3. Automatically partition the
data based on the
partitioning schema
4. Keep track of data that was
already processed
5. You can customize any of the
above
Blueprints build on AWS Glue
© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
https://aws.training
https://aws.amazon.com/training/path-databases
https://aws.amazon.com/training/path-advanced-networking
https://bit.ly/aws-office-hours
https://bit.ly/africa-virtual-day
https://bit.ly/emea-summit
https://bit.ly/cobus-youtube
Useful links
Thank you!
© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Cobus Bernard
Sr Developer Advocate
Amazon Web Services
@cobusbernard
cobusbernard
cobusbernard

AWS SSA Webinar 21 - Getting Started with Data lakes on AWS

  • 1.
    Cobus Bernard Sr DeveloperAdvocate Amazon Web Services Getting Started with Data Lakes on AWS @cobusbernard cobusbernard cobusbernard
  • 2.
    Agenda What is aData Lake Storing data in S3 Steps to build a Data Lake AWS Lakeformation Demo Q&A
  • 3.
    A data lakeis a centralised repository that allows you to store all your structured and unstructured data at any scale
  • 4.
    Why data lakes? DataLakes provide: Relational and non-relational data Scale-out to EBs (1EB = 1,024 PB = 1,048,576 TB) Diverse set of analytics and machine learning tools Work on data without any data movement Designed for low cost storage and analytics OLTP ERP CRM LOB Data Warehouse Business Intelligence Data Lake 1001100001001010111001 0101011100101010000101 1111011010 0011110010110010110 0100011000010 Devices Web Sensors Social Catalog Machine Learning DW Queries Big data processing Interactive Real-time
  • 5.
    Build a securedata lake on Amazon S3 Amazon S3 Block Public Access Amazon S3 object lock Amazon S3 object tags Amazon S3 access points • Multi-tenant bucket • Dedicated access points • Custom permissions from a virtual private cloud (VPC) • Controls public access • Across AWS accounts & individual S3 bucket levels • Specify any type of public permissions via ACL or policy • Immutable Amazon S3 objects • Retention management controls • Data protection and compliance • Access control, lifecycle policies, analysis • Classify data, filter objects • Define replication policies FSx for Lustre
  • 7.
    Choosing the rightdata lake storage class Select storage class by data pipeline stage Raw data ETL • Small log files • Overwrites if synced • Short lived • Moved & deleted • Batched & archived Production data lake Historical data Amazon S3 Standard Amazon S3 Standard Amazon S3 Intelligent-Tiering Amazon S3 Glacier or S3 Glacier Deep Archive • Data churn • Small intermediates • Multiple transforms • Deletes <30 days • Output to data lake • Optimized sizes (MBs) • Many users • Unpredictable access • Long-lived assets • Hot to cool • Historical assets • ML model training • Compliance/Audit • Data protection • Planned restores Online cool data Amazon S3 Standard Infrequent Access and One Zone-IA • Replicated DR data • Infrequently accessed • Infrequent queries • ML model training
  • 8.
    Typical steps ofbuilding a data lake Setup storage1 Move data2 Cleanse, prep, and catalog data 3 Configure and enforce security and compliance policies 4 Make data available for analytics 5
  • 9.
    Data preparation accountsfor ~80% of the work Building training sets Cleaning and organizing data Collecting data sets Mining data for patterns Refining algorithms Other
  • 10.
    Sample of stepsrequired Find sources Create Amazon Simple Storage Service (Amazon S3) locations Configure access policies Map tables to Amazon S3 locations ETL jobs to load and clean data Create metadata access policies Configure access from analytics services Rinse and repeat for other: data sets, users, and end-services And more: manage and monitor ETL jobs update metadata catalog as data changes update policies across services as users and permissions change manually maintain cleansing scripts create audit processes for compliance … Manual | Error-prone | Time consuming
  • 11.
    Enforce security policies acrossmultiple services Gain and manage new insights Identify, ingest, clean, and transform data Build a secure data lake in days AWS Lake Formation
  • 12.
  • 13.
    Register existing dataor import new Amazon S3 forms the storage layer for Lake Formation Register existing S3 buckets that contain your data Ask Lake Formation to create required S3 buckets and import data into them Data is stored in your account. You have direct access to it. No lock-in. Data Lake Storage Data Catalog Access Control Data import Lake Formation Crawlers ML-based data prep
  • 14.
    Easily load datato your data lake logs DBs Blueprints Data Lake Storage Data Catalog Access Control Data import Lake Formation Crawlers ML-based data prep one-shot incremental
  • 15.
    With blueprints You 1. Pointus to the source 2. Tell us the location to load to in your data lake 3. Specify how often you want to load the data Blueprints 1. Discover the source table(s) schema 2. Automatically convert to the target data format 3. Automatically partition the data based on the partitioning schema 4. Keep track of data that was already processed 5. You can customize any of the above
  • 16.
  • 17.
    © 2020, AmazonWeb Services, Inc. or its affiliates. All rights reserved.
  • 18.
  • 19.
    Thank you! © 2020,Amazon Web Services, Inc. or its affiliates. All rights reserved. Cobus Bernard Sr Developer Advocate Amazon Web Services @cobusbernard cobusbernard cobusbernard

Editor's Notes

  • #4 That’s why many customers are moving to a data lake architecture. A data lake is an architectural approach that helps you manage multiple data types from a wide variety of sources, both structured and unstructured, through a unified set of tools, so it's readily available to be categorized, processed, analyzed and consumed by diverse groups within an organization. Since data can be stored as-is, there is no need to convert it to a predefined schema and you no longer need to know what questions you want to ask of your data beforehand. batch, interactive, online, search, in-memory and other processing engines. It helps you manage multiple data types from a wide variety of sources, both structured and unstructured, through a unified set of tools
  • #5 Data lakes allow you to break down data silos and bring data into a single central repository. You can store a wide variety of data formats, at any scale and at low cost. Data lakes provide you a single source of truth and allow you access to the same data using a variety of analytics and machine-learning tools.
  • #9 Turns out there are a lot of steps involved in building data lakes 1/ Set up storage – Data lakes hold a massive amount of data. Before doing anything else, customers need to set up storage to hold all of that data. If they are using AWS they configure S3 buckets and partitions. If they are doing this on-prem they acquire hardware and set up large disk arrays to hold all of the data for their data lake. 2/ Move data -- Customers need to connect to different data sources on-premises, in the cloud, and on IoT devices. Then they need to collect and organize the relevant data sets from those sources, crawl the data to extract the schemas, and add metadata tags to the catalog. Customers do this today with a collection of file transfer and ETL tools, like AWS Glue. 3/ Clean and prepare data -- Next, that data must be carefully partitioned, indexed, and transformed to columnar formats to optimize for performance and cost. Customers need to clean, de-duplicate, and match related records. Today this is done using rigid and complex SQL statements that only work so well and are difficult to maintain. This process of collecting, cleaning, and transforming the incoming data is complex and must be manually monitored in order to avoid errors. 4/ Configure and enforce policies – Sensitive data must be secured according to compliance requirements. This means creating and applying data access, protection, and compliance policies to make sure you are meeting required standards. For example, restricting access to personally identifiable information (PII) at the table, column, or row level, encrypting all data, and keeping audit logs of who is accessing the data. Today customers use access control lists on S3 buckets or they use 3rd party encryption and access control software to secure the data. And for every analytics service that needs to access the data, customers need to create and maintain data access, protection and compliance policies for each one. For example, if you are running analysis against your data lake, using Redshift and Athena, you need to set up access control rules for each of these services. 5/ Make it easy to find data - Different people in your organizations, like analysts and data scientists, may have trouble finding and trusting data sets in the data lake. You need to make it easy for those end-users to easily find relevant and trusted data. To do this you must clearly label the data in a catalog of the data lake and provide users with the ability to access and analyze this data without making requests to IT. Each of these steps involve a lot of work because today a lot of it is done manually. Customers can spend months building data access and transformation workflows, mapping security and policy settings, and configuring tools and services for data movement, storage, cataloging, security, analytics, and machine learning. With all these steps, a fully productive data lake can take months to implement. TRANSITION: We’ve learned from the tens of thousands of customers running analytics on AWS that most customers that want to do analytics want to build a data lake, and many of them want this to be easier and faster than it is today.
  • #10 A recent study by CrowdFlower who surveyed ~80 data scientists about their jobs. They found Data scientists spend 60% of their time on cleaning and organizing data. Collecting data sets comes second at 19% of their time, meaning data scientists spend around ~80% of their time on preparing and managing data for analysis. They also found data preparation was the least enjoyable parts of their work! https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/#6493d6c76f63 The question we have to ask ourselves is if we can make data preparation easier? Can we minimize the time that people are collecting data sets and cleaning/organizing their data?
  • #12 Lake Formation automates many of the steps we discussed, allowing customers to get started with just a few clicks from a single, unified dashboard. 1/ Identify, ingest, clean, and transform data: With Lake Formation, you can move, store, catalog, and clean your data faster. 2/ Enforce security policies across multiple services: Once your data sources are setup, you then define security, governance, and auditing policies in one place, and enforce those policies for all users and all applications. 3/ Gain and manage new insights: With Michigan you build a data catalog that describes available data sets and their appropriate business uses. This makes your users more productive by helping them find the right data set to analyze. By providing a catalog of your data and consistent security enforcement, Michigan makes it easier for your analysts and data scientists to combine multiple analytic tools, like Athena, Redshift, and EMR, across diverse data sets.
  • #13 With just a few clicks, you can setup your data lake on Amazon S3 and start ingesting data that is readily queryable. Lake For To get started, you go to the Michigan dashboard in the AWS console, add your data sources and then Michigan will crawl those sources and move the data into your new Amazon S3 data lake. Michigan uses machine learning to automatically lay out the data in Amazon S3 partitions, change it into formats for faster analytics, like Apache Parquet and ORC, and also de-duplicates and finds matching records to increase data quality. From a single screen set up all of the permissions for your data lake and they will be implemented across all services accessing this data - analytics and machine learning services (Amazon Redshift, Amazon Athena, and Amazon EMR.) This reduces the hassle in re-defining policies across multiple services and provides consistent enforcement and compliance of those policies. mation …
  • #17 Blueprints heavily leverage the functionality in AWS Glue: We use Glue crawlers and connections to connect and discover the raw data that needs to be ingested. We use Glue code-gen and jobs to generate the ingest code to bring that data into the data lake. We leverage the data catalog for organizing the metadata. We have added a workflow construct to stitch together crawlers, jobs, and allow for monitoring for individual workflows. They’re an natural extension of the AWS Glue capabilities.