This document discusses ETL patterns in the cloud using Azure Data Factory. It covers topics like ETL vs ELT, the importance of scale and flexible schemas in cloud ETL, and how Azure Data Factory supports workflows, templates, and integration with on-premises and cloud data. It also provides examples of nightly ETL data flows, handling schema drift, loading dimensional models, and data science scenarios using Azure data services.
Introduction to ETL in the cloud with Azure Data Factory; key factors for success, including scheduling, performance, scale, and flexible schema.
Description of Azure Data Factory's workflow pipelines, control flow, built-in source control, and integration runtime capabilities.
Details on SSIS package deployment, execution, and monitoring in Azure Data Factory environment.
Details on SSIS package deployment, execution, and monitoring in Azure Data Factory environment.
Overview of ADF mapping data flows enabling zero-code data transformation at scale, and various transformation capabilities. Guided experience for building data transformation workflows; debug, monitor, and ensure data quality in data flows.
Exploration of cloud ETL patterns with ADF, modernization of enterprise data warehouses, and handling flexible schemas.
ETL Patterns inthe Cloud with
Azure Data Factory
Mark Kromer
Senior Program Manager
Microsoft Azure Data Management
@kromerbigdata
2.
ETL Patterns inthe Cloud
Important factors for success
1. What is ETL?
• More than Extract, Transform, Load
• Scheduling, Monitoring, Maintenance, Source Control, CI/CD, Operationalize
2. Platform as a Service (ADF) vs. Infrastructure as a Service (IaaS/SSIS)
• Self-managed vs. Provider-Managed
3. ELT or ETL?
• Difference is primarily highly-parsed semantics
• However: In the cloud, common pattern == stage data in low-cost, inexpensive storage
4. Not typically performant to process data in-flight
• Particularly crossing boundaries (on-prem, vnets, data centers, regions)
5. Scale is very important in Cloud ETL
• Cloud projects assume elastic scale. ETL is not immune to this expectation.
6. Flexible Schema is very important in Cloud ETL
1. Assume “Big Data tenets” aka “data chaos”: Your data sources will change shape, size and
volume. Often!
Quickly get startedwith building data integration solutions. Avoid building same workflows
repeatedly. Simply instantiate a template. Improve developer productivity along with reducing
development time for repeat processes.
Use Templates to quickly get started
8.
Secure data platformenabling Analytics and Insights on Microsoft 365.
Data access @ Scale
Dataset based access rather than
real time API based access
Granular Request/Consent
enabling Data Privacy
Row and column level scoping
with advanced filtering capability
Data Governance & Security
Control and visibility over your
data throughout its entire lifecycle
Microsoft Graph Data Connect
Azure Data FactoryService
Cloud Apps, Svcs & DataOn Premises Apps & Data
UX & SDK
11.
Azure (US West)
PublicInternet Border
HP Inc (Global)
HP Prod Firewall Border
HP Hadoop Cluster
Integration Fabric On-prem
Data Factory
(Orchestration
micro-service)
443
Storage (Azure)
443
SQL Data
Warehouse
UAM Server
443
ADF Foo On-prem
“IR”
Customer 1
Customer 1 firewall border
Azure Data Factory “Integration Runtime” deployed on premises for
transformation and then moved to cloud
What is ADFMapping Data Flow?
Transform Data, At Scale, in the Cloud,
Zero-Code
Cloud-first, scale-out ELT
Code-free dataflow pipelines
Serverless scale-out transformation
execution engine
Maximum Productivity for Data
Engineers
Does NOT require understanding of Spark /
Scala / Python / Java
Resilient Data Transformation Flows
Built for big data scenarios with
unstructured data requirements
Operationalize with Data Factory
scheduling, control flow and monitoring
20.
Code-free Data TransformationAt Scale
Does not require understanding of Spark, Big Data Execution
Engines, Clusters, Scala, Python …
Focus on building business logic and data transformation
Data cleansing
Aggregation
Data conversions
Data prep
Data exploration
21.
ADF Data FlowWorkstream
Stage Data in Azure
(ADLS, Blob, SQL
DB/DW)
Transform Data in
Visual Data Flow
Land Data in Azure
Staging Area (ADLS,
Blob, SQL DB/DW)
22.
Build your logicaldata flows adding data
transformations in a guided experience
23.
Microsoft Azure DataFactory Continues to Extend Data Flow
Library with a Rich Set of Transformations and Expression
Functions
MODEL & SERVE
AzureAnalysis ServicesAzure SQL Data
Warehouse
Power BI
Modernize your enterprise data warehouse at scale
A Z U R E D A T A F A C T O R Y
On-premises data
Oracle, SQL, Teradata,
fileshares, SAP
Cloud data
Azure, AWS, GCP
SaaS data
Salesforce, Workday,
Dynamics
INGEST STORE PREP & TRAIN
Azure Data Factory Azure Blob Storage
Azure Databricks
Polybase
Microsoft Azure also supports other Big Data services like Azure HDInsight, Azure SQL Database and Azure Data Lake to allow customers to tailor the above architecture to meet their unique needs.
Orchestrate with Azure Data Factory
31.
Lift your SQLServer Integration Services (SSIS) packages to Azure
On-Premise data sources
SQL DB Managed Instance
SQL Server
VNET
Azure Data Factory
SSIS Cloud ETL
SSIS Integration Runtime
Cloud data sources
Cloud
On-premises
Microsoft
SQL Server
Integration Services
32.
Author, orchestrate andmonitor with Azure Data Factory
Hybrid and Multi-Cloud Data Integration
Azure Data Factory
PaaS Data Integration
DATA SCIENCE
AND MACHINE
LEARNING
MODELS
ANALYTICAL
DASHBOARDS
USING POWER BI
DATA DRIVEN
APPLICATIONS
On-Prem SaaS Apps Public Cloud