Build data quality rules and data
cleansing into your data pipelines
Principal Product Manager
Microsoft Data Integration
Data Quality for Data Warehouse Scenarios
Verify data types
and lengths
How should I
handle NULLs?
Domain value
constraints (ie. US
states)
Single source of
truth (master
data)
Late arriving
dimensions
Reference data
lookups
Data Quality for Data Science Scenarios
De-Duplicate
Data
Descriptive data
statistics (Length,
type, mean, median,
average, stddev …)
Frequency
distribution
How do I handle
missing values?
Enumerations
(turn codes into
categorical data)
Value
replacement
Replacing Values
• iif (length(title) == 0,toString(null()),title)
Splitting Data Based on Values
Data Profiling
• Summary statistics describing the shape, size,
content of your data
• Helps you to understand what is inside your data
• As a Data Engineer, you have the responsibility to
provide proper data for analytics and models
• For big data sets, use sampling / good enough
approach
• View data statistics at any step in your data
transformation
• Here is how to persist your data profile stats:
https://techcommunity.microsoft.com/t5/azure-
data-factory/how-to-save-your-data-profiler-
summary-stats-in-adf-data-flows/ba-p/1243251
Pattern Matching
With very wide datasets or datasets where you cannot
anticipate all column names, generate pattern-matching rules
to apply data quality rules en masse without the need to name
each column and describe each rule individually.
Enumerations /
Lookups
• Replace coded values with categorical
strings, or vice-versa
• Hard code rules inside Case() statement or
use a Join or Lookup method against a
reference file or table
• Models and analytics will sometimes require
codes or string literal values
Data De-Duplication
and Distinct Rows
• Use this pattern to eliminate common rows
from your data
• You pick a heuristic to use during duplicate
matching
• You can tag rows and/or remove duplicate
rows
• Use exact matching and/or fuzzy matching
• Available as pipeline template Dedupe
Pipeline
Fuzzy Lookups &
Joins
• Sometimes when performing
inline lookups, you don’t
have exact matches when
looking for references
• Fuzzy Lookups with Soundex
helps find matches based on
phonetic algorithms
• Very useful in data lake
scenarios where joins and
lookups are against data that
is not normalized or cleaned
Metadata
Validation Rules
• Transform data conditionally based on metadata traits
• Create and manage metadata quality rules
• Manipulate column properties of source data
• https://www.youtube.com/watch?v=E_UD3R-VpYE
Assertions
• Useful for building data quality pipelines
• Log failed assertions
• Fail fast
• Built-in expectations
• Expect true
• Expect unique
• Expect exists

Build data quality rules and data cleansing into your data pipelines

  • 1.
    Build data qualityrules and data cleansing into your data pipelines Principal Product Manager Microsoft Data Integration
  • 2.
    Data Quality forData Warehouse Scenarios Verify data types and lengths How should I handle NULLs? Domain value constraints (ie. US states) Single source of truth (master data) Late arriving dimensions Reference data lookups
  • 3.
    Data Quality forData Science Scenarios De-Duplicate Data Descriptive data statistics (Length, type, mean, median, average, stddev …) Frequency distribution How do I handle missing values? Enumerations (turn codes into categorical data) Value replacement
  • 4.
    Replacing Values • iif(length(title) == 0,toString(null()),title)
  • 5.
  • 6.
    Data Profiling • Summarystatistics describing the shape, size, content of your data • Helps you to understand what is inside your data • As a Data Engineer, you have the responsibility to provide proper data for analytics and models • For big data sets, use sampling / good enough approach • View data statistics at any step in your data transformation • Here is how to persist your data profile stats: https://techcommunity.microsoft.com/t5/azure- data-factory/how-to-save-your-data-profiler- summary-stats-in-adf-data-flows/ba-p/1243251
  • 7.
    Pattern Matching With verywide datasets or datasets where you cannot anticipate all column names, generate pattern-matching rules to apply data quality rules en masse without the need to name each column and describe each rule individually.
  • 8.
    Enumerations / Lookups • Replacecoded values with categorical strings, or vice-versa • Hard code rules inside Case() statement or use a Join or Lookup method against a reference file or table • Models and analytics will sometimes require codes or string literal values
  • 9.
    Data De-Duplication and DistinctRows • Use this pattern to eliminate common rows from your data • You pick a heuristic to use during duplicate matching • You can tag rows and/or remove duplicate rows • Use exact matching and/or fuzzy matching • Available as pipeline template Dedupe Pipeline
  • 10.
    Fuzzy Lookups & Joins •Sometimes when performing inline lookups, you don’t have exact matches when looking for references • Fuzzy Lookups with Soundex helps find matches based on phonetic algorithms • Very useful in data lake scenarios where joins and lookups are against data that is not normalized or cleaned
  • 11.
    Metadata Validation Rules • Transformdata conditionally based on metadata traits • Create and manage metadata quality rules • Manipulate column properties of source data • https://www.youtube.com/watch?v=E_UD3R-VpYE
  • 12.
    Assertions • Useful forbuilding data quality pipelines • Log failed assertions • Fail fast • Built-in expectations • Expect true • Expect unique • Expect exists