Build data quality rules and data cleansing into your data pipelines

Build data quality rules and data
cleansing into your data pipelines
Principal Product Manager
Microsoft Data Integration

Data Quality for Data Warehouse Scenarios
Verify data types
and lengths
How should I
handle NULLs?
Domain value
constraints (ie. US
states)
Single source of
truth (master
data)
Late arriving
dimensions
Reference data
lookups

Data Quality for Data Science Scenarios
De-Duplicate
Data
Descriptive data
statistics (Length,
type, mean, median,
average, stddev …)
Frequency
distribution
How do I handle
missing values?
Enumerations
(turn codes into
categorical data)
Value
replacement

Replacing Values
• iif (length(title) == 0,toString(null()),title)

Splitting Data Based on Values

Data Profiling
• Summary statistics describing the shape, size,
content of your data
• Helps you to understand what is inside your data
• As a Data Engineer, you have the responsibility to
provide proper data for analytics and models
• For big data sets, use sampling / good enough
approach
• View data statistics at any step in your data
transformation
• Here is how to persist your data profile stats:
https://techcommunity.microsoft.com/t5/azure-
data-factory/how-to-save-your-data-profiler-
summary-stats-in-adf-data-flows/ba-p/1243251

Pattern Matching
With very wide datasets or datasets where you cannot
anticipate all column names, generate pattern-matching rules
to apply data quality rules en masse without the need to name
each column and describe each rule individually.

Enumerations /
Lookups
• Replace coded values with categorical
strings, or vice-versa
• Hard code rules inside Case() statement or
use a Join or Lookup method against a
reference file or table
• Models and analytics will sometimes require
codes or string literal values

Data De-Duplication
and Distinct Rows
• Use this pattern to eliminate common rows
from your data
• You pick a heuristic to use during duplicate
matching
• You can tag rows and/or remove duplicate
rows
• Use exact matching and/or fuzzy matching
• Available as pipeline template Dedupe
Pipeline

Fuzzy Lookups &
Joins
• Sometimes when performing
inline lookups, you don’t
have exact matches when
looking for references
• Fuzzy Lookups with Soundex
helps find matches based on
phonetic algorithms
• Very useful in data lake
scenarios where joins and
lookups are against data that
is not normalized or cleaned

Metadata
Validation Rules
• Transform data conditionally based on metadata traits
• Create and manage metadata quality rules
• Manipulate column properties of source data
• https://www.youtube.com/watch?v=E_UD3R-VpYE

Assertions
• Useful for building data quality pipelines
• Log failed assertions
• Fail fast
• Built-in expectations
• Expect true
• Expect unique
• Expect exists

Build data quality rules and data cleansing into your data pipelines

More Related Content

What's hot

Similar to Build data quality rules and data cleansing into your data pipelines

More from Mark Kromer

Recently uploaded

In this document

Build data quality rules and data cleansing into your data pipelines