Data Quality:
Principles, Approaches, and Best Practices
Carl Anderson
carl.anderson@weightwatchers.com
WW – the new Weight Watchers
1/3 business leaders frequently make
decisions with data they don’t trust
Bad data costs the economy $100s BN / year
[IBM]
[TDWI]
Data Science
Business
Intelligence
Engineering
Data Strategy
About Me
Big data:
● Food
● Activity
● Exercises
● Challenges
● Social network
● Workshops
● Personal Coaches
● CRM
● Fulfillment
● Meal kits
● Supermarket foods
● E-commerce
● Cruises
...for 56 years
2017: fill lake with data; provide analysts access
2019: upstream control and governance
Data Entry Transformation 1 Transformation 2
Inaccurate
(GIGO)
Missing
Defaults
Dropped
records
Truncation
Encoding
changes
Data type
change
Stale
3rd party
Disagree
In General, What Can Go Wrong?
Shape
change
Dupes
Dupes
Accurate
Coherent
Complete
Consistent
Defined
Timely
Missing data, duplicates
Referential integrity, connect the dots
Data entry issues, stale data, default dates...
Data dictionaries, business glossary, provenance, schema
Latency
Same values across systems, e.g. same address
Facets of Data Quality
Trust Analysts willing to use data. NPS
*
*
*
Accurate
% records quarantined
% records in range
% records matching
Coherent
% records missing entity ID
% records missing foreign key
Complete
% records dupes
% records missing
% records complete
% fields complete
Consistent % records consistent
Defined
% tables defined
% fields defined
% dimensions defined
% measures defined
Timely
Mean time to arrival
95th percentile time to arrival
Volume Number of Records
Trust NPS
“If you can't measure it, you
can't improve it”
- Peter Drucker
Data Quality
Scorecard
Facet: Accuracy
Publish Schema Publish Schema
Adhere to Schema
Field Ranges
Source teams then: Source teams now (WIP):
Data team superpowers:
1. Auto consumption
2. Auto checks
3. Quarantine
4. Reporting
Data did not always match schema
Hard to trust
Hard to automate
No accountability
Accurate
% records quarantined
% records in range
% records matching
Facet: Accuracy
Publish Schema Publish Schema
Adhere to Schema
Field Ranges
Source teams then: Source teams now (WIP):
Data team superpowers:
1. Auto consumption
2. Auto checks
3. Quarantine
4. Reporting
Data did not always match schema
Hard to trust
Hard to automate
No accountability
Facet: Defined
Table-level data dictionaries
Business-level data dictionary
(Business Glossary)
https://medium.com/@leapingllamas
Facet: Defined. Flow from master
Data catalog is
master for table-level
definitions and
business glossary
Mapping table from
master to BI tool: here,
Looker dimensions and
measures
Tool compares
master to BI tool and
updates/injects and
creates pull request
Manually
reviewed and
merged
Master definitions
appear to users
Facet: Defined. Flow from master
Data catalog is
master for table-level
definitions and
business glossary
Mapping table from
master to BI tool: here,
Looker dimensions and
measures
Tool compares
master to BI tool and
updates/injects and
creates pull request
Manually
reviewed and
merged
Master definitions
appear to users
Open sourcing: https://github.com/ww-tech/lookml-tools
Facet: Defined. Style Guide
Open sourcing: https://github.com/ww-tech/lookml-tools
LookML
linter
Defined
% tables defined
% fields defined
Facet: Defined
+
LookML
updater
LookML
linter
Defined
% dimensions defined
% measures defined
Easy to lose trust. Hard to regain!
We asked:
● NPS data: would you recommend our data to a friend?
● NPS infrastructure: would you recommend our infrastructure (Looker, BigQuery etc) to a friend?
● NPS support: would you recommend CIE’s support to a friend?
We will resurvey at end of 2019
In April, 2019, we surveyed data-related NPS with analysts, data scientists, and
some decisions makers and execs
Trust NPS
Facet: Trust
1 Accurate
% records quarantined
% records in range
% records matching
2 Coherent
% records missing entity ID
% records missing foreign key
3 Complete
% records dupes
% records missing
% records complete
% fields complete
4 Consistent % records consistent
5 Defined
% tables defined
% fields defined
% dimensions defined
% measures defined
6 Timely
Mean time to arrival
95th percentile time to arrival
7 Volume Number of Records
8 Trust NPS
“If you can't measure it, you
can't improve it”
- Peter Drucker
Data Quality
Scorecard
Reference Data
Server logs
Metadata
Schema
Data catalog +
lookml-tools
Survey
Integrate into normal workflows
Our engineers work in Slack, so let them do data quality work there too
Integrate into team culture
Agile BI engineering team
● BI engineering teams set aside 10% of time for explicit data quality work
● Expect DQ dashboards for all new sources
● Weekly data quality meetings
● Now proactive, rather than reactive or retrospective
Data Quality is a Shared Responsibility
Adhere to
Schema
Automated
consumption
DQ Dashboards
Subscribe /
Report
Value Ranges Automated checks
Data
dictionaries
Investigate Investigate
Data dictionaries
+ glossary
Investigate
Single Source of Truth
Investigate
Data Catalog
Data
dictionaries
docsschemaMonitor/
investigate
What Questions Do You Have For Me?
Carl Anderson
carl.anderson@weighwatchers.com
@leapingllamas
https://medium.com/ww-tech-blog
We are hiring:
BI engineers, engineers, and data scientists for our Toronto office (a few blocks away).
Find our booth in recruiting hall.

Data Quality: principles, approaches, and best practices

  • 1.
    Data Quality: Principles, Approaches,and Best Practices Carl Anderson carl.anderson@weightwatchers.com WW – the new Weight Watchers
  • 2.
    1/3 business leadersfrequently make decisions with data they don’t trust Bad data costs the economy $100s BN / year [IBM] [TDWI]
  • 3.
  • 5.
    Big data: ● Food ●Activity ● Exercises ● Challenges ● Social network ● Workshops ● Personal Coaches ● CRM ● Fulfillment ● Meal kits ● Supermarket foods ● E-commerce ● Cruises ...for 56 years
  • 6.
    2017: fill lakewith data; provide analysts access 2019: upstream control and governance
  • 7.
    Data Entry Transformation1 Transformation 2 Inaccurate (GIGO) Missing Defaults Dropped records Truncation Encoding changes Data type change Stale 3rd party Disagree In General, What Can Go Wrong? Shape change Dupes Dupes
  • 8.
    Accurate Coherent Complete Consistent Defined Timely Missing data, duplicates Referentialintegrity, connect the dots Data entry issues, stale data, default dates... Data dictionaries, business glossary, provenance, schema Latency Same values across systems, e.g. same address Facets of Data Quality Trust Analysts willing to use data. NPS * * *
  • 9.
    Accurate % records quarantined %records in range % records matching Coherent % records missing entity ID % records missing foreign key Complete % records dupes % records missing % records complete % fields complete Consistent % records consistent Defined % tables defined % fields defined % dimensions defined % measures defined Timely Mean time to arrival 95th percentile time to arrival Volume Number of Records Trust NPS “If you can't measure it, you can't improve it” - Peter Drucker Data Quality Scorecard
  • 10.
    Facet: Accuracy Publish SchemaPublish Schema Adhere to Schema Field Ranges Source teams then: Source teams now (WIP): Data team superpowers: 1. Auto consumption 2. Auto checks 3. Quarantine 4. Reporting Data did not always match schema Hard to trust Hard to automate No accountability
  • 11.
    Accurate % records quarantined %records in range % records matching Facet: Accuracy Publish Schema Publish Schema Adhere to Schema Field Ranges Source teams then: Source teams now (WIP): Data team superpowers: 1. Auto consumption 2. Auto checks 3. Quarantine 4. Reporting Data did not always match schema Hard to trust Hard to automate No accountability
  • 12.
    Facet: Defined Table-level datadictionaries Business-level data dictionary (Business Glossary) https://medium.com/@leapingllamas
  • 13.
    Facet: Defined. Flowfrom master Data catalog is master for table-level definitions and business glossary Mapping table from master to BI tool: here, Looker dimensions and measures Tool compares master to BI tool and updates/injects and creates pull request Manually reviewed and merged Master definitions appear to users
  • 14.
    Facet: Defined. Flowfrom master Data catalog is master for table-level definitions and business glossary Mapping table from master to BI tool: here, Looker dimensions and measures Tool compares master to BI tool and updates/injects and creates pull request Manually reviewed and merged Master definitions appear to users Open sourcing: https://github.com/ww-tech/lookml-tools
  • 15.
    Facet: Defined. StyleGuide Open sourcing: https://github.com/ww-tech/lookml-tools LookML linter
  • 16.
    Defined % tables defined %fields defined Facet: Defined + LookML updater LookML linter Defined % dimensions defined % measures defined
  • 17.
    Easy to losetrust. Hard to regain! We asked: ● NPS data: would you recommend our data to a friend? ● NPS infrastructure: would you recommend our infrastructure (Looker, BigQuery etc) to a friend? ● NPS support: would you recommend CIE’s support to a friend? We will resurvey at end of 2019 In April, 2019, we surveyed data-related NPS with analysts, data scientists, and some decisions makers and execs Trust NPS Facet: Trust
  • 18.
    1 Accurate % recordsquarantined % records in range % records matching 2 Coherent % records missing entity ID % records missing foreign key 3 Complete % records dupes % records missing % records complete % fields complete 4 Consistent % records consistent 5 Defined % tables defined % fields defined % dimensions defined % measures defined 6 Timely Mean time to arrival 95th percentile time to arrival 7 Volume Number of Records 8 Trust NPS “If you can't measure it, you can't improve it” - Peter Drucker Data Quality Scorecard Reference Data Server logs Metadata Schema Data catalog + lookml-tools Survey
  • 19.
    Integrate into normalworkflows Our engineers work in Slack, so let them do data quality work there too
  • 20.
    Integrate into teamculture Agile BI engineering team ● BI engineering teams set aside 10% of time for explicit data quality work ● Expect DQ dashboards for all new sources ● Weekly data quality meetings ● Now proactive, rather than reactive or retrospective
  • 21.
    Data Quality isa Shared Responsibility Adhere to Schema Automated consumption DQ Dashboards Subscribe / Report Value Ranges Automated checks Data dictionaries Investigate Investigate Data dictionaries + glossary Investigate Single Source of Truth Investigate Data Catalog Data dictionaries docsschemaMonitor/ investigate
  • 22.
    What Questions DoYou Have For Me? Carl Anderson carl.anderson@weighwatchers.com @leapingllamas https://medium.com/ww-tech-blog We are hiring: BI engineers, engineers, and data scientists for our Toronto office (a few blocks away). Find our booth in recruiting hall.