Building Data Quality
pipelines with Apache
Spark and Delta Lake
Sandy May & Darren Fuller
Lead Data Engineers
Elastacloud
Speaker Bio
Sandy May - @spark-spartan
Databricks Champion
Data Science London Co-Organizer
Tech speaker across the UK
Passionate about Apache Spark,
Databricks, AI, Data Security and
Reporting platforms in Microsoft
Azure
Speaker Bio
Darren Fuller - @dazfuller
Databricks Champion
Tech speaker across the UK
Passions include Apache Spark,
Microsoft Azure, Raspberry Pi
Agenda
Sandy May
What is the problem? What do we
need? How can we make it easy to
use?
Darren Fuller
How can we investigate? Where
do we go from here? What have
we learnt
Data Quality Overview
What is the problem?
• Harvard Business review suggested Dirty Data cost US companies $3
trillion in 2017
• Business data is hard to clean generically, it often requires domain
knowledge
• Dirty Data can be frustrating for Data Scientists and BI Engineers
• In the worst case, Dirty Data can provide incorrect reports and predictions
leading to potential significant losses
@dazfuller @spark_spartan
Should we Build or Buy?
§ Own the IP
§ Prioritise the features you want
§ Built for your use case
§ No licence fees
§ Use your core technology
§ May have track record
§ Bugs fixed by vendor
§ Features not thought about by
business
§ Service Level Agreements
Buy
Build
@dazfuller @spark_spartan
Key Design Decisions
▪ Support to Run Cross Cloud
▪ Use Native tools in Azure and AWS
▪ Easy for SQL Devs to write rules
▪ Single Reporting Platform
▪ Capability to reuse custom business rules
▪ Run as part of our Data Ingestion Pipelines with Delta Lake
@dazfuller @spark_spartan
Enterprise Data Warehouse
@dazfuller @spark_spartan
Let’s Build it!
@dazfuller @spark_spartan
Summing Up
@dazfuller @spark_spartan
Conclusions
▪ Building can be quick and effective
▪ Prioritise your own business needs, you know your data best
▪ Can be used as a stop gap while you create a service for an off the shelf
product
▪ Easy to run as part of ingestion pipelines
▪ Business value in reports and reuse of rules
▪ Use Delta Lake for Schema Evolution
@dazfuller @spark_spartan
Thanks for listening!
Questions?
@dazfuller @spark_spartan
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

Building Data Quality pipelines with Apache Spark and Delta Lake

  • 1.
    Building Data Quality pipelineswith Apache Spark and Delta Lake Sandy May & Darren Fuller Lead Data Engineers Elastacloud
  • 2.
    Speaker Bio Sandy May- @spark-spartan Databricks Champion Data Science London Co-Organizer Tech speaker across the UK Passionate about Apache Spark, Databricks, AI, Data Security and Reporting platforms in Microsoft Azure
  • 3.
    Speaker Bio Darren Fuller- @dazfuller Databricks Champion Tech speaker across the UK Passions include Apache Spark, Microsoft Azure, Raspberry Pi
  • 4.
    Agenda Sandy May What isthe problem? What do we need? How can we make it easy to use? Darren Fuller How can we investigate? Where do we go from here? What have we learnt
  • 5.
  • 6.
    What is theproblem? • Harvard Business review suggested Dirty Data cost US companies $3 trillion in 2017 • Business data is hard to clean generically, it often requires domain knowledge • Dirty Data can be frustrating for Data Scientists and BI Engineers • In the worst case, Dirty Data can provide incorrect reports and predictions leading to potential significant losses @dazfuller @spark_spartan
  • 7.
    Should we Buildor Buy? § Own the IP § Prioritise the features you want § Built for your use case § No licence fees § Use your core technology § May have track record § Bugs fixed by vendor § Features not thought about by business § Service Level Agreements Buy Build @dazfuller @spark_spartan
  • 8.
    Key Design Decisions ▪Support to Run Cross Cloud ▪ Use Native tools in Azure and AWS ▪ Easy for SQL Devs to write rules ▪ Single Reporting Platform ▪ Capability to reuse custom business rules ▪ Run as part of our Data Ingestion Pipelines with Delta Lake @dazfuller @spark_spartan
  • 9.
  • 10.
  • 11.
  • 12.
    Conclusions ▪ Building canbe quick and effective ▪ Prioritise your own business needs, you know your data best ▪ Can be used as a stop gap while you create a service for an off the shelf product ▪ Easy to run as part of ingestion pipelines ▪ Business value in reports and reuse of rules ▪ Use Delta Lake for Schema Evolution @dazfuller @spark_spartan
  • 13.
  • 14.
    Feedback Your feedback isimportant to us. Don’t forget to rate and review the sessions.