Building Data Quality pipelines with Apache Spark and Delta Lake

Building Data Quality
pipelines with Apache
Spark and Delta Lake
Sandy May & Darren Fuller
Lead Data Engineers
Elastacloud

Speaker Bio
Sandy May - @spark-spartan
Databricks Champion
Data Science London Co-Organizer
Tech speaker across the UK
Passionate about Apache Spark,
Databricks, AI, Data Security and
Reporting platforms in Microsoft
Azure

Speaker Bio
Darren Fuller - @dazfuller
Databricks Champion
Tech speaker across the UK
Passions include Apache Spark,
Microsoft Azure, Raspberry Pi

Agenda
Sandy May
What is the problem? What do we
need? How can we make it easy to
use?
Darren Fuller
How can we investigate? Where
do we go from here? What have
we learnt

What is the problem?
• Harvard Business review suggested Dirty Data cost US companies $3
trillion in 2017
• Business data is hard to clean generically, it often requires domain
knowledge
• Dirty Data can be frustrating for Data Scientists and BI Engineers
• In the worst case, Dirty Data can provide incorrect reports and predictions
leading to potential significant losses
@dazfuller @spark_spartan

Should we Build or Buy?
§ Own the IP
§ Prioritise the features you want
§ Built for your use case
§ No licence fees
§ Use your core technology
§ May have track record
§ Bugs fixed by vendor
§ Features not thought about by
business
§ Service Level Agreements
Buy
Build

Key Design Decisions
▪ Support to Run Cross Cloud
▪ Use Native tools in Azure and AWS
▪ Easy for SQL Devs to write rules
▪ Single Reporting Platform
▪ Capability to reuse custom business rules
▪ Run as part of our Data Ingestion Pipelines with Delta Lake

Enterprise Data Warehouse

Let’s Build it!

Summing Up

Conclusions
▪ Building can be quick and effective
▪ Prioritise your own business needs, you know your data best
▪ Can be used as a stop gap while you create a service for an off the shelf
product
▪ Easy to run as part of ingestion pipelines
▪ Business value in reports and reuse of rules
▪ Use Delta Lake for Schema Evolution

Thanks for listening!
Questions?

Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

Building Data Quality pipelines with Apache Spark and Delta Lake

In this document

More Related Content

What's hot

Similar to Building Data Quality pipelines with Apache Spark and Delta Lake

More from Databricks

Recently uploaded

Building Data Quality pipelines with Apache Spark and Delta Lake