Building Data Quality pipelines with Apache Spark and Delta Lake
This document covers a presentation on building data quality pipelines using Apache Spark and Delta Lake, emphasizing the significance of addressing dirty data which costs companies significantly. The speakers outline key design decisions for creating a robust system that meets specific business needs while facilitating ease of use for developers. Conclusively, it highlights the benefits of building custom solutions over off-the-shelf products, particularly in enhancing data ingestion processes.
Speaker Bio
Sandy May- @spark-spartan
Databricks Champion
Data Science London Co-Organizer
Tech speaker across the UK
Passionate about Apache Spark,
Databricks, AI, Data Security and
Reporting platforms in Microsoft
Azure
3.
Speaker Bio
Darren Fuller- @dazfuller
Databricks Champion
Tech speaker across the UK
Passions include Apache Spark,
Microsoft Azure, Raspberry Pi
4.
Agenda
Sandy May
What isthe problem? What do we
need? How can we make it easy to
use?
Darren Fuller
How can we investigate? Where
do we go from here? What have
we learnt
What is theproblem?
• Harvard Business review suggested Dirty Data cost US companies $3
trillion in 2017
• Business data is hard to clean generically, it often requires domain
knowledge
• Dirty Data can be frustrating for Data Scientists and BI Engineers
• In the worst case, Dirty Data can provide incorrect reports and predictions
leading to potential significant losses
@dazfuller @spark_spartan
7.
Should we Buildor Buy?
§ Own the IP
§ Prioritise the features you want
§ Built for your use case
§ No licence fees
§ Use your core technology
§ May have track record
§ Bugs fixed by vendor
§ Features not thought about by
business
§ Service Level Agreements
Buy
Build
@dazfuller @spark_spartan
8.
Key Design Decisions
▪Support to Run Cross Cloud
▪ Use Native tools in Azure and AWS
▪ Easy for SQL Devs to write rules
▪ Single Reporting Platform
▪ Capability to reuse custom business rules
▪ Run as part of our Data Ingestion Pipelines with Delta Lake
@dazfuller @spark_spartan
Conclusions
▪ Building canbe quick and effective
▪ Prioritise your own business needs, you know your data best
▪ Can be used as a stop gap while you create a service for an off the shelf
product
▪ Easy to run as part of ingestion pipelines
▪ Business value in reports and reuse of rules
▪ Use Delta Lake for Schema Evolution
@dazfuller @spark_spartan