End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta

WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics

Kim Hammar, Logical Clocks AB KimHammar1
Jim Dowling, Logical Clocks AB jim_dowling
End-to-End ML Pipelines
with Databricks Delta and
Hopsworks Feature Store
#UnifiedDataAnalytics #SparkAISummit

Machine Learning in the Abstract
3

Where does the Data come from?
4

Where does the Data come from?
5
“Data is the hardest part of ML and the most important piece to get
right. Modelers spend most of their time selecting and transforming
features at training time and then building the pipelines to deliver
those features to production models.” [Uber on Michelangelo]

Data comes from the Feature Store
6

How do we feed the Feature Store?
7

Outline
8
1. Hopsworks
2. Databricks Delta
3. Hopsworks Feature Store
4. Demo
5. Summary

9
Datasources
Applications
API
Dashboards
Hopsworks
Apache Beam
Apache Spark Pip
Conda
Tensorflow
scikit-learn
Keras
J upyter
Notebooks
Tensorboard
Apache Beam
Apache Spark
Apache Flink
Kubernetes
Batch Distributed
ML &DL
Model
Serving
Hopsworks
Feature Store
Kafka +
Spark
Streaming
Model
Monitoring
Orchestration in Airflow
Data Preparation
&Ingestion
Experimentation
&Model Training
Deploy
&Productionalize
Streaming
Filesystem and Metadata storage
HopsFS

Next-Gen Data Lakes
Data Lakes are starting to resemble databases:
– Apache Hudi, Delta, and Apache Iceberg add:
• ACID transactional layers on top of the data lake
• Indexes to speed up queries (data skipping)
• Incremental Ingestion (late data, delete existing records)
• Time-travel queries
16

Problems: No Incremental Updates, No rollback
on failure, No Time-Travel, No Isolation.
17

Solution: Incremental ETL with ACID
Transactions
18

Upsert & Time Travel Example
19

Upsert & Time Travel Example
20

Delta Lake by Databricks
• Delta Lake is a Transactional Layer that sits on
top of your Data Lake:
– ACID Transactions with Optimistic Concurrency
Control
– Log-Structured Storage
– Open Format (Parquet-based storage)
– Time-travel
23

Optimistic Concurrency Control
25

Optimistic Concurrency Control
26

Mutual Exclusion for Writers
27

Scalable Metadata Management
29

Other Frameworks: Apache Hudi,
Apache Iceberg
• Hudi was developed by Uber for their Hadoop
Data Lake (HDFS first, then S3 support)
• Iceberg was developed by Netflix with S3 as
target storage layer
• All three frameworks (Delta, Hudi, Iceberg)
have common goals of adding ACID updates,
incremental ingestion, efficient queries.
30

Next-Gen Data Lakes Compared
31
Delta Hudi Iceberg
Incremental Ingestion Spark Spark Spark
ACID updates HDFS, S3* HDFS S3, HDFS
File Formats Parquet Avro, Parquet Parquet, ORC
Data Skipping
(File-Level Indexes)
Min-Max Stats+Z-Order
Clustering*
File-Level Max-Min
stats + Bloom Filter
File-Level
Max-Min Filtering
Concurrency Control Optimistic Optimistic Optimistic
Data Validation Expectations (coming soon) In Hopsworks N/A
Merge-on-Read No Yes (coming soon) No
Schema Evolution Yes Yes Yes
File I/O Cache Yes* No No
Cleanup Manual Automatic, Manual No
Compaction Manual Automatic No
*Databricks version only (not open-source)

32
How can a Feature Store
leverage Log-Structured Storage
(e.g., Delta or Hudi or Iceberg)?

33
Feature Mgmt Storage Access
Statistics
Online
Features
Discovery
Offline
Features
Data Scientist
Online Apps
Data Engineer
MySQL Cluster
(Metadata,
Online Features)
Apache Hive
Columnar DB
(Offline Features)
Feature Data
Ingestion
Training Data
(S3, HDFS)
Batch Apps
Discover features,
create training data,
save models,
read online/offline/on-
demand features,
historical feature values.
Models
HopsFS
JDBC
(SAS, R, etc)
Feature
CRUD
Add/remove features,
access control,
feature data validation.
Access
Control
Time Travel
Data
Validation
Pandas or
PySpark
DataFrame
External DB
Feature Defn
Țselect ..Ț
AWS Sagemaker and Databricks Integration
• Computation
engine (Spark)
• Incremental
ACID Ingestion
• Time-Travel
• Data Validation
• On-Demand or
Cached Features
• Online or Offline
Features

Incremental Feature Engineering with Hudi
34

Point-in-Time Correct Feature Data
35

Feature Time Travel with Hudi
and Hopsworks Feature Store
36

Demo: Hopsworks Featurestore
+ Databricks Platform
37

Summary
• Delta, Hudi, Iceberg bring Reliability, Upserts & Time-Travel to
Data Lakes
– Functionalities that are well suited for Feature Stores
• Hopsworks Feature Store builds on Hudi/Hive and is the world’s
first open-source Feature Store (released 2018)
• The Hopsworks Platform also supports End-to-End ML pipelines
using the Feature Store and Spark/Beam/Flink, Tensorflow/PyTorch,
and Airflow
38

Thank you!
470 Ramona St, Palo Alto
Kista, Stockholm
https://www.logicalclocks.com
Register for a free account at
www.hops.site
Twitter
@logicalclocks
@hopsworks
GitHub
https://github.com/logicalclocks/hopswo
rks
https://github.com/hopshadoop/hops

References
• Feature Store: the missing data layer in ML pipelines?
https://www.logicalclocks.com/feature-store/
• Python-First ML Pipelines with Hopsworks
https://hops.readthedocs.io/en/latest/hopsml/hopsML.html.
• Hopsworks white paper.
https://www.logicalclocks.com/whitepapers/hopsworks
• HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases.
https://www.usenix.org/conference/fast17/technical-sessions/presentation/niazi
• Open Source:
https://github.com/logicalclocks/hopsworks
https://github.com/hopshadoop/hops
• Thanks to Logical Clocks Team: Jim Dowling, Seif Haridi, Theo Kakantousis, Fabio Buso,
Gautier Berthou, Ermias Gebremeskel, Mahmoud Ismail, Salman Niazi, Antonios Kouzoupis,
Robin Andersson, Alex Ormenisan, Rasmus Toivonen, Steffen Grohsschmiedt, and Moritz
Meister
40

DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta

In this document

More Related Content

What's hot

Similar to End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta

More from Databricks

Recently uploaded

End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta