End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
The document discusses the importance of data in machine learning and presents an overview of tools like Hopsworks, Databricks Delta, and various feature stores. It highlights advancements in data lakes with ACID transactions, incremental ingestion, and efficient querying using frameworks such as Delta, Hudi, and Iceberg. The Hopsworks feature store is emphasized as the world's first open-source feature store, supporting end-to-end ML pipelines with reliable and timely data access.
Introduction of WIFI details, presenters Kim Hammar and Jim Dowling, topic focus on ML Pipelines with Databricks and Hopsworks.
Explains the significance of data in ML, stating it's the hardest part, with modelers focusing on feature selection and transformation.
Introduction to data sourcing from the Feature Store, emphasizing its vital role in ML operations.
An outline of the presentation covering Hopsworks, Databricks Delta, Feature Store, demo, and summary.
Describes the ecosystem including data sources and applications, and how Hopsworks integrates various technologies like Apache Beam, Spark, and TensorFlow.
Defines new characteristics of Data Lakes, like ACID transactional layers, and solutions for issues like incremental updates and rollback failures.
Details on Upsert and Time Travel functionalities in data management, providing examples of how these concepts work.
Discusses Delta Lake's transactional layer, ACID transactions, open format storage, and time-travel capabilities.
Covers optimistic concurrency control, mutual exclusion, retrial strategies, and how scalable metadata management works.
Comparison of Delta, Hudi, and Iceberg frameworks, highlighting their common goals of reliable updates and storage efficiency.
Discusses how Feature Stores can utilize log-structured storage, integration with Databricks for incrementing ACID ingestion and data validation.
Explains incremental feature engineering and point-in-time correct data with examples using Hudi and Hopsworks.
Demonstrates the integration of Hopsworks Feature Store and Databricks platform in action.
Summarizes key functionalities of Delta, Hudi, Iceberg for data lakes and introduces Hopsworks as an open-source feature store supporting end-to-end ML.
Provides company information, resources for further reading, and thanks to team members involved in the project.
Kim Hammar, LogicalClocks AB KimHammar1
Jim Dowling, Logical Clocks AB jim_dowling
End-to-End ML Pipelines
with Databricks Delta and
Hopsworks Feature Store
#UnifiedDataAnalytics #SparkAISummit
Where does theData come from?
5
“Data is the hardest part of ML and the most important piece to get
right. Modelers spend most of their time selecting and transforming
features at training time and then building the pipelines to deliver
those features to production models.” [Uber on Michelangelo]
Next-Gen Data Lakes
DataLakes are starting to resemble databases:
– Apache Hudi, Delta, and Apache Iceberg add:
• ACID transactional layers on top of the data lake
• Indexes to speed up queries (data skipping)
• Incremental Ingestion (late data, delete existing records)
• Time-travel queries
16
Delta Lake byDatabricks
• Delta Lake is a Transactional Layer that sits on
top of your Data Lake:
– ACID Transactions with Optimistic Concurrency
Control
– Log-Structured Storage
– Open Format (Parquet-based storage)
– Time-travel
23
Other Frameworks: ApacheHudi,
Apache Iceberg
• Hudi was developed by Uber for their Hadoop
Data Lake (HDFS first, then S3 support)
• Iceberg was developed by Netflix with S3 as
target storage layer
• All three frameworks (Delta, Hudi, Iceberg)
have common goals of adding ACID updates,
incremental ingestion, efficient queries.
30
31.
Next-Gen Data LakesCompared
31
Delta Hudi Iceberg
Incremental Ingestion Spark Spark Spark
ACID updates HDFS, S3* HDFS S3, HDFS
File Formats Parquet Avro, Parquet Parquet, ORC
Data Skipping
(File-Level Indexes)
Min-Max Stats+Z-Order
Clustering*
File-Level Max-Min
stats + Bloom Filter
File-Level
Max-Min Filtering
Concurrency Control Optimistic Optimistic Optimistic
Data Validation Expectations (coming soon) In Hopsworks N/A
Merge-on-Read No Yes (coming soon) No
Schema Evolution Yes Yes Yes
File I/O Cache Yes* No No
Cleanup Manual Automatic, Manual No
Compaction Manual Automatic No
*Databricks version only (not open-source)
32.
32
How can aFeature Store
leverage Log-Structured Storage
(e.g., Delta or Hudi or Iceberg)?
33.
Hopsworks Feature Store
33
FeatureMgmt Storage Access
Statistics
Online
Features
Discovery
Offline
Features
Data Scientist
Online Apps
Data Engineer
MySQL Cluster
(Metadata,
Online Features)
Apache Hive
Columnar DB
(Offline Features)
Feature Data
Ingestion
Hopsworks Feature Store
Training Data
(S3, HDFS)
Batch Apps
Discover features,
create training data,
save models,
read online/offline/on-
demand features,
historical feature values.
Models
HopsFS
JDBC
(SAS, R, etc)
Feature
CRUD
Add/remove features,
access control,
feature data validation.
Access
Control
Time Travel
Data
Validation
Pandas or
PySpark
DataFrame
External DB
Feature Defn
Țselect ..Ț
AWS Sagemaker and Databricks Integration
• Computation
engine (Spark)
• Incremental
ACID Ingestion
• Time-Travel
• Data Validation
• On-Demand or
Cached Features
• Online or Offline
Features
Summary
• Delta, Hudi,Iceberg bring Reliability, Upserts & Time-Travel to
Data Lakes
– Functionalities that are well suited for Feature Stores
• Hopsworks Feature Store builds on Hudi/Hive and is the world’s
first open-source Feature Store (released 2018)
• The Hopsworks Platform also supports End-to-End ML pipelines
using the Feature Store and Spark/Beam/Flink, Tensorflow/PyTorch,
and Airflow
38
39.
Thank you!
470 RamonaSt, Palo Alto
Kista, Stockholm
https://www.logicalclocks.com
Register for a free account at
www.hops.site
Twitter
@logicalclocks
@hopsworks
GitHub
https://github.com/logicalclocks/hopswo
rks
https://github.com/hopshadoop/hops
40.
References
• Feature Store:the missing data layer in ML pipelines?
https://www.logicalclocks.com/feature-store/
• Python-First ML Pipelines with Hopsworks
https://hops.readthedocs.io/en/latest/hopsml/hopsML.html.
• Hopsworks white paper.
https://www.logicalclocks.com/whitepapers/hopsworks
• HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases.
https://www.usenix.org/conference/fast17/technical-sessions/presentation/niazi
• Open Source:
https://github.com/logicalclocks/hopsworks
https://github.com/hopshadoop/hops
• Thanks to Logical Clocks Team: Jim Dowling, Seif Haridi, Theo Kakantousis, Fabio Buso,
Gautier Berthou, Ermias Gebremeskel, Mahmoud Ismail, Salman Niazi, Antonios Kouzoupis,
Robin Andersson, Alex Ormenisan, Rasmus Toivonen, Steffen Grohsschmiedt, and Moritz
Meister
40
41.
DON’T FORGET TORATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT