[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
This document provides an overview of the Databricks platform. It discusses how Databricks combines features of data warehouses and data lakes to create a "data lakehouse" that supports both business intelligence/reporting and data science/machine learning use cases. Key components of the Databricks platform include Apache Spark, Delta Lake, MLFlow, Jupyter notebooks, and Delta Live Tables. The platform aims to unify data engineering, data warehousing, streaming, and data science tasks on a single open-source platform.
Overview of the company and its mission focused on data-driven business decisions, highlighting global reach and experience in data and analytics.
Presentation of comprehensive cloud data and analytics solutions, including data management, governance, and engineering.
Overview of Databricks and its foundation, emphasizing its historical significance and leadership in the data lakehouse concept.
Databricks' expanding customer base and workforce, highlighting its significant funding and market presence.
Fundamentals of the data lakehouse, merging strengths of data warehouses and lakes while addressing shortcomings.
Stats highlighting failure rates in data projects and issues related to governance, silos, and data access.
Advantages of the Databricks Lakehouse, focusing on unifying data usage and ensuring consistent performance across platforms.
Insights into platform integrations, computing resources, and technical foundations of Databricks, including key components like Apache Spark and Delta Lake.
Detailed overview of essential tools like Apache Spark, Jupyter Notebooks, Delta Lake, and MLflow, explaining their functionality and purpose.
Presentation on SQL capabilities and various Databricks functions, concluding with an interactive Q&A session.
Final remarks and opening the floor for audience questions, allowing for engagement and clarification on topics discussed.
inteligencija.com
We are Data& Analytics consulting company committed to deliver great solutions and products that
enables our clients to unlock hidden opportunities within data, become data-driven and make better
business decisions
Our goal is to enable data-driven business decisions
Offices in UK,
Sweden,
Austria,
Slovenia and
Croatia
200+
employees
20 years in
Data &
Analytics
250+
projects
100+
clients on 5
continents
3.
inteligencija.com
We deliver E2ECloud Data & Analytics solutions
Data Strategy &
Governance
Data
Management
Data Science &
Analytics
Performance
Management
Implement practices,
concepts and
processes dedicated
to leveraging data as
valuable asset.
Design data models,
improve data quality
and master data,
protect data, manage
whole data supply
chain and make data
available for any
relevant business
need.
Utilize data and
answer business
questions through
reporting, self-service
BI and data
visualization.
Use machine
learning algorithms
to uncover the
unseen patterns,
insights and trends in
data and derive
meaningful
information.
Automate budgeting
and forecasting,
financial
consolidation and
performance
management
reporting.
Discover
opportunities for data
monetization, access
organizational
maturity, evaluate
architectural options
and define migration
to cloud strategy,
plan and prioritize
projects and estimate
costs.
Data Engineering
Collect and store
data at scale, from
multiple sources and
formats, and make
them reliable and
consistent for
analysis.
inteligencija.com
The story aboutDatabricks
• The team who built Apache Spark founded Databricks in
2013
• They started several OSS projects:
• Apache Spark
• Delta Lake
• MLFlow
• Invented the Data Lakehouse concept
• Named leader by Gartner in both
• Database Management Systems
inteligencija.com
Data Lakehouse Concept
•Marries Data Warehouses and Data Lakes
• Data Warehouses
• Built for efficient BI and reporting
• But:
• Poor support for unstructured data, data science and
streaming
• Closed formats
• Expensive to scale
8.
inteligencija.com
Data Lakehouse Concept
•Data Lakes
• Store any kind of data
• Cheap storage
• Allow for exploratory data analysis and streaming UCs
• However:
• Complex to set up
• Poor BI performance
• Often devolve into data swamps
9.
inteligencija.com
Gartner insights
• 85%of Big Data and Data Science projects fail
• $3.9T business value created by AI in 2022 (by the 15% ?)
• Why do Data Science projects fail?
• Recent MIT Technology Review survey of 600 C-level
executives:
“72% percent of the technology executives we surveyed for this study say that, should their
companies fail to achieve their AI goals, data issues are more likely than not to be the reason.
Improving processing speeds, governance, and quality of data, as well as its sufficiency for
models, are the main data imperatives to ensure AI can be scaled, say the survey
respondents.”
10.
inteligencija.com
The usual problems
•Ill-defined use cases
• Data warehouses and data lakes in separate silos:
• Data often duplicated and/or difficult to access (formats,
interfaces)
• Difficult to consolidate security models
• Difficult to apply governance
11.
inteligencija.com
Databricks Lakehouse Platform- benefits
• Unifies Data Warehouse and AI use cases on a single
platform
• Built on open source and open standards
• Consistent across cloud providers (Azure, AWS, GCP)
• Provides ACID transactions
• Schema enforcement capabilities
• In one platform:
• Data Warehousing
• Data Engineering
• Data Streaming
• Data Science and ML
• Data Governance
inteligencija.com
Apache Spark
• General-purpose,distributed data processing engine
• Efficient and fast
• Spark SQL, Spark Streaming, Spark ML
• APIs in Java, Scala, Python, R
• Widely used today – ubiquitous
• Databricks provides Photon execution engine on top
22.
inteligencija.com
Jupyter notebooks
• Web-based,interactive and collaborative
• Databricks supports Python, SQL, R and Scala
• Can also serve as documentation (can be exported to
HTML, PDF, etc.)
• Can be executed as jobs in Databricks and organized in
Pipelines
• In Databricks attached to clusters
23.
inteligencija.com
Delta Lake
• Datastorage framework built on top of Parquet
• Provides ACID transactions; upserts (MERGE statements)
and deletes
• Schema enforcement
• Time travel
• Scalable metadata handling
• Unifies streaming and batch processing
24.
inteligencija.com
Delta Live Tables
•Framework for building data processing pipelines
• You define transformations and DLT manages:
• Orchestration
• Cluster management
• Monitoring
• Data quality (Expectations)
• Error handling
• Can perform CDC with APPLY CHANGES INTO .. FROM ..
25.
inteligencija.com
MLflow
• Framework formanaging machine learning lifecycles
• MLflow Tracking – tracks experiments and runs,
parameters, metrics
• MLflow models – storage format for describing models of
different “flavors” (e.g. sklearn, keras, xgboost etc.)
• MLflow Projects – package code in a format to reproduce
runs on different platforms
• Model registry – manage models in a central repository