inteligencija.com
Overview of the Databricks
Platform
Petar Zečević
Poslovna inteligencija
inteligencija.com
We are Data & Analytics consulting company committed to deliver great solutions and products that
enables our clients to unlock hidden opportunities within data, become data-driven and make better
business decisions
Our goal is to enable data-driven business decisions
Offices in UK,
Sweden,
Austria,
Slovenia and
Croatia
200+
employees
20 years in
Data &
Analytics
250+
projects
100+
clients on 5
continents
inteligencija.com
We deliver E2E Cloud Data & Analytics solutions
Data Strategy &
Governance
Data
Management
Data Science &
Analytics
Performance
Management
Implement practices,
concepts and
processes dedicated
to leveraging data as
valuable asset.
Design data models,
improve data quality
and master data,
protect data, manage
whole data supply
chain and make data
available for any
relevant business
need.
Utilize data and
answer business
questions through
reporting, self-service
BI and data
visualization.
Use machine
learning algorithms
to uncover the
unseen patterns,
insights and trends in
data and derive
meaningful
information.
Automate budgeting
and forecasting,
financial
consolidation and
performance
management
reporting.
Discover
opportunities for data
monetization, access
organizational
maturity, evaluate
architectural options
and define migration
to cloud strategy,
plan and prioritize
projects and estimate
costs.
Data Engineering
Collect and store
data at scale, from
multiple sources and
formats, and make
them reliable and
consistent for
analysis.
inteligencija.com
Databricks Lakehouse
Platform
inteligencija.com
The story about Databricks
• The team who built Apache Spark founded Databricks in
2013
• They started several OSS projects:
• Apache Spark
• Delta Lake
• MLFlow
• Invented the Data Lakehouse concept
• Named leader by Gartner in both
• Database Management Systems
inteligencija.com
• 7000+ customers
• 3000+ employees
• Received more than $3B in
funding
inteligencija.com
Data Lakehouse Concept
• Marries Data Warehouses and Data Lakes
• Data Warehouses
• Built for efficient BI and reporting
• But:
• Poor support for unstructured data, data science and
streaming
• Closed formats
• Expensive to scale
inteligencija.com
Data Lakehouse Concept
• Data Lakes
• Store any kind of data
• Cheap storage
• Allow for exploratory data analysis and streaming UCs
• However:
• Complex to set up
• Poor BI performance
• Often devolve into data swamps
inteligencija.com
Gartner insights
• 85% of Big Data and Data Science projects fail
• $3.9T business value created by AI in 2022 (by the 15% ?)
• Why do Data Science projects fail?
• Recent MIT Technology Review survey of 600 C-level
executives:
“72% percent of the technology executives we surveyed for this study say that, should their
companies fail to achieve their AI goals, data issues are more likely than not to be the reason.
Improving processing speeds, governance, and quality of data, as well as its sufficiency for
models, are the main data imperatives to ensure AI can be scaled, say the survey
respondents.”
inteligencija.com
The usual problems
• Ill-defined use cases
• Data warehouses and data lakes in separate silos:
• Data often duplicated and/or difficult to access (formats,
interfaces)
• Difficult to consolidate security models
• Difficult to apply governance
inteligencija.com
Databricks Lakehouse Platform - benefits
• Unifies Data Warehouse and AI use cases on a single
platform
• Built on open source and open standards
• Consistent across cloud providers (Azure, AWS, GCP)
• Provides ACID transactions
• Schema enforcement capabilities
• In one platform:
• Data Warehousing
• Data Engineering
• Data Streaming
• Data Science and ML
• Data Governance
inteligencija.com
Platform Integrations
inteligencija.com
inteligencija.com
inteligencija.com
inteligencija.com
inteligencija.com
inteligencija.com
Computing resources
• Clusters
• One or more VM instances running Spark components:
Driver and Executors
• Required for running notebooks, jobs, pipelines, …
• All-purpose clusters and job clusters
• SQL Warehouses (formerly „SQL Endpoints”)
• Optimized for BI workloads
• Required for running anything in SQL Workspace
• For exploring data, running queries, alerts, …
inteligencija.com
Accounts for cloud resources
inteligencija.com
Databricks Lakehouse Platform – technical foundations
• Apache Spark
• Delta Lake and Delta Live Tables
• MLFlow
• Jupyter Notebooks
• Jobs and Pipelines
inteligencija.com
Apache Spark
• General-purpose, distributed data processing engine
• Efficient and fast
• Spark SQL, Spark Streaming, Spark ML
• APIs in Java, Scala, Python, R
• Widely used today – ubiquitous
• Databricks provides Photon execution engine on top
inteligencija.com
Jupyter notebooks
• Web-based, interactive and collaborative
• Databricks supports Python, SQL, R and Scala
• Can also serve as documentation (can be exported to
HTML, PDF, etc.)
• Can be executed as jobs in Databricks and organized in
Pipelines
• In Databricks attached to clusters
inteligencija.com
Delta Lake
• Data storage framework built on top of Parquet
• Provides ACID transactions; upserts (MERGE statements)
and deletes
• Schema enforcement
• Time travel
• Scalable metadata handling
• Unifies streaming and batch processing
inteligencija.com
Delta Live Tables
• Framework for building data processing pipelines
• You define transformations and DLT manages:
• Orchestration
• Cluster management
• Monitoring
• Data quality (Expectations)
• Error handling
• Can perform CDC with APPLY CHANGES INTO .. FROM ..
inteligencija.com
MLflow
• Framework for managing machine learning lifecycles
• MLflow Tracking – tracks experiments and runs,
parameters, metrics
• MLflow models – storage format for describing models of
different “flavors” (e.g. sklearn, keras, xgboost etc.)
• MLflow Projects – package code in a format to reproduce
runs on different platforms
• Model registry – manage models in a central repository
inteligencija.com
SQL Editor
inteligencija.com
inteligencija.com
inteligencija.com
inteligencija.com
inteligencija.com
inteligencija.com
Jupyter Notebooks
inteligencija.com
inteligencija.com
inteligencija.com
inteligencija.com
inteligencija.com
inteligencija.com
Delta Live Tables
inteligencija.com
inteligencija.com
inteligencija.com
inteligencija.com
Questions ?

[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic

  • 1.
    inteligencija.com Overview of theDatabricks Platform Petar Zečević Poslovna inteligencija
  • 2.
    inteligencija.com We are Data& Analytics consulting company committed to deliver great solutions and products that enables our clients to unlock hidden opportunities within data, become data-driven and make better business decisions Our goal is to enable data-driven business decisions Offices in UK, Sweden, Austria, Slovenia and Croatia 200+ employees 20 years in Data & Analytics 250+ projects 100+ clients on 5 continents
  • 3.
    inteligencija.com We deliver E2ECloud Data & Analytics solutions Data Strategy & Governance Data Management Data Science & Analytics Performance Management Implement practices, concepts and processes dedicated to leveraging data as valuable asset. Design data models, improve data quality and master data, protect data, manage whole data supply chain and make data available for any relevant business need. Utilize data and answer business questions through reporting, self-service BI and data visualization. Use machine learning algorithms to uncover the unseen patterns, insights and trends in data and derive meaningful information. Automate budgeting and forecasting, financial consolidation and performance management reporting. Discover opportunities for data monetization, access organizational maturity, evaluate architectural options and define migration to cloud strategy, plan and prioritize projects and estimate costs. Data Engineering Collect and store data at scale, from multiple sources and formats, and make them reliable and consistent for analysis.
  • 4.
  • 5.
    inteligencija.com The story aboutDatabricks • The team who built Apache Spark founded Databricks in 2013 • They started several OSS projects: • Apache Spark • Delta Lake • MLFlow • Invented the Data Lakehouse concept • Named leader by Gartner in both • Database Management Systems
  • 6.
    inteligencija.com • 7000+ customers •3000+ employees • Received more than $3B in funding
  • 7.
    inteligencija.com Data Lakehouse Concept •Marries Data Warehouses and Data Lakes • Data Warehouses • Built for efficient BI and reporting • But: • Poor support for unstructured data, data science and streaming • Closed formats • Expensive to scale
  • 8.
    inteligencija.com Data Lakehouse Concept •Data Lakes • Store any kind of data • Cheap storage • Allow for exploratory data analysis and streaming UCs • However: • Complex to set up • Poor BI performance • Often devolve into data swamps
  • 9.
    inteligencija.com Gartner insights • 85%of Big Data and Data Science projects fail • $3.9T business value created by AI in 2022 (by the 15% ?) • Why do Data Science projects fail? • Recent MIT Technology Review survey of 600 C-level executives: “72% percent of the technology executives we surveyed for this study say that, should their companies fail to achieve their AI goals, data issues are more likely than not to be the reason. Improving processing speeds, governance, and quality of data, as well as its sufficiency for models, are the main data imperatives to ensure AI can be scaled, say the survey respondents.”
  • 10.
    inteligencija.com The usual problems •Ill-defined use cases • Data warehouses and data lakes in separate silos: • Data often duplicated and/or difficult to access (formats, interfaces) • Difficult to consolidate security models • Difficult to apply governance
  • 11.
    inteligencija.com Databricks Lakehouse Platform- benefits • Unifies Data Warehouse and AI use cases on a single platform • Built on open source and open standards • Consistent across cloud providers (Azure, AWS, GCP) • Provides ACID transactions • Schema enforcement capabilities • In one platform: • Data Warehousing • Data Engineering • Data Streaming • Data Science and ML • Data Governance
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
    inteligencija.com Computing resources • Clusters •One or more VM instances running Spark components: Driver and Executors • Required for running notebooks, jobs, pipelines, … • All-purpose clusters and job clusters • SQL Warehouses (formerly „SQL Endpoints”) • Optimized for BI workloads • Required for running anything in SQL Workspace • For exploring data, running queries, alerts, …
  • 19.
  • 20.
    inteligencija.com Databricks Lakehouse Platform– technical foundations • Apache Spark • Delta Lake and Delta Live Tables • MLFlow • Jupyter Notebooks • Jobs and Pipelines
  • 21.
    inteligencija.com Apache Spark • General-purpose,distributed data processing engine • Efficient and fast • Spark SQL, Spark Streaming, Spark ML • APIs in Java, Scala, Python, R • Widely used today – ubiquitous • Databricks provides Photon execution engine on top
  • 22.
    inteligencija.com Jupyter notebooks • Web-based,interactive and collaborative • Databricks supports Python, SQL, R and Scala • Can also serve as documentation (can be exported to HTML, PDF, etc.) • Can be executed as jobs in Databricks and organized in Pipelines • In Databricks attached to clusters
  • 23.
    inteligencija.com Delta Lake • Datastorage framework built on top of Parquet • Provides ACID transactions; upserts (MERGE statements) and deletes • Schema enforcement • Time travel • Scalable metadata handling • Unifies streaming and batch processing
  • 24.
    inteligencija.com Delta Live Tables •Framework for building data processing pipelines • You define transformations and DLT manages: • Orchestration • Cluster management • Monitoring • Data quality (Expectations) • Error handling • Can perform CDC with APPLY CHANGES INTO .. FROM ..
  • 25.
    inteligencija.com MLflow • Framework formanaging machine learning lifecycles • MLflow Tracking – tracks experiments and runs, parameters, metrics • MLflow models – storage format for describing models of different “flavors” (e.g. sklearn, keras, xgboost etc.) • MLflow Projects – package code in a format to reproduce runs on different platforms • Model registry – manage models in a central repository
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.