Azure Databricks
for Machine Learning
Mark Tabladillo Ph.D.
Cloud Solution Architect
My Story
Security and performanceFlexibility of choiceReason over any data, anywhere
Data warehouses
Data Lakes
Operational databases
Hybrid
Data warehouses
Data Lakes
Operational databases
SocialLOB Graph IoTImageCRM
T H E M O D E R N D A T A E S T A T E
Security and performanceFlexibility of choiceReason over any data, anywhere
Data warehouses
Operational databases
Hybrid
Data warehouses
Operational databases
SQL Server Azure Data Services
AI built-in | Most secure | Lowest TCO
Industry leader 2 years in a row
#1 TPC-H performance
T-SQL query over any data
70% faster than Aurora
2x global reach than Redshift
No Limits Analytics with 99.9% SLA
Easiest lift and shift
with no code changes
SocialLOB Graph IoTImageCRM
T H E M I C R O S O F T O F F E R I N G
Data lakes Data lakes
Big Data & Advanced Analytics
in Azure
Model & ServePrep & Train
Databricks
HDInsight
Data Lake Analytics
Custom
apps
Sensors
and devices
Store
Blobs
Data Lake
Ingest
Data Factory
(Data movement, pipelines & orchestration)
Machine
Learning
Cosmos DB
SQL Data
Warehouse
Analysis Services
Event Hub
IoT Hub
SQL Database
Analytical dashboards
Predictive apps
Operational reports
Intelligence
B I G D ATA & A D VA N C E D A N A LY T I C S AT A G L A N C E
Business
apps
10
01
SQLKafka
Azure Databricks
Powered by Apache Spark
Why Spark?
Open-source data processing engine built
around speed, ease of use, and sophisticated
analytics
In memory engine that is up to 100 times faster
than Hadoop
Largest open-source data project with 1000+
contributors
Highly extensible with support for Scala, Java
and Python alongside Spark SQL, GraphX,
Streaming and Machine Learning Library (Mllib)
What is Azure Databricks?
A fast, easy and collaborative Apache® Spark™ based analytics platform optimized for Azure
Best of Databricks Best of Microsoft
Designed in collaboration with the founders of Apache Spark
One-click set up; streamlined workflows
Interactive workspace that enables collaboration between data scientists, data engineers, and business analysts.
Native integration with Azure services (Power BI, SQL DW, Cosmos DB, Blob Storage)
Enterprise-grade Azure security (Active Directory integration, compliance, enterprise-grade SLAs)
Optimized Databricks Runtime Engine
DATABRICKS I/O SERVERLESS
Collaborative Workspace
Cloud storage
Data warehouses
Hadoop storage
IoT / streaming data
Rest APIs
Machine learning models
BI tools
Data exports
Data warehouses
Azure Databricks
Enhance Productivity
Deploy Production Jobs & Workflows
APACHE SPARK
MULTI-STAGE PIPELINES
DATA ENGINEER
JOB SCHEDULER NOTIFICATION & LOGS
DATA SCIENTIST BUSINESS ANALYST
Build on secure & trusted cloud Scale without limits
Azure Databricks
Azure Databricks Runtime for Machine Learning
AZURE Databricks Runtime for Machine
Learning
- Pre-installed packages for machine learning
like Tensorflow, Keras, Horovod and XGBoost
- Pre-configured HorovodEstimator for
seamless integration of Horovod with the
Spark DataFrames
- Support for GPU enabled VMs for specialized
compute for your deep learning needs
- Multi-GPU trainings of deep neural networks
using Horovod
- Unlock complex machine learning and deep
learning scenarios with a few lines of code
Features for Machine Learning
A Z U R E D A T A B R I C K S N O T E B O O K S O V E R V I E W
Notebooks are a popular way to develop, and run, Spark Applications
▪ Notebooks are not only for authoring Spark applications but
can be run/executed directly on clusters
• Shift+Enter
•
•
▪ Notebooks support fine grained permissions—so they can be
securely shared with colleagues for collaboration (see
following slide for details on permissions and abilities)
▪ Notebooks are well-suited for prototyping, rapid
development, exploration, discovery and iterative
development Notebooks typically consist of code, data, visualization, comments and notes
M I X I N G L A N G U A G E S I N N O T E B O O K S
You can mix multiple languages in the same notebook
Normally a notebook is associated with a specific language. However, with Azure Databricks notebooks, you can
mix multiple languages in the same notebook. This is done using the language magic command:
• %python Allows you to execute python code in a notebook (even if that notebook is not python)
• %sql Allows you to execute sql code in a notebook (even if that notebook is not sql).
• %r Allows you to execute r code in a notebook (even if that notebook is not r).
• %scala Allows you to execute scala code in a notebook (even if that notebook is not scala).
• %sh Allows you to execute shell code in your notebook.
• %fs Allows you to use Databricks Utilities - dbutils filesystem commands.
• %md To include rendered markdown
N O T E B O O K O P E R A T I O N S A N D A C C E S S C O N T R O L
You can create a new notebook from the Workspace or the
folder drop down menu (see previous slides)
From a notebook’s drop down menu you can:
▪ Clone the notebook
▪ Rename or delete the notebook
▪ Move the notebook to another location
▪ Export a notebook to save it and its contents as a
Databricks archive or IPython notebook or HTML or
source code file.
▪ Set Permissions for the notebook As with Workspaces
you can set 5 levels of permissions: No Permissions, Can
Manage, Can Read, Can Edit, Can Run
▪ You can also set permissions from notebook UI itself by
selecting the menu option.
V I S U A L I Z A T I O N
Azure Databricks supports a number of visualization plots out of the box
▪ All notebooks, regardless of their language,
support Databricks visualizations.
▪ When you run the notebook the visualizations
are rendered inside the notebook in-place
▪ The visualizations are written in HTML.
• You can save the HTML of the entire notebook by
exporting to HTML.
• If you use Matplotlib, the plots are rendered as
images so you can just right click and download
the image
▪ You can change the plot type just by picking
from the selection
L I B R A R I E S O V E R V I E W
Enables external code to be imported and stored into a Workspace
D A T A B R I C K S F I L E S Y S T E M ( D B F S )
Is a distributed File System (DBFS) that is a layer over Azure Blob Storage
Azure Blob Storage
Python Scala CLI dbutils
DBFS
S P A R K M A C H I N E L E A R N I N G ( M L ) O V E R V I E W
▪ Offers a set of parallelized machine learning algorithms (see next
slide)
▪ Supports Model Selection (hyperparameter tuning) using Cross
Validation and Train-Validation Split.
▪ Supports Java, Scala or Python apps using DataFrame-based API (as
of Spark 2.0). Benefits include:
• An uniform API across ML algorithms and across multiple languages
• Facilitates ML pipelines (enables combining multiple algorithms into a
single pipeline).
• Optimizations through Tungsten and Catalyst
• Spark MLlib comes pre-installed on Azure Databricks
• 3rd Party libraries supported include: H20 Sparkling Water, SciKit-
learn and XGBoost
Enables Parallel, Distributed ML for large datasets on Spark Clusters
M M L S P A R K
Microsoft Machine Learning Library for Apache Spark (MMLSpark) lets you
easily create scalable machine learning models for large datasets.
It includes integration of SparkML pipelines with the Microsoft Cognitive
Toolkit and OpenCV, enabling you to:
▪ Ingress and pre-process image data
▪ Featurize images and text using pre-trained deep learning models
▪ Train and score classification and regression models using implicit
featurization
S P A R K M L A L G O R I T H M S
Spark ML
Algorithms
D E E P L E A R N I N G
▪ Supports Deep Learning Libraries/frameworks including:
• Microsoft Cognitive Toolkit (CNTK).
o Article explains how to install CNTK on Azure Databricks.
• TensorFlowOnSpark
• BigDL
▪ Offers Spark Deep Learning Pipelines, a suite of tools for working with
and processing images using deep learning using transfer learning. It
includes high-level APIs for common aspects of deep learning so they
can be done efficiently in a few lines of code:
Azure Databricks supports and integrates with a number of Deep Learning libraries and frameworks to
make it easy to build and deploy Deep Learning applications
Distributed Hyperparameter Tuning
Transfer Learning
S P A R K R O V E R V I E W
An R package that provides a light-weight frontend to use Apache Spark from R
▪ Provides a distributed DataFrame implementation that supports operations like selection,
filtering, aggregation etc (similar to R data frames, dplyr)
▪ Supports distributed machine learning using Spark MLlib.
▪ R programs can connect to a Spark cluster from RStudio, R shell, Rscript or other R IDEs
Use Cases
Modern Big Data Warehouse
Business / custom apps
(Structured)
Logs, files and media
(unstructured)
Azure storage
Polybase
Azure SQL Data Warehouse
Data factory
Data factory
Azure Databricks
(Spark)
Analytical dashboards
Model & ServePrep & TrainStoreIngest Intelligence
Advanced Analytics on Big Data
Web & mobile appsAzure Databricks
(Spark Mllib,
SparkR, SparklyR)
Azure Cosmos DB
Business / custom apps
(Structured)
Logs, files and media
(unstructured)
Azure storage
Polybase
Azure SQL Data Warehouse
Data factory
Data factory
Analytical dashboards
Model & ServePrep & TrainStoreIngest Intelligence
Real-time analytics on Big Data
Unstructured data
Azure storage
Polybase
Azure SQL Data Warehouse
Azure HDInsight
(Kafka)
Azure Databricks
(Spark)
Analytical dashboards
Model & ServePrep & TrainStoreIngest Intelligence
Demo
How to get started
Engage Microsoft experts for a workshop to help identify
high impact scenarios
Already using Azure? try Azure Databricks now or
create a free Azure account to start using Azure Databricks
Learn more about Azure Databricks www.azure.com/databricks
How to get started
Connect with
Mark
Tabladillo
Linked In
Twitter @marktabnet
Abstract
• This presentation focuses on the value proposition for Azure Databricks for Data
Science. First, the talk includes an overview of the merits of Azure Databricks and
Spark. Second, the talk includes demos of data science on Azure Databricks. Finally,
the presentation includes some ideas for data science production.

201905 Azure Databricks for Machine Learning

  • 1.
    Azure Databricks for MachineLearning Mark Tabladillo Ph.D. Cloud Solution Architect
  • 2.
  • 3.
    Security and performanceFlexibilityof choiceReason over any data, anywhere Data warehouses Data Lakes Operational databases Hybrid Data warehouses Data Lakes Operational databases SocialLOB Graph IoTImageCRM T H E M O D E R N D A T A E S T A T E
  • 4.
    Security and performanceFlexibilityof choiceReason over any data, anywhere Data warehouses Operational databases Hybrid Data warehouses Operational databases SQL Server Azure Data Services AI built-in | Most secure | Lowest TCO Industry leader 2 years in a row #1 TPC-H performance T-SQL query over any data 70% faster than Aurora 2x global reach than Redshift No Limits Analytics with 99.9% SLA Easiest lift and shift with no code changes SocialLOB Graph IoTImageCRM T H E M I C R O S O F T O F F E R I N G Data lakes Data lakes
  • 5.
    Big Data &Advanced Analytics in Azure
  • 6.
    Model & ServePrep& Train Databricks HDInsight Data Lake Analytics Custom apps Sensors and devices Store Blobs Data Lake Ingest Data Factory (Data movement, pipelines & orchestration) Machine Learning Cosmos DB SQL Data Warehouse Analysis Services Event Hub IoT Hub SQL Database Analytical dashboards Predictive apps Operational reports Intelligence B I G D ATA & A D VA N C E D A N A LY T I C S AT A G L A N C E Business apps 10 01 SQLKafka
  • 7.
  • 8.
    Why Spark? Open-source dataprocessing engine built around speed, ease of use, and sophisticated analytics In memory engine that is up to 100 times faster than Hadoop Largest open-source data project with 1000+ contributors Highly extensible with support for Scala, Java and Python alongside Spark SQL, GraphX, Streaming and Machine Learning Library (Mllib)
  • 9.
    What is AzureDatabricks? A fast, easy and collaborative Apache® Spark™ based analytics platform optimized for Azure Best of Databricks Best of Microsoft Designed in collaboration with the founders of Apache Spark One-click set up; streamlined workflows Interactive workspace that enables collaboration between data scientists, data engineers, and business analysts. Native integration with Azure services (Power BI, SQL DW, Cosmos DB, Blob Storage) Enterprise-grade Azure security (Active Directory integration, compliance, enterprise-grade SLAs)
  • 10.
    Optimized Databricks RuntimeEngine DATABRICKS I/O SERVERLESS Collaborative Workspace Cloud storage Data warehouses Hadoop storage IoT / streaming data Rest APIs Machine learning models BI tools Data exports Data warehouses Azure Databricks Enhance Productivity Deploy Production Jobs & Workflows APACHE SPARK MULTI-STAGE PIPELINES DATA ENGINEER JOB SCHEDULER NOTIFICATION & LOGS DATA SCIENTIST BUSINESS ANALYST Build on secure & trusted cloud Scale without limits Azure Databricks
  • 11.
    Azure Databricks Runtimefor Machine Learning AZURE Databricks Runtime for Machine Learning - Pre-installed packages for machine learning like Tensorflow, Keras, Horovod and XGBoost - Pre-configured HorovodEstimator for seamless integration of Horovod with the Spark DataFrames - Support for GPU enabled VMs for specialized compute for your deep learning needs - Multi-GPU trainings of deep neural networks using Horovod - Unlock complex machine learning and deep learning scenarios with a few lines of code
  • 12.
  • 13.
    A Z UR E D A T A B R I C K S N O T E B O O K S O V E R V I E W Notebooks are a popular way to develop, and run, Spark Applications ▪ Notebooks are not only for authoring Spark applications but can be run/executed directly on clusters • Shift+Enter • • ▪ Notebooks support fine grained permissions—so they can be securely shared with colleagues for collaboration (see following slide for details on permissions and abilities) ▪ Notebooks are well-suited for prototyping, rapid development, exploration, discovery and iterative development Notebooks typically consist of code, data, visualization, comments and notes
  • 14.
    M I XI N G L A N G U A G E S I N N O T E B O O K S You can mix multiple languages in the same notebook Normally a notebook is associated with a specific language. However, with Azure Databricks notebooks, you can mix multiple languages in the same notebook. This is done using the language magic command: • %python Allows you to execute python code in a notebook (even if that notebook is not python) • %sql Allows you to execute sql code in a notebook (even if that notebook is not sql). • %r Allows you to execute r code in a notebook (even if that notebook is not r). • %scala Allows you to execute scala code in a notebook (even if that notebook is not scala). • %sh Allows you to execute shell code in your notebook. • %fs Allows you to use Databricks Utilities - dbutils filesystem commands. • %md To include rendered markdown
  • 15.
    N O TE B O O K O P E R A T I O N S A N D A C C E S S C O N T R O L You can create a new notebook from the Workspace or the folder drop down menu (see previous slides) From a notebook’s drop down menu you can: ▪ Clone the notebook ▪ Rename or delete the notebook ▪ Move the notebook to another location ▪ Export a notebook to save it and its contents as a Databricks archive or IPython notebook or HTML or source code file. ▪ Set Permissions for the notebook As with Workspaces you can set 5 levels of permissions: No Permissions, Can Manage, Can Read, Can Edit, Can Run ▪ You can also set permissions from notebook UI itself by selecting the menu option.
  • 16.
    V I SU A L I Z A T I O N Azure Databricks supports a number of visualization plots out of the box ▪ All notebooks, regardless of their language, support Databricks visualizations. ▪ When you run the notebook the visualizations are rendered inside the notebook in-place ▪ The visualizations are written in HTML. • You can save the HTML of the entire notebook by exporting to HTML. • If you use Matplotlib, the plots are rendered as images so you can just right click and download the image ▪ You can change the plot type just by picking from the selection
  • 17.
    L I BR A R I E S O V E R V I E W Enables external code to be imported and stored into a Workspace
  • 18.
    D A TA B R I C K S F I L E S Y S T E M ( D B F S ) Is a distributed File System (DBFS) that is a layer over Azure Blob Storage Azure Blob Storage Python Scala CLI dbutils DBFS
  • 19.
    S P AR K M A C H I N E L E A R N I N G ( M L ) O V E R V I E W ▪ Offers a set of parallelized machine learning algorithms (see next slide) ▪ Supports Model Selection (hyperparameter tuning) using Cross Validation and Train-Validation Split. ▪ Supports Java, Scala or Python apps using DataFrame-based API (as of Spark 2.0). Benefits include: • An uniform API across ML algorithms and across multiple languages • Facilitates ML pipelines (enables combining multiple algorithms into a single pipeline). • Optimizations through Tungsten and Catalyst • Spark MLlib comes pre-installed on Azure Databricks • 3rd Party libraries supported include: H20 Sparkling Water, SciKit- learn and XGBoost Enables Parallel, Distributed ML for large datasets on Spark Clusters
  • 20.
    M M LS P A R K Microsoft Machine Learning Library for Apache Spark (MMLSpark) lets you easily create scalable machine learning models for large datasets. It includes integration of SparkML pipelines with the Microsoft Cognitive Toolkit and OpenCV, enabling you to: ▪ Ingress and pre-process image data ▪ Featurize images and text using pre-trained deep learning models ▪ Train and score classification and regression models using implicit featurization
  • 21.
    S P AR K M L A L G O R I T H M S Spark ML Algorithms
  • 22.
    D E EP L E A R N I N G ▪ Supports Deep Learning Libraries/frameworks including: • Microsoft Cognitive Toolkit (CNTK). o Article explains how to install CNTK on Azure Databricks. • TensorFlowOnSpark • BigDL ▪ Offers Spark Deep Learning Pipelines, a suite of tools for working with and processing images using deep learning using transfer learning. It includes high-level APIs for common aspects of deep learning so they can be done efficiently in a few lines of code: Azure Databricks supports and integrates with a number of Deep Learning libraries and frameworks to make it easy to build and deploy Deep Learning applications Distributed Hyperparameter Tuning Transfer Learning
  • 23.
    S P AR K R O V E R V I E W An R package that provides a light-weight frontend to use Apache Spark from R ▪ Provides a distributed DataFrame implementation that supports operations like selection, filtering, aggregation etc (similar to R data frames, dplyr) ▪ Supports distributed machine learning using Spark MLlib. ▪ R programs can connect to a Spark cluster from RStudio, R shell, Rscript or other R IDEs
  • 24.
  • 25.
    Modern Big DataWarehouse Business / custom apps (Structured) Logs, files and media (unstructured) Azure storage Polybase Azure SQL Data Warehouse Data factory Data factory Azure Databricks (Spark) Analytical dashboards Model & ServePrep & TrainStoreIngest Intelligence
  • 26.
    Advanced Analytics onBig Data Web & mobile appsAzure Databricks (Spark Mllib, SparkR, SparklyR) Azure Cosmos DB Business / custom apps (Structured) Logs, files and media (unstructured) Azure storage Polybase Azure SQL Data Warehouse Data factory Data factory Analytical dashboards Model & ServePrep & TrainStoreIngest Intelligence
  • 27.
    Real-time analytics onBig Data Unstructured data Azure storage Polybase Azure SQL Data Warehouse Azure HDInsight (Kafka) Azure Databricks (Spark) Analytical dashboards Model & ServePrep & TrainStoreIngest Intelligence
  • 28.
  • 29.
    How to getstarted
  • 30.
    Engage Microsoft expertsfor a workshop to help identify high impact scenarios Already using Azure? try Azure Databricks now or create a free Azure account to start using Azure Databricks Learn more about Azure Databricks www.azure.com/databricks How to get started
  • 31.
  • 32.
    Abstract • This presentationfocuses on the value proposition for Azure Databricks for Data Science. First, the talk includes an overview of the merits of Azure Databricks and Spark. Second, the talk includes demos of data science on Azure Databricks. Finally, the presentation includes some ideas for data science production.