201905 Azure Databricks for Machine Learning

Azure Databricks
for Machine Learning
Mark Tabladillo Ph.D.
Cloud Solution Architect

Security and performanceFlexibility of choiceReason over any data, anywhere
Data warehouses
Data Lakes
Operational databases
Hybrid
Data warehouses
Data Lakes
SocialLOB Graph IoTImageCRM
T H E M O D E R N D A T A E S T A T E

Security and performanceFlexibility of choiceReason over any data, anywhere
Data warehouses
Hybrid
Data warehouses
SQL Server Azure Data Services
AI built-in | Most secure | Lowest TCO
Industry leader 2 years in a row
#1 TPC-H performance
T-SQL query over any data
70% faster than Aurora
2x global reach than Redshift
No Limits Analytics with 99.9% SLA
Easiest lift and shift
with no code changes
SocialLOB Graph IoTImageCRM
T H E M I C R O S O F T O F F E R I N G
Data lakes Data lakes

Big Data & Advanced Analytics
in Azure

Model & ServePrep & Train
Databricks
HDInsight
Data Lake Analytics
Custom
apps
Sensors
and devices
Store
Blobs
Data Lake
Ingest
Data Factory
(Data movement, pipelines & orchestration)
Machine
Learning
Cosmos DB
SQL Data
Warehouse
Analysis Services
Event Hub
IoT Hub
SQL Database
Analytical dashboards
Predictive apps
Operational reports
Intelligence
B I G D ATA & A D VA N C E D A N A LY T I C S AT A G L A N C E
Business
apps
10
01
SQLKafka

Azure Databricks
Powered by Apache Spark

Why Spark?
Open-source data processing engine built
around speed, ease of use, and sophisticated
analytics
In memory engine that is up to 100 times faster
than Hadoop
Largest open-source data project with 1000+
contributors
Highly extensible with support for Scala, Java
and Python alongside Spark SQL, GraphX,
Streaming and Machine Learning Library (Mllib)

What is Azure Databricks?
A fast, easy and collaborative Apache® Spark™ based analytics platform optimized for Azure
Best of Databricks Best of Microsoft
Designed in collaboration with the founders of Apache Spark
One-click set up; streamlined workflows
Interactive workspace that enables collaboration between data scientists, data engineers, and business analysts.
Native integration with Azure services (Power BI, SQL DW, Cosmos DB, Blob Storage)
Enterprise-grade Azure security (Active Directory integration, compliance, enterprise-grade SLAs)

Optimized Databricks Runtime Engine
DATABRICKS I/O SERVERLESS
Collaborative Workspace
Cloud storage
Data warehouses
Hadoop storage
IoT / streaming data
Rest APIs
Machine learning models
BI tools
Data exports
Data warehouses
Azure Databricks
Enhance Productivity
Deploy Production Jobs & Workflows
APACHE SPARK
MULTI-STAGE PIPELINES
DATA ENGINEER
JOB SCHEDULER NOTIFICATION & LOGS
DATA SCIENTIST BUSINESS ANALYST
Build on secure & trusted cloud Scale without limits
Azure Databricks

Azure Databricks Runtime for Machine Learning
AZURE Databricks Runtime for Machine
Learning
- Pre-installed packages for machine learning
like Tensorflow, Keras, Horovod and XGBoost
- Pre-configured HorovodEstimator for
seamless integration of Horovod with the
Spark DataFrames
- Support for GPU enabled VMs for specialized
compute for your deep learning needs
- Multi-GPU trainings of deep neural networks
using Horovod
- Unlock complex machine learning and deep
learning scenarios with a few lines of code

A Z U R E D A T A B R I C K S N O T E B O O K S O V E R V I E W
Notebooks are a popular way to develop, and run, Spark Applications
▪ Notebooks are not only for authoring Spark applications but
can be run/executed directly on clusters
• Shift+Enter
•
•
▪ Notebooks support fine grained permissions—so they can be
securely shared with colleagues for collaboration (see
following slide for details on permissions and abilities)
▪ Notebooks are well-suited for prototyping, rapid
development, exploration, discovery and iterative
development Notebooks typically consist of code, data, visualization, comments and notes

M I X I N G L A N G U A G E S I N N O T E B O O K S
You can mix multiple languages in the same notebook
Normally a notebook is associated with a specific language. However, with Azure Databricks notebooks, you can
mix multiple languages in the same notebook. This is done using the language magic command:
• %python Allows you to execute python code in a notebook (even if that notebook is not python)
• %sql Allows you to execute sql code in a notebook (even if that notebook is not sql).
• %r Allows you to execute r code in a notebook (even if that notebook is not r).
• %scala Allows you to execute scala code in a notebook (even if that notebook is not scala).
• %sh Allows you to execute shell code in your notebook.
• %fs Allows you to use Databricks Utilities - dbutils filesystem commands.
• %md To include rendered markdown

N O T E B O O K O P E R A T I O N S A N D A C C E S S C O N T R O L
You can create a new notebook from the Workspace or the
folder drop down menu (see previous slides)
From a notebook’s drop down menu you can:
▪ Clone the notebook
▪ Rename or delete the notebook
▪ Move the notebook to another location
▪ Export a notebook to save it and its contents as a
Databricks archive or IPython notebook or HTML or
source code file.
▪ Set Permissions for the notebook As with Workspaces
you can set 5 levels of permissions: No Permissions, Can
Manage, Can Read, Can Edit, Can Run
▪ You can also set permissions from notebook UI itself by
selecting the menu option.

V I S U A L I Z A T I O N
Azure Databricks supports a number of visualization plots out of the box
▪ All notebooks, regardless of their language,
support Databricks visualizations.
▪ When you run the notebook the visualizations
are rendered inside the notebook in-place
▪ The visualizations are written in HTML.
• You can save the HTML of the entire notebook by
exporting to HTML.
• If you use Matplotlib, the plots are rendered as
images so you can just right click and download
the image
▪ You can change the plot type just by picking
from the selection

L I B R A R I E S O V E R V I E W
Enables external code to be imported and stored into a Workspace

D A T A B R I C K S F I L E S Y S T E M ( D B F S )
Is a distributed File System (DBFS) that is a layer over Azure Blob Storage
Azure Blob Storage
Python Scala CLI dbutils
DBFS

S P A R K M A C H I N E L E A R N I N G ( M L ) O V E R V I E W
▪ Offers a set of parallelized machine learning algorithms (see next
slide)
▪ Supports Model Selection (hyperparameter tuning) using Cross
Validation and Train-Validation Split.
▪ Supports Java, Scala or Python apps using DataFrame-based API (as
of Spark 2.0). Benefits include:
• An uniform API across ML algorithms and across multiple languages
• Facilitates ML pipelines (enables combining multiple algorithms into a
single pipeline).
• Optimizations through Tungsten and Catalyst
• Spark MLlib comes pre-installed on Azure Databricks
• 3rd Party libraries supported include: H20 Sparkling Water, SciKit-
learn and XGBoost
Enables Parallel, Distributed ML for large datasets on Spark Clusters

M M L S P A R K
Microsoft Machine Learning Library for Apache Spark (MMLSpark) lets you
easily create scalable machine learning models for large datasets.
It includes integration of SparkML pipelines with the Microsoft Cognitive
Toolkit and OpenCV, enabling you to:
▪ Ingress and pre-process image data
▪ Featurize images and text using pre-trained deep learning models
▪ Train and score classification and regression models using implicit
featurization

S P A R K M L A L G O R I T H M S
Spark ML
Algorithms

D E E P L E A R N I N G
▪ Supports Deep Learning Libraries/frameworks including:
• Microsoft Cognitive Toolkit (CNTK).
o Article explains how to install CNTK on Azure Databricks.
• TensorFlowOnSpark
• BigDL
▪ Offers Spark Deep Learning Pipelines, a suite of tools for working with
and processing images using deep learning using transfer learning. It
includes high-level APIs for common aspects of deep learning so they
can be done efficiently in a few lines of code:
Azure Databricks supports and integrates with a number of Deep Learning libraries and frameworks to
make it easy to build and deploy Deep Learning applications
Distributed Hyperparameter Tuning
Transfer Learning

S P A R K R O V E R V I E W
An R package that provides a light-weight frontend to use Apache Spark from R
▪ Provides a distributed DataFrame implementation that supports operations like selection,
filtering, aggregation etc (similar to R data frames, dplyr)
▪ Supports distributed machine learning using Spark MLlib.
▪ R programs can connect to a Spark cluster from RStudio, R shell, Rscript or other R IDEs

Modern Big Data Warehouse
Business / custom apps
(Structured)
Logs, files and media
(unstructured)
Azure storage
Polybase
Azure SQL Data Warehouse
Data factory
Data factory
Azure Databricks
(Spark)
Model & ServePrep & TrainStoreIngest Intelligence

Advanced Analytics on Big Data
Web & mobile appsAzure Databricks
(Spark Mllib,
SparkR, SparklyR)
Azure Cosmos DB
Business / custom apps
(Structured)
Logs, files and media
(unstructured)
Azure storage
Polybase
Data factory
Data factory

Real-time analytics on Big Data
Unstructured data
Azure storage
Polybase
Azure HDInsight
(Kafka)
Azure Databricks
(Spark)

Engage Microsoft experts for a workshop to help identify
high impact scenarios
Already using Azure? try Azure Databricks now or
create a free Azure account to start using Azure Databricks
Learn more about Azure Databricks www.azure.com/databricks
How to get started

Connect with
Mark
Tabladillo
Linked In
Twitter @marktabnet

Abstract
• This presentation focuses on the value proposition for Azure Databricks for Data
Science. First, the talk includes an overview of the merits of Azure Databricks and
Spark. Second, the talk includes demos of data science on Azure Databricks. Finally,
the presentation includes some ideas for data science production.

201905 Azure Databricks for Machine Learning

More Related Content

What's hot

Similar to 201905 Azure Databricks for Machine Learning

More from Mark Tabladillo

Recently uploaded

201905 Azure Databricks for Machine Learning