Combining Machine Learning Frameworks with Apache Spark

Combining Machine
Learning frameworks with
Apache Spark
Tim Hunter
Hadoop Summit
June 2016

About me
Apache Spark contributor (since Spark 0.6)
Software Engineer @ Databricks
Ph.D. in Machine Learning @ UC Berkeley
2

Founded by the team who
created Apache Spark
Offers a hosted service:
- Apache Spark in the cloud
- Notebooks
- Cluster management
- Production environment
About Databricks
3

Apache Spark
The most active open-source project in big data

Large-scale machine learning on Apache Spark
Spark MLlib

MLlib’s Mission
MLlib’s mission is to make practical machine
learning easy and scalable.
• Easy to build machine learning applications
• Capable of learning from large-scale datasets
• Easy to integrate into existing workflows
6

Algorithm Coverage
• Classification
• Logistic regression
• Naive Bayes
• Streaming logistic regression
• Linear SVMs
• Decision trees
• Random forests
• Gradient-boosted trees
• Multilayer perceptron
• Regression
• Ordinary least squares
• Ridge regression
• Lasso
• Isotonic regression
• Decision trees
• Random forests
• Gradient-boosted trees
• Streaming linear methods
• Generalized Linear Models
• Frequent itemsets
• FP-growth
• PrefixSpan
7
Clustering
• Gaussian mixture models
• K-Means
• Streaming K-Means
• Latent Dirichlet Allocation
• Power Iteration Clustering
• Bisecting K-Means
Statistics
• Pearson correlation
• Spearman correlation
• Online summarization
• Chi-squared test
• Kernel density estimation
• Kolmogorov–Smirnov test
• Online hypothesis testing
• Survival analysis
Linear algebra
• Local dense & sparse vectors & matrices
• Normal equation for least squares
• Distributed matrices
• Block-partitioned matrix
• Row matrix
• Indexed row matrix
• Coordinate matrix
• Matrix decompositions
Recommendation
• Alternating Least Squares
Feature extraction & selection
• Word2Vec
• Chi-Squared selection
• Hashing term frequency
• Inverse document frequency
• Normalizer
• Standard scaler
• Tokenizer
• One-Hot Encoder
• StringIndexer
• VectorIndexer
• VectorAssembler
• Binarizer
• Bucketizer
• ElementwiseProduct
• PolynomialExpansion
• Quantile discretizer
• SQL transformer
Model import/export
Pipelines
List based on Spark 2.0

Outline
• ML workflows are complex
• Distributing single-machine ML frameworks:
• Embedding with Spark:
• Unified cross-languages ML pipelines with MLlib
8

ML workflows are complex
• Specify the pipeline
• Re-run on new data
• Inspect the results
• Tune the parameters
• Usually, each step of a pipeline is easier with one
framework
9

ML Workflows are Complex
10
Train model 1
Evaluate
Datasource 1
Datasource 2
Datasource 3
Extract featuresExtract features
Feature transform 1
Feature transform 2
Feature transform 3
Train model 2
Ensemble

Existing tools
• Scikit-learn
– Excellent documentation
– Standard for Python
• R
– Lots of packages available
• Pandas
– Very easy to use
• A lot of investment in tooling and education
– How to integrate big data with these tools?
11

Common misconceptions
• Spark is for big data only
• Spark can only work with dedicated, distributed
libraries
12

Spark as a scheduler
• A lot of tasks in ML are ”embarrassingly parallel”
• Use Spark for data management and for
scheduling
13

One example: learning digits
• Learning tasks: given set of images, recognized
digits
• Standard benchmark dataset in computer vision
built by NIST:
14

Training Deep Learning algorithms
• Training a neural network is hard:
• It is a sequential procedure (present one image after the
other to learn from)
• It can be sensitive to noise and order of images:
robustness analysis is critical
• Tuning the training parameters (descent rate, batch sizes,
etc.) is very important. Otherwise, learning is too slow or
gets stuck in a local minima. A lot of heuristics are used in
practice.
15

TensorFlow as a training library
• A lot of algorithms have been presented for this
task, we will choose TensorFlow, from Google:
• Popular choice for neural network training and
deep learning
• Competitive performance
• Easy to experiment with
• Python interface makes it easy to integrate with
Spark
16

Distributing TensorFlow computations
• Even if TF is used as a single-machine library, we
get speedups from Spark
17
Distributed Cross Validation
...
Best
Model
Model #1
Training
Model #2
Training
Model #3
Training

Distributing TensorFlow computations
18
Distributed Cross Validation
...
Best Model
Model #4
Training
Model #6
Training
Model #3
Training
Model #1
Training
Model #5
Training
Model #2
Training

Results
• Running a 2-layer neural network, and testing for
different update rates and different layer sizes
19
0
3000
6000
9000
12000
1 node 2 nodes 13 nodes

Embedding deep learning in Spark
• Best known algorithms are essentially sequential
during training
• Careful selection of training parameters is critical
• Spark can help for fast iterations and find a good
set of parameters
20

Managing ML workflows with Spark
21

A data scientist’s wish list:
• Run original code on a production environment
• Use distributed data sources
• Use familiar APIs and libraries
• Distribute ML workload piece by piece
• Only distribute as needed
• Easily switch between local & distributed settings
22

Example: sentiment analysis
23
Given a review (text), predict the user’s rating.
Data from https://snap.stanford.edu/data/web-Amazon.html

ML Workflow
24
Train model
Evaluate
Load data
Extract features
Review: This product doesn't seem to be made to last… Rating: 2
feature_vector: [0.1 -1.3 0.23 … -0.74] rating: 2.0
Regression: (review: String) => Double

Load Data
25
built-in external
{ JSON }
JDBC
and more …
Data sources for DataFrames
LIBSVM
Train model
Evaluate
Load data
Extract features

Extract Features
words: [this, product, doesn't, seem, to, …]
feature_vector: [0.1 -1.3 0.23 … -0.74]
Prediction: 3.0
Train model
Evaluate
Load data
Tokenizer
Hashed Term Frequ.

Extract Features
words: [this, product, doesn't, seem, to, …]
feature_vector: [0.1 -1.3 0.23 … -0.74]
Prediction: 3.0
Linear regression
Evaluate
Load data
Tokenizer
Hashed Term Frequ.

Our ML workflow
28
Cross Validation
Model
Training
Feature
Extraction
regularization
parameter:
{0.0, 0.1, ...}

Cross validation
29
Cross Validation
...
Best Model
Model #1
Training
Model #2
Training
Feature
Extraction
Model #3
Training

MLlib in production
ML Persistence
31

• Only distribute as needed
• Easily switch between local & distributed settings
32

DataFrame-based API for MLlib
a.k.a. “Pipelines” API, with utilities for constructing ML Pipelines
In 2.0, the DataFrame-based API will become the primary API for
MLlib.
• Voted by community
• org.apache.spark.ml, pyspark.ml
The RDD-based API will enter maintenance mode.
• Still maintained with bug fixes, but no new features
•org.apache.spark.mllib, pyspark.mllib
33

Why ML persistence?
34
Data Science Software Engineering
Prototype
(Python/R)
Create model
Re-implement model for
production (Java)
Deploy model

Why ML persistence?
35
Prototype
(Python/R)
Create Pipeline
• Extract raw features
• Transform features
• Select key features
• Fit multiple models
• Combine results to
make prediction
• Extra implementation work
• Different code paths
• Synchronization overhead
Re-implement Pipeline
for production (Java)
Deploy Pipeline

With ML persistence...
36
Prototype
(Python/R)
Create Pipeline
Persist model or Pipeline:
model.save(“s3n://...”)
Load Pipeline (Scala/Java)
Model.load(“s3n://…”)
Deploy in production

Model tuning
ML persistence status
37
Text
preprocessin
g
Feature
generation
Generalize
d Linear
Regressio
n
Unfitted Fitted
Model
Pipeline
Supported in
MLlib’s RDD-based
API
“recipe” “result”

ML persistence status
Near-complete coverage in all Spark language APIs
• Scala & Java: complete (29 feature transformers, 21 models)
• Python: complete except for 2 algorithms
• R: complete for existing APIs
Single underlying implementation of models
Exchangeable data format
• JSON for metadata
• Parquet for model data (coefficients, etc.)
38

• Directly apply learned pipelines
• Use MLlib as export format
• Builtin Spark conversions
• Easy to distribute the most common ML tasks
39

What’s next?
Prioritized items on the 2.1 roadmap JIRA (SPARK-
15581):
• Critical feature completeness for the DataFrame-based API
– Multiclass logistic regression
– Statistics
• Python API parity & R API expansion
• Scaling & speed tuning for key algorithms: trees & ensembles
GraphFrames
• Release for Spark 2.0
• Speed improvements (join elimination, connected components)
40

Get started
• Get involved via roadmap JIRA (SPARK-
15581) + mailing lists
• Download notebook for this talk
http://dbricks.co/1UfvAH9
• ML persistence blog post
http://databricks.com/blog/2016/05/31
41
Try out the Apache Spark
2.0 preview release:
http://databricks.com/try

Thank you!
spark.apache.org
spark-packages.org
databricks.com

Combining Machine Learning Frameworks with Apache Spark

In this document