Combining Machine Learning Frameworks with Apache Spark
This document discusses combining machine learning frameworks with Apache Spark. It provides an overview of Apache Spark and MLlib, describes how to distribute TensorFlow computations using Spark, and discusses managing machine learning workflows with Spark through features like cross validation, persistence, and distributed data sources. The goal is to make machine learning easy, scalable, and integrate with existing workflows.
About me
Apache Sparkcontributor (since Spark 0.6)
Software Engineer @ Databricks
Ph.D. in Machine Learning @ UC Berkeley
2
3.
Founded by theteam who
created Apache Spark
Offers a hosted service:
- Apache Spark in the cloud
- Notebooks
- Cluster management
- Production environment
About Databricks
3
MLlib’s Mission
MLlib’s missionis to make practical machine
learning easy and scalable.
• Easy to build machine learning applications
• Capable of learning from large-scale datasets
• Easy to integrate into existing workflows
6
7.
Algorithm Coverage
• Classification
•Logistic regression
• Naive Bayes
• Streaming logistic regression
• Linear SVMs
• Decision trees
• Random forests
• Gradient-boosted trees
• Multilayer perceptron
• Regression
• Ordinary least squares
• Ridge regression
• Lasso
• Isotonic regression
• Decision trees
• Random forests
• Gradient-boosted trees
• Streaming linear methods
• Generalized Linear Models
• Frequent itemsets
• FP-growth
• PrefixSpan
7
Clustering
• Gaussian mixture models
• K-Means
• Streaming K-Means
• Latent Dirichlet Allocation
• Power Iteration Clustering
• Bisecting K-Means
Statistics
• Pearson correlation
• Spearman correlation
• Online summarization
• Chi-squared test
• Kernel density estimation
• Kolmogorov–Smirnov test
• Online hypothesis testing
• Survival analysis
Linear algebra
• Local dense & sparse vectors & matrices
• Normal equation for least squares
• Distributed matrices
• Block-partitioned matrix
• Row matrix
• Indexed row matrix
• Coordinate matrix
• Matrix decompositions
Recommendation
• Alternating Least Squares
Feature extraction & selection
• Word2Vec
• Chi-Squared selection
• Hashing term frequency
• Inverse document frequency
• Normalizer
• Standard scaler
• Tokenizer
• One-Hot Encoder
• StringIndexer
• VectorIndexer
• VectorAssembler
• Binarizer
• Bucketizer
• ElementwiseProduct
• PolynomialExpansion
• Quantile discretizer
• SQL transformer
Model import/export
Pipelines
List based on Spark 2.0
8.
Outline
• ML workflowsare complex
• Distributing single-machine ML frameworks:
• Embedding with Spark:
• Unified cross-languages ML pipelines with MLlib
8
9.
ML workflows arecomplex
• Specify the pipeline
• Re-run on new data
• Inspect the results
• Tune the parameters
• Usually, each step of a pipeline is easier with one
framework
9
10.
ML Workflows areComplex
10
Train model 1
Evaluate
Datasource 1
Datasource 2
Datasource 3
Extract featuresExtract features
Feature transform 1
Feature transform 2
Feature transform 3
Train model 2
Ensemble
11.
Existing tools
• Scikit-learn
–Excellent documentation
– Standard for Python
• R
– Lots of packages available
• Pandas
– Very easy to use
• A lot of investment in tooling and education
– How to integrate big data with these tools?
11
Spark as ascheduler
• A lot of tasks in ML are ”embarrassingly parallel”
• Use Spark for data management and for
scheduling
13
14.
One example: learningdigits
• Learning tasks: given set of images, recognized
digits
• Standard benchmark dataset in computer vision
built by NIST:
14
15.
Training Deep Learningalgorithms
• Training a neural network is hard:
• It is a sequential procedure (present one image after the
other to learn from)
• It can be sensitive to noise and order of images:
robustness analysis is critical
• Tuning the training parameters (descent rate, batch sizes,
etc.) is very important. Otherwise, learning is too slow or
gets stuck in a local minima. A lot of heuristics are used in
practice.
15
16.
TensorFlow as atraining library
• A lot of algorithms have been presented for this
task, we will choose TensorFlow, from Google:
• Popular choice for neural network training and
deep learning
• Competitive performance
• Easy to experiment with
• Python interface makes it easy to integrate with
Spark
16
17.
Distributing TensorFlow computations
•Even if TF is used as a single-machine library, we
get speedups from Spark
17
Distributed Cross Validation
...
Best
Model
Model #1
Training
Model #2
Training
Model #3
Training
Results
• Running a2-layer neural network, and testing for
different update rates and different layer sizes
19
0
3000
6000
9000
12000
1 node 2 nodes 13 nodes
20.
Embedding deep learningin Spark
• Best known algorithms are essentially sequential
during training
• Careful selection of training parameters is critical
• Spark can help for fast iterations and find a good
set of parameters
20
A data scientist’swish list:
• Run original code on a production environment
• Use distributed data sources
• Use familiar APIs and libraries
• Distribute ML workload piece by piece
• Only distribute as needed
• Easily switch between local & distributed settings
22
ML Workflow
24
Train model
Evaluate
Loaddata
Extract features
Review: This product doesn't seem to be made to last… Rating: 2
feature_vector: [0.1 -1.3 0.23 … -0.74] rating: 2.0
Regression: (review: String) => Double
25.
Load Data
25
built-in external
{JSON }
JDBC
and more …
Data sources for DataFrames
LIBSVM
Train model
Evaluate
Load data
Extract features
26.
Extract Features
words: [this,product, doesn't, seem, to, …]
feature_vector: [0.1 -1.3 0.23 … -0.74]
Review: This product doesn't seem to be made to last… Rating: 2
Prediction: 3.0
Train model
Evaluate
Load data
Tokenizer
Hashed Term Frequ.
27.
Extract Features
words: [this,product, doesn't, seem, to, …]
feature_vector: [0.1 -1.3 0.23 … -0.74]
Review: This product doesn't seem to be made to last… Rating: 2
Prediction: 3.0
Linear regression
Evaluate
Load data
Tokenizer
Hashed Term Frequ.
28.
Our ML workflow
28
CrossValidation
Model
Training
Feature
Extraction
regularization
parameter:
{0.0, 0.1, ...}
A data scientist’swish list:
• Run original code on a production environment
• Use distributed data sources
• Use familiar APIs and libraries
• Distribute ML workload piece by piece
• Only distribute as needed
• Easily switch between local & distributed settings
32
33.
DataFrame-based API forMLlib
a.k.a. “Pipelines” API, with utilities for constructing ML Pipelines
In 2.0, the DataFrame-based API will become the primary API for
MLlib.
• Voted by community
• org.apache.spark.ml, pyspark.ml
The RDD-based API will enter maintenance mode.
• Still maintained with bug fixes, but no new features
•org.apache.spark.mllib, pyspark.mllib
33
34.
Why ML persistence?
34
DataScience Software Engineering
Prototype
(Python/R)
Create model
Re-implement model for
production (Java)
Deploy model
35.
Why ML persistence?
35
DataScience Software Engineering
Prototype
(Python/R)
Create Pipeline
• Extract raw features
• Transform features
• Select key features
• Fit multiple models
• Combine results to
make prediction
• Extra implementation work
• Different code paths
• Synchronization overhead
Re-implement Pipeline
for production (Java)
Deploy Pipeline
36.
With ML persistence...
36
DataScience Software Engineering
Prototype
(Python/R)
Create Pipeline
Persist model or Pipeline:
model.save(“s3n://...”)
Load Pipeline (Scala/Java)
Model.load(“s3n://…”)
Deploy in production
37.
Model tuning
ML persistencestatus
37
Text
preprocessin
g
Feature
generation
Generalize
d Linear
Regressio
n
Unfitted Fitted
Model
Pipeline
Supported in
MLlib’s RDD-based
API
“recipe” “result”
38.
ML persistence status
Near-completecoverage in all Spark language APIs
• Scala & Java: complete (29 feature transformers, 21 models)
• Python: complete except for 2 algorithms
• R: complete for existing APIs
Single underlying implementation of models
Exchangeable data format
• JSON for metadata
• Parquet for model data (coefficients, etc.)
38
39.
A data scientist’swish list:
• Run original code on a production environment
• Directly apply learned pipelines
• Use MLlib as export format
• Use distributed data sources
• Builtin Spark conversions
• Use familiar APIs and libraries
• Distribute ML workload piece by piece
• Easy to distribute the most common ML tasks
39
40.
What’s next?
Prioritized itemson the 2.1 roadmap JIRA (SPARK-
15581):
• Critical feature completeness for the DataFrame-based API
– Multiclass logistic regression
– Statistics
• Python API parity & R API expansion
• Scaling & speed tuning for key algorithms: trees & ensembles
GraphFrames
• Release for Spark 2.0
• Speed improvements (join elimination, connected components)
40
41.
Get started
• Getinvolved via roadmap JIRA (SPARK-
15581) + mailing lists
• Download notebook for this talk
http://dbricks.co/1UfvAH9
• ML persistence blog post
http://databricks.com/blog/2016/05/31
41
Try out the Apache Spark
2.0 preview release:
http://databricks.com/try
#13 Spark is a very simple tool to accelerate your computations, even if you do not use have big data
Spark integrates well with existing libraries
- easy to use as a simple scheduler
- easy to write small bindings for the critical parts of libraries
#22 I am demonstrating from databricks
Mounted S3 buckets
Parquet datafiles
#23 - Be able to extract some smaller amount of data from the production storage system
Slowly start distributing the system: piece by piece
Easily switch between local and distributed
Keep familiar APIs or even the same tools
Show how thet distribution happens
#29 Model training / tuning
Regularization: parameter that controls how the linear model does on unseen data
There is no single good value for the regularization parameter.
One common method to find on is to try out different values.
This technique is called CV: you split your training data into 2 sets: one set used to learn some parameters with a given regularization parameter, and another set to evaluate how well we are doing with the given parameter.
#33 - Be able to extract some smaller amount of data from the production storage system
Slowly start distributing the system: piece by piece
Easily switch between local and distributed
Keep familiar APIs or even the same tools
Show how thet distribution happens