A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons

A full Machine learning pipeline
in Scikit-learn vs Scala-Spark:
pros and cons
Jose Quesada and David Anderson
@quesada, @alpinegizmo, @datascienceret

• How do you get from a single-machine workload to a fully distributed
one?
• Answer: Spark machine learning
• Is there something I'm missing out by staying with python?

• Mentors are world-class. CTOs, library authors, inventors, founders of
fast-growing companies, etc
• DSR accepts fewer than 5% of the applications
• Strong focus on commercial awareness
• 5 years of working experience on average
• 30+ partner companies in Europe

DSR participants do a portfolio project

Why is DSR talking about Scala/Spark?
They are behind Scala
IBM is behind this
They hired us to make training
materials

Source: Spark 2015 infographic

Time
Mindsharein‘datasciencebadasses’(subjective)

Scala
“Scala offers the easiest refactoring experience that I've ever had due
to the type system.”
Jacob, coursera engineer

Spark
• Basically distributed Scala
• API
• Scala, Java, Python, and R bindings
• Libraries
• SQL, streams, graph processing, machine learning
• One of the most active open source projects

“Spark will inevitably become the de-facto Big Data framework
for Machine Learning and Data Science.”
Dean Wampler, Lightbend

All under one roof (big Win)
Source: Spark 2015 infographic
Spark Core
Spark SQL
Spark
streaming
Spark.ml
(machine
learning
GraphX
(graphs)

Spark Programming Model
Input
Driver /
SparkContext
Worker
Worker

Data is partitioned; code is sent to the data
Input
Driver /
SparkContext
Worker
Worker
Data
Data

Example: word count
hello world
foo bar
foo foo bar
bye world
Data is immutable,
and is partitioned
across the cluster

Example: word count
hello world
foo bar
foo foo bar
bye world
We get things done
by creating new,
transformed copies
of the data.
In parallel.
hello
world
foo
bar
foo
foo
bar
bye
world
(hello, 1)
(world, 1)
(foo, 1)
(bar, 1)
(foo, 1)
(foo, 1)
(bar, 1)
(bye, 1)
(world, 1)

Example: word count
hello world
foo bar
foo foo bar
bye world
Some operations require a shuffle
to group data together
hello
world
foo
bar
foo
foo
bar
bye
world
(hello, 1)
(world, 1)
(foo, 1)
(bar, 1)
(foo, 1)
(foo, 1)
(bar, 1)
(bye, 1)
(world, 1)
(hello, 1)
(foo, 3)
(bar, 2)
(bye, 1)
(world, 2)

Example: word count
lines = sc.textFile(input)
words = lines.flatMap(lambda x: x.split(" "))
word_count =
(words.map(lambda x: (x, 1))
.reduceByKey(lambda x, y: x + y))
-------------------------------------------------
word_count.saveAsTextFile(output)
Pipelined into the same
python executor
Nothing happens until
after this line, when this
"action" forces evaluation
of the RDD

RDD – Resilient Distributed Dataset
• An immutable, partitioned collection of elements that can be
operated on in parallel
• Lazy
• Fault-tolerant

PySpark RDD Execution Model
Whenever you provide a
lambda to operate on an
RDD:
• Each Spark worker
forks a Python worker
• data is serialized and
piped to those Python
workers

Impact of this execution model
• Worker overhead (forking, serialization)
• The cluster manager isn't aware of Python's memory needs
• Very confusing error messages

Spark Dataframes (and Datasets)
• Based on RDDs, but tabular; something like SQL tables
• Not Pandas
• Rescues Python from serialization overhead
• df.filter(df.col("color") == "red") vs. rdd.filter(lambda x: x.color == "red")
• processed entirely in the JVM
• Python UDFs and maps still require serialization and piping to Python
• can write (and register) Scala code, and then call it from Python

DataFrame execution: unified across
languages
Python DF Java/Scala DF R DF
Logical Plan
Execution
API wrappers create a
logical plan (a DAG)
Catalyst optimizes the plan;
Tungsten compiles the plan
into executable code

ML Workflow
Data
Ingestion
Data Cleaning /
Feature
Engineering
Model
Training
Testing and
Validation
Deployment

Machine learning with scikit-learn
• Easy to use
• Rich ecosystem
• Limited to one machine (but see sparkit-learn package)

Machine learning with Hadoop (in short: NO)
• Each iteration is a new M/R job
• Each job must store data in HDFS – lots of overhead

How Spark killed Hadoop map/reduce
• Far easier to program
• More cost-effective since less hardware can perform the same tasks
much faster
• Can do real-time processing as well as batch processing
• Can do ML, graphs

Machine learning with Spark
• Spark was designed for ML workloads
• Caching (reuse data)
• Accumulators (keep state across iterations)
• Functional, lazy, fault-tolerant
• Many popular algorithms are supported out of the box
• Simple to productionalize models
• MLlib is RDD (the past), spark.ml is dataframes, the future

Spark is an Ecosystem of ML frameworks
• Spark was designed by people who understood the need of ML
practitioners (unlike Hadoop)
• MLlib
• Spark.ml
• System.ml (IBM)
• Keystone.ml

Spark.ML– the basics
• DataFrame: ML requires DFs holding vectors
• Transformer: transforms one DF into another
• Estimator: fit on a DF; produces a transformer
• Pipeline: chain of transformers and estimators
• Parameter: there is a unified API for specifying parameters
• Evaluator:
• CrossValidator: model selection via grid search

Hyper-parameter
tuning
Machine Learning scaling challenges that
Spark solves

Hyper-parameter
tuning
Spark solves
ETL/feature
engineering

Hyper-parameter
tuning
Spark solves
ETL/feature
engineering
Model

Q: Hardest scaling problem in data science?
A: Adding people
• Spark.ml has a clean architecture and APIs that should encourage
code sharing and reuse
• Good first step: can you refactor some ETL code as a Transformer?
• Don't see much sharing of components happening yet
• Entire libraries, yes; components, not so much
• Perhaps because Spark has been evolving so quickly
• E.g., pull request implementing non-linear SVMs that has been stuck for a
year

Structured types in Spark
SQL DataFrames DataSets
(Java/Scala only)
Syntax Errors Runtime Compile time Compile time
Analysis Errors Runtime Runtime Compile time

User experience Spark.ml – Scikit-learn

Indexing categorical features
• You are responsible for identifying and indexing categorical features
val rfcd_indexer = new StringIndexer()
.setInputCol("color")
.setOutputCol("color_index")
.fit(dataset)
val seo_indexer = new StringIndexer()
.setInputCol("status")
.setOutputCol("status_index")
.fit(dataset)

Assembling features
• You must gather all of your features into one Vector, using a
VectorAssembler
val assembler = new VectorAssembler()
.setInputCols(Array("color_index", "status_index", ...))
.setOutputCol("features")

Spark.ml – Scikit-learn: Pipelines (good news!)
• Spark ML and scikit-learn: same approach
• Chain together Estimators and Transformers
• Support non-linear pipelines (must be a DAG)
• Unify parameter passing
• Support for cross-validation and grid search
• Can write your own custom pipeline stages
Spark.ml just like scikit-learn

Transformer Description scikit-learn
Binarizer Threshold numerical feature to binary Binarizer
Bucketizer Bucket numerical features into ranges
ElementwiseProduct Scale each feature/column separately
HashingTF Hash text/data to vector. Scale by term frequency FeatureHasher
IDF Scale features by inverse document frequency TfidfTransformer
Normalizer Scale each row to unit norm Normalizer
OneHotEncoder Encode k-category feature as binary features OneHotEncoder
PolynomialExpansion Create higher-order features PolynomialFeatures
RegexTokenizer Tokenize text using regular expressions (part of text methods)
StandardScaler Scale features to 0 mean and/or unit variance StandardScaler
StringIndexer Convert String feature to 0-based indices LabelEncoder
Tokenizer Tokenize text on whitespace (part of text methods)
VectorAssembler Concatenate feature vectors FeatureUnion
VectorIndexer Identify categorical features, and index
Word2Vec Learn vector representation of words
Spark.ml – Scikit-learn: NLP tasks (thumbs up)

Graph stuff (graphX, graphframes, not great)
• Extremely easy to run monster algorithms in a cluster
• GraphX has no python API
• Graphframes are cool, and should provide access to the graph tools in
Spark from python
• In practice, it didn’t work too well

Things we liked in Spark ML
• Architecture encourages building reusable pieces
• Type safety, plus types are driving optimizations
• Model fitting returns an object that transforms the data
• Uniform way of passing parameters
• It's interesting to use the same platform for ETL and model fitting
• Very easy to parallelize ETL and grid search, or work with huge models

Disappointments using Spark ML
• Feature indexing and assembly can become tedious
• Surprised by the maximum depth limit for trees: 30
• Data exploration and visualization aren't easy in Scala
• Wish list: non-linear SVMs, deep learning (but see Deeplearning4j)

What is new for machine learning in Spark 2.0
• DataFrame-based Machine Learning API emerges as the primary ML
API: With Spark 2.0, the spark.ml package, with its “pipeline” APIs,
will emerge as the primary machine learning API. While the original
spark.mllib package is preserved, future development will focus on
the DataFrame-based API.
• Machine learning pipeline persistence: Users can now save and
load machine learning pipelines and models across all programming
languages supported by Spark.

What is new for data structures in Spark 2.0
Unifying the API for Streams and static data: Infinite datasets (same interface as dataframes)

What have Spark and Scala ever given us?

… Other than distributed dataframes,
distributed machine learning,
easy distributed grid search,
distributed SQL,
distributed stream analysis,
more performance than map reduce
easier programming model
And easier deployment …
What have Spark and Scala ever given us?

Reminder: 25 videos explaining ML on spark
• For people who already know ML
• http://datascienceretreat.com/videos/data-science-with-scala-and-
spark)

Thank you for your attention!
@quesada, @datascienceret

A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons

In this document