Presenter : Andrey Vykhodtsev
Andrey.vykhodtsev@si.ibm.com
*collective work, see slide credits
 Two meetup groups
 Close, but different
topics
 Ran by me
 I don’t have to be a
presenter all the time
 Propose your agenda
 Not a Big Data introduction
 Visit our next Big Data Essentials meetup instead
http://www.meetup.com/Big-Data-Developers-in-
Slovenia/events/223871144/
 Not for people without technical background
(sorry)
 Not a thorough use case discussion
 Just a technical overview of technology for
beginners
 General purpose distributed computing engine
suitable for large scale machine learning and
data processing tasks
NOT SO GOOD GOOD
 Not the first computing
engine
 MapReduce
 MPI
 Not one of a kind
 Flink
 Not so old (mature)
 Developing very fast
 Rapidly growing
community
 Backed by major
vendors
 Innovation
 Designed for iterative
data analysis on large
scale (supersedes MR)
In-Memory Performance
Ease of Development
Combine Workflows
Unlimited Scale
Enterprise Platform
Wide Range of
Data Formats
 A Big Data/DWH
developer
 A Data Scientist
 An Analytics
Architect
 A CxO of IT company
Statistici
an
Business
Analyst
Software
Engineer
IT WORDS BUSINESS WORDS
 Data
processing/Transformat
ion
 Machine Learning
 Social Network
Analysis
 Streaming/Microbatchi
ng
 Segmentation
 Campaign response
prediction
 Churn avoidance
 CTR prediction
 Behavioral analysis
 Genomics
 ….
 Open Source SystemML
 Educate One Million Data Professionals
 Establish Spark Technology Center
 Founding Member of AMPLab
 Contributing to the Core
 Port many existing applications onto Spark
 Develop applications using Spark
 Distributed platform for thousands of nodes
 Data storage and computation framework
 Open source
 Runs on commodity hardware
 Flexible – everything is loosely coupled
 Driving principals
 Files are stored across the entire cluster
 Programs are brought to the data, not the data to the program
 Distributed file system (DFS) stores blocks across the whole cluster
 Blocks of a single file are distributed across the cluster
 A given block is typically replicated as well for resiliency
 Just like a regular file system, the contents of a file is up to the application
 Unlike a regular file system, you can ask it “where does each block of my file
live?”
FILE
BLOCK
S
map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
EmitIntermediate(w, "1");
reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
Hello World Bye World
Hello IBM
Content of Input Documents
Reduce (final output):
< Bye, 1>
< IBM, 1>
< Hello, 2>
< World, 2>
Map 1 emits:
< Hello, 1>
< World, 1>
< Bye, 1>
< World, 1>
Map 2 emits:
< Hello, 1>
< IBM, 1>
 Spark brings two significant value-adds:
 Bring to Map Reduce the same added value that databases (and
parallel databases) brought to query processing:
 Let the app developer focus on the WHAT (they need to ask) and let the
system figure out HOW (it should be done).
 Enable faster higher level application development through higher level
constructs and concepts: (RDD concept)
 Let the system deal with performance (as part of the HOW)
 Leveraging memory (Bufferpools, Caching RDDs in memory)
 Maintaining sets of dedicated worker processes ready to go (subagents in
DBMS, Executors in Spark)
 Enabling interactive processing (CLP, SQL*Plus, spark-shell, etc….)
 Be one general purpose engine for multiples types of
workloads (SQL, Streaming, Machine Learning, etc…)
 Apache Spark is a fast, general
purpose, easy-to-use cluster
computing system for large-scale data
processing
 Fast
 Leverages aggressively cached in-memory
distributed computing and dedicated
App Executor processes even when no jobs
are running
 Faster than MapReduce
 General purpose
 Covers a wide range of workloads
 Provides SQL, streaming and complex
analytics
 Flexible and easier to use than Map
Reduce
 Spark is written in Scala, an object oriented,
functional programming language
 Scala, Python and Java APIs
 Scala and Python interactive shells
 Runs on Hadoop, Mesos, standalone or
cloud
Logistic regression in Hadoop and Spark
Spark Stack
val wordCounts =
sc.textFile("README.md").flatMap(line =>
line.split(" ")).map(word => (word,
1)).reduceByKey((a, b) => a + b)
WordCount
 Spark is versatile and
flexible:
 Can run on YARN /
HDFS but also
standalone or on
MESOS
 Spark engine can be
exploited from multiple
“entry points”: SQL,
Streaming, Machine
Learning, Graph
Processing
 Normally you code stuff up in one of the
languages
 Scala
 Python
 Java
 I like Python, but in some cases it is slower
 With DataFrames, no difference (more later)
 One of the shells
 Scala shell (spark-shell)
 Python shell
 Code it in the editor and submit with spark-
submit
 Use “notebook” (Jupyter, Zeppelin)
 My preferred method. More later
 Enable your IDE to run spark
 PyCharm
 IntelliJ IDEA
 Jupytiter
 Zeppelin
 Scala
 Incubated
 Many others
 Spark Notebook
 Ispark
 DataBricks Cloud
 IBM Spark aaS
 IBM DataScientist
Workbench
 Initialize context
 Read data
 Run stuff
 Transformations
 Actions
 Caching
 More later
GOOD STUFF NOT SO GOOD STUFF
 Full API exposed
 Concise language
 Documentation is way
better
 Faster if you use plain
RDDs
 Build tools and
dependency tracking
 Not so many additional
libraries compared to
Python
 Pandas
 Matplotlib
 Harder to run in a
“notebook”*
 *At the moment
 Harder to learn
 Scala Crash Course
 Holden Karau, DataBricks
http://lintool.github.io/SparkTutorial/slides/day
1_Scala_crash_course.pdf
 Martin Odersky’s “Functional Programming in
Scala” course
 Books
 Scala for Impatient
 Scala by Example
GOOD STUFF NOT SO GOOD STUFF
 Clean & clear language
 Easy to learn
 Lot of libraries
 Pandas
 Scikit
 matplotlib
 Easy to run in a
“notebook”
 Slower
 Interpreted language
 Not all API functions
exposed
 Streaming
 Some times behaves
differently
 I think coding in Java
for Spark is terrible
 But if you like it
messy, there is
nobody to stop you 
 A way to connect to spark engine
 Initialized with all runtime parameters
 For example, memory parameters
 Resilient Distributed Dataset
 An abstraction over a generic data collection
 Integers
 Strings
 PairRDD : <key, value> pairs (support additional
operations)
 Single logical entity but under the hood is a
distributed collection
Mokhtar
Jacques
Dirk
Cindy
Dan
Susan
Dirk
Frank
Jacques
Partition 1 Partition 2 Partition 3
Names
 You have to pay attention what kind of
operation you are running
 Transformation
 Does not do anything until the action is called
 Actions
 Kick off computation
 Results can be persisted to memory (cache) or to disk
(more later)
 Three methods for creation
 Distributing a collection of objects from the driver program
(using the parallelize method of the spark context)
val rddNumbers = sc.parallelize(1 to 10)
val rddLetters = sc.parallelize (List(“a”, “b”, “c”, “d”))
 Loading an external dataset (file)
val quotes = sc.textFile("hdfs:/sparkdata/sparkQuotes.txt")
 Transformation from another existing RDD
val rddNumbers2 = rddNumbers.map(x=> x+1)
 Transformations are
lazy evaluations
 Returns a pointer to
the transformed RDD
 Pair RDD (K,V)
functions for
MapReduce style
transformations
 Map
 Filter
 flatMap
 reduceByKey
 sortByKey
 Join
 See the doc for full list
 Kick off the
computation
 Transformations are
lazily evaluated
 Collect()
 Count()
 Take()
 Reduce()
 First()
 saveAsTextFile()
 Each node stores any partitions of the cache that it
computes in memory
 Reuses them in other actions on that dataset (or
datasets derived from it)
 Future actions are much faster (often by more
than 10x)
 Two methods for RDD persistence: persist() and
cache()
rdd1.join(rdd2)
.groupBy(…)
.filter(…)
RDD Objects
build operator DAG
agnostic to
operators!
doesn’t know
about stages
DAGScheduler
split graph into
stages of tasks
submit each
stage as ready
DAG
TaskScheduler
TaskSet
launch tasks via
cluster manager
retry failed or
straggling tasks
Cluster
manager
Worker
execute tasks
store and serve
blocks
Block
manager
Threads
Task
stage
failed
DataBricks
SparkContext
Driver Program
Cluster Manager
Worker Node
Executor
Task Task
Cache
Worker Node
Executor
Task Task
Cache
App
 MLLib
 Distributed machine
learning libraries
 SparkSQL
 DataFrames
 GraphX
 ML
 SparkR
 Streaming
 Read the Fine Manual
 https://spark.apache.org/docs/latest/index.html
 Take the course
 BigData University
https://bigdatauniversity.com/bdu-wp/bdu-
course/spark-fundamentals/
 edX – edx.org search for Spark
 If you’re stuck
 Try the user lists :
https://spark.apache.org/community.html
 Questions?
 Topic for the next meetup?
 Your experiences?
 Want to be a presenter?
 Some slide and text graphics were borrowed
from the following sources
 Vincent Poncet, IBM France
 Jacques Roy, IBM US
 Daniel Kikuchi , IBM US
 Mokhtar Kandil , IBM US
 DataBricks
 Spark Docs
 I completely lost track what slides I copied
from which source. I apologize.

20150716 introduction to apache spark v3

  • 1.
    Presenter : AndreyVykhodtsev Andrey.vykhodtsev@si.ibm.com *collective work, see slide credits
  • 2.
     Two meetupgroups  Close, but different topics  Ran by me  I don’t have to be a presenter all the time  Propose your agenda
  • 3.
     Not aBig Data introduction  Visit our next Big Data Essentials meetup instead http://www.meetup.com/Big-Data-Developers-in- Slovenia/events/223871144/  Not for people without technical background (sorry)  Not a thorough use case discussion  Just a technical overview of technology for beginners
  • 4.
     General purposedistributed computing engine suitable for large scale machine learning and data processing tasks
  • 5.
    NOT SO GOODGOOD  Not the first computing engine  MapReduce  MPI  Not one of a kind  Flink  Not so old (mature)  Developing very fast  Rapidly growing community  Backed by major vendors  Innovation  Designed for iterative data analysis on large scale (supersedes MR)
  • 6.
    In-Memory Performance Ease ofDevelopment Combine Workflows Unlimited Scale Enterprise Platform Wide Range of Data Formats
  • 7.
     A BigData/DWH developer  A Data Scientist  An Analytics Architect  A CxO of IT company Statistici an Business Analyst Software Engineer
  • 8.
    IT WORDS BUSINESSWORDS  Data processing/Transformat ion  Machine Learning  Social Network Analysis  Streaming/Microbatchi ng  Segmentation  Campaign response prediction  Churn avoidance  CTR prediction  Behavioral analysis  Genomics  ….
  • 11.
     Open SourceSystemML  Educate One Million Data Professionals  Establish Spark Technology Center  Founding Member of AMPLab  Contributing to the Core  Port many existing applications onto Spark  Develop applications using Spark
  • 13.
     Distributed platformfor thousands of nodes  Data storage and computation framework  Open source  Runs on commodity hardware  Flexible – everything is loosely coupled
  • 14.
     Driving principals Files are stored across the entire cluster  Programs are brought to the data, not the data to the program  Distributed file system (DFS) stores blocks across the whole cluster  Blocks of a single file are distributed across the cluster  A given block is typically replicated as well for resiliency  Just like a regular file system, the contents of a file is up to the application  Unlike a regular file system, you can ask it “where does each block of my file live?” FILE BLOCK S
  • 15.
    map(String key, Stringvalue): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); Hello World Bye World Hello IBM Content of Input Documents Reduce (final output): < Bye, 1> < IBM, 1> < Hello, 2> < World, 2> Map 1 emits: < Hello, 1> < World, 1> < Bye, 1> < World, 1> Map 2 emits: < Hello, 1> < IBM, 1>
  • 16.
     Spark bringstwo significant value-adds:  Bring to Map Reduce the same added value that databases (and parallel databases) brought to query processing:  Let the app developer focus on the WHAT (they need to ask) and let the system figure out HOW (it should be done).  Enable faster higher level application development through higher level constructs and concepts: (RDD concept)  Let the system deal with performance (as part of the HOW)  Leveraging memory (Bufferpools, Caching RDDs in memory)  Maintaining sets of dedicated worker processes ready to go (subagents in DBMS, Executors in Spark)  Enabling interactive processing (CLP, SQL*Plus, spark-shell, etc….)  Be one general purpose engine for multiples types of workloads (SQL, Streaming, Machine Learning, etc…)
  • 17.
     Apache Sparkis a fast, general purpose, easy-to-use cluster computing system for large-scale data processing  Fast  Leverages aggressively cached in-memory distributed computing and dedicated App Executor processes even when no jobs are running  Faster than MapReduce  General purpose  Covers a wide range of workloads  Provides SQL, streaming and complex analytics  Flexible and easier to use than Map Reduce  Spark is written in Scala, an object oriented, functional programming language  Scala, Python and Java APIs  Scala and Python interactive shells  Runs on Hadoop, Mesos, standalone or cloud Logistic regression in Hadoop and Spark Spark Stack val wordCounts = sc.textFile("README.md").flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b) WordCount
  • 18.
     Spark isversatile and flexible:  Can run on YARN / HDFS but also standalone or on MESOS  Spark engine can be exploited from multiple “entry points”: SQL, Streaming, Machine Learning, Graph Processing
  • 19.
     Normally youcode stuff up in one of the languages  Scala  Python  Java  I like Python, but in some cases it is slower  With DataFrames, no difference (more later)
  • 20.
     One ofthe shells  Scala shell (spark-shell)  Python shell  Code it in the editor and submit with spark- submit  Use “notebook” (Jupyter, Zeppelin)  My preferred method. More later  Enable your IDE to run spark  PyCharm  IntelliJ IDEA
  • 22.
     Jupytiter  Zeppelin Scala  Incubated  Many others  Spark Notebook  Ispark  DataBricks Cloud  IBM Spark aaS  IBM DataScientist Workbench
  • 23.
     Initialize context Read data  Run stuff  Transformations  Actions  Caching  More later
  • 24.
    GOOD STUFF NOTSO GOOD STUFF  Full API exposed  Concise language  Documentation is way better  Faster if you use plain RDDs  Build tools and dependency tracking  Not so many additional libraries compared to Python  Pandas  Matplotlib  Harder to run in a “notebook”*  *At the moment  Harder to learn
  • 25.
     Scala CrashCourse  Holden Karau, DataBricks http://lintool.github.io/SparkTutorial/slides/day 1_Scala_crash_course.pdf  Martin Odersky’s “Functional Programming in Scala” course  Books  Scala for Impatient  Scala by Example
  • 26.
    GOOD STUFF NOTSO GOOD STUFF  Clean & clear language  Easy to learn  Lot of libraries  Pandas  Scikit  matplotlib  Easy to run in a “notebook”  Slower  Interpreted language  Not all API functions exposed  Streaming  Some times behaves differently
  • 27.
     I thinkcoding in Java for Spark is terrible  But if you like it messy, there is nobody to stop you 
  • 29.
     A wayto connect to spark engine  Initialized with all runtime parameters  For example, memory parameters
  • 30.
     Resilient DistributedDataset  An abstraction over a generic data collection  Integers  Strings  PairRDD : <key, value> pairs (support additional operations)  Single logical entity but under the hood is a distributed collection Mokhtar Jacques Dirk Cindy Dan Susan Dirk Frank Jacques Partition 1 Partition 2 Partition 3 Names
  • 31.
     You haveto pay attention what kind of operation you are running  Transformation  Does not do anything until the action is called  Actions  Kick off computation  Results can be persisted to memory (cache) or to disk (more later)
  • 32.
     Three methodsfor creation  Distributing a collection of objects from the driver program (using the parallelize method of the spark context) val rddNumbers = sc.parallelize(1 to 10) val rddLetters = sc.parallelize (List(“a”, “b”, “c”, “d”))  Loading an external dataset (file) val quotes = sc.textFile("hdfs:/sparkdata/sparkQuotes.txt")  Transformation from another existing RDD val rddNumbers2 = rddNumbers.map(x=> x+1)
  • 33.
     Transformations are lazyevaluations  Returns a pointer to the transformed RDD  Pair RDD (K,V) functions for MapReduce style transformations  Map  Filter  flatMap  reduceByKey  sortByKey  Join  See the doc for full list
  • 34.
     Kick offthe computation  Transformations are lazily evaluated  Collect()  Count()  Take()  Reduce()  First()  saveAsTextFile()
  • 35.
     Each nodestores any partitions of the cache that it computes in memory  Reuses them in other actions on that dataset (or datasets derived from it)  Future actions are much faster (often by more than 10x)  Two methods for RDD persistence: persist() and cache()
  • 36.
    rdd1.join(rdd2) .groupBy(…) .filter(…) RDD Objects build operatorDAG agnostic to operators! doesn’t know about stages DAGScheduler split graph into stages of tasks submit each stage as ready DAG TaskScheduler TaskSet launch tasks via cluster manager retry failed or straggling tasks Cluster manager Worker execute tasks store and serve blocks Block manager Threads Task stage failed DataBricks
  • 38.
    SparkContext Driver Program Cluster Manager WorkerNode Executor Task Task Cache Worker Node Executor Task Task Cache App
  • 40.
     MLLib  Distributedmachine learning libraries  SparkSQL  DataFrames  GraphX  ML  SparkR  Streaming
  • 41.
     Read theFine Manual  https://spark.apache.org/docs/latest/index.html  Take the course  BigData University https://bigdatauniversity.com/bdu-wp/bdu- course/spark-fundamentals/  edX – edx.org search for Spark  If you’re stuck  Try the user lists : https://spark.apache.org/community.html
  • 42.
     Questions?  Topicfor the next meetup?  Your experiences?  Want to be a presenter?
  • 43.
     Some slideand text graphics were borrowed from the following sources  Vincent Poncet, IBM France  Jacques Roy, IBM US  Daniel Kikuchi , IBM US  Mokhtar Kandil , IBM US  DataBricks  Spark Docs  I completely lost track what slides I copied from which source. I apologize.