Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

Alex Zeltov
Solutions Engineer
@azeltov
http://tiny.cc/sparkmeetup
Intro to Big Data Analytics using
Apache Spark & Zeppelin

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
In this workshop
• Introduction to HDP and Spark
• Spark Programming: Scala, Python, R
- Core Spark: working with RDDs
- Spark SQL: structured data access
- Spark MlLib: predictive analytics
- Spark Streaming: real time data processing
• Conclusion and Further Reading, Q/A

Apache Spark Background

What is Spark?
 Apache Open Source Project - originally developed at AMPLab
(University of California Berkeley)
 Data Processing Engine - focused on in-memory distributed
computing use-cases
 API - Scala, Python, Java and R

Spark Ecosystem
Spark Core
Spark SQL Spark Streaming MLLib GraphX

Why Spark?
 Elegant Developer APIs
– Single environment for data munging and Machine Learning (ML)
 In-memory computation model – Fast!
– Effective for iterative computations and ML
 Machine Learning
– Implementation of distributed ML algorithms
– Pipeline API (Spark ML)

Generality
• Combine SQL, streaming, and complex analytics.
• Spark powers a stack of libraries including SQL and DataFrames, MLlib for
machine learning, GraphX, and Spark Streaming. You can combine these
libraries seamlessly in the same application.
Runs Everywhere:
Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse
data sources including HDFS, Cassandra, HBase, S3, WASB

Emerging Spark Patterns
 Spark as query federation engine
• Bring data from multiple sources to join/query in Spark
 Use multiple Spark libraries together
• Common to see Core, ML & Sql used together
 Use Spark with various Hadoop ecosystem projects
• Use Spark & Hive together
• Spark & HBase together
• Spark & SOLR, etc...

More Data Sources APIs

What is Hadoop?
Apache Hadoop is an open-source software framework written in
Java for distributed storage and distributed processing of very large
data sets on computer clusters built from commodity hardware.
The core of Apache Hadoop consists of a storage part Hadoop
Distributed File System (HDFS) and a processing part (MapReduce)
and YARN ResourceManager for allocating resources and
scheduling applications

Access patterns enabled by YARN
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°N
HDFS
Hadoop Distributed File System
Interactive Real-TimeBatch
Applications Batch
Needs to happen but, no
timeframe limitations
Interactive
Needs to happen at
Human time
Real-Time
Needs to happen at
Machine Execution time.

Why Spark on YARN?
 Utilize existing HDP cluster infrastructure
 Resource management
– share Spark workloads with other workloads like PIG, HIVE, etc.
 Scheduling and queues
Spark Driver
Client
Spark
Application Master
YARN container
Spark Executor
YARN container
Task Task
Spark Executor
YARN container
Task Task
Spark Executor
YARN container
Task Task

Why HDFS?
Fault Tolerant Distributed Storage
• Divide files into big blocks and distribute 3 copies randomly across the cluster
• Processing Data Locality
• Not Just storage but computation
10110100101
00100111001
11111001010
01110100101
00101100100
10101001100
01010010111
01011101011
11011011010
10110100101
01001010101
01011100100
11010111010
0
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
2
2
3
3
34
4
4

Spark is certified as YARN Ready and is a part of HDP.
Hortonworks Data Platform 2.4
GOVERNANCE OPERATIONSBATCH, INTERACTIVE & REAL-TIME DATA ACCESS
(Cluster Resource Management)
MapReduce
Apache Falcon
Apache Sqoop
Apache Flume
Apache Kafka
ApacheHive
ApachePig
ApacheHBase
ApacheAccumulo
ApacheSolr
ApacheSpark
ApacheStorm
1 • • • • • • • • • • •
• • • • • • • • • • • •
HDFS
(Hadoop Distributed File System)
Apache Ambari
Apache
ZooKeeper
Apache Oozie
Deployment Choice
Linux Windows On-premises Cloud
Apache Atlas
Cloudbreak
SECURITY
Apache Ranger
Apache Knox
Apache Atlas
HDFS Encryption
ISVEngines

Hortonworks Commitment to Spark
Hortonworks is focused on making Apache
Spark enterprise ready so you can depend
on it for mission critical applications
SECURITY
BATCH, INTERACTIVE & REAL-TIME
DATA ACCESS
GOVERNANCE
&INTEGRATION
OPERATIONS
Script
Pig
Search
Solr
SQL
Hive
HCatalog
NoSQL
HBase
Accumulo
Stream
Storm
Other
ISVs
TezTez
In-Memory
1. YARN enable Spark to
co-exist with other
engines
Spark is “YARN Ready” so its
memory & CPU intensive apps can
work with predictable performance
along side other engines all on the
same set(s) of data.
2. Extend Spark with
enterprise capabilities
Ensure Spark can be managed,
secured and governed all via a single
set of frameworks to ensure
consistency. Ensure reliability and
quality of service of Spark along side
other engines.
3. Actively collaborate
within the open
community
As with everything we do at
Hortonworks we work entirely within
the open community across Spark
and all related projects to improve
this key Hadoop technology.

Interacting with Spark

Interacting with Spark
• Spark’s interactive REPL shell (in Python or Scala)
• Web-based Notebooks:
• Zeppelin: A web-based notebook that enables interactive data
analytics.
• Jupyter: Evolved from the IPython Project
• SparkNotebook: forked from the scala-notebook
• RStudio: for Spark R , Zeppelin support coming soon
https://community.hortonworks.com/articles/25558/running-sparkr-in-rstudio-using-hdp-24.html

Apache Zeppelin
• A web-based notebook that enables interactive data
analytics.
• Multiple language backend
• Multi-purpose Notebook is the place for all your
needs
 Data Ingestion
 Data Discovery
 Data Analytics
 Data Visualization
 Collaboration

Zeppelin- Multiple language backend
Scala(with Apache Spark), Python(with Apache Spark), SparkSQL, Hive, Markdown and Shell.

Zeppelin – Dependency Management
• Load libraries recursively from Maven repository
• Load libraries from local filesystem
%dep
// add maven repository
z.addRepo("RepoName").url("RepoURL”)
// add artifact from filesystem
z.load("/path/to.jar")
// add artifact from maven repository, with no dependency
z.load("groupId:artifactId:version").excludeAll()

21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved 21
Community Plugins
• 100+ connectors
http://spark-packages.org/

Apache Spark Basics

How Does Spark Work?
• RDD
• Your data is loaded in parallel into structured collections
• Actions
• Manipulate the state of the working model by forming new RDDs and
performing calculations upon them
• Persistence
• Long-term storage of an RDD’s state

RDD - Resilient Distributed Dataset
 Primary abstraction in Spark
– An Immutable collection of objects (or records, or elements) that can be operated on
in parallel
 Distributed
– Collection of elements partitioned across nodes in a cluster
– Each RDD is composed of one or more partitions
– User can control the number of partitions
– More partitions => more parallelism
 Resilient
– Recover from node failures
– An RDD keeps its lineage information -> it can be recreated from parent RDDs
 May be persisted in memory for efficient reuse across parallel operations (caching)

item-1
item-2
item-3
item-4
item-5
item-6
item-7
item-8
item-9
item-10
item-11
item-12
item-13
item-14
item-15
item-16
item-17
item-18
item-19
item-20
item-21
item-22
item-23
item-24
item-25
more partitions = more parallelism
Worker
Spark
executor
Worker
Spark
executor
Worker
Spark
executor
Programmer speciﬁes number of partitions for an RDD
(Default value used if unspeciﬁed)
RDD split into 5 partitions
RDD – Resilient Distributed Dataset

RDDs
• Two types of operations:transformations and actions
• Transformations are lazy (not computed immediately)
• Transformed RDD is executed when action runs on it
• Persist (cache) RDDs in memory or disk

Example RDD Transformations
map(func)
filter(func)
distinct(func)
• All create a new DataSet from an existing one
• Do not create the DataSet until an action is performed (Lazy)
• Each element in an RDD is passed to the target function and the
result forms a new RDD

Example Action Operations
count()
reduce(func)
collect()
take()
• Either:
• Returns a value to the driver program
• Exports state to external system

Example Persistence Operations
persist() -- takes options
cache() -- only one option: in-memory
• Stores RDD Values
• in memory (what doesn’t fit is recalculated when necessary)
• Replication is an option for in-memory
• to disk
• blended

Spark Applications
Are a definition in code of
• RDD creation
• Actions
• Persistence
Results in the creation of a DAG (Directed Acyclic Graph) [workflow]
• Each DAG is compiled into stages
• Each Stage is executed as a series of Tasks
• Each Task operates in parallel on assigned partitions

Spark Context
 Main entry point for Spark functionality
 Represents a connection to a Spark cluster
 Represented as sc in your code
What is it?

Spark Context
• A Spark program ﬁrst creates a SparkContext object
• Tells Spark how and where to access a cluster
• Use SparkContext to create RDDs
• SparkContext, SQLContext, ZeppelinContext:
• are automatically created and exposed as variable names 'sc', 'sqlContext' and 'z',
respectively, both in scala and python environments using Zeppelin
• iPython and programs must use a constructor to create a new SparkContext
Note: that scala / python environment shares the same SparkContext, SQLContext, ZeppelinContext
instance.

Processing A File in Scala
//Load the file:
val file = sc.textFile("hdfs://…/user/DAW/littlelog.csv")
//Trim away any empty rows:
val fltr = file.filter(_.length > 0)
//Print out the remaining rows:
fltr.foreach(println)
33

A Word on Anonymous Functions
Scala programmers make great use of anonymous functions as can be seen
in the code:
flatMap( line => line.split(" ") )
34
Argument
to the
function
Body of
the
function

Scala Functions Come In a Variety of Styles
flatMap( line => line.split(" ") )
flatMap((line:String) => line.split(" "))
flatMap(_.split(" "))
35
Argument to the
function (type
inferred)
Body of the
function
Argument to the
function (explicit type)
Body of the
function
No Argument to the function
declared (placeholder)
instead
Body of the function includes placeholder _ which allows for exactly one use of one arg
for each _ present. _ essentially means ‘whatever you pass me’

And Finally – the Formal ‘def’
def myFunc(line:String): Array[String]={
return line.split(",")
}
//and now that it has a name:
myFunc("Hi Mom, I’m home.").foreach(println)
Return type of the function)
Body of the
function
Argument to the
function)

Lab Spark RDD – Philly Crime Dataset

Spark SQL

Spark SQL Overview
 Spark module for structured data processing (e.g. DB tables, JSON files)
 Three ways to manipulate data:
– DataFrames API
– SQL queries
– Datasets API
 Same execution engine for all three
 Spark SQL interfaces provide more information about both structure and computation
being performed than basic Spark RDD API

DataFrames
 Conceptually equivalent to a table in relational DB or data frame in R/Python
 API available in Scala, Java, Python, and R
 Richer optimizations (significantly faster than RDDs)
 Distributed collection of data organized into named columns
 Underneath is an RDD
 Catalyst Optimizer is used underhood

DataFrames
CSVAvro
HIVE
Spark SQL
Text
Col1 Col2 … … ColN
DataFrame
(with RDD underneath)
Column
Row
Created from Various Sources
 DataFrames from HIVE:
– Reading and writing HIVE tables,
including ORC
 DataFrames from files:
– Built-in: JSON, JDBC, ORC, Parquet, HDFS
– External plug-in: CSV, HBASE, Avro
 DataFrames from existing RDDs
– with toDF()function
Data is described as a DataFrame
with rows, columns and a schema

Writing a DataFrame
val df = sqlContext.jsonFile("/tmp/people.json")
df.show()
df.printSchema()
df.select ("First Name").show()
df.select("First Name","Age").show()
df.filter(df("age")>40).show()
df.groupBy("age").count().show()

Spark SQL Examples

SQL Context and Hive Context
 Entry point into all functionality in Spark SQL
 All you need is SparkContext
val sqlContext = SQLContext(sc)
SQLContext
 Superset of functionality provided by basic SQLContext
– Read data from Hive tables
– Access to Hive Functions  UDFs
HiveContext
val hc = HiveContext(sc)
Use when your
data resides in
Hive

DataFrame Example
val df = sqlContext.table("flightsTbl")
df.select("Origin", "Dest", "DepDelay").show(5)
Reading Data From Table
+------+----+--------+
|Origin|Dest|DepDelay|
+------+----+--------+
| IAD| TPA| 8|
| IAD| TPA| 19|
| IND| BWI| 8|
| IND| BWI| -4|
| IND| BWI| 34|
+------+----+--------+

DataFrame Example
df.select("Origin", "Dest", "DepDelay”).filter($"DepDelay" > 15).show(5)
Using DataFrame API to Filter Data (show delays more than 15 min)
+------+----+--------+
+------+----+--------+
| IAD| TPA| 19|
| IND| BWI| 34|
| IND| JAX| 25|
| IND| LAS| 67|
| IND| MCO| 94|
+------+----+--------+

SQL Example
// Register Temporary Table
df.registerTempTable("flights")
// Use SQL to Query Dataset
sqlContext.sql("SELECT Origin, Dest, DepDelay
FROM flights
WHERE DepDelay > 15 LIMIT 5").show
Using SQL to Query and Filter Data (again, show delays more than 15 min)
+------+----+--------+
+------+----+--------+
| IAD| TPA| 19|
| IND| BWI| 34|
| IND| JAX| 25|
| IND| LAS| 67|
| IND| MCO| 94|
+------+----+--------+

RDD vs. DataFrame

RDDs vs. DataFrames
RDD
DataFrame
 Lower-level API (more control)
 Lots of existing code & users
 Compile-time type-safety
 Higher-level API (faster development)
 Faster sorting, hashing, and serialization
 More opportunities for automatic optimization
 Lower memory pressure

Data Frames are Intuitive
RDD Example
Equivalent Data Frame Example
dept name age
Bio H Smith 48
CS A Turing 54
Bio B Jones 43
Phys E Witten 61
Find average age by
department?

Spark SQL Optimizations
 Spark SQL uses an underlying optimization engine (Catalyst)
– Catalyst can perform intelligent optimization since it understands the schema
 Spark SQL does not materialize all the columns (as with RDD) only what’s needed

Lab DataFrames – Federated Spark SQL

Hortonworks Community Connection

community.hortonworks.com

HCC DS, Analytics, and Spark Related Questions Sample

Thank you!

Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

More Related Content

What's hot

Similar to Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

Recently uploaded

Intro to Big Data Analytics using Apache Spark and Apache Zeppelin