An Introduction to Apache spark with scala

Contents
 Overview
 Layers and packages in Spark
 Download and Installation
 Spark Application Overview
 Simple Spark Application
 Introduction to Spark
 How a Spark Application Runs on a cluster?
 Spark Abstraction
 Scala introduction
 Getting started in scala
 Example programs with Scala and Python
 Prediction with Regressions utilizing MLlib

Introduction to Spark
 Spark is a general-purpose distributed data processing engine that is suitable for use in a wide range of
circumstances. on top of the spark core data processing engine, there are libraries for sql, machine
learning, graph computation, and stream processing, which can be used together in an application.
 Programming languages supported by spark include: java, python, scala, and r. application developers
and data scientists incorporate spark into their applications to rapidly query, analyze, and transform data
at scale.
 Tasks most frequently associated with spark include etl and sql batch jobs across large data sets,
processing of streaming data from sensors, iot, or financial systems, and machine learning tasks.

Overview
 Spark is general purpose engine for large-scale data processing.
 Spark SQL for querying structured data via SQL and Hive Query Language (HQL)
 Runs on Hadoop
 Uses APIs to help execute workloads
 It has been claimed to be 100 times faster than Hadoop’s MapReduce
 It Supports Java, Python, R, and Scala programming languages
asks most frequently associated with Spark include ETL and SQL batch jobs across
large data sets, processing of streaming data from sensors, IoT, or financial systems, and machine
learning tasks

Layers and packages in Spark
 Spark SQL allows for querying structured data via SQL and Hive Query Language (HQL).
 It has its own Graph Computation Engine, called GraphX that allows users to make computations using
graphs.
 Spark Streaming mainly enables you to create analytical and interactive applications for live streaming data.
You can do the streaming of the data and then, Spark can run its operations from the streamed data itself.
 MLLib is a machine learning library that is built on top of Spark, and has the provision to support many
machine learning algorithms.
 But the point difference is that it runs almost 100 times faster than MapReduce.

Download and Installation
 Download at http://spark.apache.org/downloads.html
 Prior to downloading, ensure that Java JDK and Scala is installed on your machine.
 Spark requires Java to run and Scala is used to implement Spark.
 Select package type as “Pre-built for Hadoop 2.7 and later” and download the compressed TAR file. Unpack
the tar file after downloading.
 Spark’s shell provides a simple way to learn the API, as well as a powerful tool to analyze data interactively.
 It is available in either Scala (which runs on the Java VM and is thus a good way to use existing Java libraries)
or Python.
 Open the spark shell by typing “./bin/spark-shell” for Scala version and “./bin/pyspark ” for Python Version

Download and Installation
 Configure the environment according to your needs/preferences using options such as the SparkConf or the
Spark Shell tools. You should also be able to configure the settings using the Installation Wizard for the Spark
application.
 Initialize a new SparkContext using your preferred language (i.e. Python, Java, Scala, R). SparkContext sets up
services and connects to an execution environment for Spark applications.

Spark Application Overview
 Each Spark application is a self-contained computation that runs user-supplied code to compute a result like
MapReduce applications But, Spark has many advantages over MapReduce.
 In MapReduce, the highest-level unit of computation is a job while In Spark, the highest-level unit of
computation is an application.
 A Spark application can be used for a single batch job, an interactive session with multiple jobs, or a long-lived
server continually satisfying requests.
 Multiple tasks can run within the same executor. Both combine to enable extremely fast task startup time as
well as in-memory data storage, resulting in orders of magnitude faster performance over MapReduce.
 Spark application execution involves runtime concepts such as driver, executor, task , job and stage

Simple Spark Application
Spark application can be developed in 3 of the supported following languages.
1)Java 2) Python 3) Scala
 Spark provides primarily two abstractions in its applications:
 RDD (Resilient Distributed Dataset) --- (Dataset in newest version)
 Two types of Shared Variables in parallel Operations
 Broadcast variables: Can be stored in the cache of each system
 Accumulators : Can help with aggregation functions such as addition
○Accumulators

Simple Spark Applications Continued
Accumulator: Stores variable that can have additive and
cumulative functions performed onto it. Safer than using
“Global” declaration
Parallelize: Stores iterable data such as a list onto a
distributed dataset to be sent to clusters on network
Broadcast: Sends variable to other memory devices in
cluster. Efficient algorithm pre-built that helps reduce
communication costs. Should only be used if the same
data is needed in all nodes
Foreach (Reducer): Runs function on each element in
dataset, output is best to be an accumulator object for
safe updates

How a Spark Application Runs on a cluster?
 A Spark application runs as independent processes,
coordinated by the SparkSession object in the driver
program.
 The resource or cluster manager assigns tasks to
workers, one task per partition.
 A task applies its unit of work to the dataset in its
partition and outputs a new partition dataset. Because
iterative algorithms apply operations repeatedly to data,
they benefit from caching datasets across iterations.
 Results are sent back to the driver application or can be
saved to disk.

Spark Abstraction
Resilient Distributed Database (RDD)
 It is the fundamental abstraction in Apache Spark. It is the basic data structure. RDD in Apache
Spark is an immutable collection of objects which computes on the different node of the cluster.
 Resilient- i.e. fault-tolerant ,so able to recompute missing or damaged partitions due to node
failures.
 Distributed- since Data resides on multiple nodes.
 Dataset -represents records of the data you work with. The user can load the data set externally
which can be either JSON file, CSV file, text file or database via JDBC with no specific data
structure.

contd..
 Data Frame
We can term DataFrame as Dataset organized into columns. DataFrames are similar to the table in a relational
database or data frame in R /Python.
 Spark Streaming
It is a Spark’s core extension, which allows Real-time stream processing from several sources. To offer a unified,
continuous DataFrame abstraction that can be used for interactive and batch queries these two sources work
together. It offers scalable, high-throughput and fault-tolerant processing.
 GraphX
It is one more example of specialized data abstraction. It enables developers to analyze social networks. Also, other
graphs alongside Excel-like two-dimensional data.

What is scala?
 Scala is a general purpose language that can be used to develop solutions for any software problem.
 Scala combines object-oriented and functional programming in one concise, high-level language.
 Completely compatible with java consequently runs on JVM
 Scala offers a toolset to write scalable concurrent applications in a simple way with more confidence in their
correctness.
 Scala is an excellent base of parallel, distributed, and concurrent computing, which is widely thought to be a
very big challenge in software development but by the unique combination of features has won this
challenge.

Why Scala??
 Using Scala apps are less costlier to maintain and easier to evolve Scala because Scala is a functional and
object-oriented programming language that makes light bend reactive and helps developers write code that's
more concise than other options.
 Scala is used outside of its killer-app domain as well, of course, and certainly for a while there was a hype
about the language that meant that even if the problem at hand could easily be solved in Java, Scala would
still be the preference, as the language was seen as a future replacement for Java.
 It reduces the amount of code developers to write the code.

Getting Started in Scala:
•scala
–Runs compiled scala code
–Or without arguments, as an interpreter!
•scalac - compiles
• fsc - compiles faster! (uses a background server to minimize startup time)
•Go to scala-lang.org for downloads/documentation
•Read Scala: A Scalable Language
(see http://www.artima.com/scalazine/articles/scalable-language.html )

Example programs with Scala and Python
WordCount:In this example we use few transformations to build a dataset of (String,
int) pairs called counts and save it to a file.
Python:
text_file = sc.textFile("hdfs://...")
# Read data from file
counts =
text_file.flatMap(lambda line:
line.split(" "))
.map(lambda word:
(word, 1))
.reduceByKey(lambda
a, b: a + b)
counts.saveAsTextFile("hdfs://...")
#Save output into the file
Scala:
val textFile =
sc.textFile("hdfs://...")
val counts = textFile.flatMap(line
=> line.split(" "))
.map(word => (word,
1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...
")

Prediction with Regressions utilizing MLlib
Python:
#Read data into dataframe
df = sqlContext.createDataFrame(data,
["label","features"])
lr = LogisticRegression(maxIter=10)
#Fit the model to the data
model = lr.fit(df)
# Predicting the results using the model
model.transform(df).show()
Scala:
// Read data into dataframe
val df =
sqlContext.createDataFrame(data).toDF("lab
el", "features")
val lr = new
LogisticRegression().setMaxIter(10)
val model = lr.fit(df)
val weights = model.weights
// Predicting the results using the model

An Introduction to Apache spark with scala

More Related Content

Similar to An Introduction to Apache spark with scala

Recently uploaded

An Introduction to Apache spark with scala