An Introduction to Apache Spark

An Introduction
A big data processing tool built with Scala and runs on JVM

BigData
It was first mentioned by NASA researchers Michael Cox and
David Ellsworth in 1997.
Definition: data sets that are so large or complex that
traditional data processing applications are inadequate to
deal with them. Challenges include analysis, capture, data
curation, search, sharing, storage, transfer, visualization,
querying, updating, information privacy, and real-time
capabilities. (Wikipedia)

BigDataProcessing
The rise of Big Data required faster tools for processing
data.
Can traditional databases handle Big Data? What are the
limitations?
What are the solutions?

WhatisSpark?
A big data processing framework built around speed,
generality, ease of use, and accessibility.

PropertiesofSPARK
Speed: Run programs up to 100x faster than Hadoop MapReduce
in memory, or 10x faster on disk.
Ease of Use: Write applications quickly in Java, Scala,
Python, and R.
Generality: Combines SQL, streaming, and complex analytics.
Accessibility: Spark runs on Hadoop, Mesos, standalone, or
in the cloud. It can access diverse data sources including
HDFS, Cassandra, HBase, and S3.

SparkFeatures
❏ Works directly on memory for speed-up
❏ Supports MapReduce
❏ Lazy evaluation of big data queries for optimization. (It
does exactly what you tell it to do)
❏ Operators and API that allow for easy interaction through
Scala, R, and Python.

Sparknuggets
Spark Streaming - process real-time data using basic
abstraction called Resilient Distributed Datasets (RDDs).
Spark SQL - Allows for ad-hoc querying using JDBC API
Spark MLlib - provides optimized machine learning libraries
for regression, classification, and clustering tasks.
Spark GraphX - an extension to RDD for graph-parallel
computation.

RDD
A fundamental data structure in Spark. An immutable
collection of objects that support in-memory processing.
Two ways of creating RDDs:
❏ Parallelize existing data collection in your driver
program
❏ Importing directly from HDFS, HBASE or any other
Hadoop Input Format.

Map/Reduce
A programming model for processing and generating large
amounts of data with a parallel, distributed algorithm on a
cluster.

ASimpleWordCountprogram
text_file = spark.textFile("hdfs://...")
text_file.flatMap(lambda line:
line.split())
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a+b)
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class WordCount extends
Configured implements Tool{
public int run(String[] args)
throws Exception
{
//creating a JobConf
object and assigning a job name for
identification purposes
JobConf conf = new
JobConf(getConf(), WordCount.class);
conf.setJobName("WordCount");
//Setting configuration
object with the Data Type of output
Key and Value
conf.setOutputKeyClass(Text.class);

References
http://spark.apache.org/
http://www.winshuttle.com/big-data-timeline/
https://www.infoq.com/articles/apache-spark-introduction
http://kickstarthadoop.blogspot.tw/2011/04/word-count-hadoop
-map-reduce-example.html

An Introduction to Apache Spark

An Introduction to Apache Spark

More Related Content

What's hot

Similar to An Introduction to Apache Spark

More from Elvis Saravia

Recently uploaded

An Introduction to Apache Spark