An Introduction
A big data processing tool built with Scala and runs on JVM
BigData
It was first mentioned by NASA researchers Michael Cox and
David Ellsworth in 1997.
Definition: data sets that are so large or complex that
traditional data processing applications are inadequate to
deal with them. Challenges include analysis, capture, data
curation, search, sharing, storage, transfer, visualization,
querying, updating, information privacy, and real-time
capabilities. (Wikipedia)
BigDataProcessing
The rise of Big Data required faster tools for processing
data.
Can traditional databases handle Big Data? What are the
limitations?
What are the solutions?
WhatisSpark?
A big data processing framework built around speed,
generality, ease of use, and accessibility.
PropertiesofSPARK
Speed: Run programs up to 100x faster than Hadoop MapReduce
in memory, or 10x faster on disk.
Ease of Use: Write applications quickly in Java, Scala,
Python, and R.
Generality: Combines SQL, streaming, and complex analytics.
Accessibility: Spark runs on Hadoop, Mesos, standalone, or
in the cloud. It can access diverse data sources including
HDFS, Cassandra, HBase, and S3.
SparkFeatures
❏ Works directly on memory for speed-up
❏ Supports MapReduce
❏ Lazy evaluation of big data queries for optimization. (It
does exactly what you tell it to do)
❏ Operators and API that allow for easy interaction through
Scala, R, and Python.
Sparknuggets
Spark Streaming - process real-time data using basic
abstraction called Resilient Distributed Datasets (RDDs).
Spark SQL - Allows for ad-hoc querying using JDBC API
Spark MLlib - provides optimized machine learning libraries
for regression, classification, and clustering tasks.
Spark GraphX - an extension to RDD for graph-parallel
computation.
RDD
A fundamental data structure in Spark. An immutable
collection of objects that support in-memory processing.
Two ways of creating RDDs:
❏ Parallelize existing data collection in your driver
program
❏ Importing directly from HDFS, HBASE or any other
Hadoop Input Format.
Map/Reduce
A programming model for processing and generating large
amounts of data with a parallel, distributed algorithm on a
cluster.
ASimpleWordCountprogram
text_file = spark.textFile("hdfs://...")
text_file.flatMap(lambda line:
line.split())
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a+b)
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class WordCount extends
Configured implements Tool{
public int run(String[] args)
throws Exception
{
//creating a JobConf
object and assigning a job name for
identification purposes
JobConf conf = new
JobConf(getConf(), WordCount.class);
conf.setJobName("WordCount");
//Setting configuration
object with the Data Type of output
Key and Value
conf.setOutputKeyClass(Text.class);
References
http://spark.apache.org/
http://www.winshuttle.com/big-data-timeline/
https://www.infoq.com/articles/apache-spark-introduction
http://kickstarthadoop.blogspot.tw/2011/04/word-count-hadoop
-map-reduce-example.html
An Introduction to Apache Spark

An Introduction to Apache Spark

  • 1.
    An Introduction A bigdata processing tool built with Scala and runs on JVM
  • 2.
    BigData It was firstmentioned by NASA researchers Michael Cox and David Ellsworth in 1997. Definition: data sets that are so large or complex that traditional data processing applications are inadequate to deal with them. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, querying, updating, information privacy, and real-time capabilities. (Wikipedia)
  • 3.
    BigDataProcessing The rise ofBig Data required faster tools for processing data. Can traditional databases handle Big Data? What are the limitations? What are the solutions?
  • 4.
    WhatisSpark? A big dataprocessing framework built around speed, generality, ease of use, and accessibility.
  • 5.
    PropertiesofSPARK Speed: Run programsup to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Ease of Use: Write applications quickly in Java, Scala, Python, and R. Generality: Combines SQL, streaming, and complex analytics. Accessibility: Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3.
  • 6.
    SparkFeatures ❏ Works directlyon memory for speed-up ❏ Supports MapReduce ❏ Lazy evaluation of big data queries for optimization. (It does exactly what you tell it to do) ❏ Operators and API that allow for easy interaction through Scala, R, and Python.
  • 7.
    Sparknuggets Spark Streaming -process real-time data using basic abstraction called Resilient Distributed Datasets (RDDs). Spark SQL - Allows for ad-hoc querying using JDBC API Spark MLlib - provides optimized machine learning libraries for regression, classification, and clustering tasks. Spark GraphX - an extension to RDD for graph-parallel computation.
  • 8.
    RDD A fundamental datastructure in Spark. An immutable collection of objects that support in-memory processing. Two ways of creating RDDs: ❏ Parallelize existing data collection in your driver program ❏ Importing directly from HDFS, HBASE or any other Hadoop Input Format.
  • 9.
    Map/Reduce A programming modelfor processing and generating large amounts of data with a parallel, distributed algorithm on a cluster.
  • 10.
    ASimpleWordCountprogram text_file = spark.textFile("hdfs://...") text_file.flatMap(lambdaline: line.split()) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a+b) import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; public class WordCount extends Configured implements Tool{ public int run(String[] args) throws Exception { //creating a JobConf object and assigning a job name for identification purposes JobConf conf = new JobConf(getConf(), WordCount.class); conf.setJobName("WordCount"); //Setting configuration object with the Data Type of output Key and Value conf.setOutputKeyClass(Text.class);
  • 11.