Loading + Saving Data
Loading + Saving Data
1. So far we
a. either converted in-memory data
b. Or used the HDFS file
Loading & Saving Data
Loading + Saving Data
1. So far we
a. either converted in-memory data
b. Or used the HDFS file
2. Spark supports wide variety of datasets
3. Can access data through InputFormat & OutputFormat
a. The interfaces used by Hadoop
b. Available for many common file formats and storage
systems (e.g., S3, HDFS, Cassandra, HBase, etc.).
Loading & Saving Data
Loading + Saving Data
Common Data Sources
File formats Stores
Loading + Saving Data
Common Data Sources
File formats Stores
● Text, JSON, SequenceFiles,
Protocol buffers.
● We can also configure
compression
Loading + Saving Data
Common Data Sources
File formats Stores
● Text, JSON, SequenceFiles,
Protocol buffers.
● We can also configure
compression
Filesystems
● Local, NFS, HDFS, Amazon S3
Databases and key/value stores
● For Cassandra, HBase, Elasticsearch,
and JDBC databases.
Loading + Saving Data
Structured data sources through Spark SQL aka Data Frames
+ Efficient API for structured data sources, including JSON and Apache Hive
+ Covered later
Loading & Saving Data
Loading + Saving Data
Common supported file formats
+ A file could be in any format
+ If we know upfront, we can read it and load it
+ Else we can use “file” command like tool
Loading + Saving Data
Common supported file formats
+ Very common. Plain old text files. Printable chars
+ Records are assumed to be one per line.
+ Unstructured Data
Text files
Example:
Loading + Saving Data
Common supported file formats
+ Javascript Object Notation
+ Common text-based format
+ Semistructured; most libraries require one record per line.
JSON files
{
"name" : "John",
"age" : 31,
"knows" : [“C”, “C++”]
}
Example:
Loading + Saving Data
+ Very common text-based format
+ Often used with spreadsheet applications.
+ Comma separated Values
Common supported file formats
CSV files
Example:
Loading + Saving Data
+ Compact Hadoop file format used for key/value data.
+ Key and values can be binary data
+ To bundle together many small files
Common supported file formats
Sequence files
See More at https://wiki.apache.org/hadoop/SequenceFile
Loading + Saving Data
+ A fast, space-efficient multilanguage format.
+ More compact than JSON.
Common supported file formats
Protocol buffers
See More at https://developers.google.com/protocol-buffers/
message Person {
required string name = 1;
required int32 id = 2;
optional string email = 3;
}
Loading + Saving Data
+ For data from a Spark job to be consumed by another
+ Breaks if you change your classes - Java Serialization.
Common supported file formats
Object Files
Loading + Saving Data
Loading Files
var input = sc.textFile("/data/ml-100k/u1.test")
Handling Text Files - scala
Loading + Saving Data
Loading Files
var input = sc.textFile("/data/ml-100k/u1.test")
Loading Directories
var input = sc.wholeTextFiles("/data/ml-100k");
var lengths = input.mapValues(x => x.length);
lengths.collect();
[(u'hdfs://ip-172-31-53-48.ec2.internal:8020/data/ml-100k/mku.sh', 643),
(u'hdfs://ip-172-31-53-48.ec2.internal:8020/data/ml-100k/u.data', 1979173),
(u'hdfs://ip-172-31-53-48.ec2.internal:8020/data/ml-100k/u.genre', 202),
(u'hdfs://ip-172-31-53-48.ec2.internal:8020/data/ml-100k/u.info', 36) …]
Handling Text Files - scala
Loading + Saving Data
Handling Text Files - scala
Loading Files
var input = sc.textFile("/data/ml-100k/u1.test")
Loading Directories
var input = sc.wholeTextFiles("/data/ml-100k");
var lengths = input.mapValues(x => x.length);
lengths.collect();
[(u'hdfs://ip-172-31-53-48.ec2.internal:8020/data/ml-100k/mku.sh', 643),
(u'hdfs://ip-172-31-53-48.ec2.internal:8020/data/ml-100k/u.data', 1979173),
(u'hdfs://ip-172-31-53-48.ec2.internal:8020/data/ml-100k/u.genre', 202),
(u'hdfs://ip-172-31-53-48.ec2.internal:8020/data/ml-100k/u.info', 36) …]
Saving Files
lengths.saveAsTextFile(outputDir)
Loading + Saving Data
1. Records are stored one per line,
2. Fixed number of fields per line
3. Fields are separated by a comma (tab in TSV)
4. We get row number to detect header etc.
Comma / Tab -Separated Values (CSV / TSV)
Loading + Saving Data
Data: /data/spark/temps.csv
Loading CSV - Sample Data
20, NYC, 2014-01-01
20, NYC, 2015-01-01
21, NYC, 2014-01-02
23, BLR, 2012-01-01
25, SEATLE, 2016-01-01
21, CHICAGO, 2013-01-05
24, NYC, 2016-5-05
Loading + Saving Data
Loading CSV - Simple Approach
Array(
Array(20, " NYC", " 2014-01-01"),
Array(20, " NYC", " 2015-01-01"),
Array(21, " NYC", " 2014-01-02"),
Array(23, " BLR", " 2012-01-01"),
Array(25, " SEATLE", " 2016-01-01"),
Array(21, " CHICAGO", " 2013-01-05"),
Array(24, " NYC", " 2016-5-05")
)
var lines = sc.textFile("/data/spark/temps.csv");
var recordsRDD = lines.map(line => line.split(","));
recordsRDD.take(10);
Loading + Saving Data
import au.com.bytecode.opencsv.CSVParser
var a = sc.textFile("/data/spark/temps.csv");
var p = a.map(
line => {
val parser = new CSVParser(',')
parser.parseLine(line)
})
p.take(1)
//Array(Array(20, " NYC", " 2014-01-01"))
spark-shell --packages net.sf.opencsv:opencsv:2.3
Or
Add this to sbt: libraryDependencies += "net.sf.opencsv" % "opencsv" % "2.3"
Loading CSV - Example
https://gist.github.com/girisandeep/b721cf93981c338665c328441d419253
Loading + Saving Data
Loading CSV - Example Efficient
https://gist.github.com/girisandeep/fddf49ef97fde429a0d3256160b257c1
Loading + Saving Data
import au.com.bytecode.opencsv.CSVParser
var linesRdd = sc.textFile("/data/spark/temps.csv");
Loading CSV - Example Efficient
https://gist.github.com/girisandeep/fddf49ef97fde429a0d3256160b257c1
Loading + Saving Data
import au.com.bytecode.opencsv.CSVParser
var linesRdd = sc.textFile("/data/spark/temps.csv");
def parseCSV(itr:Iterator[String]):Iterator[Array[String]] = {
val parser = new CSVParser(',')
for(line <- itr)
yield parser.parseLine(line)
}
Loading CSV - Example Efficient
https://gist.github.com/girisandeep/fddf49ef97fde429a0d3256160b257c1
Loading + Saving Data
import au.com.bytecode.opencsv.CSVParser
var linesRdd = sc.textFile("/data/spark/temps.csv");
def parseCSV(itr:Iterator[String]):Iterator[Array[String]] = {
val parser = new CSVParser(',')
for(line <- itr)
yield parser.parseLine(line)
}
//Check with simple example
val x = parseCSV(Array("1,2,3","a,b,c").iterator)
val result = linesRdd.mapPartitions(parseCSV)
Loading CSV - Example Efficient
https://gist.github.com/girisandeep/fddf49ef97fde429a0d3256160b257c1
Loading + Saving Data
import au.com.bytecode.opencsv.CSVParser
var linesRdd = sc.textFile("/data/spark/temps.csv");
def parseCSV(itr:Iterator[String]):Iterator[Array[String]] = {
val parser = new CSVParser(',')
for(line <- itr)
yield parser.parseLine(line)
}
//Check with simple example
val x = parseCSV(Array("1,2,3","a,b,c").iterator)
val result = linesRdd.mapPartitions(parseCSV)
result.take(1)
//Array[Array[String]] = Array(Array(20, " NYC", " 2014-01-01"))
Loading CSV - Example Efficient
https://gist.github.com/girisandeep/fddf49ef97fde429a0d3256160b257c1
Loading + Saving Data
Tab Separated Files
Similar to csv:
val parser = new CSVParser('t')
Loading + Saving Data
● Popular Hadoop format
○ For handling small files
○ Create InputSplits without too much transport
SequenceFiles
Loading + Saving Data
SequenceFiles
● Popular Hadoop format
○ For handling small files
○ Create InputSplits without too much transport
● Composed of flat files with key/value pairs.
● Has Sync markers
○ Allow to seek to a point
○ Then resynchronize with the record boundaries
○ Allows Spark to efficiently read in parallel from multiple nodes
Loading + Saving Data
Loading SequenceFiles
val data = sc.sequenceFile(inFile,
"org.apache.hadoop.io.Text", "org.apache.hadoop.io.IntWritable")
data.map(func)
…
data.saveAsSequenceFile(outputFile)
Loading + Saving Data
var rdd = sc.parallelize(Array(("key1", 1.0), ("key2", 2.0), ("key3", 3.0)))
rdd.saveAsSequenceFile("pysequencefile1")
Saving SequenceFiles - Example
Loading + Saving Data
var rdd = sc.parallelize(Array(("key1", 1.0), ("key2", 2.0), ("key3", 3.0)))
rdd.saveAsSequenceFile("pysequencefile1")
Saving SequenceFiles - Example
Loading + Saving Data
import org.apache.hadoop.io.DoubleWritable
import org.apache.hadoop.io.Text
Loading SequenceFiles - Example
Loading + Saving Data
import org.apache.hadoop.io.DoubleWritable
import org.apache.hadoop.io.Text
val myrdd = sc.sequenceFile(
"pysequencefile1",
classOf[Text], classOf[DoubleWritable])
Loading SequenceFiles - Example
Loading + Saving Data
Loading SequenceFiles - Example
import org.apache.hadoop.io.DoubleWritable
import org.apache.hadoop.io.Text
val myrdd = sc.sequenceFile(
"pysequencefile1",
classOf[Text], classOf[DoubleWritable])
val result = myrdd.map{case (x, y) => (x.toString, y.get())}
result.collect()
//Array((key1,1.0), (key2,2.0), (key3,3.0))
Loading + Saving Data
● Simple wrapper around SequenceFiles
● Values are written out using Java Serialization.
● Intended to be used for Spark jobs communicating with other
Spark jobs
● Can also be quite slow.
Object Files
Loading + Saving Data
Object Files
● Saving - saveAsObjectFile() on an RDD
● Loading - objectFile() on SparkContext
● Require almost no work to save almost arbitrary objects.
● Not available in python using pickle file instead
● If you change the objects, old files may not be valid
Loading + Saving Data
Pickle File
● Python way of handling object files
● Uses Python’s pickle serialization library
● Saving - saveAsPickleFile() on an RDD
● Loading - pickleFile() on SparkContext
● Can also be quite slow as Object Fiels
Loading + Saving Data
● Access Hadoop-supported storage formats
● Many key/value stores provide Hadoop input formats
● Example providers:HBase, MongoDB
● Older: hadoopFile() / saveAsHadoopFile()
● Newer: newAPIHadoopDataset() / saveAsNewAPIHadoopDataset()
● Takes a Configuration object on which you set the Hadoop properties
Non-filesystem data sources - hadoopFile
Loading + Saving Data
Read an ‘old’ Hadoop InputFormat with arbitrary key and value class from HDFS, a local file
system (available on all nodes), or any Hadoop-supported file system URI. The mechanism is
the same as for sc.sequenceFile.
A Hadoop configuration can be passed in as a Python dict. This will be converted into a
Configuration in Java.
Parameters:
path – path to Hadoop file
inputFormatClass – fully qualified classname of Hadoop InputFormat (e.g. “org.apache.hadoop.mapred.TextInputFormat”)
keyClass – fully qualified classname of key Writable class (e.g. “org.apache.hadoop.io.Text”)
valueClass – fully qualified classname of value Writable class (e.g. “org.apache.hadoop.io.LongWritable”)
keyConverter – (None by default)
valueConverter – (None by default)
conf – Hadoop configuration, passed in as a dict (None by default)
batchSize – The number of Python objects represented as a single Java object. (default 0, choose batchSize automatically)
Hadoop Input and Output Formats - Old API
hadoopFile(path, inputFormatClass, keyClass, valueClass, keyConverter=None,
valueConverter=None, conf=None, batchSize=0)
Loading + Saving Data
Read a ‘new API’ Hadoop InputFormat with arbitrary key and value class from HDFS, a local
file system (available on all nodes), or any Hadoop-supported file system URI. The mechanism
is the same as for sc.sequenceFile.
A Hadoop configuration can be passed in as a Python dict. This will be converted into a
Configuration in Java
Parameters:
path – path to Hadoop file
inputFormatClass – fully qualified classname of Hadoop InputFormat (e.g.
“org.apache.hadoop.mapreduce.lib.input.TextInputFormat”)
keyClass – fully qualified classname of key Writable class (e.g. “org.apache.hadoop.io.Text”)
valueClass – fully qualified classname of value Writable class (e.g. “org.apache.hadoop.io.LongWritable”)
keyConverter – (None by default)
valueConverter – (None by default)
conf – Hadoop configuration, passed in as a dict (None by default)
batchSize – The number of Python objects represented as a single Java object. (default 0, choose batchSize automatically)
Hadoop Input and Output Formats - New API
newAPIHadoopFile(path, inputFormatClass, keyClass, valueClass, keyConverter=None,
valueConverter=None, conf=None, batchSize=0)
Loading + Saving Data
See https://databricks.com/blog/2015/03/20/using-mongodb-with-spark.html
# set up parameters for reading from MongoDB via Hadoop input format
config = {"mongo.input.uri": "mongodb://localhost:27017/marketdata.minbars"}
inputFormatClassName = "com.mongodb.hadoop.MongoInputFormat"
keyClassName = "org.apache.hadoop.io.Text"
valueClassName = "org.apache.hadoop.io.MapWritable"
# read the 1-minute bars from MongoDB into Spark RDD format
minBarRawRDD = sc.newAPIHadoopRDD(inputFormatClassName, keyClassName,
valueClassName, None, None, config)
Hadoop Input and Output Formats - Old API
Loading Data from mongodb
Loading + Saving Data
● Developed at Google for internal RPCs
● Open sourced
● Structured data - fields & types of fields defined
● Fast for encoding and decoding (20-100x than XML)
● Take up the minimum space (3-10x than xml)
● Defined using a domain-specific language
● Compiler generates accessor methods in variety of languages
● Consist of fields: optional, required, or repeated
● While parsing
○ A missing optional field => success
○ A missing required field => failure
● So, make new fields as optional (remember object file failures?)
Protocol buffers
Loading + Saving Data
Protocol buffers - Example
package tutorial;
message Person {
required string name = 1;
required int32 id = 2;
optional string email = 3;
enum PhoneType {
MOBILE = 0;
HOME = 1;
WORK = 2;
}
message PhoneNumber {
required string number = 1;
optional PhoneType type = 2 [default = HOME];
}
repeated PhoneNumber phone = 4;
}
message AddressBook {
repeated Person person = 1;
}
Loading + Saving Data
1. Download and install protocol buffer compiler
2. pip install protobuf
3. protoc -I=$SRC_DIR --python_out=$DST_DIR
$SRC_DIR/addressbook.proto
4. create objects
5. Convert those into protocol buffers
6. See this project
Protocol buffers - Steps
Loading + Saving Data
1. To Save Storage & Network Overhead
2. With most hadoop output formats we can specify compression codecs
3. Compression should not require the whole file at once
4. Each worker can find start of record => splitable
5. You can configure HDP for LZO using Ambari:
http://docs.hortonworks.com/HDPDocuments/Ambari-2.2.2.0/bk_ambari
_reference_guide/content/_configure_core-sitexml_for_lzo.html
File Compression
Loading + Saving Data
File Compression Options
Characteristics of compression:
● Splittable
● Speed
● Effectiveness on Text
● Code
Loading + Saving Data
Format Splittable Spee
d
Effectiveness on
text
Hadoop compression codec comments
gzip N Fast High org.apache.hadoop.io.com.GzipCodec
lzo Y V.
Fast
Medium com.hadoop.compression.lzo.LzoCode
c
LZO requires
installation on
every worker
node
bzip2 Y Slow V. High org.apache.hadoop.io.com.BZip2Codec Uses pure Java
for splittable
version
zlib N Slow Medium org.apache.hadoop.io.com.DefaultCode
c
Default
compression
codec for
Hadoop
Snappy N V.
Fast
Low org.apache.hadoop.io.com.SnappyCod
ec
There is a pure
Java port of
Snappy but it is
not yet available
in Spark/
Hadoop
File Compression Options
Loading + Saving Data
1. Enable in HADOOP by updating the conf of hadoop
http://docs.hortonworks.com/HDPDocuments/Ambari-2.2.2.0/bk_ambari_r
eference_guide/content/_configure_core-sitexml_for_lzo.html
2. Create data:
$ bzip2 --stdout file.bz2 | lzop -o file.lzo
3. Update Spark-env.sh with
export SPARK_CLASSPATH=$SPARK_CLASSPATH:hadoop-lzo-0.4.20-SNAPSHOT.jar
In your code, use:
conf.set("io.compression.codecs”, "com.hadoop.compression.lzo.LzopCodec”);
Ref: https://gist.github.com/zedar/c43cbc7ff7f98abee885
Handling LZO
Loading + Saving Data
Loading + Saving Data: File Systems
Loading + Saving Data
1. rdd = sc.textFile("file:///home/holden/happypandas.gz")
2. The path has to be available on all nodes.
Otherwise, load it locally and distribute using sc.parallelize
Local/“Regular” FS
Loading + Saving Data
1. Popular option
2. Good if nodes are inside EC2
3. Use path in all input methods (textFile, hadoopFile etc)
s3n://bucket/path-within-bucket
4. Set Env. Vars: AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY
More details: https://cloudxlab.com/blog/access-s3-files-spark/
Amazon S3
Loading + Saving Data
1. The Hadoop Distributed File System
2. Spark and HDFS can be collocated on the same machines
3. Spark can take advantage of this data locality to avoid network overhead
4. In all i/o methods, use path: hdfs://master:port/path
5. Use only the version of spark w.r.t HDFS version
HDFS
Thank you!
Spark - Loading + Saving Data

Apache Spark - Loading & Saving data | Big Data Hadoop Spark Tutorial | CloudxLab

  • 1.
  • 2.
    Loading + SavingData 1. So far we a. either converted in-memory data b. Or used the HDFS file Loading & Saving Data
  • 3.
    Loading + SavingData 1. So far we a. either converted in-memory data b. Or used the HDFS file 2. Spark supports wide variety of datasets 3. Can access data through InputFormat & OutputFormat a. The interfaces used by Hadoop b. Available for many common file formats and storage systems (e.g., S3, HDFS, Cassandra, HBase, etc.). Loading & Saving Data
  • 4.
    Loading + SavingData Common Data Sources File formats Stores
  • 5.
    Loading + SavingData Common Data Sources File formats Stores ● Text, JSON, SequenceFiles, Protocol buffers. ● We can also configure compression
  • 6.
    Loading + SavingData Common Data Sources File formats Stores ● Text, JSON, SequenceFiles, Protocol buffers. ● We can also configure compression Filesystems ● Local, NFS, HDFS, Amazon S3 Databases and key/value stores ● For Cassandra, HBase, Elasticsearch, and JDBC databases.
  • 7.
    Loading + SavingData Structured data sources through Spark SQL aka Data Frames + Efficient API for structured data sources, including JSON and Apache Hive + Covered later Loading & Saving Data
  • 8.
    Loading + SavingData Common supported file formats + A file could be in any format + If we know upfront, we can read it and load it + Else we can use “file” command like tool
  • 9.
    Loading + SavingData Common supported file formats + Very common. Plain old text files. Printable chars + Records are assumed to be one per line. + Unstructured Data Text files Example:
  • 10.
    Loading + SavingData Common supported file formats + Javascript Object Notation + Common text-based format + Semistructured; most libraries require one record per line. JSON files { "name" : "John", "age" : 31, "knows" : [“C”, “C++”] } Example:
  • 11.
    Loading + SavingData + Very common text-based format + Often used with spreadsheet applications. + Comma separated Values Common supported file formats CSV files Example:
  • 12.
    Loading + SavingData + Compact Hadoop file format used for key/value data. + Key and values can be binary data + To bundle together many small files Common supported file formats Sequence files See More at https://wiki.apache.org/hadoop/SequenceFile
  • 13.
    Loading + SavingData + A fast, space-efficient multilanguage format. + More compact than JSON. Common supported file formats Protocol buffers See More at https://developers.google.com/protocol-buffers/ message Person { required string name = 1; required int32 id = 2; optional string email = 3; }
  • 14.
    Loading + SavingData + For data from a Spark job to be consumed by another + Breaks if you change your classes - Java Serialization. Common supported file formats Object Files
  • 15.
    Loading + SavingData Loading Files var input = sc.textFile("/data/ml-100k/u1.test") Handling Text Files - scala
  • 16.
    Loading + SavingData Loading Files var input = sc.textFile("/data/ml-100k/u1.test") Loading Directories var input = sc.wholeTextFiles("/data/ml-100k"); var lengths = input.mapValues(x => x.length); lengths.collect(); [(u'hdfs://ip-172-31-53-48.ec2.internal:8020/data/ml-100k/mku.sh', 643), (u'hdfs://ip-172-31-53-48.ec2.internal:8020/data/ml-100k/u.data', 1979173), (u'hdfs://ip-172-31-53-48.ec2.internal:8020/data/ml-100k/u.genre', 202), (u'hdfs://ip-172-31-53-48.ec2.internal:8020/data/ml-100k/u.info', 36) …] Handling Text Files - scala
  • 17.
    Loading + SavingData Handling Text Files - scala Loading Files var input = sc.textFile("/data/ml-100k/u1.test") Loading Directories var input = sc.wholeTextFiles("/data/ml-100k"); var lengths = input.mapValues(x => x.length); lengths.collect(); [(u'hdfs://ip-172-31-53-48.ec2.internal:8020/data/ml-100k/mku.sh', 643), (u'hdfs://ip-172-31-53-48.ec2.internal:8020/data/ml-100k/u.data', 1979173), (u'hdfs://ip-172-31-53-48.ec2.internal:8020/data/ml-100k/u.genre', 202), (u'hdfs://ip-172-31-53-48.ec2.internal:8020/data/ml-100k/u.info', 36) …] Saving Files lengths.saveAsTextFile(outputDir)
  • 18.
    Loading + SavingData 1. Records are stored one per line, 2. Fixed number of fields per line 3. Fields are separated by a comma (tab in TSV) 4. We get row number to detect header etc. Comma / Tab -Separated Values (CSV / TSV)
  • 19.
    Loading + SavingData Data: /data/spark/temps.csv Loading CSV - Sample Data 20, NYC, 2014-01-01 20, NYC, 2015-01-01 21, NYC, 2014-01-02 23, BLR, 2012-01-01 25, SEATLE, 2016-01-01 21, CHICAGO, 2013-01-05 24, NYC, 2016-5-05
  • 20.
    Loading + SavingData Loading CSV - Simple Approach Array( Array(20, " NYC", " 2014-01-01"), Array(20, " NYC", " 2015-01-01"), Array(21, " NYC", " 2014-01-02"), Array(23, " BLR", " 2012-01-01"), Array(25, " SEATLE", " 2016-01-01"), Array(21, " CHICAGO", " 2013-01-05"), Array(24, " NYC", " 2016-5-05") ) var lines = sc.textFile("/data/spark/temps.csv"); var recordsRDD = lines.map(line => line.split(",")); recordsRDD.take(10);
  • 21.
    Loading + SavingData import au.com.bytecode.opencsv.CSVParser var a = sc.textFile("/data/spark/temps.csv"); var p = a.map( line => { val parser = new CSVParser(',') parser.parseLine(line) }) p.take(1) //Array(Array(20, " NYC", " 2014-01-01")) spark-shell --packages net.sf.opencsv:opencsv:2.3 Or Add this to sbt: libraryDependencies += "net.sf.opencsv" % "opencsv" % "2.3" Loading CSV - Example https://gist.github.com/girisandeep/b721cf93981c338665c328441d419253
  • 22.
    Loading + SavingData Loading CSV - Example Efficient https://gist.github.com/girisandeep/fddf49ef97fde429a0d3256160b257c1
  • 23.
    Loading + SavingData import au.com.bytecode.opencsv.CSVParser var linesRdd = sc.textFile("/data/spark/temps.csv"); Loading CSV - Example Efficient https://gist.github.com/girisandeep/fddf49ef97fde429a0d3256160b257c1
  • 24.
    Loading + SavingData import au.com.bytecode.opencsv.CSVParser var linesRdd = sc.textFile("/data/spark/temps.csv"); def parseCSV(itr:Iterator[String]):Iterator[Array[String]] = { val parser = new CSVParser(',') for(line <- itr) yield parser.parseLine(line) } Loading CSV - Example Efficient https://gist.github.com/girisandeep/fddf49ef97fde429a0d3256160b257c1
  • 25.
    Loading + SavingData import au.com.bytecode.opencsv.CSVParser var linesRdd = sc.textFile("/data/spark/temps.csv"); def parseCSV(itr:Iterator[String]):Iterator[Array[String]] = { val parser = new CSVParser(',') for(line <- itr) yield parser.parseLine(line) } //Check with simple example val x = parseCSV(Array("1,2,3","a,b,c").iterator) val result = linesRdd.mapPartitions(parseCSV) Loading CSV - Example Efficient https://gist.github.com/girisandeep/fddf49ef97fde429a0d3256160b257c1
  • 26.
    Loading + SavingData import au.com.bytecode.opencsv.CSVParser var linesRdd = sc.textFile("/data/spark/temps.csv"); def parseCSV(itr:Iterator[String]):Iterator[Array[String]] = { val parser = new CSVParser(',') for(line <- itr) yield parser.parseLine(line) } //Check with simple example val x = parseCSV(Array("1,2,3","a,b,c").iterator) val result = linesRdd.mapPartitions(parseCSV) result.take(1) //Array[Array[String]] = Array(Array(20, " NYC", " 2014-01-01")) Loading CSV - Example Efficient https://gist.github.com/girisandeep/fddf49ef97fde429a0d3256160b257c1
  • 27.
    Loading + SavingData Tab Separated Files Similar to csv: val parser = new CSVParser('t')
  • 28.
    Loading + SavingData ● Popular Hadoop format ○ For handling small files ○ Create InputSplits without too much transport SequenceFiles
  • 29.
    Loading + SavingData SequenceFiles ● Popular Hadoop format ○ For handling small files ○ Create InputSplits without too much transport ● Composed of flat files with key/value pairs. ● Has Sync markers ○ Allow to seek to a point ○ Then resynchronize with the record boundaries ○ Allows Spark to efficiently read in parallel from multiple nodes
  • 30.
    Loading + SavingData Loading SequenceFiles val data = sc.sequenceFile(inFile, "org.apache.hadoop.io.Text", "org.apache.hadoop.io.IntWritable") data.map(func) … data.saveAsSequenceFile(outputFile)
  • 31.
    Loading + SavingData var rdd = sc.parallelize(Array(("key1", 1.0), ("key2", 2.0), ("key3", 3.0))) rdd.saveAsSequenceFile("pysequencefile1") Saving SequenceFiles - Example
  • 32.
    Loading + SavingData var rdd = sc.parallelize(Array(("key1", 1.0), ("key2", 2.0), ("key3", 3.0))) rdd.saveAsSequenceFile("pysequencefile1") Saving SequenceFiles - Example
  • 33.
    Loading + SavingData import org.apache.hadoop.io.DoubleWritable import org.apache.hadoop.io.Text Loading SequenceFiles - Example
  • 34.
    Loading + SavingData import org.apache.hadoop.io.DoubleWritable import org.apache.hadoop.io.Text val myrdd = sc.sequenceFile( "pysequencefile1", classOf[Text], classOf[DoubleWritable]) Loading SequenceFiles - Example
  • 35.
    Loading + SavingData Loading SequenceFiles - Example import org.apache.hadoop.io.DoubleWritable import org.apache.hadoop.io.Text val myrdd = sc.sequenceFile( "pysequencefile1", classOf[Text], classOf[DoubleWritable]) val result = myrdd.map{case (x, y) => (x.toString, y.get())} result.collect() //Array((key1,1.0), (key2,2.0), (key3,3.0))
  • 36.
    Loading + SavingData ● Simple wrapper around SequenceFiles ● Values are written out using Java Serialization. ● Intended to be used for Spark jobs communicating with other Spark jobs ● Can also be quite slow. Object Files
  • 37.
    Loading + SavingData Object Files ● Saving - saveAsObjectFile() on an RDD ● Loading - objectFile() on SparkContext ● Require almost no work to save almost arbitrary objects. ● Not available in python using pickle file instead ● If you change the objects, old files may not be valid
  • 38.
    Loading + SavingData Pickle File ● Python way of handling object files ● Uses Python’s pickle serialization library ● Saving - saveAsPickleFile() on an RDD ● Loading - pickleFile() on SparkContext ● Can also be quite slow as Object Fiels
  • 39.
    Loading + SavingData ● Access Hadoop-supported storage formats ● Many key/value stores provide Hadoop input formats ● Example providers:HBase, MongoDB ● Older: hadoopFile() / saveAsHadoopFile() ● Newer: newAPIHadoopDataset() / saveAsNewAPIHadoopDataset() ● Takes a Configuration object on which you set the Hadoop properties Non-filesystem data sources - hadoopFile
  • 40.
    Loading + SavingData Read an ‘old’ Hadoop InputFormat with arbitrary key and value class from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. The mechanism is the same as for sc.sequenceFile. A Hadoop configuration can be passed in as a Python dict. This will be converted into a Configuration in Java. Parameters: path – path to Hadoop file inputFormatClass – fully qualified classname of Hadoop InputFormat (e.g. “org.apache.hadoop.mapred.TextInputFormat”) keyClass – fully qualified classname of key Writable class (e.g. “org.apache.hadoop.io.Text”) valueClass – fully qualified classname of value Writable class (e.g. “org.apache.hadoop.io.LongWritable”) keyConverter – (None by default) valueConverter – (None by default) conf – Hadoop configuration, passed in as a dict (None by default) batchSize – The number of Python objects represented as a single Java object. (default 0, choose batchSize automatically) Hadoop Input and Output Formats - Old API hadoopFile(path, inputFormatClass, keyClass, valueClass, keyConverter=None, valueConverter=None, conf=None, batchSize=0)
  • 41.
    Loading + SavingData Read a ‘new API’ Hadoop InputFormat with arbitrary key and value class from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. The mechanism is the same as for sc.sequenceFile. A Hadoop configuration can be passed in as a Python dict. This will be converted into a Configuration in Java Parameters: path – path to Hadoop file inputFormatClass – fully qualified classname of Hadoop InputFormat (e.g. “org.apache.hadoop.mapreduce.lib.input.TextInputFormat”) keyClass – fully qualified classname of key Writable class (e.g. “org.apache.hadoop.io.Text”) valueClass – fully qualified classname of value Writable class (e.g. “org.apache.hadoop.io.LongWritable”) keyConverter – (None by default) valueConverter – (None by default) conf – Hadoop configuration, passed in as a dict (None by default) batchSize – The number of Python objects represented as a single Java object. (default 0, choose batchSize automatically) Hadoop Input and Output Formats - New API newAPIHadoopFile(path, inputFormatClass, keyClass, valueClass, keyConverter=None, valueConverter=None, conf=None, batchSize=0)
  • 42.
    Loading + SavingData See https://databricks.com/blog/2015/03/20/using-mongodb-with-spark.html # set up parameters for reading from MongoDB via Hadoop input format config = {"mongo.input.uri": "mongodb://localhost:27017/marketdata.minbars"} inputFormatClassName = "com.mongodb.hadoop.MongoInputFormat" keyClassName = "org.apache.hadoop.io.Text" valueClassName = "org.apache.hadoop.io.MapWritable" # read the 1-minute bars from MongoDB into Spark RDD format minBarRawRDD = sc.newAPIHadoopRDD(inputFormatClassName, keyClassName, valueClassName, None, None, config) Hadoop Input and Output Formats - Old API Loading Data from mongodb
  • 43.
    Loading + SavingData ● Developed at Google for internal RPCs ● Open sourced ● Structured data - fields & types of fields defined ● Fast for encoding and decoding (20-100x than XML) ● Take up the minimum space (3-10x than xml) ● Defined using a domain-specific language ● Compiler generates accessor methods in variety of languages ● Consist of fields: optional, required, or repeated ● While parsing ○ A missing optional field => success ○ A missing required field => failure ● So, make new fields as optional (remember object file failures?) Protocol buffers
  • 44.
    Loading + SavingData Protocol buffers - Example package tutorial; message Person { required string name = 1; required int32 id = 2; optional string email = 3; enum PhoneType { MOBILE = 0; HOME = 1; WORK = 2; } message PhoneNumber { required string number = 1; optional PhoneType type = 2 [default = HOME]; } repeated PhoneNumber phone = 4; } message AddressBook { repeated Person person = 1; }
  • 45.
    Loading + SavingData 1. Download and install protocol buffer compiler 2. pip install protobuf 3. protoc -I=$SRC_DIR --python_out=$DST_DIR $SRC_DIR/addressbook.proto 4. create objects 5. Convert those into protocol buffers 6. See this project Protocol buffers - Steps
  • 46.
    Loading + SavingData 1. To Save Storage & Network Overhead 2. With most hadoop output formats we can specify compression codecs 3. Compression should not require the whole file at once 4. Each worker can find start of record => splitable 5. You can configure HDP for LZO using Ambari: http://docs.hortonworks.com/HDPDocuments/Ambari-2.2.2.0/bk_ambari _reference_guide/content/_configure_core-sitexml_for_lzo.html File Compression
  • 47.
    Loading + SavingData File Compression Options Characteristics of compression: ● Splittable ● Speed ● Effectiveness on Text ● Code
  • 48.
    Loading + SavingData Format Splittable Spee d Effectiveness on text Hadoop compression codec comments gzip N Fast High org.apache.hadoop.io.com.GzipCodec lzo Y V. Fast Medium com.hadoop.compression.lzo.LzoCode c LZO requires installation on every worker node bzip2 Y Slow V. High org.apache.hadoop.io.com.BZip2Codec Uses pure Java for splittable version zlib N Slow Medium org.apache.hadoop.io.com.DefaultCode c Default compression codec for Hadoop Snappy N V. Fast Low org.apache.hadoop.io.com.SnappyCod ec There is a pure Java port of Snappy but it is not yet available in Spark/ Hadoop File Compression Options
  • 49.
    Loading + SavingData 1. Enable in HADOOP by updating the conf of hadoop http://docs.hortonworks.com/HDPDocuments/Ambari-2.2.2.0/bk_ambari_r eference_guide/content/_configure_core-sitexml_for_lzo.html 2. Create data: $ bzip2 --stdout file.bz2 | lzop -o file.lzo 3. Update Spark-env.sh with export SPARK_CLASSPATH=$SPARK_CLASSPATH:hadoop-lzo-0.4.20-SNAPSHOT.jar In your code, use: conf.set("io.compression.codecs”, "com.hadoop.compression.lzo.LzopCodec”); Ref: https://gist.github.com/zedar/c43cbc7ff7f98abee885 Handling LZO
  • 50.
    Loading + SavingData Loading + Saving Data: File Systems
  • 51.
    Loading + SavingData 1. rdd = sc.textFile("file:///home/holden/happypandas.gz") 2. The path has to be available on all nodes. Otherwise, load it locally and distribute using sc.parallelize Local/“Regular” FS
  • 52.
    Loading + SavingData 1. Popular option 2. Good if nodes are inside EC2 3. Use path in all input methods (textFile, hadoopFile etc) s3n://bucket/path-within-bucket 4. Set Env. Vars: AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY More details: https://cloudxlab.com/blog/access-s3-files-spark/ Amazon S3
  • 53.
    Loading + SavingData 1. The Hadoop Distributed File System 2. Spark and HDFS can be collocated on the same machines 3. Spark can take advantage of this data locality to avoid network overhead 4. In all i/o methods, use path: hdfs://master:port/path 5. Use only the version of spark w.r.t HDFS version HDFS
  • 54.
    Thank you! Spark -Loading + Saving Data