@gamussa @hazelcast #oraclecode
IN-MEMORY ANALYTICS
with APACHE SPARK and
HAZELCAST
@gamussa @hazelcast #oraclecode
Solutions Architect
Developer Advocate
@gamussa in internetz
Please, follow me on Twitter
I’m very interesting ©
Who am I?
@gamussa @hazelcast #oraclecode
What’s Apache Spark?
Lightning-Fast Cluster Computing
@gamussa @hazelcast #oraclecode
Run programs up to 100x
faster than Hadoop
MapReduce in memory,
or 10x faster on disk.
@gamussa @hazelcast #oraclecode
When to use Spark?
Data Science Tasks
when questions are unknown
Data Processing Tasks
when you have to much data
You’re tired of Hadoop
@gamussa @hazelcast #oraclecode
Spark Architecture
@gamussa @hazelcast #oraclecode
@gamussa @hazelcast #oraclecode
RDD
@gamussa @hazelcast #oraclecode
Resilient Distributed Datasets (RDD)
are the primary abstraction in Spark –
a fault-tolerant collection of elements that can be
operated on in parallel
@gamussa @hazelcast #oraclecode
@gamussa @hazelcast #oraclecode
RDD Operations
@gamussa @hazelcast #oraclecode
operations on RDDs:
transformations and actions
@gamussa @hazelcast #oraclecode
transformations are lazy
(not computed immediately)
the transformed RDD gets recomputed
when an action is run on it (default)
@gamussa @hazelcast #oraclecode
RDD
Transformations
@gamussa @hazelcast #oraclecode
@gamussa @hazelcast #oraclecode
@gamussa @hazelcast #oraclecode
RDD
Actions
@gamussa @hazelcast #oraclecode
@gamussa @hazelcast #oraclecode
@gamussa @hazelcast #oraclecode
RDD
Fault Tolerance
@gamussa @hazelcast #oraclecode
@gamussa @hazelcast #oraclecode
RDD
Construction
@gamussa @hazelcast #oraclecode
parallelized collections
take an existing Scala collection
and run functions on it in parallel
@gamussa @hazelcast #oraclecode
Hadoop datasets
run functions on each record of a file in Hadoop distributed
file system or any other storage system supported by
Hadoop
@gamussa @hazelcast #oraclecode
What’s Hazelcast IMDG?
The Fastest In-memory Data Grid
@gamussa @hazelcast #oraclecode
Hazelcast IMDG
is an operational,
in-memory,
distributed computing platform
that manages data using
in-memory storage, and
performs parallel execution for
breakthrough application speed
and scale
@gamussa @hazelcast #oraclecode
High-Density
Caching
In-Memory
Data Grid
Web Session
Clustering
Microservices
Infrastructure
@gamussa @hazelcast #oraclecode
What’s Hazelcast IMDG?
In-memory Data Grid
Apache v2 Licensed
Distributed
Caches (IMap, JCache)
Java Collections (IList, ISet, IQueue)
Messaging (Topic, RingBuffer)
Computation (ExecutorService, M-R)
@gamussa @hazelcast #oraclecode
Green
Primary
Green
Backup
Green
Shard
@gamussa @hazelcast #oraclecode
@gamussa @hazelcast #oraclecode
final SparkConf sparkConf = new SparkConf()
.set("hazelcast.server.addresses", "localhost")
.set("hazelcast.server.groupName", "dev")
.set("hazelcast.server.groupPass", "dev-pass")
.set("hazelcast.spark.readBatchSize", "5000")
.set("hazelcast.spark.writeBatchSize", "5000")
.set("hazelcast.spark.valueBatchingEnabled", "true");
final JavaSparkContext jsc = new JavaSparkContext("spark://localhost:7077",
"app", sparkConf);
final HazelcastSparkContext hsc = new HazelcastSparkContext(jsc);
final HazelcastJavaRDD<Object, Object> mapRdd = hsc.fromHazelcastMap("movie");
final HazelcastJavaRDD<Object, Object> cacheRdd = hsc.fromHazelcastCache("my-
cache");
@gamussa @hazelcast #oraclecode
final SparkConf sparkConf = new SparkConf()
.set("hazelcast.server.addresses", "localhost")
.set("hazelcast.server.groupName", "dev")
.set("hazelcast.server.groupPass", "dev-pass")
.set("hazelcast.spark.readBatchSize", "5000")
.set("hazelcast.spark.writeBatchSize", "5000")
.set("hazelcast.spark.valueBatchingEnabled", "true");
final JavaSparkContext jsc = new JavaSparkContext("spark://localhost:7077",
"app", sparkConf);
final HazelcastSparkContext hsc = new HazelcastSparkContext(jsc);
final HazelcastJavaRDD<Object, Object> mapRdd = hsc.fromHazelcastMap("movie");
final HazelcastJavaRDD<Object, Object> cacheRdd = hsc.fromHazelcastCache("my-
cache");
@gamussa @hazelcast #oraclecode
final SparkConf sparkConf = new SparkConf()
.set("hazelcast.server.addresses", "localhost")
.set("hazelcast.server.groupName", "dev")
.set("hazelcast.server.groupPass", "dev-pass")
.set("hazelcast.spark.readBatchSize", "5000")
.set("hazelcast.spark.writeBatchSize", "5000")
.set("hazelcast.spark.valueBatchingEnabled", "true");
final JavaSparkContext jsc = new JavaSparkContext("spark://localhost:7077",
"app", sparkConf);
final HazelcastSparkContext hsc = new HazelcastSparkContext(jsc);
final HazelcastJavaRDD<Object, Object> mapRdd = hsc.fromHazelcastMap("movie");
final HazelcastJavaRDD<Object, Object> cacheRdd = hsc.fromHazelcastCache("my-
cache");
@gamussa @hazelcast #oraclecode
final SparkConf sparkConf = new SparkConf()
.set("hazelcast.server.addresses", "localhost")
.set("hazelcast.server.groupName", "dev")
.set("hazelcast.server.groupPass", "dev-pass")
.set("hazelcast.spark.readBatchSize", "5000")
.set("hazelcast.spark.writeBatchSize", "5000")
.set("hazelcast.spark.valueBatchingEnabled", "true");
final JavaSparkContext jsc = new JavaSparkContext("spark://localhost:7077",
"app", sparkConf);
final HazelcastSparkContext hsc = new HazelcastSparkContext(jsc);
final HazelcastJavaRDD<Object, Object> mapRdd = hsc.fromHazelcastMap("movie");
final HazelcastJavaRDD<Object, Object> cacheRdd = hsc.fromHazelcastCache("my-
cache");
@gamussa @hazelcast #oraclecode
Demo
@gamussa @hazelcast #oraclecode
LIMITATIONS
@gamussa @hazelcast #oraclecode
DATA SHOULD NOT BE
UPDATED WHILE READING
FROM SPARK
@gamussa @hazelcast #oraclecode
WHY ?
@gamussa @hazelcast #oraclecode
MAP EXPANSION
SHUFFLES THE DATA
INSIDE THE BUCKET
@gamussa @hazelcast #oraclecode
CURSOR DOESN’T POINT TO
CORRECT ENTRY ANYMORE,
DUPLICATE OR MISSING
ENTRIES COULD OCCUR
@gamussa @hazelcast #oraclecode
github.com/hazelcast/hazelcast-spark
@gamussa @hazelcast #oraclecode
THANKS!
Any questions?
You can find me at
@gamussa
viktor@hazelcast.com

[OracleCode SF] In memory analytics with apache spark and hazelcast