Apache Flink internals

Flink
internals
Kostas Tzoumas
Flink committer &
Co-founder, data Artisans
ktzoumas@apache.org
@kostas_tzoumas

Welcome
§ Last talk: how to program PageRank in Flink,
and Flink programming model
§ This talk: how Flink works internally
§ Again, a big bravo to the Flink community
2

DataSet and transformations
Input X First Y Second
Operator X Operator Y
ExecutionEnvironment
env
=
ExecutionEnvironment.getExecutionEnvironment();
DataSet<String>
input
=
env.readTextFile(input);
DataSet<String>
first
=
input
.filter
(str
-‐>
str.contains(“Apache
Flink“));
DataSet<String>
second
=
first
.filter
(str
-‐>
str.length()
>
40);
second.print()
env.execute();
4

Available transformations
§ map
§ flatMap
§ filter
§ reduce
§ reduceGroup
§ join
§ coGroup
§ aggregate
§ cross
§ project
§ distinct
§ union
§ iterate
§ iterateDelta
§ repartition
§ …
5

Other API elements & tools
§ Accumulators and counters
• Int, Long, Double counters
• Histogram accumulator
• Define your own
§ Broadcast variables
§ Plan visualization
§ Local debugging/testing mode
6

Data types and grouping
public
static
class
Access
{
public
int
userId;
public
String
url;
...
}
public
static
class
User
{
public
int
userId;
public
int
region;
public
Date
customerSince;
...
}
DataSet<Tuple2<Access,User>>
campaign
=
access.join(users)
.where(“userId“).equalTo(“userId“)
DataSet<Tuple3<Integer,String,String>
someLog;
someLog.groupBy(0,1).reduceGroup(...);
§ Bean-style Java classes & field names
§ Tuples and position addressing
§ Any data type with key selector function
7

Other API elements
§ Hadoop compatibility
• Supports all Hadoop data types, input/output
formats, Hadoop mappers and reducers
§ Data streaming API
• DataStream instead of DataSet
• Similar set of operators
• Currently in alpha but moving very fast
§ Scala and Java APIs (mirrored)
§ Graph API (Spargel)
8

for
(String
token
:
value.split("W"))
{
out.collect(new
Tuple2<>(token,
1));
Task
Manager
DataSet<String>
text
=
DataSet<Tuple2<String,
Integer>>
result
=
text
Job
Manager
Task
Manager
.flatMap((str,
out)
-‐>
{
})
.groupBy(0)
.aggregate(SUM,
1);
Flink Client &
Optimizer
O
Romeo,
Romeo,
wherefore
art
thou
Romeo?
O,
1
Romeo,
3
wherefore,
1
art,
1
thou,
1
Apache Flink
10
Nor
arm,
nor
face,
nor
any
other
part
nor,
3
arm,
1
face,
1,
any,
1,
other,
1
part,
1

If you want to know one
thing about Flink is that
you don’t need to know
the internals of Flink.
11

Philosophy
§ Flink “hides” its internal workings from the
user
§ This is good
• User does not worry about how jobs are
executed
• Internals can be changed without breaking
changes
§ … and bad
• Execution model more complicated to explain
compared to MapReduce or Spark RDD
12

Recap: DataSet
Operator X Operator Y
13
ExecutionEnvironment
env
=
ExecutionEnvironment.getExecutionEnvironment();
DataSet<String>
input
=
DataSet<String>
first
=
input
.filter
(str
-‐>
str.contains(“Apache
Flink“));
DataSet<String>
second
=
first
.filter
(str
-‐>
str.length()
>
40);
second.print()
env.execute();

Common misconception
§ Programs are not executed eagerly
§ Instead, system compiles program to an
execution plan and executes that plan
14

DataSet<String>
§ Think of it as a PCollection<String>, or a
Spark RDD[String]
§ With a major difference: it can be produced/
recovered in several ways
• … like a Java collection
• … like an RDD
• … perhaps it is never fully materialized (because
the program does not need it to)
• … implicitly updated in an iteration
§ And this is transparent to the user
15

Example: grep
Romeo,
Romeo,
where
art
thou
Romeo?
Load Log
Search
for str1
Search
for str2
Search
for str3
Grep 1
Grep 2
Grep 3
16

Staged (batch) execution
Romeo,
Romeo,
where
art
thou
Romeo?
Load Log
Load Log
Search
for str1
Search
for str2
Search
for str3
Grep 1
Grep 2
Grep 3
Stage 1:
Create/cache Log
Subseqent stages:
Grep log for matches
Caching in-memory
and disk if needed
Search
for str1
Search
for str2
Search
for str2
Grep 1
Grep 2
Grep 2
Load Log
Search
for str1
Search
for str2
Search
for str2
Grep 1
Grep 2
Grep 2
17

Load Log
Search
for str1
Search
for str2
Search
for str2
Grep 1
Grep 2
Grep 2
Pipelined execution
Romeo,
Romeo,
where
art
thou
Romeo?
Load Log
Load Log
Search
for str1
Search
for str2
Search
for str3
Grep 1
Grep 2
Grep 3
000000111111000000111111
Stage 1:
Deploy and start operators
Data transfer in-memory
and disk if
needed
Search
for str1
Search
for str2
Search
for str2
Grep 1
Grep 2
Grep 2
18
Note: Log
DataSet is
never
“created”!

Benefits of pipelining
§ 25 node cluster
§ Grep log for 3
terms
§ Scale data size
from 100GB to
1TB
2500
Time to complete grep (sec) Data size (GB)
2250
2000
1750
1500
1250
1000
750
500
250
0
Pipelined with Flink
0 100 200 300 400 500 600 700 800 900 1000
Cluster memory
exceeded 19

Drawbacks of pipelining
§ Long pipelines may be active at the same time leading
to memory fragmentation
• FLINK-1101: Changes memory allocation from static to
adaptive
§ Fault-tolerance harder to get right
• FLINK-986: Adds intermediate data sets (similar to RDDS) as
first-class citizen to Flink Runtime. Will lead to fine-grained
fault-tolerance among other features.
21

Example: Iterative processing
DataSet<Page>
pages
=
...
DataSet<Neighborhood>
edges
=
...
DataSet<Page>
oldRanks
=
pages;
DataSet<Page>
newRanks;
for
(i
=
0;
i
<
maxIterations;
i++)
{
newRanks
=
update(oldRanks,
edges)
oldRanks
=
newRanks
}
DataSet<Page>
result
=
newRanks;
DataSet<Page>
update
(DataSet<Page>
ranks,
adjacency)
{
return
oldRanks
.join(adjacency)
.where(“id“).equalTo(“id“)
.with
(
(page,
adj,
out)
-‐>
{
for
(long
n
:
adj.neighbors)
out.collect(new
Page(n,
df
*
page.rank
/
adj.neighbors.length))
})
.groupBy(“id“)
.reduce
(
(a,
b)
-‐>
new
Page(a.id,
a.rank
+
b.rank)
);
22

Iterate by unrolling
Client
Step Step Step Step Step
§ for/while loop in client submits one job per
iteration step
§ Data reuse by caching in memory and/or disk
23

Iterate natively
Y initial
solution
DataSet<Page>
pages
=
...
edges
=
...
IterativeDataSet<Page>
pagesIter
=
pages.iterate(maxIterations);
DataSet<Page>
newRanks
=
update
(pagesIter,
edges);
DataSet<Page>
result
=
pagesIter.closeWith(newRanks)
24
partial
solution
partial
X solution
other
datasets
iteration
result
Replace
Step function

Iterate natively with deltas
Replace
workset A B workset
initial
workset
initial
partial
solution
solution
Y delta
X set
other
datasets
Merge deltas
DeltaIteration<...>
pagesIter
=
pages.iterateDelta(initialDeltas,
iteration
result
maxIterations,
0);
DataSet<...>
newRanks
=
update
(pagesIter,
edges);
DataSet<...>
newRanks
=
...
DataSet<...>
result
=
pagesIter.closeWith(newRanks,
deltas)
See http://data-artisans.com/data-analysis-with-flink.html 25

Native, unrolling, and delta
26

The growing Flink stack
29
Python API
(upcoming) Graph API Apache
Common API
Flink Optimizer Flink Stream Builder
Scala API
(batch)
Java API
(streaming)
Java API
(batch)
MRQL
Flink Local Runtime
Embedded
environment
(Java collections)
Local
Environment
(for debugging)
Remote environment
(Regular cluster execution) Apache Tez
Single node execution Standalone or YARN cluster
Data
storage Files HDFS S3 JDBC Redis Rabbit
Kafka MQ Azure
tables …

Stack without Flink Streaming
30
30
Python API
Focus on regular (batch)
processing…
Scala API Java API
Common API
Flink Optimizer
MRQL
Embedded Flink Local Runtime
environment
(Java collections) Local
Environment
(for debugging)
Remote environment
Standalone or YARN cluster
Data
storage Files HDFS S3 JDBC Azure
tables …
Single node execution

Program lifecycle
30
30
Python API
Scala API Java API
Common API
Flink Optimizer
MRQL
environment
Environment
(for debugging)
Remote environment
Data
tables …
31
val
source1
=
…
val
source2
=
…
maxed
=
source1
.map(v
=>
(v._1,v._2,
val
math.max(v._1,v._2))
val
filtered
=
source2
.filter(v
=>
(v._1
>
4))
val
result
=
maxed
.join(filtered).where(0).equalTo(0)
.filter(_1
>
3)
.groupBy(0)
.reduceGroup
{……}
1
3
4
5
2

30
30
Python API
Scala API Java API
Common API
Flink Optimizer
MRQL
environment
Environment
(for debugging)
Remote environment
Data
tables …
§ The optimizer is the
component that selects
an execution plan for a
Common API program
§ Think of an AI system
manipulating your
program for you J
§ But don’t be scared – it
works
• Relational databases have
been doing this for
decades – Flink ports the
technology to API-based
systems
Flink Optimizer
32

A simple program
33
DataSet<Tuple5<Integer,
String,
String,
String,
Integer>>
orders
=
…
Double>>
lineitems
=
…
Integer>>
filteredOrders
=
orders
.filter(.
.
.)
.project(0,4).types(Integer.class,
Integer.class);
Integer,
Double>>
lineitemsOfOrders
=
filteredOrders
.join(lineitems)
.where(0).equalTo(0)
.projectFirst(0,1).projectSecond(1)
.types(Integer.class,
Integer.class,
Double.class);
Integer,
Double>>
priceSums
=
lineitemsOfOrders
.groupBy(0,1).aggregate(Aggregations.SUM,
2);
priceSums.writeAsCsv(outputPath);

Two execution plans
34
GroupRed
sort
Combine
Map DataSource
Filter
DataSource
orders.tbl
lineitem.tbl
Join
Hybrid Hash
buildHT probe
broadcast forward
Map DataSource
Filter
DataSource
orders.tbl
lineitem.tbl
Join
Hybrid Hash
buildHT probe
hash-part [0] hash-part [0]
hash-part [0,1]
GroupRed
sort
Best plan forward
depends on
relative sizes
of input files

Flink Local Runtime
30
30
Python API
Scala API Java API
Common API
Flink Optimizer
MRQL
environment
Environment
(for debugging)
Remote environment
Data
tables …
§ Local runtime, not
the distributed
execution engine
§ Aka: what happens
inside every
parallel task
35

Flink runtime operators
§ Sorting and hashing data
• Necessary for grouping, aggregation,
reduce, join, cogroup, delta iterations
§ Flink contains tailored implementations
of hybrid hashing and external sorting in
Java
• Scale well with both abundant and restricted
memory sizes
36

Internal data representation
37
JVM Heap
map
JVM Heap
reduce
O
Romeo,
Romeo,
wherefore
art
thou
Romeo?
00110011
art,
1
O,
1
Romeo,
1
Romeo,
1
00110011
00010111
01110001
01111010
00010111
00110011
Network transfer
Local sort
How is intermediate data internally represented?

Internal data representation
§ Two options: Java objects or raw bytes
§ Java objects
• Easier to program
• Can suffer from GC overhead
• Hard to de-stage data to disk, may suffer from “out
of memory exceptions”
§ Raw bytes
• Harder to program (customer serialization stack,
more involved runtime operators)
• Solves most of memory and GC problems
• Overhead from object (de)serialization
§ Flink follows the raw byte approach
38

Memory in Flink
public
class
WC
{
public
String
word;
public
int
count;
}
empty
page
Pool of Memory Pages
JVM Heap
User code
objects
Sorting,
hashing,
caching
Shuffling,
broadcasts
Unmanaged
heap
Managed
heap
Network
buffers
39

Memory in Flink (2)
§ Internal memory management
• Flink initially allocates 70% of the free heap as byte[]
segments
• Internal operators allocate() and release() these
segments
§ Flink has its own serialization stack
• All accepted data types serialized to data segments
§ Easy to reason about memory, (almost) no
OutOfMemory errors, reduces the pressure to
the GC (smooth performance)
40

Operating on serialized data
Microbenchmark
§ Sorting 1GB worth of (long, double) tuples
§ 67,108,864 elements
§ Simple quicksort
41

Flink distributed execution
30
30
Python API
Scala API Java API
Common API
Flink Optimizer
MRQL
environment
Environment
(for debugging)
Remote environment
Data
tables …
42
§ Pipelined
• Same engine for
Flink and Flink
streaming
§ Pluggable
• Local runtime can be
executed on other
engines
• E.g., Java collections
and Apache Tez

Summary
§ Flink decouples API from execution
• Same program can be executed in many different
ways
• Hopefully users do not need to care about this and
still get very good performance
§ Unique Flink internal features
• Pipelined execution, native iterations, optimizer,
serialized data manipulation, good disk destaging
§ Very good performance
• Known issues currently worked on actively
44

Stay informed
§ flink.incubator.apache.org
• Subscribe to the mailing lists!
• http://flink.incubator.apache.org/community.html#mailing-lists
§ Blogs
• flink.incubator.apache.org/blog
• data-artisans.com/blog
§ Twitter
• follow @ApacheFlink
45

That’s it, time for beer
47

Apache Flink internals

More Related Content

What's hot

Viewers also liked

Similar to Apache Flink internals

Recently uploaded

Apache Flink internals