Streaming Distributed Data
Processing with Silk
Taro L. Saito
University of Tokyo
leo@xerial.org
March 3rd, 2014
DEIM2014

xerial.org/silk Twitter @taroleo

1
Distributed Data Processing
Streaming Distributed Data Processing with Silk



Translate this data processing program
A



g

f

B

C

into a cluster computing program
g

f

A0

B0

A1

B1

A2

B2
map
xerial.org/silk Twitter @taroleo

C

reduce
2
Streaming Distributed Data Processing
Streaming Distributed Data Processing with Silk



What is streaming?

A

f

g

B

F

C

G
D



E

Silk: A framework for building and running complex
workflows of distributed data processing
xerial.org/silk Twitter @taroleo

3
Problem Definition
Streaming Distributed Data Processing with Silk



How do we run the distributed data processing while
extending the program?
A

f

g

B

F

C

G
D

xerial.org/silk Twitter @taroleo

E

4
Silk
Streaming Distributed Data Processing with Silk



Describing Dataflows in Scala


A dataflow in Silk is a sequence of function calls




Type safe and concise syntax, easy to learn.

Silk[A] : Set of type A object

xerial.org/silk Twitter @taroleo

5
Object-Oriented Dataflow Programming
Streaming Distributed Data Processing with Silk



Reusing and overriding dataflow programs

xerial.org/silk Twitter @taroleo

6
Big Data Volumes in Human Genome Analysis
Streaming Distributed Data Processing with Silk

Input: FASTQ file(s) 500GB (50x coverage, 200 million entries)



DNA Sequencer (Illumina, PacBio, etc.)






f: An alignment program
Output: Alignment results 750GB (sequence + alignment data)



Total storage space required: 1.2TB
Computational time required: 1 days (using hundreds of CPUs)

Input

f

Output

University of Tokyo Genome Browser (UTGB)
xerial.org/silk Twitter @taroleo

7
Varieties of Scientific Data and Analysis
Streaming Distributed Data Processing with Silk



WormTSS: http://wormtss.utgenome.org/

Integrating various data sources, hundreds of data analysis…

xerial.org/silk Twitter @taroleo

8
Produced Thousands of Data Analysis Charts
Streaming Distributed Data Processing with Silk

Using R, JFreeChart, etc.
Need a automated
pipeline to redo the entire
analysis for answering the
paper review within a
month.

xerial.org/silk Twitter @taroleo

9
Writing A Dataflow
Streaming Distributed Data Processing with Silk

a Program v1

f

A

B
val B = A.map(f)



Apply function f to the input A, then produce the output B


This step may take more than 1 hours in big data analysis

xerial.org/silk Twitter @taroleo

10
Distribution and Fault Tolerance
Streaming Distributed Data Processing with Silk



Resume only B2 = A2.map(f)
a Program v1

f

A

B

f

A0

B0

A1

B1

A2

B2
Failure!
xerial.org/silk Twitter @taroleo

Retry

11
Extending Dataflows
Streaming Distributed Data Processing with Silk

Program v2
Program v1

A




f

g

B

C

While running program v1, adding another code (program v2)
How do we reuse the already computed result (B) to generate C?

xerial.org/silk Twitter @taroleo

12
Marking to A Program
Streaming Distributed Data Processing with Silk

Program v2
Program v1

A

f

g

B

C

val B = A.map(f)
val C = B.map(g)


Storing intermediate results using variable names


variable names := program markers!!



But, we lost variable names after compilation



Extracting AST and variable names upon compile time


Using Scala Macros (Since Scala 2.10)
xerial.org/silk Twitter @taroleo

13
Scala Program (AST) to DAG Schedule (Logical Plan)
Streaming Distributed Data Processing with Silk

Program v2
Program v1

A



g

B

C

Translating a program (AST) into a set of Silk operations (DAG)





f

val B = MapOp(input:A, output:B, function:f)
val C = MapOp(input:B, output:C, function:g)

Operations in Silk can be nested


val C = MapOp(input:MapOp(input:A, output:B, function:f), output:C, function:g)

xerial.org/silk Twitter @taroleo

14
Weaving Silks
Streaming Distributed Data Processing with Silk

In-memory weaver

Cluster weaver

Result
Hadoop weaver

Silk[A]
(operation DAG)



Weave

Output

Data analysis code is independent from weavers

xerial.org/silk Twitter @taroleo

15
Cluster Weaver: Logical Plan to Physical Plan on Cluster
Streaming Distributed Data Processing with Silk



Logical plan




GroupByOp(in:people, out:g, key: {_.dept.id})

Physical plan
P1

Partition
(hashing)

S1

P1

S3

S1

P1

S1

S2

P2

P2

S2

S2

P2

S3

S2

P2

S1

S3

P3

P2

S2

S3

P3

P3
Scatter

S2

P1

I3

P2

P3

I2

P1

P1

Silk[people]

S1

P3

I1

S1

S3

S3

P3

serialization

shuffle

deserialization

xerial.org/silk Twitter @taroleo

R1

R2

R3
merge sort

16
Local machine

Local ClassBox
User program
builds workflows

Weaving Silk materializes objects

classpaths & local jar files
•
•
•
•
•

Silk[A]

Silk[A]

read file, toSilk
map, reduce, join,
groupBy
UNIX commands
etc.

SilkSingle[A]

SilkSeq[A]

weave
weave

Static optimization

A

DAG Schedule

single object
•
•

Cluster
•
•
•
•

Dispatches tasks to clients
Manages master resource table
Authorizes resource allocation
Automatic recovery by
leader election in ZK

Register ClassBox
Submit schedule

ZooKeeper
ensemble mode
(at least 3 ZK instances)

Silk Master
•
•

dispatch

•
•

Silk Client

Silk Client

Task Scheduler

Task Scheduler

Task Executor

Task Executor

Resource Monitor

Resource Monitor

Data Server

Data Server

Leader election
Collects locations of slices
and ClassBox jars
Watches active nodes
Watches available resources

Seq[A]
sequence of objects

Node Table
Slice Table
Task Status
Resource Table
(CPU, memory)
ClassBox Table

•
•
•
•
•
•
•
•

Submits tasks
Run-time optimization
Resource allocation
Monitoring resource usage
Launches Web UI
Manages assigned task status
Object serialization/deserialization
Serves slice data

xerial.org/silk Twitter @taroleo

17
Static Optimization
Streaming Distributed Data Processing with Silk



Tree transformation






map(f).map(g) => map(g・f)
(Function composition)
map(f).filter(p) => mapWithFilter(f, p) (Reduces intermediate
data)
Pushing-down selection
Retrieves only accessed fields in an object




Analyzing the byte code of functions with ASM

Rewriting logical plans using pattern matching in Scala


Easy to add optimization rules

xerial.org/silk Twitter @taroleo

18
Run-time Optimization
Streaming Distributed Data Processing with Silk



Adjusting the number of data splits


According to the available cluster resources.



Multi-core execution



Omega-based task scheduler


Sharing the cluster resource table between nodes




Each node determines how to use the resource

Monitoring actual CPU/memory resources periodically

xerial.org/silk Twitter @taroleo

19
UNIX Command Workflows in Silk
Streaming Distributed Data Processing with Silk


c”(UNIX Command)”

xerial.org/silk Twitter @taroleo

20
Buffer Management
Streaming Distributed Data Processing with Silk




Silk frequently uses distributed memory (like Spark)
LArray[A]







Immediate memory deallocation (free)




To eliminate OutOfMemoryException and GC-stall

Fast memory allocation




Allocating Off-heap (outside JVM heap)memories
sun.misc.Unsafe
Github: https://github.com/xerial/larray

Skips zero-filling

Object Serialization


Extending msgpack





Scala Pickling
Inject ser/dser codes

Off-heap objects
xerial.org/silk Twitter @taroleo

21
Summary
Streaming Distributed Data Processing with Silk



Silk


A framework for distributed data processing for all data scientists




Object-oriented data processing programming




Similar to query optimization in DBMS

Analyze Data as You Write Programs!




Reuse, override and mix-in

Optimizing data flow programs




including non-experts in distributed data processing (e.g. Biologists)

Database research now enters program optimization.

In Future


Workflow queries





Making queries against dataflow program
Monitoring intermediate results

Multi-user program execution
xerial.org/silk Twitter @taroleo

22
http://xerial.org/silk
Streaming Distributed Data Processing with Silk

xerial.org/silk Twitter @taroleo

23

Streaming Distributed Data Processing with Silk #deim2014

  • 1.
    Streaming Distributed Data Processingwith Silk Taro L. Saito University of Tokyo leo@xerial.org March 3rd, 2014 DEIM2014 xerial.org/silk Twitter @taroleo 1
  • 2.
    Distributed Data Processing StreamingDistributed Data Processing with Silk  Translate this data processing program A  g f B C into a cluster computing program g f A0 B0 A1 B1 A2 B2 map xerial.org/silk Twitter @taroleo C reduce 2
  • 3.
    Streaming Distributed DataProcessing Streaming Distributed Data Processing with Silk  What is streaming? A f g B F C G D  E Silk: A framework for building and running complex workflows of distributed data processing xerial.org/silk Twitter @taroleo 3
  • 4.
    Problem Definition Streaming DistributedData Processing with Silk  How do we run the distributed data processing while extending the program? A f g B F C G D xerial.org/silk Twitter @taroleo E 4
  • 5.
    Silk Streaming Distributed DataProcessing with Silk  Describing Dataflows in Scala  A dataflow in Silk is a sequence of function calls   Type safe and concise syntax, easy to learn. Silk[A] : Set of type A object xerial.org/silk Twitter @taroleo 5
  • 6.
    Object-Oriented Dataflow Programming StreamingDistributed Data Processing with Silk  Reusing and overriding dataflow programs xerial.org/silk Twitter @taroleo 6
  • 7.
    Big Data Volumesin Human Genome Analysis Streaming Distributed Data Processing with Silk Input: FASTQ file(s) 500GB (50x coverage, 200 million entries)  DNA Sequencer (Illumina, PacBio, etc.)    f: An alignment program Output: Alignment results 750GB (sequence + alignment data)   Total storage space required: 1.2TB Computational time required: 1 days (using hundreds of CPUs) Input f Output University of Tokyo Genome Browser (UTGB) xerial.org/silk Twitter @taroleo 7
  • 8.
    Varieties of ScientificData and Analysis Streaming Distributed Data Processing with Silk  WormTSS: http://wormtss.utgenome.org/  Integrating various data sources, hundreds of data analysis… xerial.org/silk Twitter @taroleo 8
  • 9.
    Produced Thousands ofData Analysis Charts Streaming Distributed Data Processing with Silk Using R, JFreeChart, etc. Need a automated pipeline to redo the entire analysis for answering the paper review within a month. xerial.org/silk Twitter @taroleo 9
  • 10.
    Writing A Dataflow StreamingDistributed Data Processing with Silk a Program v1 f A B val B = A.map(f)  Apply function f to the input A, then produce the output B  This step may take more than 1 hours in big data analysis xerial.org/silk Twitter @taroleo 10
  • 11.
    Distribution and FaultTolerance Streaming Distributed Data Processing with Silk  Resume only B2 = A2.map(f) a Program v1 f A B f A0 B0 A1 B1 A2 B2 Failure! xerial.org/silk Twitter @taroleo Retry 11
  • 12.
    Extending Dataflows Streaming DistributedData Processing with Silk Program v2 Program v1 A   f g B C While running program v1, adding another code (program v2) How do we reuse the already computed result (B) to generate C? xerial.org/silk Twitter @taroleo 12
  • 13.
    Marking to AProgram Streaming Distributed Data Processing with Silk Program v2 Program v1 A f g B C val B = A.map(f) val C = B.map(g)  Storing intermediate results using variable names  variable names := program markers!!  But, we lost variable names after compilation  Extracting AST and variable names upon compile time  Using Scala Macros (Since Scala 2.10) xerial.org/silk Twitter @taroleo 13
  • 14.
    Scala Program (AST)to DAG Schedule (Logical Plan) Streaming Distributed Data Processing with Silk Program v2 Program v1 A  g B C Translating a program (AST) into a set of Silk operations (DAG)    f val B = MapOp(input:A, output:B, function:f) val C = MapOp(input:B, output:C, function:g) Operations in Silk can be nested  val C = MapOp(input:MapOp(input:A, output:B, function:f), output:C, function:g) xerial.org/silk Twitter @taroleo 14
  • 15.
    Weaving Silks Streaming DistributedData Processing with Silk In-memory weaver Cluster weaver Result Hadoop weaver Silk[A] (operation DAG)  Weave Output Data analysis code is independent from weavers xerial.org/silk Twitter @taroleo 15
  • 16.
    Cluster Weaver: LogicalPlan to Physical Plan on Cluster Streaming Distributed Data Processing with Silk  Logical plan   GroupByOp(in:people, out:g, key: {_.dept.id}) Physical plan P1 Partition (hashing) S1 P1 S3 S1 P1 S1 S2 P2 P2 S2 S2 P2 S3 S2 P2 S1 S3 P3 P2 S2 S3 P3 P3 Scatter S2 P1 I3 P2 P3 I2 P1 P1 Silk[people] S1 P3 I1 S1 S3 S3 P3 serialization shuffle deserialization xerial.org/silk Twitter @taroleo R1 R2 R3 merge sort 16
  • 17.
    Local machine Local ClassBox Userprogram builds workflows Weaving Silk materializes objects classpaths & local jar files • • • • • Silk[A] Silk[A] read file, toSilk map, reduce, join, groupBy UNIX commands etc. SilkSingle[A] SilkSeq[A] weave weave Static optimization A DAG Schedule single object • • Cluster • • • • Dispatches tasks to clients Manages master resource table Authorizes resource allocation Automatic recovery by leader election in ZK Register ClassBox Submit schedule ZooKeeper ensemble mode (at least 3 ZK instances) Silk Master • • dispatch • • Silk Client Silk Client Task Scheduler Task Scheduler Task Executor Task Executor Resource Monitor Resource Monitor Data Server Data Server Leader election Collects locations of slices and ClassBox jars Watches active nodes Watches available resources Seq[A] sequence of objects Node Table Slice Table Task Status Resource Table (CPU, memory) ClassBox Table • • • • • • • • Submits tasks Run-time optimization Resource allocation Monitoring resource usage Launches Web UI Manages assigned task status Object serialization/deserialization Serves slice data xerial.org/silk Twitter @taroleo 17
  • 18.
    Static Optimization Streaming DistributedData Processing with Silk  Tree transformation     map(f).map(g) => map(g・f) (Function composition) map(f).filter(p) => mapWithFilter(f, p) (Reduces intermediate data) Pushing-down selection Retrieves only accessed fields in an object   Analyzing the byte code of functions with ASM Rewriting logical plans using pattern matching in Scala  Easy to add optimization rules xerial.org/silk Twitter @taroleo 18
  • 19.
    Run-time Optimization Streaming DistributedData Processing with Silk  Adjusting the number of data splits  According to the available cluster resources.  Multi-core execution  Omega-based task scheduler  Sharing the cluster resource table between nodes   Each node determines how to use the resource Monitoring actual CPU/memory resources periodically xerial.org/silk Twitter @taroleo 19
  • 20.
    UNIX Command Workflowsin Silk Streaming Distributed Data Processing with Silk  c”(UNIX Command)” xerial.org/silk Twitter @taroleo 20
  • 21.
    Buffer Management Streaming DistributedData Processing with Silk   Silk frequently uses distributed memory (like Spark) LArray[A]     Immediate memory deallocation (free)   To eliminate OutOfMemoryException and GC-stall Fast memory allocation   Allocating Off-heap (outside JVM heap)memories sun.misc.Unsafe Github: https://github.com/xerial/larray Skips zero-filling Object Serialization  Extending msgpack    Scala Pickling Inject ser/dser codes Off-heap objects xerial.org/silk Twitter @taroleo 21
  • 22.
    Summary Streaming Distributed DataProcessing with Silk  Silk  A framework for distributed data processing for all data scientists   Object-oriented data processing programming   Similar to query optimization in DBMS Analyze Data as You Write Programs!   Reuse, override and mix-in Optimizing data flow programs   including non-experts in distributed data processing (e.g. Biologists) Database research now enters program optimization. In Future  Workflow queries    Making queries against dataflow program Monitoring intermediate results Multi-user program execution xerial.org/silk Twitter @taroleo 22
  • 23.
    http://xerial.org/silk Streaming Distributed DataProcessing with Silk xerial.org/silk Twitter @taroleo 23