Tachyon: An Open Source Memory-Centric Distributed Storage System

Haoyuan Li, Tachyon Nexus 
haoyuan@tachyonnexus.com 
September 30, 2015 @ Strata and Hadoop World NYC 2015
An Open Source Memory-Centric
Distributed Storage System

Outline
•  Open Source
•  Introduction to Tachyon
•  New Features
•  Getting Involved
2

Outline
•  Open Source
•  New Features
3

History
•  Started at UC Berkeley AMPLab
–  From summer 2012
–  Same lab produced Apache Spark and Apache Mesos
•  Open sourced
–  April 2013
–  Apache License 2.0
–  Latest Release: Version 0.7.1 (August 2015)
•  Deployed at > 100 companies
4

Contributors Growth
5
v0.4!
Feb ‘14
v0.3!
Oct ‘13
v0.2
Apr ‘13
v0.1
Dec ‘12
v0.6!
Mar ‘15
v0.5!
Jul ‘14
v0.7!
Jul ‘15
1
3
15
30
46
70
111

Contributors Growth
6
> 150 Contributors
(3x increment over the last Strata NYC)
> 50 Organizations

Contributors Growth
7
One of the Fastest
Growing Big Data
Open Source
Project

Thanks to Contributors and Users!
8

One Tachyon Production 
Deployment Example
•  Baidu (Dominant Search Engine in China,
~ 50 Billion USD Market Cap)
•  Framework: SparkSQL
•  Under Storage: Baidu’s File System
•  Storage Media: MEM + HDD
•  100+ nodes deployment
•  1PB+ managed space
•  30x Performance Improvement
9

Outline
•  Open Source
•  New Features
10

Tachyon is an
Open Source 
Memory-centric 
Distributed
Storage System
11

Performance Trend:  
Memory is Fast
•  RAM throughput  
increasing exponentially
•  Disk throughput
increasing slowly
13
Memory-locality key to interactive response times

Price Trend: Memory is Cheaper
source:
jcmit.com

14

17
Missing a Solution
for the Storage Layer

A Use Case Example with -
•  Fast, in-memory data processing framework
– Keep one in-memory copy inside JVM
– Track lineage of operations used to derive data
– Upon failure, use lineage to recompute data
map
ﬁlter
map
join
reduce
Lineage Tracking
18

Issue 1
19
Data Sharing is the bottleneck in
analytics pipeline: 
Slow writes to disk
Spark Job1
Spark mem
block manager
block 1
block 3
Spark Job2
Spark mem
block manager
block 3
block 1
HDFS / Amazon S3
block 1
block 3
block 2
block 4
storage engine &
execution engine
same process
(slow writes)

Issue 1
20
Spark Job
Spark mem
block manager
block 1
block 3
Hadoop MR Job
YARN
HDFS / Amazon S3
block 1
block 3
block 2
block 4
Data Sharing is the bottleneck in
analytics pipeline: 
Slow writes to disk
storage engine &
execution engine
same process
(slow writes)

Issue 1 resolved with Tachyon
21
Memory-speed data sharing 
among jobs in diﬀerent
frameworks
execution engine &  
storage engine
same process
(fast writes)
Spark Job
Spark mem
Hadoop MR Job
YARN
HDFS / Amazon S3
block 1
block 3
block 2
block 4
HDFS

disk

block
1

block
3

block
2

block
4

Tachyon!
in-memory
block 1
block 3
block 4

Issue 2
22
Spark Task
Spark memory
block manager
block 1
block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
storage engine
same process
Cache loss when process
crashes

Issue 2
23
crash
Spark memory
block manager
block 1
block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
storage engine
same process
crashes

HDFS / Amazon S3
Issue 2
24
block 1
block 3
block 2
block 4
storage engine
same process
crash
crashes

HDFS / Amazon S3
block 1
block 3
block 2
block 4
Tachyon!
in-memory
block 1
block 3
block 4
25
Spark Task
Spark memory
block manager
storage engine
same process
Keep in-memory data safe, 
even when a job crashes.

26
HDFS

disk

block
1

block
3

block
2

block
4

storage engine
same process
Tachyon!
in-memory

block 1
block 3
block 4
crash
HDFS / Amazon S3
block 1
block 3
block 2
block 4
Keep in-memory data safe, 
even when a job crashes.

HDFS / Amazon S3
Issue 3
27
In-memory Data Duplication &
Java Garbage Collection
Spark Job1
Spark mem
block manager
block 1
block 3
Spark Job2
Spark mem
block manager
block 3
block 1
block 1
block 3
block 2
block 4
storage engine
same process
(duplication & GC)

28
No in-memory data duplication, 
much less GC
Spark Job1
Spark mem
Spark Job2
Spark mem
HDFS / Amazon S3
block 1
block 3
block 2
block 4
storage engine
same process
(no duplication & GC)
HDFS

disk

block
1

block
3

block
2

block
4

Tachyon!
in-memory
block 1
block 3
block 4

Previously Mentioned
•  A memory-centric storage architecture
•  Push lineage down to storage layer
29

Tachyon Memory-Centric Architecture
30

Tachyon Memory-Centric Architecture
31

Outline
•  Open Source
•  New Features
33

1) Eco-system:
Enable new workload in any storage;
Work with the framework of your choice;
34

2) Tachyon running in
production environment,
both
in the Cloud and on Premise.
35

Use Case: Baidu
•  Framework: SparkSQL
•  Under Storage: Baidu’s File System
•  Storage Media: MEM + HDD
•  100+ nodes deployment
•  1PB+ managed space
36

Use Case: a SAAS Company
•  Framework: Impala
•  Under Storage: S3
•  Storage Media: MEM + SSD
37

Use Case: an Oil Company
•  Framework: Spark
•  Under Storage: GlusterFS
•  Storage Media: MEM only
•  Analyzing data in traditional storage
38

Use Case: a SAAS Company
•  Framework: Spark
•  Under Storage: S3
•  Storage Media: SSD only
•  Elastic Tachyon deployment
39

40
What if  
data size exceeds  
memory capacity?

41
3) Tiered Storage: 
Tachyon Manages More Than DRAM
MEM
SSD
HDD
Faster
Higher  
Capacity

42
Conﬁgurable Storage Tiers
MEM only
MEM + HHD
SSD only

43
4) Pluggable Data Management Policy
Evict stale data to
lower tier
Promote hot data to
upper tier

More Features
•  7) Remote Write Support
•  8) Easy deployment with Mesos and Yarn
•  9) Initial Security Support
•  10) One Command Cluster Deployment
•  11) Metrics Reporting for Clients, Workers,
and Master
47

12) More Under Storage Supports
48

Outline
•  Open Source
•  New Features
50

Memory-Centric Distributed Storage
Welcome to try, contact, and collaborate!
51
JIRA New Contributor Tasks

•  Team consists of Tachyon creators, top contributors
•  Series A ($7.5 million) from Andreessen Horowitz 

•  Committed to Tachyon Open Source 

52

Strata NYC 2015
•  Welcome to visit us at our booth #P18.
•  Check out other Tachyon related talks.
–  First-ever scalable, distributed deep learning architecture
using Spark and Tachyon
•  Christopher Nguyen (Adatao, Inc.), Vu Pham (Adatao, Inc)
•  2:05pm–2:45pm Thursday, 10/01/2015
–  Faster time to insight using Spark, Tachyon, and Zeppelin
•  Nirmal Ranganathan (Rackspace Hosting)
•  2:05pm–2:45pm Thursday, 10/01/2015
54

•  Try Tachyon: http://tachyon-project.org 

•  Develop Tachyon: https://github.com/amplab/tachyon 

•  Meet Friends: http://www.meetup.com/Tachyon 

•  Get News: http://goo.gl/mwB2sX
•  Tachyon Nexus: http://www.tachyonnexus.com

•  Contact us: haoyuan@tachyonnexus.com
55

Tachyon: An Open Source Memory-Centric Distributed Storage System

More Related Content

What's hot

Viewers also liked

Similar to Tachyon: An Open Source Memory-Centric Distributed Storage System

Recently uploaded

Tachyon: An Open Source Memory-Centric Distributed Storage System