Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters

StoreApp:A Shared Storage Appliance for Efficient
and Scalable Virtualized Hadoop Clusters
LIU Kai
Email: kiwenlau@163.com
Blog: http://kiwenlau.com/
National Institute of Informatics, Japan
2015/6/27 1LIU Kai, National Institute of Informatics

Contents
 Introduction (What?)
 Motivation (Why?)
 Implementation (How?)
 Personal Ideas

Introduction – What is StoreApp?

Background
 Hadoop (version 1): for big data storage and computation
 Hadoop Distributed File System (HDFS): for storage
 Hadoop MapReduce Framework: for computation
 Master/Slave Architecture
 Storage(DataNode) and computation(TaskTracker) co-locate in a node
2015/6/27 LIU Kai, National Institute of Informatics 4
DataNode
TaskTracker
…
Slave Slave Slave Slave
NameNode
JobTracker
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
Master
Physical Machine
Or Virtual Machine

Overview
 What is StoreApp?
 A Hadoop plugin
 For speeding up Hadoop running in virtual machines
 Separate storage (DataNode) from computation (TaskTracker)
TaskTracker
DataNode
TaskTracker
TaskTracker
Physical machine Physical machine
Virtual machineDataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
Virtual machine

Benefit
 Improve HDFS throughput by 78.3%
 Storage VM has higher priority in scheduling than computation VM
 Consolidating storage into one VM reduce I/O contentions
 Reduce job completion time by 61%
 Most Hadoop jobs are data intensive
 Their performance are bottlenecked by slow disk access

Motivation – Why do we need StoreApp?

Challenge 1
 Can’t add or remove nodes easily
 Rebalancing data incurs significant data movement
 Cannot utilize the elasticity of virtual machines
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
Physical Machine
Virtual Machine
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
…

Solution 1
 Separate storage from computation
 Adding or removing computation node need no data movement
 Finding optimal number of computation nodes for each Hadoop job
TaskTracker
DataNode
TaskTracker
TaskTracker
TaskTracker
DataNode
TaskTracker
TaskTracker
TaskTracker
DataNode
TaskTracker
TaskTracker
…
Physical Machine
Virtual Machine

Challenge 2
 Colocated Virtual Machines often access disk concurrently
 Random IO operations will compete with each other
 Significantly degrade the Hadoop Job performance
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
Physical Machine
Virtual Machine
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
…

Solution 2
 Each physical machine only has one storage virtual machine
 Only the storage Virtual Machine is IO intensive
 No serious concurrent IO operations
TaskTracker
DataNode
TaskTracker
TaskTracker
TaskTracker
DataNode
TaskTracker
TaskTracker
TaskTracker
DataNode
TaskTracker
TaskTracker
…
Physical Machine
Virtual Machine

Challenge 3
 Can’t schedule Virtual Machines efficiently
 IO intensive VMs can be prioritized since they consume less CPU
 However, every VM is IO intensive!
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
Physical Machine
Virtual Machine
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
…

Solution 3
 Only the storage Virtual Machine is IO intensive
 The storage Virtual Machine will receive a higher priority
TaskTracker
DataNode
TaskTracker
TaskTracker
TaskTracker
DataNode
TaskTracker
TaskTracker
TaskTracker
DataNode
TaskTracker
TaskTracker
…
Physical Machine
Virtual Machine

Implementation – How to design StoreApp?

Architecture
 A StoreApp manager and multiple storage nodes
 The StoreApp manager run on the master node
 Each physical machine has one storage node

Components
 StoreApp manager
 Coordinate the operations of all data nodes
 Scheduler
 Scheduling tasks according to data locations
 HDFS Proxy
 Receive all HDFS requests and forward them to DataNode
 Shuffler
 Receive map output and push them to DataNode

HDFS Prefetching
 Read the whole block b1 instead of needed partial records
 Unused data of block b1 is kept in the memory
 Read consecutive block into memory to form input split s1
task0 task1

Automated Cluster Resizing
 Dynamically change Cluster Size during the job execution
 The iterative algorithm can search for the optimal cluster size

Personal Ideas

Pros and cons
 Pros
 Simple idea but shows good result
 Show clear logic of locating and solving problems
 Cons
 Restrict to Hadoop 1
 No open source

Future direction
 From Hadoop 1 to Hadoop 2
 Hadoop 2 is quite different with Hadoop 1
 Hadoop 2 can support more application framework like Spark
 From Virtual Machine to container
 Container is a more lightweight virtualization technology
 Container is more Resource efficient than Virtual Machine
 Container is more easy to scale than Virtual Machine

References
Yanfei Guo, et al. "StoreApp: A Shared Storage Appliance for Efficient and
Scalable Virtualized Hadoop Clusters”, INFOCOM, 4, 2015

Thank you!

Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters

More Related Content

What's hot

Viewers also liked

Similar to Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters

Recently uploaded

Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters

Editor's Notes