StoreApp:A Shared Storage Appliance for Efficient
and Scalable Virtualized Hadoop Clusters
LIU Kai
Email: kiwenlau@163.com
Blog: http://kiwenlau.com/
National Institute of Informatics, Japan
2015/6/27 1LIU Kai, National Institute of Informatics
Contents
 Introduction (What?)
 Motivation (Why?)
 Implementation (How?)
 Personal Ideas
2015/6/27 2LIU Kai, National Institute of Informatics
Introduction – What is StoreApp?
2015/6/27 3LIU Kai, National Institute of Informatics
Background
 Hadoop (version 1): for big data storage and computation
 Hadoop Distributed File System (HDFS): for storage
 Hadoop MapReduce Framework: for computation
 Master/Slave Architecture
 Storage(DataNode) and computation(TaskTracker) co-locate in a node
2015/6/27 LIU Kai, National Institute of Informatics 4
DataNode
TaskTracker
…
Slave Slave Slave Slave
NameNode
JobTracker
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
Master
Physical Machine
Or Virtual Machine
Overview
 What is StoreApp?
 A Hadoop plugin
 For speeding up Hadoop running in virtual machines
 Separate storage (DataNode) from computation (TaskTracker)
2015/6/27 LIU Kai, National Institute of Informatics 5
TaskTracker
DataNode
TaskTracker
TaskTracker
Physical machine Physical machine
Virtual machineDataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
Virtual machine
Benefit
 Improve HDFS throughput by 78.3%
 Storage VM has higher priority in scheduling than computation VM
 Consolidating storage into one VM reduce I/O contentions
 Reduce job completion time by 61%
 Most Hadoop jobs are data intensive
 Their performance are bottlenecked by slow disk access
2015/6/27 LIU Kai, National Institute of Informatics 6
Motivation – Why do we need StoreApp?
2015/6/27 7LIU Kai, National Institute of Informatics
Challenge 1
 Can’t add or remove nodes easily
 Rebalancing data incurs significant data movement
 Cannot utilize the elasticity of virtual machines
2015/6/27 LIU Kai, National Institute of Informatics 8
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
Physical Machine
Virtual Machine
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
…
Solution 1
 Separate storage from computation
 Adding or removing computation node need no data movement
 Finding optimal number of computation nodes for each Hadoop job
2015/6/27 LIU Kai, National Institute of Informatics 9
TaskTracker
DataNode
TaskTracker
TaskTracker
TaskTracker
DataNode
TaskTracker
TaskTracker
TaskTracker
DataNode
TaskTracker
TaskTracker
…
Physical Machine
Virtual Machine
Challenge 2
 Colocated Virtual Machines often access disk concurrently
 Random IO operations will compete with each other
 Significantly degrade the Hadoop Job performance
2015/6/27 LIU Kai, National Institute of Informatics 10
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
Physical Machine
Virtual Machine
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
…
Solution 2
 Separate storage from computation
 Each physical machine only has one storage virtual machine
 Only the storage Virtual Machine is IO intensive
 No serious concurrent IO operations
2015/6/27 LIU Kai, National Institute of Informatics 11
TaskTracker
DataNode
TaskTracker
TaskTracker
TaskTracker
DataNode
TaskTracker
TaskTracker
TaskTracker
DataNode
TaskTracker
TaskTracker
…
Physical Machine
Virtual Machine
Challenge 3
 Can’t schedule Virtual Machines efficiently
 IO intensive VMs can be prioritized since they consume less CPU
 However, every VM is IO intensive!
2015/6/27 LIU Kai, National Institute of Informatics 12
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
Physical Machine
Virtual Machine
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
…
Solution 3
 Separate storage from computation
 Only the storage Virtual Machine is IO intensive
 The storage Virtual Machine will receive a higher priority
2015/6/27 LIU Kai, National Institute of Informatics 13
TaskTracker
DataNode
TaskTracker
TaskTracker
TaskTracker
DataNode
TaskTracker
TaskTracker
TaskTracker
DataNode
TaskTracker
TaskTracker
…
Physical Machine
Virtual Machine
Implementation – How to design StoreApp?
2015/6/27 14LIU Kai, National Institute of Informatics
Architecture
2015/6/27 LIU Kai, National Institute of Informatics 15
 A StoreApp manager and multiple storage nodes
 The StoreApp manager run on the master node
 Each physical machine has one storage node
Components
 StoreApp manager
 Coordinate the operations of all data nodes
 Scheduler
 Scheduling tasks according to data locations
 HDFS Proxy
 Receive all HDFS requests and forward them to DataNode
 Shuffler
 Receive map output and push them to DataNode
2015/6/27 LIU Kai, National Institute of Informatics 16
HDFS Prefetching
2015/6/27 LIU Kai, National Institute of Informatics 17
 Read the whole block b1 instead of needed partial records
 Unused data of block b1 is kept in the memory
 Read consecutive block into memory to form input split s1
task0 task1
Automated Cluster Resizing
2015/6/27 LIU Kai, National Institute of Informatics 18
 Dynamically change Cluster Size during the job execution
 The iterative algorithm can search for the optimal cluster size
Personal Ideas
2015/6/27 19LIU Kai, National Institute of Informatics
Pros and cons
 Pros
 Simple idea but shows good result
 Show clear logic of locating and solving problems
 Cons
 Restrict to Hadoop 1
 No open source
2015/6/27 LIU Kai, National Institute of Informatics 20
Future direction
 From Hadoop 1 to Hadoop 2
 Hadoop 2 is quite different with Hadoop 1
 Hadoop 2 can support more application framework like Spark
 From Virtual Machine to container
 Container is a more lightweight virtualization technology
 Container is more Resource efficient than Virtual Machine
 Container is more easy to scale than Virtual Machine
2015/6/27 LIU Kai, National Institute of Informatics 21
References
Yanfei Guo, et al. "StoreApp: A Shared Storage Appliance for Efficient and
Scalable Virtualized Hadoop Clusters”, INFOCOM, 4, 2015
2015/6/27 LIU Kai, National Institute of Informatics 22
Thank you!
2015/6/27 LIU Kai, National Institute of Informatics 23

Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters

  • 1.
    StoreApp:A Shared StorageAppliance for Efficient and Scalable Virtualized Hadoop Clusters LIU Kai Email: kiwenlau@163.com Blog: http://kiwenlau.com/ National Institute of Informatics, Japan 2015/6/27 1LIU Kai, National Institute of Informatics
  • 2.
    Contents  Introduction (What?) Motivation (Why?)  Implementation (How?)  Personal Ideas 2015/6/27 2LIU Kai, National Institute of Informatics
  • 3.
    Introduction – Whatis StoreApp? 2015/6/27 3LIU Kai, National Institute of Informatics
  • 4.
    Background  Hadoop (version1): for big data storage and computation  Hadoop Distributed File System (HDFS): for storage  Hadoop MapReduce Framework: for computation  Master/Slave Architecture  Storage(DataNode) and computation(TaskTracker) co-locate in a node 2015/6/27 LIU Kai, National Institute of Informatics 4 DataNode TaskTracker … Slave Slave Slave Slave NameNode JobTracker DataNode TaskTracker DataNode TaskTracker DataNode TaskTracker Master Physical Machine Or Virtual Machine
  • 5.
    Overview  What isStoreApp?  A Hadoop plugin  For speeding up Hadoop running in virtual machines  Separate storage (DataNode) from computation (TaskTracker) 2015/6/27 LIU Kai, National Institute of Informatics 5 TaskTracker DataNode TaskTracker TaskTracker Physical machine Physical machine Virtual machineDataNode TaskTracker DataNode TaskTracker DataNode TaskTracker Virtual machine
  • 6.
    Benefit  Improve HDFSthroughput by 78.3%  Storage VM has higher priority in scheduling than computation VM  Consolidating storage into one VM reduce I/O contentions  Reduce job completion time by 61%  Most Hadoop jobs are data intensive  Their performance are bottlenecked by slow disk access 2015/6/27 LIU Kai, National Institute of Informatics 6
  • 7.
    Motivation – Whydo we need StoreApp? 2015/6/27 7LIU Kai, National Institute of Informatics
  • 8.
    Challenge 1  Can’tadd or remove nodes easily  Rebalancing data incurs significant data movement  Cannot utilize the elasticity of virtual machines 2015/6/27 LIU Kai, National Institute of Informatics 8 DataNode TaskTracker DataNode TaskTracker DataNode TaskTracker Physical Machine Virtual Machine DataNode TaskTracker DataNode TaskTracker DataNode TaskTracker DataNode TaskTracker DataNode TaskTracker DataNode TaskTracker …
  • 9.
    Solution 1  Separatestorage from computation  Adding or removing computation node need no data movement  Finding optimal number of computation nodes for each Hadoop job 2015/6/27 LIU Kai, National Institute of Informatics 9 TaskTracker DataNode TaskTracker TaskTracker TaskTracker DataNode TaskTracker TaskTracker TaskTracker DataNode TaskTracker TaskTracker … Physical Machine Virtual Machine
  • 10.
    Challenge 2  ColocatedVirtual Machines often access disk concurrently  Random IO operations will compete with each other  Significantly degrade the Hadoop Job performance 2015/6/27 LIU Kai, National Institute of Informatics 10 DataNode TaskTracker DataNode TaskTracker DataNode TaskTracker Physical Machine Virtual Machine DataNode TaskTracker DataNode TaskTracker DataNode TaskTracker DataNode TaskTracker DataNode TaskTracker DataNode TaskTracker …
  • 11.
    Solution 2  Separatestorage from computation  Each physical machine only has one storage virtual machine  Only the storage Virtual Machine is IO intensive  No serious concurrent IO operations 2015/6/27 LIU Kai, National Institute of Informatics 11 TaskTracker DataNode TaskTracker TaskTracker TaskTracker DataNode TaskTracker TaskTracker TaskTracker DataNode TaskTracker TaskTracker … Physical Machine Virtual Machine
  • 12.
    Challenge 3  Can’tschedule Virtual Machines efficiently  IO intensive VMs can be prioritized since they consume less CPU  However, every VM is IO intensive! 2015/6/27 LIU Kai, National Institute of Informatics 12 DataNode TaskTracker DataNode TaskTracker DataNode TaskTracker Physical Machine Virtual Machine DataNode TaskTracker DataNode TaskTracker DataNode TaskTracker DataNode TaskTracker DataNode TaskTracker DataNode TaskTracker …
  • 13.
    Solution 3  Separatestorage from computation  Only the storage Virtual Machine is IO intensive  The storage Virtual Machine will receive a higher priority 2015/6/27 LIU Kai, National Institute of Informatics 13 TaskTracker DataNode TaskTracker TaskTracker TaskTracker DataNode TaskTracker TaskTracker TaskTracker DataNode TaskTracker TaskTracker … Physical Machine Virtual Machine
  • 14.
    Implementation – Howto design StoreApp? 2015/6/27 14LIU Kai, National Institute of Informatics
  • 15.
    Architecture 2015/6/27 LIU Kai,National Institute of Informatics 15  A StoreApp manager and multiple storage nodes  The StoreApp manager run on the master node  Each physical machine has one storage node
  • 16.
    Components  StoreApp manager Coordinate the operations of all data nodes  Scheduler  Scheduling tasks according to data locations  HDFS Proxy  Receive all HDFS requests and forward them to DataNode  Shuffler  Receive map output and push them to DataNode 2015/6/27 LIU Kai, National Institute of Informatics 16
  • 17.
    HDFS Prefetching 2015/6/27 LIUKai, National Institute of Informatics 17  Read the whole block b1 instead of needed partial records  Unused data of block b1 is kept in the memory  Read consecutive block into memory to form input split s1 task0 task1
  • 18.
    Automated Cluster Resizing 2015/6/27LIU Kai, National Institute of Informatics 18  Dynamically change Cluster Size during the job execution  The iterative algorithm can search for the optimal cluster size
  • 19.
    Personal Ideas 2015/6/27 19LIUKai, National Institute of Informatics
  • 20.
    Pros and cons Pros  Simple idea but shows good result  Show clear logic of locating and solving problems  Cons  Restrict to Hadoop 1  No open source 2015/6/27 LIU Kai, National Institute of Informatics 20
  • 21.
    Future direction  FromHadoop 1 to Hadoop 2  Hadoop 2 is quite different with Hadoop 1  Hadoop 2 can support more application framework like Spark  From Virtual Machine to container  Container is a more lightweight virtualization technology  Container is more Resource efficient than Virtual Machine  Container is more easy to scale than Virtual Machine 2015/6/27 LIU Kai, National Institute of Informatics 21
  • 22.
    References Yanfei Guo, etal. "StoreApp: A Shared Storage Appliance for Efficient and Scalable Virtualized Hadoop Clusters”, INFOCOM, 4, 2015 2015/6/27 LIU Kai, National Institute of Informatics 22
  • 23.
    Thank you! 2015/6/27 LIUKai, National Institute of Informatics 23

Editor's Notes

  • #2 Today, I want to introduce a paper about Hadoop. In fact, for people who are related to cloud computing, Hadoop is an old friend. I’m also quite familiar with Hadoop. That’s why I choose this paper.
  • #3 Here is the contents of My presentation. First, I will introduce the paper briefly. Then, I want to talk about the motivation. After that, I will provide some details of StoreApp implementation. These parts are corresponding to 3 different questions: what is StoreAPP? Why do we need StoreAPP? And how to design StoreApp. Finally, I will discuss some personal ideas.
  • #4 First, let’s look at the introduction.
  • #5 Since this paper focus on Hadoop. It is necessary to learn some background knowledge about Hadoop. Hadoop version 1 and version 2 are quite different. This paper focus Hadoop version 1, so I will introduce version 1. Hadoop is for big data storage and computation. It is consisted of two components: Hadoop Distributed File System and Hadoop MapReduce Framework. HDFS is for storage and MapReduce is for computation. Hadoop has a Master/slave architecture. A Hadoop cluster has a master and multiple slaves. Normally, datanode and tasktracker co-locate in the same slave node. DataNode is in charge of storage and TaskTracker deals with computation. If we run slave node in physical machine, it is OK to run DateNode and TaskTracker together. However, when we run slave node on virtual machine, it is quite different. Because multiple virtual machines will run on the same physical machine. This will cause some problem for Hadoop. This paper concentrate on solving the problem. I will talk about this in detail.
  • #6 Simply speaking, StoreApp, which proposed in the paper, is just a Hadoop plugin. It can speedup Hadoop running in virtual machines. The man function of StoreApp is to separate storage and computation, this can improve the performance of Hadoop. The picture shows the basic idea of StoreApp. For each physical machine, there are multiple virtual machines. DataNode and TaskTracker run inside virtual machines. In Hadoop default setting, DataNode and TaskTracker run together and run in the same Virtual Machine. However, by using StoreApp, DataNode and TaskTracker run in different Virtual Machines. In addition of, each physical machine runs only one datanode, which in charge of data storage.
  • #7 StoreApp can bring obvious benefits for Hadoop. It can improve HDFS throughput by 78.3%. There are two reasons for this improvement. First, a storage and multiple computation VM run on the same physical machine, and the storage VM has higher priority in scheduling than computation VM. Second, if there all VMs run DataNode, multiple VM will compete with each other. And consolidating storage into one VM reduce I/O contentions. StoreApp can also reduce job completion time by 61%. Because most Hadoop jobs are bottlenecked by slow disk access and StoreApp can speedup I/O performance, so it can reduce job completion time.
  • #8 Then, Let’s look at the motivation of StoreApp. This can help us understand why do we need StoreApp.
  • #9 If we run DataNode and TaskTracker together. The first challenge is that we can’t add or remove nodes easily. When we remove a node, we need move the data of the node to other nodes. If we add a node, we need to move the data from other node to the new data. However, rebalancing data incurs significant data movement, it is not a good idea to change the cluster size too often. Therefore, we cannot utilize the elasticity of virtual machines.
  • #10 The solution for the first challenge is to separate storage from computation, running DataNode and TaskTracker independently. Then, the computation nodes don’t store data, thus, adding or removing computation node requires no data movement. So, we can find optimal number of computation nodes for each hadoop job and change the cluster size dynamically, this can help improve performance.
  • #11 The second challenge is that colocated virtual machines often access disk concurrently because all virtual machines are in charge of data storage. These random IO operations will compete with each other and this will significantly degrade the hadoop job performance.
  • #12 The solution for the second challenge is also separating storage from computation. Because each physical machine only has one storage virtual machine and only the storage virtual machine is IO intensive, there is no serious concurrent IO operations.
  • #13 The third challenge is that we can’t schedule virtual machines efficiently. In fact, the IO intensive VMs can be prioritized since they consume less CPU and this can help improve IO performance. However, every VM is IO intensive when we run DataNode and TaskTracker together. So, these virtual machines cannot be scheduled properly.
  • #14 Separating storage from computation can also solve the third challenge. Since only the storage virtual machine is IO intensive, it can receive a higher priority. Therefore, IO performance can be improved.
  • #15 Then, I want to talk about the implementation of StoreApp
  • #16 This picture shows the architecture of StoreApp. The StoreApp is consist of a StoreApp manager and multiple storage nodes. The StoreApp manager run on the master node and each physical machine has one storage node.
  • #17 As show in the picture, StoreApp has 4 main components: StoreApp manager, scheduler, HDFS Proxy and shuffler. StoreApp manager is in charge of coordinating the operations of all data nodes, scheduler will schedule tasks according to data locations, HDFS can receive all HDFS requests and forward them to DataNode and shuffler will receive map output and push them to DataNode.
  • #18 StoreApp uses HDFS proxy to create input splits for each task. We often need to use multiple data blocks to create a input split. For example, creating input split s0 need block b0 and a part of block b1. However, StoreApp will read the whole block b1 instead of only the needed part of block b1. Then, unused data of block b1 is kept in the memory and it will be used for creating input split s1. This method can help reduce disk access and improve Hadoop performance.
  • #19 StoreApp also implement automated cluster resizing using the algorithm shows in the picture. This can dynamically change cluster size during the job execution. This iterative algorithm can help us to search for the optimal cluster size.
  • #20 In the end, I want to provide some personal ideas about this paper.
  • #21 In my opinion, this paper has some pros and cons. On one hand, the basic idea of StoreApp is very simple, just separate storage and computation, but it shows good result. And the paper shows clear logic of locating and solving problems. On the other hand, StoreApp is restricted to Hadoop 1. And it is not open source, this means it is hard to put it into real use.
  • #22 In my view, there are some future directions for StoreApp. First, StoreApp is restricted to Hadoop 1. However, Hadoop 2 is quite different with Hadoop 1 and Hadoop 2 can support more application framework like spark. If StoreApp can support Hadoop 2, it will gain more success. Second, container is a more lightweight virtualization technology. Moreover, it is more resource efficient and more easy to scale than virtual machines. I believe it’s promising to implement the idea of StoreApp for running Hadoop in containers.
  • #24 That’s all, thank you!