Apache Hadoop, HDFS and MapReduce Overview
Nisanth Simon
Agenda

Motivation behind Hadoop
− A different approach to Distributed computing
− Map Reduce paradigm- in general

Hadoop Overview
− Hadoop distributed file system
− Map Reduce Engine
− Map Reduce Framework

Walk thru First MR job
Data Explosion

Modern systems has to deal with far more data than in
the past. Many organizations are generating data at a rate
of terabytes per day.
Facebook – over 15Pb of data
eBay – over 5Pb of data
Telecom Industry
Hardware improvements through the years...

CPU Speeds:
− 1990 - 44 MIPS at 40 MHz
− 2000 - 3,561 MIPS at 1.2 GHz
− 2010 - 147,600 MIPS at 3.3 GHz

RAM Memory
− 1990 – 640K conventional memory (256K extended memory recommended)
− 2000 – 64MB memory
− 2010 - 8-32GB (and more)

Disk Capacity
− 1990 – 20MB
− 2000 - 1GB
− 2010 – 1TB

Disk Latency (speed of reads and writes) – not much improvement in last 7-10 years, currently around 70 –
80MB / sec
How long it will take to read 1TB of data?

1TB (at 80Mb / sec):
− 1 disk - 3.4 hours
− 10 disks - 20 min
− 100 disks - 2 min
− 1000 disks - 12 sec

Distributed Data Processing is the answer!
Distributed computing is not new

HPC and Grid computing
− Move data to computation- Network bandwidth becomes a bottleneck; compute nodes
idle

Works well for compute intensive jobs
− Exchanging data requires synchronization– very tricky
− Scalability is programmer’s responsibility

Will require change in job implementation

Hadoop’s approach
− Move computation to data- data locality, conserves network bandwidth
− Shared nothing Architecture- no dependencies between tasks
− Communication between nodes in frameworks responsibility
− Designed for scalability

Adding increased load to a system should not cause outright failure, but a
graceful decline

Increasing resources should support a proportional increase in load capacity

Without modifying the job implementation
Map Reduce Paradigm

Calculate the number of occurrences of each word in this book
− Hadoop: The Definitive Guide, Third Edition

623 Pages
Apache

A scalable fault-tolerant distributed system for data storage and processing
(open source under the Apache license).
− Meant for heterogeneous commodity hardware

Inspired by Google technologies
− MapReduce
− Google file system

Originally built to address scalability problems of Nutch, an open source Web
search technology
− Developed by Douglass Read Cutting (Doug cutting)

Core Hadoop has two main systems:
− Hadoop Distributed File System: self-healing high-bandwidth clustered
storage.
− MapReduce: distributed fault-tolerant resource management and scheduling
coupled with a scalable data programming abstraction.
Hadoop Distributed File System (HDFS)
Hadoop Distributed File System (HDFS) is a distributed file system designed
to run on commodity hardware.
HDFS is highly fault-tolerant and is designed to be deployed on low-cost
hardware.
HDFS provides high throughput access to application data and is suitable for
applications that have large data sets.
HDFS Architecture – Master/Slaves
Data Replication
NameNode(Master)

Manages the file system namespace
− Maintains file system tree and meta data for all files/directories in the tree

Maps blocks to DataNodes, filenames, etc
− Two persistent files (namespace image and edit log) plus additional in-memory data

Safemode - Read only state, no modification to HDFS allowed
− Single point of failure. Name node loss renders file system inaccessible

Hadoop V1 has no built-in failover mechanism for NameNode
− Coordinates access to DataNodes but data never goes on NameNode

Centralizes and manages file system metadata in memory
− Metadata size limited to available RAM of NameNode.
− Bias toward modest number of large files, not large number of small files (where metadata can
grow too sizeable)
− NameNode will crash if it runs out of RAM
DataNode (Slave)
•
Files on HDFS are chopped into blocks and stored on
DataNodes
• Size of blocks is configurable
• Different blocks from the same file are stored on different
DataNodes if possible
•
Performs block creation, deletions, and replication as
instructed by NameNode
•
Serves read and write requests to clients
HDFS user interfacesHDFS user interfaces
•
HDFS Web UI for NameNode and DataNodes
• NameNode front page is at http://localhost:50070 (default configuration of a
Hadoop in pseudo-distributed mode)
• Distributed File System Browser (read only)
• Display basic cluster statistics
•
Hadoop shell commands
• $HADOOP_HOME/bin/hadoop dfs –ls /user/biadmin
• $HADOOP_HOME/bin/hadoop dfs –chown hdfs:biadmin /user/hdfs
• $HADOOP_HOME/bin/hadoop dfsadmin –report
•
Programmatic interface
• HDFS Java API: http://hadoop.apache.org/core/docs/current/api/
• C wrapper over java APIs
HDFS Commands
Download the Airline Dataset
– stat-computing.org/dataexpo/2009/1987.csv.bz2
•
Creating a directory in HDFS
– hadoop fs –mkdir /user/hadoop/dir1
– hadoop fs -mkdir hdfs://nn1.example.com/user/hadoop/dir
hdfs://nn2.example.com/user/hadoop/dir
•
Delete a directory in HDFS
– hadoop fs -rm hdfs://nn.example.com/file
– hadoop fs -rmr /user/hadoop/dir
– hadoop fs -rmr hdfs://nn.example.com/user/hadoop/dir
•
List a directory
– hadoop fs -ls /user/hadoop/file1
HDFS Commands
Copy a file to HDFS file
– hadoop fs -put localfile /user/hadoop/hadoopfile
– hadoop fs -put localfile1 localfile2 /user/hadoop/hadoopdir
– hadoop fs -put localfile hdfs://nn.example.com/hadoop/hadoopfile
•
Copy file in HDFS
– hadoop fs -cp /user/hadoop/file1 /user/hadoop/file2
•
Copy from Local File system to HDFS
– hadoop fs -copyFromLocal /opt/1987.csv
•
Copy from HDFS to Local File System
– hadoop fs –copyToLocal /user/nis/1987.csv /opt/res.csv
•
Display a content of a file
– hadoop fs –cat /user/nis/1987.csv
Hadoop MapReduce Engine
•
Framework which enables writing applications to process multi-terabyte
of data in-parallel on large clusters (thousands of nodes) of commodity
hardware
•
A clean abstraction for programmers
• No need to deal with internals of large scale computing
• Implement just Mapper and Reducer functions- most of the times
• Implement in the language you comfortable with
–
Java (assembly language for Hadoop)
–
With hadoop streaming, you can run any shell utility as mapper and reducer
–
Hadoop pipes to support implementation of mapper and reducer in C++.
•
Automatic parallelization & distribution
• Divides the job into tasks (map and reduce task)
• Schedules submitted jobs
• Schedules tasks as close to data as possible
• Monitors task progress
•
Fault-tolerance
• Re-execute failed or slow task instances.
MapReduce Architecture- Master/Slaves
•
Single master (JobTracker) controls job execution on multiple slaves (TaskTrackers).
•
JobTracker
• Accepts MapReduce jobs submitted by clients
• Pushes map and reduce tasks out to TaskTracker nodes
• Keeps the work as physically close to data as possible
• Monitors tasks and TaskTracker status
•
TaskTracker
• Runs map and reduce tasks; Reports status to JobTracker
• Manages storage and transmission of intermediate output
JobTracker
TaskTracker TaskTracker TaskTrackerTaskTracker
JobClient
clusterMaster node
Slave node 1
Map Reduce Family
•
Job – A MapReduce job is a unit of work that the client wants to
perform. It consists of the input data, output location the
MapReduce program, and configuration information.
•
Task – Hadoop runs the job by dividing it into tasks, of which there
are two types: map tasks and reduce tasks.
•
Task Attempt – A particular instance of an attempt to execute a
task on a machine
•
Input Split - Hadoop divides the input to a MapReduce job into
fixed-size pieces called inputsplits, or just splits.
• Default split size == Block size for the input
• Number of map task == no of splits of job’s input
Map Reduce Family…
•
Record – the unit of data from an input split, on which Map task
runs the user defined mapper function.
•
InputFormat - Hadoop can process many different types of data
formats, from flat text files to databases. This guy help’s hadoop in
dividing job’s input into Splits and interpret records from a split.
• File based InputFormat
•
Text Input Format
•
Binary Input Format
• DataBase InputFormat
•
OutputFormat – This guy helps hadoop writing job’s output to
specified output location. There are corresponding output formats
for each Inputformat.
How data flows in a map reduce job
Some more members …
•
Partitioner - Partitioner partitions the key space.
• Determines the destination Reducer task for intermediate map output.
• Number of partitions is equal to Number of Reduce task.
• HashPartitioner used by default
•
Uses key.hashCode() to return partition num
•
Combiner – Reduces the data transferred between MAP and REDUCE
tasks
• Takes outputs of multiple MAP functions and combines it into single input to
REDUCE function
• Example
•
Map task output - (BB, 1), (Apple,1), (Android,1), (Apple,1), (iOS,1),(iOS,1),(RHEL,1),
(Windows,1),(BB,1)
•
Combiner output – (BB,2),(Apple,2),(MicroSoft,1),(iOS,2),(RHEL,1),(Windows,1)
Word Count Mapper
public static class WordCountMapper extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
} } }
Word count Reducer
public static class WordCountReducer extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
} }
Prepare and Submit job
public class WordCountJob {
public static void main(String[] args) throws Exception{
JobConf conf = new JobConf(WordCount.class);
// specify input and output dirs
FileInputFormat.addInputPath(conf, new Path("input"));
FileOutputFormat.addOutputPath(conf, new Path("output"));
// specify output types
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
//InputFormat and OutputFormat
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
conf.setMapperClass(WordCountMapper.class); // specify a mapper
conf.setReducerClass(WordCountReducer.class); // specify a reducer
conf.setCombinerClass(WordCountReducer.class);
conf.setNumberOfReducer(2); //Number of reducer
JobClient.runJob(conf); // Submit the job to Job Tracker
}}
Complete Picture
TaskTrackers (compute
nodes) and DataNodes co-
locate
= high aggregate
bandwidth across cluster
Hadoop Ecosystem
Thank You

Apache hadoop, hdfs and map reduce Overview

  • 1.
    Apache Hadoop, HDFSand MapReduce Overview Nisanth Simon
  • 2.
    Agenda  Motivation behind Hadoop −A different approach to Distributed computing − Map Reduce paradigm- in general  Hadoop Overview − Hadoop distributed file system − Map Reduce Engine − Map Reduce Framework  Walk thru First MR job
  • 3.
    Data Explosion  Modern systemshas to deal with far more data than in the past. Many organizations are generating data at a rate of terabytes per day. Facebook – over 15Pb of data eBay – over 5Pb of data Telecom Industry
  • 4.
    Hardware improvements throughthe years...  CPU Speeds: − 1990 - 44 MIPS at 40 MHz − 2000 - 3,561 MIPS at 1.2 GHz − 2010 - 147,600 MIPS at 3.3 GHz  RAM Memory − 1990 – 640K conventional memory (256K extended memory recommended) − 2000 – 64MB memory − 2010 - 8-32GB (and more)  Disk Capacity − 1990 – 20MB − 2000 - 1GB − 2010 – 1TB  Disk Latency (speed of reads and writes) – not much improvement in last 7-10 years, currently around 70 – 80MB / sec
  • 5.
    How long itwill take to read 1TB of data?  1TB (at 80Mb / sec): − 1 disk - 3.4 hours − 10 disks - 20 min − 100 disks - 2 min − 1000 disks - 12 sec  Distributed Data Processing is the answer!
  • 6.
    Distributed computing isnot new  HPC and Grid computing − Move data to computation- Network bandwidth becomes a bottleneck; compute nodes idle  Works well for compute intensive jobs − Exchanging data requires synchronization– very tricky − Scalability is programmer’s responsibility  Will require change in job implementation  Hadoop’s approach − Move computation to data- data locality, conserves network bandwidth − Shared nothing Architecture- no dependencies between tasks − Communication between nodes in frameworks responsibility − Designed for scalability  Adding increased load to a system should not cause outright failure, but a graceful decline  Increasing resources should support a proportional increase in load capacity  Without modifying the job implementation
  • 7.
    Map Reduce Paradigm  Calculatethe number of occurrences of each word in this book − Hadoop: The Definitive Guide, Third Edition  623 Pages
  • 8.
    Apache  A scalable fault-tolerantdistributed system for data storage and processing (open source under the Apache license). − Meant for heterogeneous commodity hardware  Inspired by Google technologies − MapReduce − Google file system  Originally built to address scalability problems of Nutch, an open source Web search technology − Developed by Douglass Read Cutting (Doug cutting)  Core Hadoop has two main systems: − Hadoop Distributed File System: self-healing high-bandwidth clustered storage. − MapReduce: distributed fault-tolerant resource management and scheduling coupled with a scalable data programming abstraction.
  • 9.
    Hadoop Distributed FileSystem (HDFS) Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets.
  • 10.
    HDFS Architecture –Master/Slaves
  • 11.
  • 12.
    NameNode(Master)  Manages the filesystem namespace − Maintains file system tree and meta data for all files/directories in the tree  Maps blocks to DataNodes, filenames, etc − Two persistent files (namespace image and edit log) plus additional in-memory data  Safemode - Read only state, no modification to HDFS allowed − Single point of failure. Name node loss renders file system inaccessible  Hadoop V1 has no built-in failover mechanism for NameNode − Coordinates access to DataNodes but data never goes on NameNode  Centralizes and manages file system metadata in memory − Metadata size limited to available RAM of NameNode. − Bias toward modest number of large files, not large number of small files (where metadata can grow too sizeable) − NameNode will crash if it runs out of RAM
  • 13.
    DataNode (Slave) • Files onHDFS are chopped into blocks and stored on DataNodes • Size of blocks is configurable • Different blocks from the same file are stored on different DataNodes if possible • Performs block creation, deletions, and replication as instructed by NameNode • Serves read and write requests to clients
  • 14.
    HDFS user interfacesHDFSuser interfaces • HDFS Web UI for NameNode and DataNodes • NameNode front page is at http://localhost:50070 (default configuration of a Hadoop in pseudo-distributed mode) • Distributed File System Browser (read only) • Display basic cluster statistics • Hadoop shell commands • $HADOOP_HOME/bin/hadoop dfs –ls /user/biadmin • $HADOOP_HOME/bin/hadoop dfs –chown hdfs:biadmin /user/hdfs • $HADOOP_HOME/bin/hadoop dfsadmin –report • Programmatic interface • HDFS Java API: http://hadoop.apache.org/core/docs/current/api/ • C wrapper over java APIs
  • 15.
    HDFS Commands Download theAirline Dataset – stat-computing.org/dataexpo/2009/1987.csv.bz2 • Creating a directory in HDFS – hadoop fs –mkdir /user/hadoop/dir1 – hadoop fs -mkdir hdfs://nn1.example.com/user/hadoop/dir hdfs://nn2.example.com/user/hadoop/dir • Delete a directory in HDFS – hadoop fs -rm hdfs://nn.example.com/file – hadoop fs -rmr /user/hadoop/dir – hadoop fs -rmr hdfs://nn.example.com/user/hadoop/dir • List a directory – hadoop fs -ls /user/hadoop/file1
  • 16.
    HDFS Commands Copy afile to HDFS file – hadoop fs -put localfile /user/hadoop/hadoopfile – hadoop fs -put localfile1 localfile2 /user/hadoop/hadoopdir – hadoop fs -put localfile hdfs://nn.example.com/hadoop/hadoopfile • Copy file in HDFS – hadoop fs -cp /user/hadoop/file1 /user/hadoop/file2 • Copy from Local File system to HDFS – hadoop fs -copyFromLocal /opt/1987.csv • Copy from HDFS to Local File System – hadoop fs –copyToLocal /user/nis/1987.csv /opt/res.csv • Display a content of a file – hadoop fs –cat /user/nis/1987.csv
  • 17.
    Hadoop MapReduce Engine • Frameworkwhich enables writing applications to process multi-terabyte of data in-parallel on large clusters (thousands of nodes) of commodity hardware • A clean abstraction for programmers • No need to deal with internals of large scale computing • Implement just Mapper and Reducer functions- most of the times • Implement in the language you comfortable with – Java (assembly language for Hadoop) – With hadoop streaming, you can run any shell utility as mapper and reducer – Hadoop pipes to support implementation of mapper and reducer in C++. • Automatic parallelization & distribution • Divides the job into tasks (map and reduce task) • Schedules submitted jobs • Schedules tasks as close to data as possible • Monitors task progress • Fault-tolerance • Re-execute failed or slow task instances.
  • 18.
    MapReduce Architecture- Master/Slaves • Singlemaster (JobTracker) controls job execution on multiple slaves (TaskTrackers). • JobTracker • Accepts MapReduce jobs submitted by clients • Pushes map and reduce tasks out to TaskTracker nodes • Keeps the work as physically close to data as possible • Monitors tasks and TaskTracker status • TaskTracker • Runs map and reduce tasks; Reports status to JobTracker • Manages storage and transmission of intermediate output JobTracker TaskTracker TaskTracker TaskTrackerTaskTracker JobClient clusterMaster node Slave node 1
  • 19.
    Map Reduce Family • Job– A MapReduce job is a unit of work that the client wants to perform. It consists of the input data, output location the MapReduce program, and configuration information. • Task – Hadoop runs the job by dividing it into tasks, of which there are two types: map tasks and reduce tasks. • Task Attempt – A particular instance of an attempt to execute a task on a machine • Input Split - Hadoop divides the input to a MapReduce job into fixed-size pieces called inputsplits, or just splits. • Default split size == Block size for the input • Number of map task == no of splits of job’s input
  • 20.
    Map Reduce Family… • Record– the unit of data from an input split, on which Map task runs the user defined mapper function. • InputFormat - Hadoop can process many different types of data formats, from flat text files to databases. This guy help’s hadoop in dividing job’s input into Splits and interpret records from a split. • File based InputFormat • Text Input Format • Binary Input Format • DataBase InputFormat • OutputFormat – This guy helps hadoop writing job’s output to specified output location. There are corresponding output formats for each Inputformat.
  • 21.
    How data flowsin a map reduce job
  • 22.
    Some more members… • Partitioner - Partitioner partitions the key space. • Determines the destination Reducer task for intermediate map output. • Number of partitions is equal to Number of Reduce task. • HashPartitioner used by default • Uses key.hashCode() to return partition num • Combiner – Reduces the data transferred between MAP and REDUCE tasks • Takes outputs of multiple MAP functions and combines it into single input to REDUCE function • Example • Map task output - (BB, 1), (Apple,1), (Android,1), (Apple,1), (iOS,1),(iOS,1),(RHEL,1), (Windows,1),(BB,1) • Combiner output – (BB,2),(Apple,2),(MicroSoft,1),(iOS,2),(RHEL,1),(Windows,1)
  • 23.
    Word Count Mapper publicstatic class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } }
  • 24.
    Word count Reducer publicstatic class WordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
  • 25.
    Prepare and Submitjob public class WordCountJob { public static void main(String[] args) throws Exception{ JobConf conf = new JobConf(WordCount.class); // specify input and output dirs FileInputFormat.addInputPath(conf, new Path("input")); FileOutputFormat.addOutputPath(conf, new Path("output")); // specify output types conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); //InputFormat and OutputFormat conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); conf.setMapperClass(WordCountMapper.class); // specify a mapper conf.setReducerClass(WordCountReducer.class); // specify a reducer conf.setCombinerClass(WordCountReducer.class); conf.setNumberOfReducer(2); //Number of reducer JobClient.runJob(conf); // Submit the job to Job Tracker }}
  • 26.
    Complete Picture TaskTrackers (compute nodes)and DataNodes co- locate = high aggregate bandwidth across cluster
  • 27.
  • 28.