Introducing MapReduce Programming Framework

Introducing MapReduce
Programming Model
Samuel Yee

MapReduce Programming Model
 For parallelization & distributed computing, programmers don’t have to worry
about multi-threading, system failure, file I/O, networking, data loss etc. All these
complex low-level activities are taken care of by Hadoop.
 Focus on 2 key functions instead: Mapper and Reducer
 Mapper function
 Ingest from large input files
 Split up into many smaller blocks (default 64MB per block size)
 Transform inputs into key-value pairs, shuffle and map them to Reduce function
 Reducer function
 Reduce outputs by aggregating, summing, eliminating etc.
 Write to output files
 Key-Value pairs must match between Mapper and Reducer functions

Data Processing (MapReduce)
Input
Data
Map()
Map()
Map()
Reduce()
Reduce()
Output
Data
Split
[k1, v1]
Sort by
k1
Merge
[k1, [v1, v2, v3…]]

Hadoop’s Approach
Big Data
Block
Block
Block
Block
Block
Block
Split into smaller data blocks

Hadoop’s Approach
Block
Block
Block
Block
Block
Block
Computing
Computing
Computing
Computing
Computing
Computing
Map Computing Process
to Data Blocks
Reduce outputs
by aggregating
into a result
Output
Output
Output
Output
Output
Output

Consider Two Input Files
 File01.txt: Hello World Bye World
 File02.txt: Hello Hadoop Goodbye Hadoop

Outputs of Mappers
 Process 1
 [Hello, 1]
 [Hadoop, 1]
 [Goodbye, 1]
 [Hadoop, 1]
 Process 2
 [Hello, 1]
 [World, 1]
 [Bye, 1]
 [World, 1]

Consolidated Result of Reducers
 [Bye, 1]
 [Goodbye, 1]
 [Hadoop, 2]
 [Hello, 2]
 [World, 2]

Demo
 MapReduce programming using IntelliJ IDEA and Java
 Read my LinkedIn articles on how to setup development environment for
MapReduce and Spark on Windows
 http://tinyurl.com/px9rwwk

Introducing MapReduce Programming Framework

More Related Content

What's hot

Viewers also liked

Similar to Introducing MapReduce Programming Framework

Recently uploaded

Introducing MapReduce Programming Framework