Introducing MapReduce
Programming Model
Samuel Yee
*Multi-threaded
Programming
MapReduce Programming Model
 For parallelization & distributed computing, programmers don’t have to worry
about multi-threading, system failure, file I/O, networking, data loss etc. All these
complex low-level activities are taken care of by Hadoop.
 Focus on 2 key functions instead: Mapper and Reducer
 Mapper function
 Ingest from large input files
 Split up into many smaller blocks (default 64MB per block size)
 Transform inputs into key-value pairs, shuffle and map them to Reduce function
 Reducer function
 Reduce outputs by aggregating, summing, eliminating etc.
 Write to output files
 Key-Value pairs must match between Mapper and Reducer functions
Data Processing (MapReduce)
Input
Data
Map()
Map()
Map()
Reduce()
Reduce()
Output
Data
Split
[k1, v1]
Sort by
k1
Merge
[k1, [v1, v2, v3…]]
Hadoop’s Approach
Big Data
Block
Block
Block
Block
Block
Block
Split into smaller data blocks
Hadoop’s Approach
Block
Block
Block
Block
Block
Block
Computing
Computing
Computing
Computing
Computing
Computing
Map Computing Process
to Data Blocks
Reduce outputs
by aggregating
into a result
Output
Output
Output
Output
Output
Output
Consider Two Input Files
 File01.txt: Hello World Bye World
 File02.txt: Hello Hadoop Goodbye Hadoop
Outputs of Mappers
 Process 1
 [Hello, 1]
 [Hadoop, 1]
 [Goodbye, 1]
 [Hadoop, 1]
 Process 2
 [Hello, 1]
 [World, 1]
 [Bye, 1]
 [World, 1]
Consolidated Result of Reducers
 [Bye, 1]
 [Goodbye, 1]
 [Hadoop, 2]
 [Hello, 2]
 [World, 2]
MapReduce Template in Java
Demo
 MapReduce programming using IntelliJ IDEA and Java
 Read my LinkedIn articles on how to setup development environment for
MapReduce and Spark on Windows
 http://tinyurl.com/px9rwwk

Introducing MapReduce Programming Framework