1
Parallel
Algorithms
Shashikant V. Athawale
Assistant Professor ,Computer Engineering
Department AISSMS College of Engineering,
Kennedy Road, Pune , MS, India - 411001
Parallel Algorithms
Parallel: perform more than one operation at a time.
PRAM model: Parallel Random Access Model.
2
p0
p1
pn-1
Shared
memory
Multiple processors connected to a shared memory.
Each processor access any location in unit time.
All processors can access memory in parallel.
All processors can perform operations in parallel.
Concurrent vs. Exclusive AccessFour models
EREW: exclusive read and exclusive write
CREW: concurrent read and exclusive write
ERCW: exclusive read and concurrent write
CRCW: concurrent read and concurrent write
Handling write conflicts
Common-write model: only if they write the same
value.
Arbitrary-write model: an arbitrary one succeeds.
Priority-write model: the one with smallest index
succeeds.
EREW and CRCW are most popular.
3
Synchronization and Control
Synchronization:
A most important and complicated issue
Suppose all processors are inherently tightly
synchronized:
 All processors execute the same statements at the same
time
 No race among processors, i.e, same pace.
Termination control of a parallel loop:
Depend on the state of all processors
Can be tested in O(1) time.
4
Pointer Jumping –list ranking
Given a single linked list L with n objects,
compute, for each object in L, its distance from the
end of the list.
Formally: suppose next is the pointer field
d[i]= 0 if next[i]=nil
 d[next[i]]+1 if next[i]≠nil
Serial algorithm: Θ(n).
5
List ranking –EREW algorithm
 LIST-RANK(L) (in O(lg n) time)
1. for each processor i, in parallel
2. do if next[i]=nil
3. then d[i]←0
4. else d[i]←1
5. while there exists an object i such that next[i]≠nil
6. do for each processor i, in parallel
7. do if next[i]≠nil
8. then d[i]← d[i]+ d[next[i]]
9. next[i] ←next[next[i]]
6
7
1
3
1
4
1
6
1
1
1
0
0
5
(a)
3 4 6 1 0 5
(b) 2 2 2 2 1 0
3 4 6 1 0 5
(c) 4 4 3 2 1 0
3 4 6 1 0 5
(d) 5 4 3 2 1 0
List ranking –correctness of EREW algorithm
Loop invariant: for each i, the sum of d values
in the sublist headed by i is the correct
distance from i to the end of the original list L.
Parallel memory must be synchronized: the
reads on the right must occur before the wirtes
on the left. Moreover, read d[i] and then read
d[next[i]].
An EREW algorithm: every read and write is
exclusive. For an object i, its processor reads
d[i], and then its precedent processor reads its
d[i]. Writes are all in distinct locations.
8
LIST ranking EREW algorithm running time
O(lg n):
The initialization for loop runs in O(1).
Each iteration of while loop runs in O(1).
There are exactly lg n iterations:
 Each iteration transforms each list into two interleaved lists:
one consisting of objects in even positions, and the other
odd positions. Thus, each iteration double the number of
lists but halves their lengths.
The termination test in line 5 runs in O(1).
Define work =#processors ×running time. O(n lg n).
9
Parallel prefix on a list
A prefix computation is defined as:
Input: <x1, x2, …, xn>
Binary associative operation ⊗
Output:<y1, y2, …, yn>
Such that:
 y1= x1
 yk= yk-1⊗ xkfork=2,3, …,n, i.e, yk= ⊗ x1⊗ x2 …⊗ xk.
Suppose <x1, x2, …, xn> are stored orderly in a list.
Define notation: [i,j]= xi⊗ xi+1 …⊗ xj
10
Prefix computation LIST-PREFIX(L)
1. for each processor i, in parallel
2. do y[i]← x[i]
3. while there exists an object i such that next[i]≠nil
4. do for each processor i, in parallel
5. do if next[i]≠nil
6. then y[next[i]]← y[i] ⊗ y[next[i]]
7. next[i] ←next[next[i]]
11
12
[1,1]
x1
[2,2]
x2
[3,3] [4,4]
x4
[5,5]
x5
[6,6]
x6
(a)
x3
x4
(b)
x1 x2 x5
x6x3
[1,1] [1,2] [2,3] [3,4] [4,5] [5,6]
x1 x2 x5
x6x3
x1 x2 x5
x6x3
(c)
(d)
[1,1] [1,2] [1,3] [1,4] [2,5] [3,6]
[1,1] [1,2] [1,3] [1,4] [1,5] [1,6]
Find root –CREW algorithm
Suppose a forest of binary trees, each node i has a
pointer parent[i].
Find the identity of the tree of each node.
Assume that each node is associated a processor.
Assume that each node i has a field root[i].
13
Find-roots –CREW algorithm
 FIND-ROOTS(F)
1. for each processor i, in parallel
2. do if parent[i] = nil
3. then root[i]←i
4. while there exist a node i such that parent[i] ≠ nil
5. do for each processor i, in parallel
6. do if parent[i] ≠ nil
7. then root[i] ← root[parent[i]]
8. parent[i] ← parent[parent[i]]
14
Find root –CREW algorithm
Running time: O(lg d), where d is the height of
maximum-depth tree in the forest.
All the writes are exclusive
But the read in line 7 is concurrent, since several
nodes may have same node as parent.
See figure 30.5.
15
16
Find roots –CREW vs. EREW
How fast can n nodes in a forest determine their
roots using only exclusive read?
17
Ω(lg n)
Argument: when exclusive read, a given peace of information can only be
copied to one other memory location in each step, thus the number of locations
containing a given piece of information at most doubles at each step. Looking
at a forest with one tree of n nodes, the root identity is stored in one place initially.
After the first step, it is stored in at most two places; after the second step, it is
Stored in at most four places, …, so need lg n steps for it to be stored at n places.
So CREW: O(lg d) and EREW: Ω(lg n).
If d=2(lg n)
, CREW outperforms any EREW algorithm.
If d=Θ(lg n), then CREW runs in O(lg lg n), and EREW is
much slower.
Find maximum – CRCW algorithm Given n elements A[0,n-1], find the maximum.
 Suppose n2
processors, each processor (i,j) compare A[i] and A[j], for 0≤
i, j ≤n-1.
 FAST-MAX(A)
1. n←length[A]
2. for i ←0 to n-1, in parallel
3. do m[i] ←true
4. for i ←0 to n-1 and j ←0 to n-1, in parallel
5. do if A[i] < A[j]
6. then m[i] ←false
7. for i ←0 to n-1, in parallel
8. do if m[i] =true
9. then max ← A[i]
10. return max
18
The running time is O(1).
Note: there may be multiple maximum values, so their processors
Will write to max concurrently. Its work = n2
× O(1) =O(n2
).
5 6 9 2 9 m
5 F T T F T F
6 F F T F T F
9 F F F F F T
2 T T T F T F
9 F F F F F T
A[j]
A[i]
max=9
Find maximum –CRCW vs. EREW
If find maximum using EREW, then Ω(lg n).
Argument: consider how many elements “think”
that they might be the maximum.
First, n,
After first step, n/2,
After second step n/4. …, each step, halve.
Moreover, CREW takes Ω(lg n).
19
Stimulating CRCW with EREW
Theorem:
A p-processor CRCW algorithm can be no more than O(lg p)
times faster than a best p-processor EREW algorithm for the same
problem.
Proof: each step of CRCW can be simulated by O(lg p)
computations of EREW.
Suppose concurrent write:
 CRCW pi write data xi to location li, (li may be same for multiple pi ‘s).
 Corresponding EREW pi write (li, xi) to a location A[i], (different A[i]’s)
so exclusive write.
 Sort all (li, xi)’s by li’s, same locations are brought together. in O(lg p).
 Each EREW picompares A[i]= (lj, xj), and A[i-1]= (lk, xk). If lj≠ lk or i=0,
then EREW pi writes xj to lj. (exclusive write).
See figure 30.7.
20
21
CRCW vs. EREW
CRCW:
Some says: easier to program and more faster.
Others say: The hardware to CRCW is slower than
EREW. And One can not find maximum in O(1).
Still others say: either EREW or CRCW is wrong.
Processors must be connected by a network, and only
be able to communicate with other via the network, so
network should be part of the model.
22
 Thank You
23

Parallel algorithms

  • 1.
    1 Parallel Algorithms Shashikant V. Athawale AssistantProfessor ,Computer Engineering Department AISSMS College of Engineering, Kennedy Road, Pune , MS, India - 411001
  • 2.
    Parallel Algorithms Parallel: performmore than one operation at a time. PRAM model: Parallel Random Access Model. 2 p0 p1 pn-1 Shared memory Multiple processors connected to a shared memory. Each processor access any location in unit time. All processors can access memory in parallel. All processors can perform operations in parallel.
  • 3.
    Concurrent vs. ExclusiveAccessFour models EREW: exclusive read and exclusive write CREW: concurrent read and exclusive write ERCW: exclusive read and concurrent write CRCW: concurrent read and concurrent write Handling write conflicts Common-write model: only if they write the same value. Arbitrary-write model: an arbitrary one succeeds. Priority-write model: the one with smallest index succeeds. EREW and CRCW are most popular. 3
  • 4.
    Synchronization and Control Synchronization: Amost important and complicated issue Suppose all processors are inherently tightly synchronized:  All processors execute the same statements at the same time  No race among processors, i.e, same pace. Termination control of a parallel loop: Depend on the state of all processors Can be tested in O(1) time. 4
  • 5.
    Pointer Jumping –listranking Given a single linked list L with n objects, compute, for each object in L, its distance from the end of the list. Formally: suppose next is the pointer field d[i]= 0 if next[i]=nil  d[next[i]]+1 if next[i]≠nil Serial algorithm: Θ(n). 5
  • 6.
    List ranking –EREWalgorithm  LIST-RANK(L) (in O(lg n) time) 1. for each processor i, in parallel 2. do if next[i]=nil 3. then d[i]←0 4. else d[i]←1 5. while there exists an object i such that next[i]≠nil 6. do for each processor i, in parallel 7. do if next[i]≠nil 8. then d[i]← d[i]+ d[next[i]] 9. next[i] ←next[next[i]] 6
  • 7.
    7 1 3 1 4 1 6 1 1 1 0 0 5 (a) 3 4 61 0 5 (b) 2 2 2 2 1 0 3 4 6 1 0 5 (c) 4 4 3 2 1 0 3 4 6 1 0 5 (d) 5 4 3 2 1 0
  • 8.
    List ranking –correctnessof EREW algorithm Loop invariant: for each i, the sum of d values in the sublist headed by i is the correct distance from i to the end of the original list L. Parallel memory must be synchronized: the reads on the right must occur before the wirtes on the left. Moreover, read d[i] and then read d[next[i]]. An EREW algorithm: every read and write is exclusive. For an object i, its processor reads d[i], and then its precedent processor reads its d[i]. Writes are all in distinct locations. 8
  • 9.
    LIST ranking EREWalgorithm running time O(lg n): The initialization for loop runs in O(1). Each iteration of while loop runs in O(1). There are exactly lg n iterations:  Each iteration transforms each list into two interleaved lists: one consisting of objects in even positions, and the other odd positions. Thus, each iteration double the number of lists but halves their lengths. The termination test in line 5 runs in O(1). Define work =#processors ×running time. O(n lg n). 9
  • 10.
    Parallel prefix ona list A prefix computation is defined as: Input: <x1, x2, …, xn> Binary associative operation ⊗ Output:<y1, y2, …, yn> Such that:  y1= x1  yk= yk-1⊗ xkfork=2,3, …,n, i.e, yk= ⊗ x1⊗ x2 …⊗ xk. Suppose <x1, x2, …, xn> are stored orderly in a list. Define notation: [i,j]= xi⊗ xi+1 …⊗ xj 10
  • 11.
    Prefix computation LIST-PREFIX(L) 1.for each processor i, in parallel 2. do y[i]← x[i] 3. while there exists an object i such that next[i]≠nil 4. do for each processor i, in parallel 5. do if next[i]≠nil 6. then y[next[i]]← y[i] ⊗ y[next[i]] 7. next[i] ←next[next[i]] 11
  • 12.
    12 [1,1] x1 [2,2] x2 [3,3] [4,4] x4 [5,5] x5 [6,6] x6 (a) x3 x4 (b) x1 x2x5 x6x3 [1,1] [1,2] [2,3] [3,4] [4,5] [5,6] x1 x2 x5 x6x3 x1 x2 x5 x6x3 (c) (d) [1,1] [1,2] [1,3] [1,4] [2,5] [3,6] [1,1] [1,2] [1,3] [1,4] [1,5] [1,6]
  • 13.
    Find root –CREWalgorithm Suppose a forest of binary trees, each node i has a pointer parent[i]. Find the identity of the tree of each node. Assume that each node is associated a processor. Assume that each node i has a field root[i]. 13
  • 14.
    Find-roots –CREW algorithm FIND-ROOTS(F) 1. for each processor i, in parallel 2. do if parent[i] = nil 3. then root[i]←i 4. while there exist a node i such that parent[i] ≠ nil 5. do for each processor i, in parallel 6. do if parent[i] ≠ nil 7. then root[i] ← root[parent[i]] 8. parent[i] ← parent[parent[i]] 14
  • 15.
    Find root –CREWalgorithm Running time: O(lg d), where d is the height of maximum-depth tree in the forest. All the writes are exclusive But the read in line 7 is concurrent, since several nodes may have same node as parent. See figure 30.5. 15
  • 16.
  • 17.
    Find roots –CREWvs. EREW How fast can n nodes in a forest determine their roots using only exclusive read? 17 Ω(lg n) Argument: when exclusive read, a given peace of information can only be copied to one other memory location in each step, thus the number of locations containing a given piece of information at most doubles at each step. Looking at a forest with one tree of n nodes, the root identity is stored in one place initially. After the first step, it is stored in at most two places; after the second step, it is Stored in at most four places, …, so need lg n steps for it to be stored at n places. So CREW: O(lg d) and EREW: Ω(lg n). If d=2(lg n) , CREW outperforms any EREW algorithm. If d=Θ(lg n), then CREW runs in O(lg lg n), and EREW is much slower.
  • 18.
    Find maximum –CRCW algorithm Given n elements A[0,n-1], find the maximum.  Suppose n2 processors, each processor (i,j) compare A[i] and A[j], for 0≤ i, j ≤n-1.  FAST-MAX(A) 1. n←length[A] 2. for i ←0 to n-1, in parallel 3. do m[i] ←true 4. for i ←0 to n-1 and j ←0 to n-1, in parallel 5. do if A[i] < A[j] 6. then m[i] ←false 7. for i ←0 to n-1, in parallel 8. do if m[i] =true 9. then max ← A[i] 10. return max 18 The running time is O(1). Note: there may be multiple maximum values, so their processors Will write to max concurrently. Its work = n2 × O(1) =O(n2 ). 5 6 9 2 9 m 5 F T T F T F 6 F F T F T F 9 F F F F F T 2 T T T F T F 9 F F F F F T A[j] A[i] max=9
  • 19.
    Find maximum –CRCWvs. EREW If find maximum using EREW, then Ω(lg n). Argument: consider how many elements “think” that they might be the maximum. First, n, After first step, n/2, After second step n/4. …, each step, halve. Moreover, CREW takes Ω(lg n). 19
  • 20.
    Stimulating CRCW withEREW Theorem: A p-processor CRCW algorithm can be no more than O(lg p) times faster than a best p-processor EREW algorithm for the same problem. Proof: each step of CRCW can be simulated by O(lg p) computations of EREW. Suppose concurrent write:  CRCW pi write data xi to location li, (li may be same for multiple pi ‘s).  Corresponding EREW pi write (li, xi) to a location A[i], (different A[i]’s) so exclusive write.  Sort all (li, xi)’s by li’s, same locations are brought together. in O(lg p).  Each EREW picompares A[i]= (lj, xj), and A[i-1]= (lk, xk). If lj≠ lk or i=0, then EREW pi writes xj to lj. (exclusive write). See figure 30.7. 20
  • 21.
  • 22.
    CRCW vs. EREW CRCW: Somesays: easier to program and more faster. Others say: The hardware to CRCW is slower than EREW. And One can not find maximum in O(1). Still others say: either EREW or CRCW is wrong. Processors must be connected by a network, and only be able to communicate with other via the network, so network should be part of the model. 22
  • 23.