As @Alexey correctly points out the easiest way to do a parallel sort is definitely using a fork/join framework and merge sort. That's very easy to do and looks something like (pseudo code):
def mergesort(a, i0, i1):
if i0 == i1:
return
im = i0 + (i1 - i0) / 2
fork mergesort(a, i0, im)
fork mergesort(a, im, i1)
join
merge(a, i0, im, i1) # serial merge
If we analyse this we see that we have (easy to show by master theorem):
Work: T_1(n) = 2T_1(n / 2) + O(n) = O(n lg n)
Span: T_inf(n) = 1 T_inf(n / 2) + O(n) = O(n)
where work means the total amount of work done and span describes how long it'd take if we had infinitely many threads available (basically the depth of the tree).
The parallelism an algorithm has is basically Work / Span which in this case gives us O(lg n) - that's practically irrelevant although if we use a good serial sort algorithm for small enough leaf sizes this still works quite well.
We can do better though by parallelizing the merge. This can be done without an auxiliary array, but I'll leave that as an exercise for the reader (means: not easy and I'd have to look up how to actually do that).
Parallel merge: Assume that we have an auxiliary array aux with two sorted arrays in [i0, i1) and [j0, j1) and we want to put the merged sub arrays into array a between k0, k1. We do this recursively again:
- Compute im = i0 + (i1 - i0) / 2 - aux[im] is the median of the left half
- Find jm so that aux[im] is inserted directly before aux[jm]
- insert aux[im] into the correct position in a, which we know if we have im and jm.
- merge the two subarrays.
Confused? Well, the following illustration (I'm in CS not arts..) should help I hope:
In code this looks something like
def merge(a, aux, i0, i1, j0, j1, k0, k1):
if i0 == i1:
copy aux[j0, j1] to a[k0, k1]
return
if j0 == j1:
copy aux[i0, i1] to a[k0, k1]
return
im = im = i0 + (i1 - i0) / 2
jm = find(aux, j0, j1, aux[im])
km = k0 + (im - i0) + 1 + (jm - j0 )
a[km] = aux[im]
fork merge(a, aux, i0, im, j0, jm, k0, km)
fork merge(a, aux, im + 1, i1, jm, j1, km + 1, k1)
join
It's important to note that find has to be done using a simple serial binary search in O(lg n) since we know the right side is sorted.
Using such a parallel merge gives us the same work, but reduces the span to O(lg^3 n), which translates to a parallelism of O(n / lg^2 n) - which is a great improvement.
Nota bene: As for any parallel algorithm in practice you will want to use a simple serial version if the problem sizes get too small (quicksort or something) - which leaf size is the best has to be evaluated for each architecture separately through experimentation.