4

I'm comparing the performance of two haskell programs running the same computation.

The first one is sequential:

main :: IO()
main = putStr $ unlines . map (show . solve) $ [100..107]
  where solve x = pow x (10^7) (982451653)

The second one uses Control.Parallel.Strategies:

import Control.Parallel.Strategies

main :: IO()
main = putStr $ unlines . parMap rdeepseq (show . solve) $ [100..107]
  where solve x = pow x (10^7) (982451653)

In both cases, pow is the modular exponentiation naively implemented as:

pow :: Int -> Int -> Int -> Int
pow a 0 m = 1
pow a b m = a * (pow a (b-1) m) `mod` m

The sequential program runs in about 3 seconds using, as expected, 100% CPU.

$ stack ghc seq.hs -- -O2
$ \time -f "%e s - %P" ./seq > /dev/null
2.96 s - 100%

The parallel program also runs in about 3 seconds using 100% CPU when limited to a single core.

$ stack ghc par.hs -- -O2 -threaded
$ \time -f "%e s - %P" ./par +RTS -N1 > /dev/null
3.14 s - 99%

But when I ran it on 4 cores, I did not observe the performance gain I was expected:

$ \time -f "%e s - %P" ./par +RTS -N4 > /dev/null
3.31 s - 235%

Even more surprising, the sequential program uses more than 100% CPU when run on several cores:

$ stack ghc seq.hs -- -O2 -threaded
$ \time -f "%e s - %P" ./seq +RTS -N4 > /dev/null
3.26 s - 232%

How can those results be explained?


EDIT - As advised by @RobertK and @Yuras, I replaced the rdeeseq by rpar and it did fix the initial issue. However, the performance is still much less than what I expected:

$ stack ghc par.hs -- -O2 -threaded
$ \time -f "%e s - %P" ./par +RTS -N1 > /dev/null
3.12 s - 99%
$ \time -f "%e s - %P" ./par +RTS -N4 > /dev/null
1.91 s - 368%

The execution time is barely divided by two even though the 4 cores are running more than 90% of the time on average.

Also, some parts of the threadscope graph look very sequential: enter image description here

0

2 Answers 2

2

First of all, rdeepseq seems to be buggy. Try to run ./seq +RTS -N4 -s, and you'll see no sparks created. That is why you don't see any speedup on 4 cores. Use rnf x ‘pseq‘ return x instead.

Also note GC statictics in +RTS -s output. Actually GC takes most of the CPU. With -N4 you have 4 parallel GC running, they take more time. That is why sequencial progral takes much more CPU on 4 cores. Basically you have 3 GC threads idle in spin lock waiting for synchronization. The do nothing useful, by eat CPU in a busy loop. Try to limit number of parallel GC threads using -qn1 option.

Regarding performance gain. You should not expect perfect scaling. Also I think you have 1 fizzled spark -- it is evaluated in parallel, but its result is not used.

Added: Comparing with the python implementation you linked in the comments, I see that you are using completely different algorithm in haskell. More or less similar approach is the next (requires BangPatterns):

pow :: Int -> Int -> Int -> Int
pow a b m = go 1 b
  where
  go !r 0 = r
  go r b' = go ((r * a) `mod` m) (pred b')

Your ariginal algorithm uses stack to build the result, so it is bound by GC, not by actuall computation. So you don't see big speedup. With new one I see 3x speedup (I had to increase amount of work to see the speedup because the algorithm becomes too slow).

Sign up to request clarification or add additional context in comments.

7 Comments

Thanks, replacing redeepseq with rpar fixed it. However, the performance gain is still much less than I expected: 3.1 s with -N1 and 1.9 s with -N4. About the sequential program, I don't see why the GC would use an extra 132% when there is nothing to collect.
I believe the GC threads are asleep untill GC can initiate, they are not stuck in a busy loop.
Thanks for the edit. I actually chose to run a small number of long computation so the program can achieve close to perfect scaling. For instance, this python code has a scaling factor of 3.6 over 4 cores.
@RobertK they are in a busy loop while GC is in progress, but there is not enough work for all GC threads. See there: github.com/ghc/ghc/blob/…
rdeepseq is not buggy. That issue was actually caused by a buggy implementation of rparWith that allowed the optimizer to mess it up. It's fixed in the master branch. I don't know if there's been a release.
|
1

I do not believe your parallel example is parallel. parMap accepts a strategy, and your strategy simply tells it to perform a deepseq. You need to combine this strategy with one that defines the parallel behaviour, e.g rpar. You are telling haskell 'perform this map, using this strategy', and right now your strategy does not define any parallel behaviour.

Also make sure that you compile your program specifying the -rtsopts flag (I do not know if stack does this for you, but ghc requires it to enable runtime options).

7 Comments

Thanks, replacing redeepseq with rpar fixed it. However, the performance gain is still much less than I expected: 3.1 s with -N1 and 1.9 s with -N4. I'll edit my question to include relevant information.
Haskell is a lazy language. You are telling it 'evaluate this in parallel', and it only evaluates it as much as it needs to. It might return an expression which is not fully evaluated, leaving work for the 'main' thread. stackoverflow.com/questions/6872898/… You need to combine the rpar strategy with the rdeepseq one, such that haskell interprets it as 'in parallel, evaluate this to normal form'. Consider this function when combining strategies: hackage.haskell.org/package/parallel-3.2.2.0/docs/…
Simon Marlow has an excellent book on parallel and concurrent functional programming web.archive.org/web/20171207155221/http://… It explains all these behaviours very nicely. Once you get this to work okay you could look into granularity control, making sure the parallel tasks are big enough. If the task is small the work of creating the spark might dominate the entire work done. @Vincent
Thanks for all this information! I used rpar `dot` rdeepseq as a strategy and added force (from Control.DeepSeq) before and after parMap to make sure haskell knows all those values need to be computed. However, I did not notice any improvement and the threadscope graph looks very similar.
I would suggest looking into granularity control, moving forward. If your list contains 10 000 elements and the computations are fairly trivial, you do not want to create 10 000 sparks. Perhaps you wish to spawn 1 000 sparks which each compute 100 sequential mappings? This requires some trial and error to find a nice granularity. Also, you can try turning off the garbace collector (by allocating a large amount of space for it), while trying out your parallel strategies. @Vincent
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.