Parallel iterator in Scala

Question

Is it somehow possible, using Scala's parallel collections to parallelize an Iterator without evaluating it completely beforehand?

Here I am talking about parallelizing the functional transformations on an Iterator, namely map and flatMap. I think this requires evaluating some elements of the Iterator in advance, and then computing more, once some are consumed via next.

All I could find would require the iterator to be converted to a Iterable or a Stream at best. The Stream then gets completely evaluated when I call .par on it.

I also welcome implementation proposals if this is not readily available. Implementations should support parallel map and flatMap.

The answer is probably no but can you say a little more about what you want from this? In particular, when should the computation start running--after you create the iterator, or once you call something that forces evaluation? — Rex Kerr
– Rex Kerr, Commented Jun 18, 2013 at 20:35
@RexKerr Seems like a design choice; but having it start on first request makes the first request somehow special. I'm currently trying to implement something like this and I choose to start running right away and store the next n results. Once one is consumed, I compute a replacement. — ziggystar
– ziggystar, Commented Jun 18, 2013 at 20:38

ms-tg · Accepted Answer · 2015-06-11 23:58:49Z

6

I realize that this is an old question, but does the ParIterator implementation in the iterata library do what you were looking for?

scala> import com.timgroup.iterata.ParIterator.Implicits._
scala> val it = (1 to 100000).toIterator.par().map(n => (n + 1, Thread.currentThread.getId))
scala> it.map(_._2).toSet.size
res2: Int = 8 // addition was distributed over 8 threads

answered Jun 11, 2015 at 23:58

ms-tg

2,70825 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

ziggystar Over a year ago

It adresses the problem. It could be a bit more efficient, though, since you get much blocking, if you have large variation in the runtimes of the operations within one chunk.

ms-tg Over a year ago

How might it be made more efficient @ziggystar?

ziggystar Over a year ago

ParIterator splits the Iterator into chunks. So if you have small chunks (e.g. size 2) and one element takes 1s and the other takes 10s, then you have bad parallelization. A different implementation could feed the workers new elements from the iterator once a worker becomes free.

ms-tg Over a year ago

@ziggystar ParIterator in iterata defers that consideration to the standard library parallel collections. So within a single chunk, is that how scala parallel collections behave?

ziggystar Over a year ago

I'm not sure you got my point. Even if you do the best thing possible within a chunk, chunking creates barriers across which there is no parallelization. This means you cannot get the maximum CPU utilization possible. Another disadvantage is a higher memory requirement. To parallelize the chunks Scala requires them to be forced, which results in the whole chunk being in memory at the same time (assuming the iterator creates the objects). In theory you only have to have the elements in memory that are currently processed. Large chunks -> good par/bad mem and small chunks -> bad par/good mem.

|

Rex Kerr · Accepted Answer · 2013-06-18 20:44:48Z

4

Your best bet with the standard library is probably not using parallel collections but concurrent.Future.traverse:

import concurrent._
import ExecutionContext.Implicits.global
Future.traverse(Iterator(1,2,3))(i => Future{ i*i })

though I think this will execute the whole thing starting as soon as it can.

answered Jun 18, 2013 at 20:44

Rex Kerr

168k27 gold badges325 silver badges411 bronze badges

Comments

som-snytt · Accepted Answer · 2013-06-19 04:58:19Z

2

From the ML, Traversing iterator elements in parallel:

https://groups.google.com/d/msg/scala-user/q2NVdE6MAGE/KnutOq3iT3IJ

I moved off Future.traverse for a similar reason. For my use case, keeping N jobs working, I wound up with code to throttle feeding the execution context from the job queue.

My first attempt involved blocking the feeder thread, but that risked also blocking tasks which wanted to spawn tasks on the execution context. What do you know, blocking is evil.

answered Jun 19, 2013 at 4:58

som-snytt

39.6k2 gold badges49 silver badges133 bronze badges

4 Comments

ziggystar Over a year ago

Can you comment on why you use (NUM_CPUs + 1)^2 as size for the blocking queue?

ziggystar Over a year ago

Also I found out the hard way that 1. I'm no good at concurrent programming 2. flatMap is more difficult.

som-snytt Over a year ago

@ziggystar By "you" you mean "Juha" on the ML. I don't think it's a magic number: big enough so the consumer doesn't get ahead of the original iterator (which might do i/o, maybe) plus the mapping function (CPU-bound, he says, but long or short running?). I see that the future feeding the queue will block without calling blocking; maybe the +1 is left over from the "desired parallelism". My solution had the end of the pipeline check for more work, i.e., the last thing a job would do is check if enough jobs are in process, and if not, feed the beast. I agree it's hard, simplicity is key.

samthebest Over a year ago

This seem to work OK, and the API is much easier than Future.traverse. I combine it with iterator.grouped so that it chunks elements together, which I assume reduces overhead.

warpedjavaguy · Accepted Answer · 2013-06-19 05:20:43Z

0

It's a bit hard to follow exactly what you're after, but perhaps it's something like this:

val f = (x: Int) => x + 1
val s = (0 to 9).toStream map f splitAt(6) match { 
  case (left, right) => left.par; right 
}

This will eveluate f on the first 6 elements in parallel and then return a stream over the rest.

edited Jun 19, 2013 at 5:20

answered Jun 19, 2013 at 5:00

warpedjavaguy

1115 bronze badges

1 Comment

DNA Over a year ago

This doesn't seem to run in parallel - don't you need to move the map f to after the par ?

Collectives™ on Stack Overflow

Parallel iterator in Scala

4 Answers 4

6 Comments

Comments

4 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

6 Comments

Comments

4 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related