NumPy: Create a multidimensional array from an iterable

Question

I have an iterable of tuples, and I'd like to build an ndarray from it. Say that the shape would be (12345, 67890). What would be an efficient and elegant way to do so?

Here are a few options, and why I ruled them out:

np.array(my_tuples) starts allocating the array before it knows the size, which requires inefficient relocations according to NumPy's documentation.
Create an array with uninitialized content using np.ndarray((12345, 67890)) and then do a loop that populates it with data. It works and it's efficient, but a bit inelegant because it requires multiple statements.
Use np.fromiter which appears to be geared towards 1-dimensional arrays only.

Does anyone have a better solution?

(I've seen this question, but I'm not seeing any promising answers there.)

So you have something that repeatedly generates tuples or lists of len 67890? — hpaulj
– hpaulj, Commented Jun 8, 2020 at 16:14
There really isn't any trick. np.array(list(your_generator)) is straight forward, and probably as efficient as any. np.stack([np.array(row) for row in generator]) might time better (or not), or np.concatenate([np.atleast_2d(row) for row in generator]). etc. — hpaulj
– hpaulj, Commented Jun 8, 2020 at 16:18
We normally use np.zeros or np.empty to create an 'uninitialized' array, not np.ndarray. — hpaulj
– hpaulj, Commented Jun 8, 2020 at 16:46
What's the advantage of that, assuming you're going to fill the array with data? — Ram Rachum
– Ram Rachum, Commented Jun 8, 2020 at 16:47

Paul Panzer · Accepted Answer · 2020-06-11 17:05:57Z

3

I suspect you'll find this not elegant enough, but fast it is:

from timeit import timeit
import itertools as it 

def x():
   for i in range(3000):
       yield list(range(i,i+4000))

timeit(lambda:np.fromiter(it.chain.from_iterable(x()),int,12000000).reshape(3000,4000),number=10)
# 5.048861996969208

Compare that to, for example

timeit(lambda:np.concatenate(list(x()),0),number=10)
# 12.466914481949061

Btw. if you do not know the total number of elements in advance, no big deal:

timeit(lambda:np.fromiter(it.chain.from_iterable(x()),int).reshape(3000,-1),number=10)
# 5.331893905065954

answered Jun 11, 2020 at 17:05

Paul Panzer

53.3k3 gold badges59 silver badges103 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Ram Rachum Over a year ago

Yup, that's the solution that Udi and I come up with earlier today. But... Doesn't that look insane to you? I'm sure that I'm misunderstanding something about NumPy, because it can't be that such a basic action as creating an array would need to be written in such a complicated way to be efficient.

Paul Panzer Over a year ago

@RamRachum I suppose making this robust and fast is not impossible but tricky enough, for example how do you sniff out the nesting depth and dimension sizes without consuming the iterable? Probably not a high priority for the devs. If you need to do this often why not write a little convenience function around this or a similar snippet?

Ram Rachum Over a year ago

Why sniff for the length when you can ask the user directly? I'm okay with entering the length, it can be done much shorter though.

Paul Panzer Over a year ago

@RamRachum of course, you can make that contract with yourself and probably trust yourself to stick to it. All I'm saying is that a proper API function must check everything, it can't just segfault, if the user messed up the length, for example.

hpaulj · Accepted Answer · 2020-06-08 16:55:27Z

1

Define a generator:

def foo(m,n):
    for i in range(m):
        yield list(range(i,i+n))

timing several alternatives:

In [93]: timeit np.array(list(foo(3000,4000)))                                  
1.74 s ± 17.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [94]: timeit list(foo(3000,4000))                                            
663 ms ± 3.84 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [95]: timeit np.stack([np.array(row) for row in foo(3000,4000)])             
1.32 s ± 2.04 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [96]: timeit np.concatenate([np.array(row, ndmin=2) for row in foo(3000,4000)
    ...: ])                                                                     
1.33 s ± 23.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [97]: %%timeit  
    ...: arr = np.empty((3000,4000),int) 
    ...: for i,row in enumerate(foo(3000,4000)): 
    ...:     arr[i] = row 
    ...:                                                                        
1.29 s ± 3.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

and with a flat generator:

def foo1(m,n):
    for i in range(m):
        for j in range(n):
            yield i+j
In [104]: timeit np.fromiter(foo1(3000,4000),int).reshape(3000,4000)            
1.54 s ± 5.64 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

edited Jun 8, 2020 at 16:55

answered Jun 8, 2020 at 16:52

hpaulj

233k14 gold badges260 silver badges392 bronze badges

3 Comments

Ram Rachum Over a year ago

Nice benchmarking. Is there a good reason that there isn't an elegant solution that is also efficient? Why wouldn't NumPy provide a function that does your last snippet in one call?

hpaulj Over a year ago

"Ours not to reason why, ours but to do and die." [Alfred Lord Tennyson]

Ram Rachum Over a year ago

I'm more into the reasoning why part and less into the dying part, but thanks.

Udi · Accepted Answer · 2020-06-11 16:17:42Z

1

Use fromiter() with .reshape(). Reshaping does not require more memory or processing.

answered Jun 11, 2020 at 16:17

Udi

30.8k9 gold badges105 silver badges134 bronze badges

4 Comments

Ram Rachum Over a year ago

Interesting, but still inelegant in my opinion. Having to calculate the 1d length and feeding it to it is ugly in my opinion.

Ram Rachum Over a year ago

You've made 3 mistakes here: 1. itertools.tee will have to do a run through your entire iterable, which is undesirable. 2. I'm assuming I already know the size of the iterable, the issue is that it's an iterable of iterables/sequences, and I'll need to multiple the number of items with the length of each item to feed it to the count argument. Doable, but inelegant for what is a basic action in NumPy.

Udi Over a year ago

You are right regarding tee (I deleted the comment). The count argument is optional.

Ram Rachum Over a year ago

It's optional, but I believe that the array creation will be less efficient without it, because it might have to relocate your array as it grows bigger.

Collectives™ on Stack Overflow

NumPy: Create a multidimensional array from an iterable

3 Answers 3

4 Comments

3 Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

3 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related