5

I have an iterable of tuples, and I'd like to build an ndarray from it. Say that the shape would be (12345, 67890). What would be an efficient and elegant way to do so?

Here are a few options, and why I ruled them out:

  1. np.array(my_tuples) starts allocating the array before it knows the size, which requires inefficient relocations according to NumPy's documentation.

  2. Create an array with uninitialized content using np.ndarray((12345, 67890)) and then do a loop that populates it with data. It works and it's efficient, but a bit inelegant because it requires multiple statements.

  3. Use np.fromiter which appears to be geared towards 1-dimensional arrays only.

Does anyone have a better solution?

(I've seen this question, but I'm not seeing any promising answers there.)

6
  • So you have something that repeatedly generates tuples or lists of len 67890? Commented Jun 8, 2020 at 16:14
  • Yes, 12345 of them. Commented Jun 8, 2020 at 16:15
  • There really isn't any trick. np.array(list(your_generator)) is straight forward, and probably as efficient as any. np.stack([np.array(row) for row in generator]) might time better (or not), or np.concatenate([np.atleast_2d(row) for row in generator]). etc. Commented Jun 8, 2020 at 16:18
  • We normally use np.zeros or np.empty to create an 'uninitialized' array, not np.ndarray. Commented Jun 8, 2020 at 16:46
  • What's the advantage of that, assuming you're going to fill the array with data? Commented Jun 8, 2020 at 16:47

3 Answers 3

3

I suspect you'll find this not elegant enough, but fast it is:

from timeit import timeit
import itertools as it 

def x():
   for i in range(3000):
       yield list(range(i,i+4000))

timeit(lambda:np.fromiter(it.chain.from_iterable(x()),int,12000000).reshape(3000,4000),number=10)
# 5.048861996969208

Compare that to, for example

timeit(lambda:np.concatenate(list(x()),0),number=10)
# 12.466914481949061

Btw. if you do not know the total number of elements in advance, no big deal:

timeit(lambda:np.fromiter(it.chain.from_iterable(x()),int).reshape(3000,-1),number=10)
# 5.331893905065954
Sign up to request clarification or add additional context in comments.

4 Comments

Yup, that's the solution that Udi and I come up with earlier today. But... Doesn't that look insane to you? I'm sure that I'm misunderstanding something about NumPy, because it can't be that such a basic action as creating an array would need to be written in such a complicated way to be efficient.
@RamRachum I suppose making this robust and fast is not impossible but tricky enough, for example how do you sniff out the nesting depth and dimension sizes without consuming the iterable? Probably not a high priority for the devs. If you need to do this often why not write a little convenience function around this or a similar snippet?
Why sniff for the length when you can ask the user directly? I'm okay with entering the length, it can be done much shorter though.
@RamRachum of course, you can make that contract with yourself and probably trust yourself to stick to it. All I'm saying is that a proper API function must check everything, it can't just segfault, if the user messed up the length, for example.
1

Define a generator:

def foo(m,n):
    for i in range(m):
        yield list(range(i,i+n))

timing several alternatives:

In [93]: timeit np.array(list(foo(3000,4000)))                                  
1.74 s ± 17.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [94]: timeit list(foo(3000,4000))                                            
663 ms ± 3.84 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [95]: timeit np.stack([np.array(row) for row in foo(3000,4000)])             
1.32 s ± 2.04 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [96]: timeit np.concatenate([np.array(row, ndmin=2) for row in foo(3000,4000)
    ...: ])                                                                     
1.33 s ± 23.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [97]: %%timeit  
    ...: arr = np.empty((3000,4000),int) 
    ...: for i,row in enumerate(foo(3000,4000)): 
    ...:     arr[i] = row 
    ...:                                                                        
1.29 s ± 3.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

and with a flat generator:

def foo1(m,n):
    for i in range(m):
        for j in range(n):
            yield i+j
In [104]: timeit np.fromiter(foo1(3000,4000),int).reshape(3000,4000)            
1.54 s ± 5.64 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

3 Comments

Nice benchmarking. Is there a good reason that there isn't an elegant solution that is also efficient? Why wouldn't NumPy provide a function that does your last snippet in one call?
"Ours not to reason why, ours but to do and die." [Alfred Lord Tennyson]
I'm more into the reasoning why part and less into the dying part, but thanks.
1

Use fromiter() with .reshape(). Reshaping does not require more memory or processing.

4 Comments

Interesting, but still inelegant in my opinion. Having to calculate the 1d length and feeding it to it is ugly in my opinion.
You've made 3 mistakes here: 1. itertools.tee will have to do a run through your entire iterable, which is undesirable. 2. I'm assuming I already know the size of the iterable, the issue is that it's an iterable of iterables/sequences, and I'll need to multiple the number of items with the length of each item to feed it to the count argument. Doable, but inelegant for what is a basic action in NumPy.
You are right regarding tee (I deleted the comment). The count argument is optional.
It's optional, but I believe that the array creation will be less efficient without it, because it might have to relocate your array as it grows bigger.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.