11

How can I use numpy/scipy to flatten a nested list with sublists of different sizes? Speed is very important and the lists are large.

 lst = [[1, 2, 3, 4],[2, 3],[1, 2, 3, 4, 5],[4, 1, 2]]

Is anything faster than this?

 vec = sp.array(list(*chain(lst)))

6 Answers 6

14

How about np.fromiter:

In [49]: %timeit np.hstack(lst*1000)
10 loops, best of 3: 25.2 ms per loop

In [50]: %timeit np.array(list(chain.from_iterable(lst*1000)))
1000 loops, best of 3: 1.81 ms per loop

In [52]: %timeit np.fromiter(chain.from_iterable(lst*1000), dtype='int')
1000 loops, best of 3: 1 ms per loop
Sign up to request clarification or add additional context in comments.

Comments

8

You can try numpy.hstack

>>> lst = [[1, 2, 3, 4],[2, 3],[1, 2, 3, 4, 5],[4, 1, 2]]
>>> np.hstack(lst)
array([1, 2, 3, 4, 2, 3, 1, 2, 3, 4, 5, 4, 1, 2])

Comments

6

The fastest way to create a numpy array from an iterator is to use numpy.fromiter:

>>> %timeit numpy.fromiter(itertools.chain.from_iterable(lst), numpy.int64)
100000 loops, best of 3: 3.76 us per loop
>>> %timeit numpy.array(list(itertools.chain.from_iterable(lst)))
100000 loops, best of 3: 14.5 us per loop
>>> %timeit numpy.hstack(lst)
10000 loops, best of 3: 57.7 us per loop

As you can see, this is faster than converting to a list, and much faster than hstack.

Comments

3

How about trying:

np.hstack(lst)

Comments

2

Use chain.from_iterable:

vec = sp.array(list(chain.from_iterable(lst)))

This avoids using * which is quite expensive to handle if the iterable has many sublists.

An other option might be to sum the lists:

vec = sp.array(sum(lst, []))

Note however that this will cause quadratic reallocation. Something like this performs much better:

def sum_lists(lst):
    if len(lst) < 2:
        return sum(lst, [])
    else:
        half_length = len(lst) // 2
        return sum_lists(lst[:half_length]) + sum_lists(lst[half_length:])

On my machine I get:

>>> L = [[random.randint(0, 500) for _ in range(x)] for x in range(10, 510)]
>>> timeit.timeit('sum(L, [])', 'from __main__ import L', number=1000)
168.3029818534851
>>> timeit.timeit('sum_lists(L)', 'from __main__ import L,sum_lists', number=1000)
10.248489141464233
>>> 168.3029818534851 / 10.248489141464233
16.422223757114615

As you can see, a 16x speed-up. The chain.from_iterable is even faster:

>>> timeit.timeit('list(itertools.chain.from_iterable(L))', 'import itertools; from __main__ import L', number=1000)
1.905594825744629
>>> 10.248489141464233 / 1.905594825744629
5.378105042586658

An other 6x speed-up.


I looked for a "pure-python" solution, not knowing numpy. I believe Abhijitunutbu/senderle's solution is the way to go in your case.

Comments

0

Use a function to flatten the list

>>> flatten = lambda x: [y for l in x for y in flatten(l)] if type(x) is list else [x]
>>> flatten(lst)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.