sorting data in python with huge data

Question

and nothing be else.

[1, 28]
[2, 14]
[3, 5]

Can you sort your list by the second element after the first? — Hoog
– Hoog, Commented Jan 28, 2022 at 15:15

jmd_dk · Accepted Answer · 2022-01-28 15:30:24Z

2

This solution is really simple, but does not exploit the fact that your data is sorted according to the first column.

import collections

data = [
    (1, 50),
    (1, 95),
    (1, 28),
    (2, 104),
    (2, 14),
    (3, 5),
    (3, 28),
]

mins = collections.defaultdict(lambda: float('inf'))
for a, b in data:
    if mins[a] > b:
        mins[a] = b
data_reduced = list(mins.items())
print(data_reduced)

It should be plenty fast!

The slightly advanced collections.defaultdict(lambda: float('inf')) expression results in a special kind of dictionary, which returns float('inf') (infinity) if you look up an element that is not in the dictionary. With this, we can do the mins[a] > b test without worrying about whether mins[a] fails because a might not already be in the dictionary.

answered Jan 28, 2022 at 15:30

jmd_dk

13.2k11 gold badges71 silver badges104 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Eli Harold Over a year ago

decently fast, 1 sec for 2,000,000 tuples.

jmd_dk Over a year ago

@EliHarold Thanks for the testing. What if we upgrade the mins = ... line to mins = collections.defaultdict(lambda _=float('inf'): _)?

Eli Harold Over a year ago

sorry I already deleted my test code, kinda not in the mood to rewrite it xD

DeepSpace · Accepted Answer · 2022-01-28 15:35:50Z

2

Given that the outer list is already sorted by the first element (which is the premise of the question), I'd use itertools.groupby:

from itertools import groupby

for _, group in groupby(data, lambda t: t[0]):
    print(min(group, key=lambda g: g[1]))

This outputs

[1, 28]
[2, 14]
[3, 5]

edited Jan 28, 2022 at 15:35

answered Jan 28, 2022 at 15:20

DeepSpace

82.1k12 gold badges119 silver badges166 bronze badges

1 Comment

Eli Harold Over a year ago

This is the fastest solution by far. well under 1 sec for 2,000,000 tuples. others are 1sec and the np.array solution is 3 sec.

Eli Harold · Accepted Answer · 2022-01-28 15:17:34Z

1

This runs in seconds for 2,000,000 tuples including creating and reducing the list:

from random import randint
l = []
output = []
for i in range(2000000):
    l.append((randint(1,5), randint(1,50)))
l = sorted(l)
d = {}
for tup in l:
    try:
        d[tup[0]].append(tup[1])
    except:
        d[tup[0]] = [tup[1]]
for k,v in d.items():
    output.append((k, min(v)))
print(output)

Output:

[(1, 1), (2, 1), (3, 1), (4, 1), (5, 1)]

Solution without setup given sorted l list:

d = {}
for tup in l:
    try:
        d[tup[0]].append(tup[1])
    except:
        d[tup[0]] = [tup[1]]
for k,v in d.items():
    output.append((k, min(v)))
print(output)

answered Jan 28, 2022 at 15:17

Eli Harold

2,3011 gold badge5 silver badges22 bronze badges

Comments

eshirvana · Accepted Answer · 2022-01-28 17:44:41Z

1

here is one way, this is O(n^2) for sorting and O(n) for finding the min:

li.sort()
res = []
for i in li:
    if not res or not i[0] == res[-1][0]:
        res.append(i)

print(res)

output:

[[1, 28], [2, 14], [3, 5]]

another method which should be way faster ( doesn't need sorting) : this should be O(n)

res = {}
for a, b in l: res[a] = min(res.get(a,b) , b)
print([*res.items()])

edited Jan 28, 2022 at 17:44

answered Jan 28, 2022 at 15:17

eshirvana

24.7k3 gold badges28 silver badges43 bronze badges

Comments

Lukasz Wiecek · Accepted Answer · 2022-01-29 17:10:21Z

I would go with something like this (modified @jmd_dk answer). No need to use any dictionary here since the elements are sorted on the first index. That will get rid of the memory footprint associated with this dictionary and if you data set is very large that could be a big plus.

data = [
    (1, 50),
    (1, 95),
    (1, 28),
    (2, 104),
    (2, 14),
    (3, 5),
    (3, 28),
]

last_a = None
minimum = None
for a, b in data:
    # Detects the change of the first index. That's how we know it's time to start a new group and look for it's minimum
    if a != last_a:
       if last_a is not None:
          print([last_a, minimum])
       last_a = a
       minimum = None
    else:
        if minimum is None:
            minimum = b
        else:
            minimum = min(minimum, b)
print([a, minimum])

Output:

[1, 28]
[2, 14]
[3, 28]

Mario · Accepted Answer · 2022-01-28 15:24:50Z

0

A short answer with numpy and comprehensions would be:

import numpy as np

a = np.array(a)
data = [(i, j) for i, j in {k: v for k, v in a[np.argsort((-a[:,1]))].tolist()}.items()]

Assuming the input is:

a = [
[1, 50],
[1, 95],
[1, 28],
[2, 104],
[2, 14],
[3, 5],
[3, 28]
]

Output would be:

[(2, 14), (1, 28), (3, 5)]

answered Jan 28, 2022 at 15:24

Mario

5833 silver badges21 bronze badges

4 Comments

Eli Harold Over a year ago

This is pretty slow.

Eli Harold Over a year ago

3x slower than all other solutions

Mario Over a year ago

@EliHarold well, its the best I came up with

Eli Harold Over a year ago

that's no problem, just giving info for OP and future readers.

Collectives™ on Stack Overflow

sorting data in python with huge data

6 Answers 6

3 Comments

1 Comment

Comments

Comments

Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

3 Comments

1 Comment

Comments

Comments

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related