3

I have a dataframe like this:

df1= pd.DataFrame({
    'col1': [np.asarray([1,4,3,2]), np.asarray([9,10,7,5]), np.asarray([100,120,10,22])],
    'col2': [np.asarray([0,1,4,5]), np.asarray([100,101,102,103]), np.asarray([10,11,12,13])]
})

df1
                 col1                  col2
0        [1, 4, 3, 2]          [0, 1, 4, 5]
1       [9, 10, 7, 5]  [100, 101, 102, 103]
2  [100, 120, 10, 22]      [10, 11, 12, 13]

I want to sort the values of the arrays in column 2 based on the values of the arrays in column 1.

Here's my solution:

sort_idx = df1['col1'].apply(np.argsort).values
for rowidxval, (index, row) in enumerate(df1.iterrows()):
    df1['col1'][index] = df1['col1'][index][sort_idx[rowidxval]]
    df1['col2'][index] = df1['col2'][index][sort_idx[rowidxval]]

Is there an elegant, pythonic way of doing it instead of brute force sort the dataframe row-wise? What if I want to re-sort more than one column based on the values in column 1?

0

3 Answers 3

5

Lists in columns are never recommended (mixed dtypes and mutable dtypes introduce bottlenecks and performance reduction in code), but you can make this as fast as possible using a list comprehension:

df['col2'] = [np.array(y)[np.argsort(x)] for x, y in zip(df.col1, df.col2)]
df

                 col1                  col2
0        [1, 4, 3, 2]          [0, 5, 4, 1]
1       [9, 10, 7, 5]  [103, 102, 100, 101]
2  [100, 120, 10, 22]      [12, 13, 10, 11]

If they are both arrays, the solution simplifies:

df['col2'] = [y[x.argsort()] for x, y in zip(df.col1, df.col2)]
df

                 col1                  col2
0        [1, 4, 3, 2]          [0, 5, 4, 1]
1       [9, 10, 7, 5]  [103, 102, 100, 101]
2  [100, 120, 10, 22]      [12, 13, 10, 11]

For more information on performance related concerns, see the section on "Mixed dtypes" in For loops with pandas - When should I care?.

Sign up to request clarification or add additional context in comments.

Comments

3

Using for loop

[[z for _,z in sorted(zip(x,y))] for x, y in zip(df1.col1, df1.col2)]
Out[250]: [[0, 5, 4, 1], [103, 102, 100, 101], [12, 13, 10, 11]]

#df1.col2=[[z for _,z in sorted(zip(x,y))] for x, y in zip(df1.col1, df1.col2)]

Comments

0

Try and avoid using NumPy arrays within series. Such a data structure will not support vectorised computations. Since in this case all your arrays have the same size, you can easily split them into multiple columns:

# STEP 1: split NumPy arrays into separate columns
col1 = pd.DataFrame(df1.pop('col1').values.tolist()).add_prefix('col1_')
col2 = pd.DataFrame(df1.pop('col2').values.tolist()).add_prefix('col2_')
df1 = df1.join(pd.concat([col1, col2], axis=1))

# STEP 2: calculate indices for NumPy assignment
x_idx = np.arange(df1.shape[0])[:, None]
y_idx = df1.iloc[:, :4].values.argsort(1)

# STEP 3: assign via iloc
df1.iloc[:, 4:] = df1.iloc[:, 4:].values[x_idx, y_idx]

print(df1)

#    col1_0  col1_1  col1_2  col1_3  col2_0  col2_1  col2_2  col2_3
# 0       1       4       3       2       0       5       4       1
# 1       9      10       7       5     103     102     100     101
# 2     100     120      10      22      12      13      10      11

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.