3

I have two dataframes that share an ID column between them. The first dataframe is split out and sent to the owners of the data for updates. Once returned, they are put back together into a single dataframe again. Now one dataframe has been updated and contains new entries with no ID yet and is also in a different order from what is originally was. df1 is the old, df2 is the new I want sort df2 based on the ID column in df1 and leave the new entries at the bottom. The IDs are randomly generated and do not have an order, which is by design.

Is there any good way of doing that? I looked at this post, which makes use of indexing. I could make my ID column the index, but as some new entries would not have an ID yet, that would not work.

I have made a mockup of the situation here:

df=pd.DataFrame(columns=['Name','DataOwner','UniqueID'], data=[['P1',1,123],['P2',2,321],['P3',3,456]])
df2=pd.DataFrame(columns=['Name','DataOwner','UniqueID'], data=[['P1',1,123],['P4', 1, ],['P2',2,321],['P5',2,],['P3',3,456], ['P6', 3, ]])

Which results in these two dataframes:

  Name  DataOwner  UniqueID
0   P1          1       123
1   P2          2       321
2   P3          3       456
  Name  DataOwner  UniqueID
0   P1          1     123.0
1   P4          1       NaN
2   P2          2     321.0
3   P5          2       NaN
4   P3          3     456.0
5   P6          3       NaN

The names of the projects are descriptive text and cannot be used for sorting, the dataowner is not sorted and just put there to illustrate that the data is returned by dataowner, put together in one big datafram before i need to sort it based on the ID with new entries at the bottom.

The result i want is then to have:

  Name  DataOwner  UniqueID
0   P1          1       123
1   P2          2       321
2   P3          3       456
  Name  DataOwner  UniqueID
0   P1          1     123.0
2   P2          2     321.0
4   P3          3     456.0
1   P4          1       NaN
3   P5          2       NaN
5   P6          3       NaN

Although the order of the new entries does not matter - they just need to be at the bottom.

2 Answers 2

1

One option using a custom key in sort_values:

key = pd.Series({k:v for v,k in enumerate(df['UniqueID'].unique())})

out = df2.sort_values(by='UniqueID', key=key.reindex, na_position='last')

Output:

  Name  DataOwner  UniqueID
0   P1          1     123.0
2   P2          2     321.0
4   P3          3     456.0
1   P4          1       NaN
3   P5          2       NaN
5   P6          3       NaN
Sign up to request clarification or add additional context in comments.

2 Comments

This does work, but not if i use the column UniqueID, which is the only column i can be absolutely certain is going to be unique and be unchanged in between the two datasets. All the other columns are open to changes when being updated. Thanks even so. #Edit: My bad, this works on the UniqueID columns as well - thank you :)
An important addition i fount later: If I need to be able to use the same indexing in both dataframes afterwards, I have to reset the index: final_sorted_w_reset_index = out.reset_index(drop=True) This ensures, that my newly sorted dataframe has the same index at the same rows as the one i used to sort by. Otherwise, if i said: df[column][idx] And out[column][idx], i would be accessing different rows as the index was sorted as well, but not reset
0

CategoricalDtype is my go-to when I need custom sort order. Assume that the index on df2 is unique:

UniqueIDType = pd.CategoricalDtype(df["UniqueID"], ordered=True)
index = df2["UniqueID"].astype(UniqueIDType).sort_values().index

df2.reindex(index)

1 Comment

This does work thank you very much. Would you be able to explain a little bit what is going on exatcly?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.