DASK - AttributeError: 'DataFrame' object has no attribute 'sort_values'

Question

I am just trying to order a dask dataframe by a specific column.

CODE 1 - If I call it it shows as indeed a ddf

my_ddf

OUTPUT 1

npartitions=1   
headers .....

CODE 2

my_ddf.sort_values('id', ascending=False)

OUTPUT 2

AttributeError                            Traceback (most recent call last)
<ipython-input-374-35ce4bd06557> in <module>
----> 1 my_ddf.sort_values('id', ascending=False) #.head(20)
      2 # df.sort_values(columns, ascending=True)

~/anaconda3/envs/rapids/lib/python3.7/site-packages/dask/dataframe/core.py in __getattr__(self, key)
   3619             return self[key]
   3620         else:
-> 3621             raise AttributeError("'DataFrame' object has no attribute %r" % key)
   3622 
   3623     def __dir__(self):

AttributeError: 'DataFrame' object has no attribute 'sort_values'

Tried Solutions

This is an example from the official dask documentation df.sort_values(columns, ascending=False).head(n)
pandas only - DataFrame object has no attribute 'sort_values'
pandas only - 'DataFrame' object has no attribute 'sort'
DASK answer - https://stackoverflow.com/a/40378896/10270590
- I don't want to set it in to index because I want to have only my current index values.
- The following answer is a bit strange and I am not sure that it would work when I have more partition (currently I have 1 because if previous group by of the data) or how to not to have just a random big number "1000000000". Or how to make it Increasing from top to bottom in the dask dataframe my_ddf.nlargest(1000000000, 'id').compute()

SultanOrazbayev · Accepted Answer · 2021-01-28 06:09:26Z

3

AFAIK, sort across partitions is not implemented (yet?). If the dataset is small enough to fit in memory you can do ddf = ddf.compute() and then run sorting on the pandas dataframe.

answered Jan 28, 2021 at 6:09

SultanOrazbayev

16.7k3 gold badges25 silver badges59 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

sogu Over a year ago

Currently it is, but only because I am building out the pipeline with a smaller dataset. So it would be crucial to keep everything in DASK when I put x100 load on it it should still function.

SultanOrazbayev Over a year ago

I see, when I had a problem like that I had to incorporate additional logic that dask was not aware of, e.g. that certain values could only appear in certain partitions, so a full data shuffle was not needed. I ended up using delayed for that task.

saloni · Accepted Answer · 2021-01-28 21:10:21Z

2

Dask indexes are not global anyway (by default) If you want to retain the original within-partition index, you can do something like

df["old_index"] = df.reset_index()
df.set_index("colA")

answered Jan 28, 2021 at 21:10

saloni

3161 silver badge7 bronze badges

Comments

Flair · Accepted Answer · 2022-01-11 03:29:55Z

2

Try setting the index as id and then do sorting through map_partitions like below:

df = df.set_index("id")
df = df.map_partitions(lambda df: df.sort_values(["id"], ascending=False)).reset_index()

edited Jan 11, 2022 at 3:29

Flair

2,9572 gold badges33 silver badges45 bronze badges

answered Jan 7, 2022 at 18:28

Mohamed Niyaz

2184 silver badges20 bronze badges

Collectives™ on Stack Overflow

DASK - AttributeError: 'DataFrame' object has no attribute 'sort_values'

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related