1

I am working with this csv https://drive.google.com/file/d/1o3Nna6CTdCRvRhszA01xB9chawhngGV7/view?usp=sharing

I am trying to sort by the 'Taxes' column, but when I use

import pandas as pd

df = pd.read_csv('statesFedTaxes.csv')
df.Taxes.values.sort_values()

I get

AttributeError: 'numpy.ndarray' object has no attribute 'sort_values'

This is baffling to me and I cannot find a similar problem online. How can I sort the data by the "Taxes" column?

EDIT: I should explain that my real problem is that when I use

df.sort_values('Taxes')

I get this output:

    State   Taxes
48  Washington  100,609,767
24  Minnesota   102,642,589
25  Mississippi 11,273,202
13  Idaho   11,343,181
30  New Hampshire   12,208,656
54  International   12,611,648
22  Massachusetts   120,035,203
40  Rhode Island    14,325,645
31  New Jersey  140,258,435

Therefore, I assume the commas are getting in the way of my chart sorting properly. How do I get over this?

1
  • From DataFrame.values docs: We recommend using DataFrame.to_numpy() instead. (This name should help you understand error i.e., trying to sorting on numpy array). Commented Nov 21, 2020 at 23:24

3 Answers 3

3
import pandas as pd
df = pd.DataFrame({"Taxes": ["1,000", "100", "100,000"]})

Your dataframe looks fine when we print it.

>>> df.sort_values(by="Taxes")
     Taxes
0    1,000
1      100
2  100,000

But the dtype is all wrong. This is strings (stored as objects), not numbers. When you call .values you get an array of... more strings, not numbers.

>>> df.dtypes
Taxes    object

So turn them into numbers

>>> df['Taxes'] = df['Taxes'].str.replace(",", "").astype(int)

>>> df.sort_values(by="Taxes")
    Taxes
1     100
0    1000
2  100000

Now it's fine.

Also an option is to just read it in with a thousands separator explicitly defined, which will fix the typing problem earlier.

df = pd.read_csv('statesFedTaxes.csv', thousands=",")
Sign up to request clarification or add additional context in comments.

Comments

2

It's basically the inverted order: you want to sort the column values and then extract them to an array:

df.sort_values("Taxes")["Taxes"].values

3 Comments

That's good for sorting the Taxes column, but I want to sort the whole dataframe by the Taxes column. If I do df['Taxes'] = df.sort_values('taxes')['Taxes'].values now I have an inaccurate 'State' column
Let's do this step by step: The sort_values("Taxes") sorts the entire dataframe by taxes The ["Taxes"] extracts one column, if you want both columns omit this bracket .values converts the content to an array. If it's one column its a 1D array, otherwise 2D.
Unfortunately, I am unable to add comments to other answers so here is what might be helpful for the problem mentioned in your edit: pass the argument thousands=',' to pd.read_csv so these numbers will be interpreted correctly. It would then look like df = pd.read_csv('statesFedTaxes.csv', thousands=",")
1

df.Taxes is a Series object, and df.Taxes.values is a ndarray object. In this case, you're not calling sort_values on the data frame df - you're trying to call it on the data from the Taxes column itself.

df.sort_values('Taxes') will give you df sorted on that column.

2 Comments

I realize this should have been my original question: when I do, the numbers are sorted by putting 100,000 on top, then 102,000, then 11,000, then 26,000, etc. In other words, the commas mess up the sorting. How do I overcome this?
This means that the Taxes column is string, not int. You'll need to remove the commas from the string and then convert to integer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.