2

this is a basic question about sorting arrays in numpy and pandas:

I realized that when I used pandas for sorting and selecting specific columns of a data frame, that it took almost twice as long when I changed the code to use numpy arrays.

What is the reason for this change in speed?

Thanks, Leon

eg. Pandas:

j = pd.DataFrame(df)         # df columns["date","I",...]
j = j.sort(["date"], ascending=False)
x = [[DATES[int(k[1]) - 1]] for k in j["date"].tolist()]
y = j["I"].tolist()

eg. Numpy:

j = np.array(df)             # df column["date"] == j[:,0]
j = np.array(sorted(j, key=lambda a_entry: a_entry[0]))
x = [[DATES[int(k[1]) - 1]] for k in j[:,0].tolist()]
y = j[:,4].tolist()          # df column["I"] == j[:,4] 

1 Answer 1

1

https://penandpants.com/2014/09/05/performance-of-pandas-series-vs-numpy-arrays/ explains it quite nicely. pandas as a lot of overhead, compared to numpy

quote from that site: "Why is Pandas so much slower than NumPy? The short answer is that Pandas is doing a lot of stuff when you index into a Series, and it’s doing that stuff in Python."

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.