Processing data using pandas in a memory efficient manner using Python

Question

I have to read multiple csv files and group them by "event_name". I also might have some duplicates, so I need to drop them. paths contains all the paths of the csv files, and my code is as follows:

data = []
for path in paths:
    csv_file = pd.read_csv(path)
    data.append(csv_file)

events = pd.concat(data)
events = events.drop_duplicates()

event_names = events.groupby('event_name')

ev2 = []

for name, group in event_names:
    a, b = group.shape
    ev2.append([name, a])

This code is going to tell me how many unique event_name unique there are, and how many entries per event_name. It works wonderfully, except that the csv files are too large and I am having memory problems. Is there a way to do the same using less memory?

I read about using dir() and globals() to delete variables, which I could certainly use, because once I have event_names, I don't need the DataFrame events any longer. However, I am still having those memory issues. My question more specifically is: can I read the csv files in a more memory-efficient way? or is there something additional I can do to reduce memory usage? I don't mind sacrificing performance, as long as I can read all csv files at once, instead of doing chunk by chunk.

Bill Huang · Accepted Answer · 2020-09-30 21:30:42Z

1

Just keep a hash value of each row to reduce the data size.

csv_file = pd.read_csv(path)

# compute hash (gives an `uint64` value per row)
csv_file["hash"] = pd.util.hash_pandas_object(csv_file)

# keep only the 2 columns relevant to counting
data.append(csv_file[["event_name", "hash"]])

If you cannot risk hash collision (which would be astronomically unlikely), just use another hash key and check if the final counting results are identical. The way to change a hash key is as follows.

# compute hash using a different hash key
csv_file["hash2"] = pd.util.hash_pandas_object(csv_file, hash_key='stackoverflow')

Reference: pandas official docs page

edited Sep 30, 2020 at 21:30

answered Sep 30, 2020 at 21:21

Bill Huang

4,6772 gold badges15 silver badges37 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Processing data using pandas in a memory efficient manner using Python

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related