pandas memory consumption hdf file grouping

Question

I wrote the following script but I have an issue with memory consumption, pandas is allocating more than 30 G of ram, where the sum of data files is roughly 18 G

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import time


mean_wo = pd.DataFrame()
mean_w = pd.DataFrame()
std_w = pd.DataFrame()
std_wo = pd.DataFrame()

start_time=time.time() #taking current time as starting time

data_files=['2012.h5','2013.h5','2014.h5','2015.h5', '2016.h5', '2008_2011.h5'] 



for data_file in data_files:
    print data_file
    df = pd.read_hdf(data_file)
    grouped = df.groupby('day')
    mean_wo_tmp=grouped['Significance_without_muons'].agg([np.mean])
    mean_w_tmp=grouped['Significance_with_muons'].agg([np.mean])
    std_wo_tmp=grouped['Significance_without_muons'].agg([np.std])
    std_w_tmp=grouped['Significance_with_muons'].agg([np.std])
    mean_wo = pd.concat([mean_wo, mean_wo_tmp])
    mean_w = pd.concat([mean_w, mean_w_tmp])
    std_w = pd.concat([std_w,std_w_tmp])
    std_wo = pd.concat([std_wo,std_wo_tmp])
    print mean_wo.info()
    print mean_w.info()
    del df, grouped, mean_wo_tmp, mean_w_tmp, std_w_tmp, std_wo_tmp

std_wo=std_wo.reset_index()
std_w=std_w.reset_index()
mean_wo=mean_wo.reset_index()
mean_w=mean_w.reset_index()

#setting the field day as date
std_wo['day']= pd.to_datetime(std_wo['day'], format='%Y-%m-%d')
std_w['day']= pd.to_datetime(std_w['day'], format='%Y-%m-%d')
mean_w['day']= pd.to_datetime(mean_w['day'], format='%Y-%m-%d')
mean_wo['day']= pd.to_datetime(mean_w['day'], format='%Y-%m-%d')

So someone has an idea how to decrease the memory consumption?

Cheers,

piRSquared · Accepted Answer · 2016-09-21 08:51:43Z

1

I'd do something like this
Solution

data_files=['2012.h5', '2013.h5', '2014.h5', '2015.h5', '2016.h5', '2008_2011.h5'] 
cols = ['Significance_without_muons', 'Significance_with_muons']

def agg(data_file):
    return pd.read_hdf(data_file).groupby('day')[cols].agg(['mean', 'std'])

big_df = pd.concat([agg(fn) for fn in data_files], axis=1, keys=data_files)

mean_wo_tmp = big_df.xs(('Significance_without_muons', 'mean'), axis=1, level=[1, 2])
mean_w_tmp = big_df.xs(('Significance_with_muons', 'mean'), axis=1, level=[1, 2])
std_wo_tmp = big_df.xs(('Significance_without_muons', 'std'), axis=1, level=[1, 2])
std_w_tmp = big_df.xs(('Significance_with_muons', 'mean'), axis=1, level=[1, 2])

del big_df

Setup

data_files=['2012.h5', '2013.h5', '2014.h5', '2015.h5', '2016.h5', '2008_2011.h5'] 
cols = ['Significance_without_muons', 'Significance_with_muons']

np.random.seed([3,1415])
data_df = pd.DataFrame(np.random.rand(1000, 2), columns=cols)
data_df['day'] = np.random.choice(list('ABCDEFG'), 1000)

for fn in data_files:
    data_df.to_hdf(fn, 'day', append=False)

Run Above Solution
Then

mean_wo_tmp

answered Sep 21, 2016 at 8:51

piRSquared

296k68 gold badges509 silver badges654 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Giulio Momentè Over a year ago

Thanks a lot piRSquared! I'll try your method, right now I added a: gc.collect() at the end of the for loop and I managed to run it within a threshold of 25 G. I'll let you know if your way is better :) Thanks again!

Collectives™ on Stack Overflow

pandas memory consumption hdf file grouping

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related