2

I wrote the following script but I have an issue with memory consumption, pandas is allocating more than 30 G of ram, where the sum of data files is roughly 18 G

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import time


mean_wo = pd.DataFrame()
mean_w = pd.DataFrame()
std_w = pd.DataFrame()
std_wo = pd.DataFrame()

start_time=time.time() #taking current time as starting time

data_files=['2012.h5','2013.h5','2014.h5','2015.h5', '2016.h5', '2008_2011.h5'] 



for data_file in data_files:
    print data_file
    df = pd.read_hdf(data_file)
    grouped = df.groupby('day')
    mean_wo_tmp=grouped['Significance_without_muons'].agg([np.mean])
    mean_w_tmp=grouped['Significance_with_muons'].agg([np.mean])
    std_wo_tmp=grouped['Significance_without_muons'].agg([np.std])
    std_w_tmp=grouped['Significance_with_muons'].agg([np.std])
    mean_wo = pd.concat([mean_wo, mean_wo_tmp])
    mean_w = pd.concat([mean_w, mean_w_tmp])
    std_w = pd.concat([std_w,std_w_tmp])
    std_wo = pd.concat([std_wo,std_wo_tmp])
    print mean_wo.info()
    print mean_w.info()
    del df, grouped, mean_wo_tmp, mean_w_tmp, std_w_tmp, std_wo_tmp

std_wo=std_wo.reset_index()
std_w=std_w.reset_index()
mean_wo=mean_wo.reset_index()
mean_w=mean_w.reset_index()

#setting the field day as date
std_wo['day']= pd.to_datetime(std_wo['day'], format='%Y-%m-%d')
std_w['day']= pd.to_datetime(std_w['day'], format='%Y-%m-%d')
mean_w['day']= pd.to_datetime(mean_w['day'], format='%Y-%m-%d')
mean_wo['day']= pd.to_datetime(mean_w['day'], format='%Y-%m-%d')

So someone has an idea how to decrease the memory consumption?

Cheers,

1 Answer 1

1

I'd do something like this
Solution

data_files=['2012.h5', '2013.h5', '2014.h5', '2015.h5', '2016.h5', '2008_2011.h5'] 
cols = ['Significance_without_muons', 'Significance_with_muons']

def agg(data_file):
    return pd.read_hdf(data_file).groupby('day')[cols].agg(['mean', 'std'])

big_df = pd.concat([agg(fn) for fn in data_files], axis=1, keys=data_files)

mean_wo_tmp = big_df.xs(('Significance_without_muons', 'mean'), axis=1, level=[1, 2])
mean_w_tmp = big_df.xs(('Significance_with_muons', 'mean'), axis=1, level=[1, 2])
std_wo_tmp = big_df.xs(('Significance_without_muons', 'std'), axis=1, level=[1, 2])
std_w_tmp = big_df.xs(('Significance_with_muons', 'mean'), axis=1, level=[1, 2])

del big_df

Setup

data_files=['2012.h5', '2013.h5', '2014.h5', '2015.h5', '2016.h5', '2008_2011.h5'] 
cols = ['Significance_without_muons', 'Significance_with_muons']

np.random.seed([3,1415])
data_df = pd.DataFrame(np.random.rand(1000, 2), columns=cols)
data_df['day'] = np.random.choice(list('ABCDEFG'), 1000)

for fn in data_files:
    data_df.to_hdf(fn, 'day', append=False)

Run Above Solution
Then

mean_wo_tmp

enter image description here

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks a lot piRSquared! I'll try your method, right now I added a: gc.collect() at the end of the for loop and I managed to run it within a threshold of 25 G. I'll let you know if your way is better :) Thanks again!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.