I need to read in data which is stored in many files of the same format, but varying length, i.e. identical columns, but varying number of rows. Furthermore, I need each column of the data to be stored in one array (preferrably one numpy array, but a list is also acceptable).
For now, I read in every file in a loop with numpy.loadtxt() and then concatenate the resulting arrays. Say the data consists of 3 columns and is stored in the two files "foo" and "bar":
import numpy as np
filenames = ["foo", "bar"]
col1_all = 0 #data will be stored in these 3 arrays
col2_all = 0
col3_all = 0
for f in filename:
col1, col2, col3 = np.loadtxt(f, unpack=True)
if col1.shape[0] > 0: # I can't guarantee file won't be empty
if type(col1_all) == int:
# if there is no data read in yet, just copy arrays
col1_all = col1[:]
col2_all = col2[:]
col3_all = col3[:]
else:
col1_all = np.concatenate((col1_all, col1))
col2_all = np.concatenate((col2_all, col2))
col3_all = np.concatenate((col3_all, col3))
My question is: Is there a better/faster way to do this? I need this to be as quick as possible, as I need to read in hundreds of files.
I could imagine, for example, that first finding out how many rows in total I will have and "allocating" an array of big enough size to fit all the data first, then copying the read-in data in that array might perform better, as I circumvent the concatenations. I don't know the total number of rows, so this will have to be done in python too.
Another idea would be first read in all the data, store each read-in separately, and concatenate them in the end. (Or, as this essentialy gives me the total number of rows, allocate a row that fits all the data, and then copy the data in there).
Does anyone have experience on what works best?