3

I have more than 500 PyTables stores that contain about 300Mb of data each. I would like to merge these files into a big store, using pandas append as in the code below.

def merge_hdfs(file_list, merged_store):
    for file in file_list:
        store = HDFStore(file, mode='r')
        merged_store.append('data', store.data)
        store.close()

The append operation is very slow (it is taking up to 10 minutes to append a single store to merged_store), and strangely the file size of merged_store seems to be increasing by 1Gb for each appended store.

I have indicated the total number of expected rows which according to the documentation should improve performance, and having read Improve pandas (PyTables?) HDF5 table write performance I was expecting large write times, but almost 10 minutes for every 300Mb seems to be too slow, and I cannot understand why the increase in size.

I wonder if I am missing something?

For additional information, here is a description of one of the 500 PyTables.

/data/table (Table(272734,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(6,), dflt=0.0, pos=1),
  "id": StringCol(itemsize=11, shape=(), dflt='', pos=2),
  "datetaken": Int64Col(shape=(), dflt=0, pos=3),
  "owner": StringCol(itemsize=15, shape=(), dflt='', pos=4),
  "machine_tags": StringCol(itemsize=100, shape=(), dflt='', pos=5),
  "title": StringCol(itemsize=200, shape=(), dflt='', pos=6),
  "country": StringCol(itemsize=3, shape=(), dflt='', pos=7),
  "place_id": StringCol(itemsize=18, shape=(), dflt='', pos=8),
  "url_s": StringCol(itemsize=80, shape=(), dflt='', pos=9),
  "url_o": StringCol(itemsize=80, shape=(), dflt='', pos=10),
  "ownername": StringCol(itemsize=50, shape=(), dflt='', pos=11),
  "tags": StringCol(itemsize=505, shape=(), dflt='', pos=12)}
  byteorder := 'little'
  chunkshape := (232,)

1 Answer 1

3

This is basically the answer here, which I recently answered.

Bottom line is this, you need to turn off indexing store.append('df',df,index=False). When creating the store, then index it at the end.

Furthermore turn off compression when merging the tables as well.

Indexing is a fairly expensive operation and if I recall correctly, only uses a single processor.

Finally, make sure that you create the merged with with mode='w' as all of the subsequent operations are appends and you want to start with a clean new file.

I also would NOT specify the chunksize upfront. Rather, after you have created the final index, perform the compression using ptrepack and specify chunksize=auto which will compute it for you. I don't think this will affect write performance but will optimize query performance.

You might try tweaking the chunksize parameter to append (this is the writing chunksize) to a larger number as well.

Obviously make sure that each of the appending tables has exactly the same structure (will raise if this is not the case).

I created this issue for an enhancement to do this 'internally': https://github.com/pydata/pandas/issues/6837

Sign up to request clarification or add additional context in comments.

8 Comments

Thank you, removing the indexing is speeding up the process substantially. However, I still get very large file sizes: for every 300Mb table that I append to the merged store, I get an increase in size of 1Gb which eventually is going to fill in my disk. This should not be due to compression, as I did not apply compression to the 300Mb files.
are the tables EXACTLY the same? When you concat them, the strings will all be of the resulting table string-sizes (which could be bigger depending on what the minimum is)
I have parsed the data so that the size of long strings in every table is restricted to fixed values (to avoid errors in the append operation). The tables are not exactly the same, and their size also varies slightly, but I have kept the same min_itemsize parameters for the merged table, and so I would not expect strings to change size...
are you sure that min_itemsize is specified for EVERY field? (e.g. do a ptdump -av comparison for an individual file and the merged one. This can make a huge difference. Say a field is NOT specified with a min_itemsize in the individual file but IS in the merged file you easily double the size (e.g. say its 10 in the individual file, but 100 in the merged). Compression should help a lot with this though. So compare the compressed files as well. Seeing that your dataset if pretty large, I would certainly compress it using blosc as it will make queries FASTER as well.
Because all the tables were generated with the same min_itemsize parameters (and contained strings of fixed maximum length), I assumed that the fields of the merged table would automatically be created with the right size without explicitly passing the min_itemsize parameter to the append function. I have now done this and everything seems to be ok. I will follow your suggestion about the compression, thank you again for your help!
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.