You should just try it.
In [22]: df = DataFrame(np.random.randn(5,2),columns=['A','B'])
In [23]: store = pd.HDFStore('test.h5',mode='w')
In [24]: store.append('df_only_indexables',df)
In [25]: store.append('df_with_data_columns',df,data_columns=True)
In [26]: store.append('df_no_index',df,data_columns=True,index=False)
In [27]: store
Out[27]:
<class 'pandas.io.pytables.HDFStore'>
File path: test.h5
/df_no_index frame_table (typ->appendable,nrows->5,ncols->2,indexers->[index],dc->[A,B])
/df_only_indexables frame_table (typ->appendable,nrows->5,ncols->2,indexers->[index])
/df_with_data_columns frame_table (typ->appendable,nrows->5,ncols->2,indexers->[index],dc->[A,B])
In [28]: store.close()
you automatically get the index of the stored frame as a queryable column. By default NO other columns can be queried.
If you specify data_columns=True or data_columns=list_of_columns, then these are stored separately and can then be subsequently queried.
If you specify index=False then a PyTables index is not automatically created for the queryable column (eg. the index and/or data_columns).
To see the actual indexes being created (the PyTables indexes), see the output below. colindexes defines which columns have an actual PyTables index created. (I have truncated it somewhat).
/df_no_index/table (Table(5,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"A": Float64Col(shape=(), dflt=0.0, pos=1),
"B": Float64Col(shape=(), dflt=0.0, pos=2)}
byteorder := 'little'
chunkshape := (2730,)
/df_no_index/table._v_attrs (AttributeSet), 15 attributes:
[A_dtype := 'float64',
A_kind := ['A'],
B_dtype := 'float64',
B_kind := ['B'],
CLASS := 'TABLE',
FIELD_0_FILL := 0,
FIELD_0_NAME := 'index',
FIELD_1_FILL := 0.0,
FIELD_1_NAME := 'A',
FIELD_2_FILL := 0.0,
FIELD_2_NAME := 'B',
NROWS := 5,
TITLE := '',
VERSION := '2.7',
index_kind := 'integer']
/df_only_indexables/table (Table(5,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Float64Col(shape=(2,), dflt=0.0, pos=1)}
byteorder := 'little'
chunkshape := (2730,)
autoindex := True
colindexes := {
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
/df_only_indexables/table._v_attrs (AttributeSet), 11 attributes:
[CLASS := 'TABLE',
FIELD_0_FILL := 0,
FIELD_0_NAME := 'index',
FIELD_1_FILL := 0.0,
FIELD_1_NAME := 'values_block_0',
NROWS := 5,
TITLE := '',
VERSION := '2.7',
index_kind := 'integer',
values_block_0_dtype := 'float64',
values_block_0_kind := ['A', 'B']]
/df_with_data_columns/table (Table(5,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"A": Float64Col(shape=(), dflt=0.0, pos=1),
"B": Float64Col(shape=(), dflt=0.0, pos=2)}
byteorder := 'little'
chunkshape := (2730,)
autoindex := True
colindexes := {
"A": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"B": Index(6, medium, shuffle, zlib(1)).is_csi=False}
/df_with_data_columns/table._v_attrs (AttributeSet), 15 attributes:
[A_dtype := 'float64',
A_kind := ['A'],
B_dtype := 'float64',
B_kind := ['B'],
CLASS := 'TABLE',
FIELD_0_FILL := 0,
FIELD_0_NAME := 'index',
FIELD_1_FILL := 0.0,
FIELD_1_NAME := 'A',
FIELD_2_FILL := 0.0,
FIELD_2_NAME := 'B',
NROWS := 5,
TITLE := '',
VERSION := '2.7',
index_kind := 'integer']
So if you want to query a column, make it a data_column. If you don't then they will be stored in blocks by dtype (faster / less space).
You normally always want to index a column for retrieval, BUT, if you are creating and then appending multiple files to a single store, you usually turn off the index creation and do it at the end (as this is pretty expensive to create as you go).
See the cookbook for a menagerie of questions.
HDFStore.select_columnfunction. Found this only after figuring out that the column needed to be set indata_columns. This issue on github delves further into this: github.com/pandas-dev/pandas/issues/21188