Indexing and Data Columns in Pandas/PyTables

Question

http://pandas.pydata.org/pandas-docs/stable/io.html#indexing

I'm really confused about this concept of Data columns in Pandas HDF5 IO. Plus there's very little to no information about it to be found on googling it either. Since I'm diving into Pandas in a large project which involves HDF5 storage, I'd like to be clear about such concepts.

The docs say:

You can designate (and index) certain columns that you want to be able to perform queries (other than the indexable columns, which you can always query). For instance say you want to perform this common operation, on-disk, and return just the frame that matches this query. You can specify data_columns = True to force all columns to be data_columns

This is confusing:

other than the indexable columns, which you can always query: What are 'indexable' columns? aren't all columns 'indexable'? What does this term mean?
For instance say you want to perform this common operation, on-disk, and return just the frame that matches this query. How is that different from normal querying on a Pytable; with or without any indexes of data_columns?
What is the fundamental difference between a non-indexed, indexed, and data_column column?

I was having the same issue, with using HDFStore.select_column function. Found this only after figuring out that the column needed to be set in data_columns. This issue on github delves further into this: github.com/pandas-dev/pandas/issues/21188 — Nikhil VJ
– Nikhil VJ, Commented May 24, 2018 at 16:37

Jeff · Accepted Answer · 2014-09-07 22:21:34Z

8

You should just try it.

In [22]: df = DataFrame(np.random.randn(5,2),columns=['A','B'])

In [23]: store = pd.HDFStore('test.h5',mode='w')

In [24]: store.append('df_only_indexables',df)

In [25]: store.append('df_with_data_columns',df,data_columns=True)

In [26]: store.append('df_no_index',df,data_columns=True,index=False)

In [27]: store
Out[27]: 
<class 'pandas.io.pytables.HDFStore'>
File path: test.h5
/df_no_index                     frame_table  (typ->appendable,nrows->5,ncols->2,indexers->[index],dc->[A,B])
/df_only_indexables              frame_table  (typ->appendable,nrows->5,ncols->2,indexers->[index])          
/df_with_data_columns            frame_table  (typ->appendable,nrows->5,ncols->2,indexers->[index],dc->[A,B])

In [28]: store.close()

you automatically get the index of the stored frame as a queryable column. By default NO other columns can be queried.
If you specify data_columns=True or data_columns=list_of_columns, then these are stored separately and can then be subsequently queried.
If you specify index=False then a PyTables index is not automatically created for the queryable column (eg. the index and/or data_columns).

To see the actual indexes being created (the PyTables indexes), see the output below. colindexes defines which columns have an actual PyTables index created. (I have truncated it somewhat).

/df_no_index/table (Table(5,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "A": Float64Col(shape=(), dflt=0.0, pos=1),
  "B": Float64Col(shape=(), dflt=0.0, pos=2)}
  byteorder := 'little'
  chunkshape := (2730,)
  /df_no_index/table._v_attrs (AttributeSet), 15 attributes:
   [A_dtype := 'float64',
    A_kind := ['A'],
    B_dtype := 'float64',
    B_kind := ['B'],
    CLASS := 'TABLE',
    FIELD_0_FILL := 0,
    FIELD_0_NAME := 'index',
    FIELD_1_FILL := 0.0,
    FIELD_1_NAME := 'A',
    FIELD_2_FILL := 0.0,
    FIELD_2_NAME := 'B',
    NROWS := 5,
    TITLE := '',
    VERSION := '2.7',
    index_kind := 'integer']
/df_only_indexables/table (Table(5,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(2,), dflt=0.0, pos=1)}
  byteorder := 'little'
  chunkshape := (2730,)
  autoindex := True
  colindexes := {
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
  /df_only_indexables/table._v_attrs (AttributeSet), 11 attributes:
   [CLASS := 'TABLE',
    FIELD_0_FILL := 0,
    FIELD_0_NAME := 'index',
    FIELD_1_FILL := 0.0,
    FIELD_1_NAME := 'values_block_0',
    NROWS := 5,
    TITLE := '',
    VERSION := '2.7',
    index_kind := 'integer',
    values_block_0_dtype := 'float64',
    values_block_0_kind := ['A', 'B']]
/df_with_data_columns/table (Table(5,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "A": Float64Col(shape=(), dflt=0.0, pos=1),
  "B": Float64Col(shape=(), dflt=0.0, pos=2)}
  byteorder := 'little'
  chunkshape := (2730,)
  autoindex := True
  colindexes := {
    "A": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "B": Index(6, medium, shuffle, zlib(1)).is_csi=False}
  /df_with_data_columns/table._v_attrs (AttributeSet), 15 attributes:
   [A_dtype := 'float64',
    A_kind := ['A'],
    B_dtype := 'float64',
    B_kind := ['B'],
    CLASS := 'TABLE',
    FIELD_0_FILL := 0,
    FIELD_0_NAME := 'index',
    FIELD_1_FILL := 0.0,
    FIELD_1_NAME := 'A',
    FIELD_2_FILL := 0.0,
    FIELD_2_NAME := 'B',
    NROWS := 5,
    TITLE := '',
    VERSION := '2.7',
    index_kind := 'integer']

So if you want to query a column, make it a data_column. If you don't then they will be stored in blocks by dtype (faster / less space).

You normally always want to index a column for retrieval, BUT, if you are creating and then appending multiple files to a single store, you usually turn off the index creation and do it at the end (as this is pretty expensive to create as you go).

See the cookbook for a menagerie of questions.

edited Sep 7, 2014 at 22:21

answered Sep 7, 2014 at 21:38

Jeff

130k21 gold badges223 silver badges189 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Michael Over a year ago

What is the use of setting index=True? It seems to me that I can set data_columns=True, index=False, and I can still query the table using its columns

Jeff Over a year ago

sure you can but you won't actually get the benefit of having an index causing a linear scan for any query. index=False is useful when appending, eg multiply appending then construct the index is more efficient that appending with an index (for large amounts of data)

Michael Over a year ago

how do I construct an index after finish appending? I couldn't find an example on the pandas documentation? Thank you for your help.

Jeff Over a year ago

see this post: stackoverflow.com/questions/17893370/…

Jeff Over a year ago

if you would like to add to the cookbook would be gr8 (do a pull request)

|

Collectives™ on Stack Overflow

Indexing and Data Columns in Pandas/PyTables

1 Answer 1

8 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related