2

I need to do a lot of successive queries on time series data in specific time spans from a HDF5 database (the data is stord in seconds, not always "continuous", I only know the start and end time). Therefore, I wonder wether there is a faster solution than my current code, which was inspired by this answer:

import pandas as pd
from pandas import HDFStore

store = HDFStore(pathToStore)
dates = pd.date_range(start=start_date,end=end_date, freq='S')
index = store.select_column('XAU','index')
ts    = store.select('XAU', where=index[index.isin(dates)].index)

Any comments and suggestions are highly appreciated, thx!

1 Answer 1

9

Let's test it !

Generating 1M rows DF:

In [129]: df = pd.DataFrame({'val':np.random.rand(10**6)}, index=pd.date_range('1980-01-01', freq='19S', periods=10**6))

In [130]: df.shape
Out[130]: (1000000, 1)

In [131]: df.head()
Out[131]:
                          val
1980-01-01 00:00:00  0.388980
1980-01-01 00:00:19  0.916917
1980-01-01 00:00:38  0.894360
1980-01-01 00:00:57  0.235797
1980-01-01 00:01:16  0.577791

Let's shuffle it:

In [132]: df = df.sample(frac=1)

In [133]: df.head()
Out[133]:
                          val
1980-07-04 12:10:11  0.898648
1980-07-08 20:37:39  0.563325
1980-03-10 00:06:12  0.449458
1980-08-07 02:01:42  0.511847
1980-02-28 21:09:43  0.757327

Storing generated DF into HDF5 file (NOTE: per default only index is indexed, so if you are going to search also by other columns, use data_columns parameter):

In [134]: store = pd.HDFStore('d:/temp/test_time_ser.h5')

In [135]: store.append('XAU', df, format='t')

In [136]: store.close()

In [140]: store = pd.HDFStore('d:/temp/test_time_ser.h5')

Let's test select(where="<query>") method:

In [141]: store.select('XAU', where="index >= '1980-04-04' and index<= '1980-05-01'").head()
Out[141]:
                          val
1980-04-13 07:22:05  0.391409
1980-04-25 14:23:07  0.400838
1980-04-10 12:32:08  0.136346
1980-04-09 18:58:35  0.944389
1980-04-13 22:34:05  0.115643

Measuring performance:

In [142]: %timeit store.select('XAU', where="index >= '1980-04-04' and index<= '1980-05-01'")
1 loop, best of 3: 755 ms per loop

Let's compare it with your current approach:

In [144]: dates = pd.date_range(start='1980-04-04',end='1980-05-01', freq='S')

In [145]: index = store.select_column('XAU','index')

In [146]: store.select('XAU', where=index[index.isin(dates)].index).head()
Out[146]:
                          val
1980-04-13 07:22:05  0.391409
1980-04-25 14:23:07  0.400838
1980-04-10 12:32:08  0.136346
1980-04-09 18:58:35  0.944389
1980-04-13 22:34:05  0.115643

In [147]: %timeit store.select('XAU', where=index[index.isin(dates)].index)
1 loop, best of 3: 8.13 s per loop

UPDATE: let's do the same test, but this time assuming that the index (time series) is sorted:

In [156]: df = pd.DataFrame({'val':np.random.rand(10**6)}, index=pd.date_range('1980-01-01', freq='19S', periods=10**6))

In [157]: df.shape
Out[157]: (1000000, 1)

In [164]: store.close()

In [165]: store = pd.HDFStore('d:/temp/test_time_ser2.h5')

In [166]: store.append('XAU', df, format='t')

In [167]: %timeit store.select('XAU', where="index >= '1980-04-04' and index<= '1980-05-01'")
1 loop, best of 3: 253 ms per loop

In [168]: %timeit store.select('XAU', where=index[index.isin(dates)].index)
1 loop, best of 3: 8.13 s per loop
Sign up to request clarification or add additional context in comments.

7 Comments

This is fantastic, thank you. Could you be so kind and write a sentence on why the first option is faster?
@Tim, IMO isin() has a huge overhead compared to "range scan" (in Oracle Cost Base Optimizer terminology). We need to comare all values from the list with a series (vector). When we do a range scan ("index >= '1980-04-04' and index<= '1980-05-01'") we need to compare only two values (range boundaries)
Is your index sorted in the HDF file or not? It might affect our test. I assumed that it's not sorted
Nice, almost same, I appreciate your effort!
@Tim, i was testing against the same HDF5 file - now i've fixed it. The "range scan" approach got ~3 times faster, the .isin() - remained the same...
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.