I'm dealing with some large data sets - observations as a function of time - which are not continuous in time (i.e., there is a lot of missing data, where the complete record is absent). To make things fun, there are a lot of data sets, all with missing records, all at random places...
I somehow need to get the data "synchronised" in time, with missing data flagged as missing data, instead of being completely absent. I've managed to get this partially working, but I'm still having some problems.
Example:
import numpy as np
# The date range (in the format that I'm dealing with), which I define
# myself for the period in which I'm interested
dc = np.arange(2010010100, 2010010106)
# Observation dates (d1) and values (v1)
d1 = np.array([2010010100, 2010010104, 2010010105]) # date
v1 = np.array([10, 11, 12 ]) # values
# Another data set with (partially) other times
d2 = np.array([2010010100, 2010010102, 2010010104]) # date
v2 = np.array([13, 14, 15 ]) # values
# For now set -1 as fill_value
v1_filled = -1 * np.ones_like(dc)
v2_filled = -1 * np.ones_like(dc)
v1_filled[dc.searchsorted(d1)] = v1
v2_filled[dc.searchsorted(d2)] = v2
This gives me the desired result:
v1_filled = [10 -1 -1 -1 11 12]
v2_filled = [13 -1 14 -1 15 -1]
but only if the values in d1 or d2 are also in dc; if a value in d1 or d2 is not in dc the code fails because then searchsorted behaves as:
If there is no suitable index, return either 0 or N (where N is the length of
a).
So for example, if I change d2 and v2 to:
d2 = np.array([2010010100, 2010010102, 2010010104, 0]) # date
v2 = np.array([13, 14, 15, 9999]) # values
The result is
[9999 -1 14 -1 15 -1]
In this case, because d2=0 is not in dc, it should discard that value, instead of inserting it at the start (or end). Any idea how to easily achieve that?