1

I'm dealing with some large data sets - observations as a function of time - which are not continuous in time (i.e., there is a lot of missing data, where the complete record is absent). To make things fun, there are a lot of data sets, all with missing records, all at random places...

I somehow need to get the data "synchronised" in time, with missing data flagged as missing data, instead of being completely absent. I've managed to get this partially working, but I'm still having some problems.

Example:

import numpy as np

# The date range (in the format that I'm dealing with), which I define
# myself for the period in which I'm interested
dc = np.arange(2010010100, 2010010106)

# Observation dates (d1) and values (v1)
d1  = np.array([2010010100, 2010010104, 2010010105]) # date
v1  = np.array([10,         11,         12        ]) # values

# Another data set with (partially) other times
d2  = np.array([2010010100, 2010010102, 2010010104]) # date
v2  = np.array([13,         14,         15        ]) # values

# For now set -1 as fill_value
v1_filled = -1 * np.ones_like(dc)
v2_filled = -1 * np.ones_like(dc)

v1_filled[dc.searchsorted(d1)] = v1
v2_filled[dc.searchsorted(d2)] = v2

This gives me the desired result:

v1_filled = [10 -1 -1 -1 11 12]
v2_filled = [13 -1 14 -1 15 -1]

but only if the values in d1 or d2 are also in dc; if a value in d1 or d2 is not in dc the code fails because then searchsorted behaves as:

If there is no suitable index, return either 0 or N (where N is the length of a).

So for example, if I change d2 and v2 to:

d2  = np.array([2010010100, 2010010102, 2010010104, 0]) # date
v2  = np.array([13,         14,         15,         9999]) # values

The result is

[9999   -1   14   -1   15   -1]

In this case, because d2=0 is not in dc, it should discard that value, instead of inserting it at the start (or end). Any idea how to easily achieve that?

2
  • 1
    This kind of task is exactly what pandas is great for. Commented Jul 29, 2016 at 12:30
  • Yes, I was afraid of that.. I have a bit of a love-hate relationship with Pandas; it seems to be very useful, but I also find it a bit difficult to get started with. Commented Jul 29, 2016 at 20:22

1 Answer 1

1

If you do d2 = np.intersect1d(dc, d2) before calling dc.searchsorted(d2) it will remove all elements in d2 that are not in dc.

Sign up to request clarification or add additional context in comments.

1 Comment

I ended up using a slightly different approach (compressing the masked arrays first to remove the masked values, since not all statistics routines work well with masked arrays), but intersect1d() was indeed the missing step...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.