Searching a sequence in a NumPy array

Question

Let's say I have the following array :

 array([2, 0, 0, 1, 0, 1, 0, 0])

How do I get the indices where I have occurrence of sequence of values : [0,0]? So, the expected output for such a case would be : [1,2,6,7].

Edit :

1) Please note that [0,0] is just a sequence. It could be [0,0,0] or [4,6,8,9] or [5,2,0], just anything.

2) If my array were modified to : array([2, 0, 0, 0, 0, 1, 0, 1, 0, 0]), the expected result with the same sequence of [0,0] would be [1,2,3,4,8,9].

I am looking for some NumPy shortcut.

If I understand your question properly, you want a generic method that would accommodate for any sequence, with [0, 0] just being an example? — Reti43
– Reti43, Commented Apr 9, 2016 at 23:19

ndrwnaguib · Accepted Answer · 2019-02-01 06:28:08Z

Well, this is basically a template-matching problem that comes up in image-processing a lot. Listed in this post are two approaches: Pure NumPy based and OpenCV (cv2) based.

Approach #1: With NumPy, one can create a 2D array of sliding indices across the entire length of the input array. Thus, each row would be a sliding window of elements. Next, match up each row with the input sequence, which will bring in broadcasting for a vectorized solution. We look for all True rows indicating those are the ones that are the perfect matches and as such would be the starting indices of the matches. Finally, using those indices, create a range of indices extending up to the length of the sequence, to give us the desired output. The implementation would be -

def search_sequence_numpy(arr,seq):
    """ Find sequence in an array using NumPy only.

    Parameters
    ----------    
    arr    : input 1D array
    seq    : input 1D array

    Output
    ------    
    Output : 1D Array of indices in the input array that satisfy the 
    matching of input sequence in the input array.
    In case of no match, an empty list is returned.
    """

    # Store sizes of input array and sequence
    Na, Nseq = arr.size, seq.size

    # Range of sequence
    r_seq = np.arange(Nseq)

    # Create a 2D array of sliding indices across the entire length of input array.
    # Match up with the input sequence & get the matching starting indices.
    M = (arr[np.arange(Na-Nseq+1)[:,None] + r_seq] == seq).all(1)

    # Get the range of those indices as final output
    if M.any() >0:
        return np.where(np.convolve(M,np.ones((Nseq),dtype=int))>0)[0]
    else:
        return []         # No match found

Approach #2: With OpenCV (cv2), we have a built-in function for template-matching : cv2.matchTemplate. Using this, we would have the starting matching indices. Rest of the steps would be same as for the previous approach. Here's the implementation with cv2 :

from cv2 import matchTemplate as cv2m

def search_sequence_cv2(arr,seq):
    """ Find sequence in an array using cv2.
    """

    # Run a template match with input sequence as the template across
    # the entire length of the input array and get scores.
    S = cv2m(arr.astype('uint8'),seq.astype('uint8'),cv2.TM_SQDIFF)

    # Now, with floating point array cases, the matching scores might not be 
    # exactly zeros, but would be very small numbers as compared to others.
    # So, for that use a very small to be used to threshold the scorees 
    # against and decide for matches.
    thresh = 1e-5 # Would depend on elements in seq. So, be careful setting this.

    # Find the matching indices
    idx = np.where(S.ravel() < thresh)[0]

    # Get the range of those indices as final output
    if len(idx)>0:
        return np.unique((idx[:,None] + np.arange(seq.size)).ravel())
    else:
        return []         # No match found

Sample run

In [512]: arr = np.array([2, 0, 0, 0, 0, 1, 0, 1, 0, 0])

In [513]: seq = np.array([0,0])

In [514]: search_sequence_numpy(arr,seq)
Out[514]: array([1, 2, 3, 4, 8, 9])

In [515]: search_sequence_cv2(arr,seq)
Out[515]: array([1, 2, 3, 4, 8, 9])

Runtime test

In [477]: arr = np.random.randint(0,9,(100000))
     ...: seq = np.array([3,6,8,4])
     ...: 

In [478]: np.allclose(search_sequence_numpy(arr,seq),search_sequence_cv2(arr,seq))
Out[478]: True

In [479]: %timeit search_sequence_numpy(arr,seq)
100 loops, best of 3: 11.8 ms per loop

In [480]: %timeit search_sequence_cv2(arr,seq)
10 loops, best of 3: 20.6 ms per loop

Seems like the Pure NumPy based one is the safest and fastest!

Hans · Accepted Answer · 2023-11-18 02:52:02Z

I have been using Divakar's solution for quite a while, and it is working perfectly. Thank you very much! However, a couple of days ago, I needed something faster for a certain project. Using strides https://numpy.org/doc/stable/reference/generated/numpy.ndarray.strides.html saves a lot of memory since it creates a "fake copy", and numexpr https://github.com/pydata/numexpr is about twice as fast as numpy, but even without numexpr it is pretty fast

import numexpr
import numpy as np

def rolling_window(a, window):
    """
    Generate a rolling window view of a 1-dimensional NumPy array.

    Parameters:
    a (numpy.ndarray): The input array.
    window (int): The size of the rolling window.

    Returns:
    numpy.ndarray: A view of the input array with shape (N - window + 1, window), where N is the size of the input array.

    Example:
    >>> a = np.array([1, 2, 3, 4, 5])
    >>> windowed = rolling_window(a, 3)
    >>> print(windowed)
    array([[1, 2, 3],
           [2, 3, 4],
           [3, 4, 5]])
    """

    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)


def circular_rolling_window(a, window):
    """
    Generate a circular rolling window view of a 1-dimensional NumPy array.

    Parameters:
    a (numpy.ndarray): The input array.
    window (int): The size of the circular rolling window.

    Returns:
    numpy.ndarray: A view of the input array with shape (N, window), where N is the size of the input array, and the window wraps around at the boundaries.

    Example:
    >>> a = np.array([1, 2, 3, 4, 5])
    >>> circular_windowed = circular_rolling_window(a, 3)
    >>> print(circular_windowed)
    array([[1, 2, 3],
           [2, 3, 4],
           [3, 4, 5],
           [4, 5, 1],
           [5, 1, 2]])
    """

    pseudocircular = np.pad(a, pad_width=(0, window - 1), mode="wrap")
    return rolling_window(pseudocircular, window)


def find_sequence_in_array(sequence, array, numexpr_enabled=True):
    """
    Find occurrences of a sequence in a 1-dimensional NumPy array using a rolling window approach.

    Parameters:
    sequence (numpy.ndarray): The sequence to search for.
    array (numpy.ndarray): The input array to search within.
    numexpr_enabled (bool, optional): Whether to use NumExpr for efficient computation (default is True).

    Returns:
    numpy.ndarray: An array of indices where the sequence is found in the input array.

    Example:
    >>> arr = np.array([1, 2, 3, 4, 5, 1, 2, 3, 4, 5])
    >>> seq = np.array([3, 4, 5])
    >>> indices = find_sequence_in_array(seq, arr)
    >>> print(indices)
    [2 7]
    """

    a3 = circular_rolling_window(array, len(sequence))
    if numexpr_enabled:
        isseq = numexpr.evaluate(
            "a3==sequence", global_dict={}, local_dict={"a3": a3, "sequence": sequence}
        )
        su1 = numexpr.evaluate(
            "sum(isseq,1)", global_dict={}, local_dict={"isseq": isseq.astype(np.int8)}
        )
        wherelen = numexpr.evaluate(
            "(su1==l)", global_dict={}, local_dict={"su1": su1, "l": len(sequence)}
        )
    else:
        isseq = a3 == sequence
        su1 = np.sum(isseq, axis=1)
        wherelen = su1 == len(sequence)

    resu = np.nonzero(wherelen)
    return resu[0]
seq = np.array([3, 6, 8, 4])
arr = np.random.randint(0, 9, (100000,))
%timeit a3 = find_sequence_in_array(sequence=seq, array=arr, numexpr_enabled=True)
1.32 ms ± 13.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit a3 = find_sequence_in_array(sequence=seq, array=arr, numexpr_enabled=False)
2.2 ms ± 17.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit a4 = search_sequence_numpy(arr=arr, seq=seq)
4.96 ms ± 50.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

EDIT:

Here is a NumPy one-liner that is much faster than the others

from functools import reduce
import numpy as np

def np_search_sequence(a, seq, distance=1):
    return np.where(reduce(lambda a,b:a & b, ((np.concatenate([(a == s)[i * distance:], np.zeros(i * distance, dtype=np.uint8)],dtype=np.uint8)) for i,s in enumerate(seq))))[0]
seq = np.array([3, 6, 8, 4])
arr = np.random.randint(0, 9, (100000,))
%timeit np_search_sequence(a=arr, seq=seq, distance=1)
604 µs ± 7.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Input Parameters:

a: NumPy array in which the search is performed.

seq: Sequence to search for within the array.

distance: Minimum distance between consecutive elements of the sequence in the array (default is 1) You can use distance 2 when looking for utf-8 strings in e.g. memory dumps

Using reduce and lambda function:

The reduce function is employed with a lambda function to iteratively perform a bitwise AND (&) operation on the binary arrays.

Sequence Processing:

For each element s in the given sequence (seq), the code does the following: Creates a binary array indicating the presence of the current element at the desired distance in the array ((a == s)[i * distance:]). Appends a binary array of zeros (np.zeros(i * distance, dtype=np.uint8)) to ensure alignment with the original array.

Final Result:

Obtains the indices where the boolean array is True using np.where. Returns these indices as a NumPy array.

Jozsef Meszaros · Accepted Answer · 2019-03-13 16:21:45Z

0

I find that the most succinct, intuitive and general way to do this is using regular expressions.

import re
import numpy as np

# Set the threshold for string printing to infinite
np.set_printoptions(threshold=np.inf)

# Remove spaces and linebreaks that would come through when printing your vector
yourarray_string = re.sub('\n|\s','',np.array_str( yourarray ))[1:-1]

# The next line is the most important, set the arguments in the braces
# such that the first argument is the shortest sequence you want
# and the second argument is the longest (using empty as infinite length)

r = re.compile(r"[0]{1,}") 
zero_starts = [m.start() for m in r.finditer( yourarray_string )]
zero_ends = [m.end() for m in r.finditer( yourarray_string )]

answered Mar 13, 2019 at 16:21

Jozsef Meszaros

191 bronze badge

1 Comment

Maksym Ganenko Over a year ago

Converting number sequence into string? Replacing nicely packed fixed length (in bytes) array of numbers with variable length strings?

Collectives™ on Stack Overflow

Searching a sequence in a NumPy array

3 Answers 3

Comments

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related