I have been using Divakar's solution for quite a while, and it is working perfectly. Thank you very much! However, a couple of days ago, I needed something faster for a certain project. Using strides https://numpy.org/doc/stable/reference/generated/numpy.ndarray.strides.html saves a lot of memory since it creates a "fake copy", and numexpr https://github.com/pydata/numexpr is about twice as fast as numpy, but even without numexpr it is pretty fast
import numexpr
import numpy as np
def rolling_window(a, window):
"""
Generate a rolling window view of a 1-dimensional NumPy array.
Parameters:
a (numpy.ndarray): The input array.
window (int): The size of the rolling window.
Returns:
numpy.ndarray: A view of the input array with shape (N - window + 1, window), where N is the size of the input array.
Example:
>>> a = np.array([1, 2, 3, 4, 5])
>>> windowed = rolling_window(a, 3)
>>> print(windowed)
array([[1, 2, 3],
[2, 3, 4],
[3, 4, 5]])
"""
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
def circular_rolling_window(a, window):
"""
Generate a circular rolling window view of a 1-dimensional NumPy array.
Parameters:
a (numpy.ndarray): The input array.
window (int): The size of the circular rolling window.
Returns:
numpy.ndarray: A view of the input array with shape (N, window), where N is the size of the input array, and the window wraps around at the boundaries.
Example:
>>> a = np.array([1, 2, 3, 4, 5])
>>> circular_windowed = circular_rolling_window(a, 3)
>>> print(circular_windowed)
array([[1, 2, 3],
[2, 3, 4],
[3, 4, 5],
[4, 5, 1],
[5, 1, 2]])
"""
pseudocircular = np.pad(a, pad_width=(0, window - 1), mode="wrap")
return rolling_window(pseudocircular, window)
def find_sequence_in_array(sequence, array, numexpr_enabled=True):
"""
Find occurrences of a sequence in a 1-dimensional NumPy array using a rolling window approach.
Parameters:
sequence (numpy.ndarray): The sequence to search for.
array (numpy.ndarray): The input array to search within.
numexpr_enabled (bool, optional): Whether to use NumExpr for efficient computation (default is True).
Returns:
numpy.ndarray: An array of indices where the sequence is found in the input array.
Example:
>>> arr = np.array([1, 2, 3, 4, 5, 1, 2, 3, 4, 5])
>>> seq = np.array([3, 4, 5])
>>> indices = find_sequence_in_array(seq, arr)
>>> print(indices)
[2 7]
"""
a3 = circular_rolling_window(array, len(sequence))
if numexpr_enabled:
isseq = numexpr.evaluate(
"a3==sequence", global_dict={}, local_dict={"a3": a3, "sequence": sequence}
)
su1 = numexpr.evaluate(
"sum(isseq,1)", global_dict={}, local_dict={"isseq": isseq.astype(np.int8)}
)
wherelen = numexpr.evaluate(
"(su1==l)", global_dict={}, local_dict={"su1": su1, "l": len(sequence)}
)
else:
isseq = a3 == sequence
su1 = np.sum(isseq, axis=1)
wherelen = su1 == len(sequence)
resu = np.nonzero(wherelen)
return resu[0]
seq = np.array([3, 6, 8, 4])
arr = np.random.randint(0, 9, (100000,))
%timeit a3 = find_sequence_in_array(sequence=seq, array=arr, numexpr_enabled=True)
1.32 ms ± 13.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit a3 = find_sequence_in_array(sequence=seq, array=arr, numexpr_enabled=False)
2.2 ms ± 17.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit a4 = search_sequence_numpy(arr=arr, seq=seq)
4.96 ms ± 50.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
EDIT:
Here is a NumPy one-liner that is much faster than the others
from functools import reduce
import numpy as np
def np_search_sequence(a, seq, distance=1):
return np.where(reduce(lambda a,b:a & b, ((np.concatenate([(a == s)[i * distance:], np.zeros(i * distance, dtype=np.uint8)],dtype=np.uint8)) for i,s in enumerate(seq))))[0]
seq = np.array([3, 6, 8, 4])
arr = np.random.randint(0, 9, (100000,))
%timeit np_search_sequence(a=arr, seq=seq, distance=1)
604 µs ± 7.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Input Parameters:
a: NumPy array in which the search is performed.
seq: Sequence to search for within the array.
distance: Minimum distance between consecutive elements of the sequence in the array (default is 1)
You can use distance 2 when looking for utf-8 strings in e.g. memory dumps
Using reduce and lambda function:
The reduce function is employed with a lambda function to iteratively perform a bitwise AND (&) operation on the binary arrays.
Sequence Processing:
For each element s in the given sequence (seq), the code does the following:
Creates a binary array indicating the presence of the current element at the desired distance
in the array ((a == s)[i * distance:]).
Appends a binary array of zeros (np.zeros(i * distance, dtype=np.uint8))
to ensure alignment with the original array.
Final Result:
Obtains the indices where the boolean array is True using np.where.
Returns these indices as a NumPy array.
array([2, 0, 0,0, 1, 0, 1, 0, 0])?