3

I have an array, I need to find the sequence and then split it into array of arrays.

what I thought if i find outlier and order it accordingly. I have tried the following

data = [ 347,  348,  349,  350,  351,  352,  353,  354,  355,  356, 2987,
   2988, 2989, 2990, 2991, 2992, 2993, 2994, 2995, 2996, 4992, 4993,
   4994, 4995, 5007, 5008, 5009, 5010, 5011, 5012, 5013, 5014, 5015,
   5016, 5987, 5988, 5989, 5990, 5991, 5992, 5993, 5994, 5995, 5996,
   6036, 6037, 6038, 6039, 6040, 6041, 6042, 6043, 6044, 6045, 6046,
   6047]

def reject_outliers(data, m = 2.):
    d = np.abs(data - np.median(data))
    mdev = np.median(d)
    s = d/mdev if mdev else 0.
    return data[s<m], data[s>m]

def sort_sequences(data):
    d = tuple()
    sorted_data = tuple()

    seq = reject_outliers(data)
    sorted_data =  + (seq[1])
    print(sorted_data)
    main_part = seq[0]
    another_part = seq[1]
    if len(another_part) != 0:
        sort_sequences(main_part)
    return sorted_data

Now if i apply this

`data_sorted = sort_sequences(actual)`

I get:

[347 348 349 350 351 352 353 354 355 356]

which is not what I am looking for

2 Answers 2

2

What about

>>> ddata = data[1:] - data[:-1] # faster than np.diff(data)
>>> ddata
array([   1,    1,    1,    1,    1,    1,    1,    1,    1, 2631,    1,
          1,    1,    1,    1,    1,    1,    1,    1, 1996,    1,    1,
          1,   12,    1,    1,    1,    1,    1,    1,    1,    1,    1,
        971,    1,    1,    1,    1,    1,    1,    1,    1,    1,   40,
          1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1])

and finally

>>> sequences = np.split(data, np.argwhere(ddata>1).flatten() + 1)
>>> sequences[0]
array([347, 348, 349, 350, 351, 352, 353, 354, 355, 356])
>>> sequences[2]
array([4992, 4993, 4994, 4995])
>>> sequences[-1]
array([6036, 6037, 6038, 6039, 6040, 6041, 6042, 6043, 6044, 6045, 6046,
       6047])


Execution times: data[1:] - data[:-1] VS np.diff(data) -- tested via repl.it

setup_="""
import numpy as np

MULT = 100
data = np.array(MULT*[ 347,  348,  349,  350,  351,  352,  353,  354,  355,  356, 2987,
  2988, 2989, 2990, 2991, 2992, 2993, 2994, 2995, 2996, 4992, 4993,
  4994, 4995, 5007, 5008, 5009, 5010, 5011, 5012, 5013, 5014, 5015,
  5016, 5987, 5988, 5989, 5990, 5991, 5992, 5993, 5994, 5995, 5996,
  6036, 6037, 6038, 6039, 6040, 6041, 6042, 6043, 6044, 6045, 6046,
  6047])
"""
import timeit

then

Python 3.6.1 (default, Dec 2015, 13:05:11)
[GCC 4.8.2] on linux
>>> timeit.Timer('data[1:] - data[:-1]', setup=setup_).timeit()
35.17186516011134    
>>> timeit.Timer('np.diff(data)', setup=setup_).timeit()
63.88404295803048
Sign up to request clarification or add additional context in comments.

Comments

1

Write less code:

diffs = np.diff(data)
sequences = np.split(data, np.argwhere(diffs>1).flatten() + 1)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.