Numpy Finding Matching number with Array

Question

Any help is greatly appreciated!! I have been trying to solve this for the last few days....

I have two arrays: import pandas as pd

 OldDataSet = {
 'id': [20,30,40,50,60,70]
 ,'OdoLength': [26.12,43.12,46.81,56.23,111.07,166.38]}

 NewDataSet = {
 'id': [3000,4000,5000,6000,7000,8000]
 ,'OdoLength': [25.03,42.12,45.74,46,110.05,165.41]}

 df1= pd.DataFrame(OldDataSet)
 df2 = pd.DataFrame(NewDataSet)

 OldDataSetArray = df1.as_matrix()
 NewDataSetArray = df2.as_matrix()

The result that I am trying to get is:

Array 1 and Array 2 Match by closes difference, based on left over number from Array2

20  26.12   3000    25.03   
30  43.12   4000    42.12   
40  46.81   6000    46  
50  56.23   7000    110.05  
60  111.07  8000    165.41  
70  166.38  0   0

Starting at Array 1, ID 20, find the nearest which in this case would be the first Number in Array 2 ID 3000 (26.12-25.03). so ID 20, gets matched to 3000. Where it gets tricky is if one value in Array 2 is not the closest, then it is skipped. for example, ID 40 value 46.81 is compared to 45.74, 46 and the smallest value is .81 from 46 ID 6000. So ID 40--> ID 6000. ID 5000 in array 2 is now skipped for any future comparisons. So now when comparing array 1 ID 50, it is compared to the next available number in array 2, 110.05. array 1 ID 50 is matched to Array 2 ID 7000.

UPDATE

so here's the code that i have tried and it works. Yes, it is not the greatest, so if someone has another suggestion please let me know.

 import pandas as pd
 import operator 

 OldDataSet = {
 'id': [20,30,40,50,60,70]
 ,'OdoLength': [26.12,43.12,46.81,56.23,111.07,166.38]}

NewDataSet = {
'id': [3000,4000,5000,6000,7000,8000]
,'OdoLength': [25.03,42.12,45.74,46,110.05,165.41]}

df1= pd.DataFrame(OldDataSet)
df2 = pd.DataFrame(NewDataSet)

OldDataSetArray = df1.as_matrix()
NewDataSetArray = df2.as_matrix()




newPos = 1
CurrentNumber = 0
OldArrayLen = len(OldDataSetArray) -1
NewArrayLen = len(NewDataSetArray) -1
numberResults = []

for oldPos in range(len(OldDataSetArray)):
PreviousNumber =  abs(OldDataSetArray[oldPos, 0]- NewDataSetArray[oldPos, 0])


 while newPos <= len(NewDataSetArray) - 1:   
    CurrentNumber = abs(OldDataSetArray[oldPos, 0] - NewDataSetArray[newPos, 0])

#if it is the last row for the inner array, then match the next available 
#in Array 1 to that last record
    if newPos == NewArrayLen and oldPos < newPos and oldPos +1 <= OldArrayLen:
       numberResults.append([OldDataSetArray[oldPos +1, 1],NewDataSetArray[newPos, 1],OldDataSetArray[oldPos +1, 0],NewDataSetArray[newPos, 0]])

    if PreviousNumber < CurrentNumber:
        numberResults.append([OldDataSetArray[oldPos, 1], NewDataSetArray[newPos - 1, 1], OldDataSetArray[oldPos, 0], NewDataSetArray[newPos - 1, 0]])
        newPos +=1
        break
    elif PreviousNumber > CurrentNumber:
        PreviousNumber = CurrentNumber
        newPos +=1  


#sort by array one values        
numberResults = sorted(numberResults, key=operator.itemgetter(0)) 
numberResultsDf = pd.DataFrame(numberResults)

The output looks ragged (diferent no. of elems per row because some won't have matches). Could you confirm? — Divakar
– Divakar, Commented Nov 8, 2017 at 18:05
@Divakar, yes some will not have matches. because some will drop off drop array 2. — yanci
– yanci, Commented Nov 8, 2017 at 18:15

Nils Werner · Accepted Answer · 2017-11-08 18:57:51Z

2

You can use NumPy broadcasting to build a distance matrix:

a = numpy.array([26.12, 43.12, 46.81, 56.23, 111.07, 166.38,])
b = numpy.array([25.03, 42.12, 45.74, 46, 110.05, 165.41,])

numpy.abs(a[:, None] - b[None, :])
# array([[   1.09,   16.  ,   19.62,   19.88,   83.93,  139.29],
#        [  18.09,    1.  ,    2.62,    2.88,   66.93,  122.29],
#        [  21.78,    4.69,    1.07,    0.81,   63.24,  118.6 ],
#        [  31.2 ,   14.11,   10.49,   10.23,   53.82,  109.18],
#        [  86.04,   68.95,   65.33,   65.07,    1.02,   54.34],
#        [ 141.35,  124.26,  120.64,  120.38,   56.33,    0.97]])

of that matrix you can then find the closest elements using argmin, either row- or columnwise (depending of if you want to search in a or b).

numpy.argmin(numpy.abs(a[:, None] - b[None, :]), axis=1)
# array([0, 1, 3, 3, 4, 5])

edited Nov 8, 2017 at 18:57

answered Nov 8, 2017 at 18:11

Nils Werner

37.2k7 gold badges85 silver badges108 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

yanci Over a year ago

how can I get the ID's associated to the result set?

yanci Over a year ago

Nils Werner, thank you for your response. However for the final result set has the lowest value as 0 & 0, 1 & 1, 2 & 3, 3 & 3, 4 & 4, 5 & 5. Position 3 is called out twice. Can you see the expected results that I posted and give me your thoughts? let me know if you see a way out of the for loop. thanks again for your help!

Nils Werner Over a year ago

Well, 45.74 and 46 are both closest to 46.81. Totally expected and valid result, no?

yanci Over a year ago

the final match is based on nearest and next available. so, 46.81 is closest to 46 (diff is .81). meaning 45.74 in array 2 was skipped over and can't be used for any matches. since 46.81 is matched to 46, the next available number from array 2 is 110.05. array 1 number 56.23 will then subtract from array 2 numbers 110.05 and 165.41 to find the nearest (110.05) since is the closest and it is matched to 56.23 from the left over numbers. it is hard to explain, I hope this makes sense.

Nils Werner Over a year ago

OK, please reword your question, as this is absolutely not clear from it.

|

B. M. · Accepted Answer · 2017-11-08 20:18:30Z

0

Compute all the differences, and use `np.argmin to lookup the closest.

    a,b=np.random.rand(2,10)

    all_differences=np.abs(np.subtract.outer(a,b))

    ia=all_differences.argmin(axis=1)

    for i in range(10):
        print(i,a[i],ia[i], b[ia[i]])



    0 0.231603891949 8 0.21177584152
    1 0.27810475456 7 0.302647382888
    2 0.582133214953 2 0.548920922033
    3 0.892858042793 1 0.872622982632
    4 0.67293347218 6 0.677971552011
    5 0.985227546492 1 0.872622982632
    6 0.82431697833 5 0.83765895237
    7 0.426992114791 4 0.451084369838
    8 0.181147161752 8 0.21177584152
    9 0.631139744522 3 0.653554586691

EDIT

with dataframes and indexes:

va,vb=np.random.rand(2,10)
na,nb=np.random.randint(0,100,(2,10))

dfa=pd.DataFrame({'id':na,'odo':va})
dfb=pd.DataFrame({'id':nb,'odo':vb})


all_differences=np.abs(np.subtract.outer(dfa.odo,dfb.odo))

ia=all_differences.argmin(axis=1)

dfc=dfa.merge(dfb.loc[ia].reset_index(drop=True),\
left_index=True,right_index=True)

Input :

In [337]: dfa

Out[337]: 
   id       odo
0  72  0.426457
1  12  0.315997
2  96  0.623164
3   9  0.821498
4  72  0.071237
5   5  0.730634
6  45  0.963051
7  14  0.603289
8   5  0.401737
9  63  0.976644

In [338]: dfb
Out[338]: 
   id       odo
0  95  0.333215
1   7  0.023957
2  61  0.021944
3  57  0.660894
4  22  0.666716
5   6  0.234920
6  83  0.642148
7  64  0.509589
8  98  0.660273
9  19  0.658639

Output :

In [339]: dfc
Out[339]: 
   id_x     odo_x  id_y     odo_y
0    72  0.426457    64  0.509589
1    12  0.315997    95  0.333215
2    96  0.623164    83  0.642148
3     9  0.821498    22  0.666716
4    72  0.071237     7  0.023957
5     5  0.730634    22  0.666716
6    45  0.963051    22  0.666716
7    14  0.603289    83  0.642148
8     5  0.401737    95  0.333215
9    63  0.976644    22  0.666716

edited Nov 8, 2017 at 20:18

answered Nov 8, 2017 at 18:19

B. M.

18.7k2 gold badges40 silver badges56 bronze badges

2 Comments

yanci Over a year ago

will nearest take into effect the skipping some from array 2? For example, final result set skips id 5000.

B. M. Over a year ago

here it's line numbers. It's in principle not a problem to find the index with the line number.

Collectives™ on Stack Overflow

Numpy Finding Matching number with Array

2 Answers 2

6 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related