0

Any help is greatly appreciated!! I have been trying to solve this for the last few days....

I have two arrays: import pandas as pd

 OldDataSet = {
 'id': [20,30,40,50,60,70]
 ,'OdoLength': [26.12,43.12,46.81,56.23,111.07,166.38]}

 NewDataSet = {
 'id': [3000,4000,5000,6000,7000,8000]
 ,'OdoLength': [25.03,42.12,45.74,46,110.05,165.41]}

 df1= pd.DataFrame(OldDataSet)
 df2 = pd.DataFrame(NewDataSet)

 OldDataSetArray = df1.as_matrix()
 NewDataSetArray = df2.as_matrix()

The result that I am trying to get is:

Array 1 and Array 2 Match by closes difference, based on left over number from Array2

20  26.12   3000    25.03   
30  43.12   4000    42.12   
40  46.81   6000    46  
50  56.23   7000    110.05  
60  111.07  8000    165.41  
70  166.38  0   0   

Starting at Array 1, ID 20, find the nearest which in this case would be the first Number in Array 2 ID 3000 (26.12-25.03). so ID 20, gets matched to 3000. Where it gets tricky is if one value in Array 2 is not the closest, then it is skipped. for example, ID 40 value 46.81 is compared to 45.74, 46 and the smallest value is .81 from 46 ID 6000. So ID 40--> ID 6000. ID 5000 in array 2 is now skipped for any future comparisons. So now when comparing array 1 ID 50, it is compared to the next available number in array 2, 110.05. array 1 ID 50 is matched to Array 2 ID 7000.

UPDATE

so here's the code that i have tried and it works. Yes, it is not the greatest, so if someone has another suggestion please let me know.

 import pandas as pd
 import operator 

 OldDataSet = {
 'id': [20,30,40,50,60,70]
 ,'OdoLength': [26.12,43.12,46.81,56.23,111.07,166.38]}

NewDataSet = {
'id': [3000,4000,5000,6000,7000,8000]
,'OdoLength': [25.03,42.12,45.74,46,110.05,165.41]}

df1= pd.DataFrame(OldDataSet)
df2 = pd.DataFrame(NewDataSet)

OldDataSetArray = df1.as_matrix()
NewDataSetArray = df2.as_matrix()




newPos = 1
CurrentNumber = 0
OldArrayLen = len(OldDataSetArray) -1
NewArrayLen = len(NewDataSetArray) -1
numberResults = []

for oldPos in range(len(OldDataSetArray)):
PreviousNumber =  abs(OldDataSetArray[oldPos, 0]- NewDataSetArray[oldPos, 0])


 while newPos <= len(NewDataSetArray) - 1:   
    CurrentNumber = abs(OldDataSetArray[oldPos, 0] - NewDataSetArray[newPos, 0])

#if it is the last row for the inner array, then match the next available 
#in Array 1 to that last record
    if newPos == NewArrayLen and oldPos < newPos and oldPos +1 <= OldArrayLen:
       numberResults.append([OldDataSetArray[oldPos +1, 1],NewDataSetArray[newPos, 1],OldDataSetArray[oldPos +1, 0],NewDataSetArray[newPos, 0]])

    if PreviousNumber < CurrentNumber:
        numberResults.append([OldDataSetArray[oldPos, 1], NewDataSetArray[newPos - 1, 1], OldDataSetArray[oldPos, 0], NewDataSetArray[newPos - 1, 0]])
        newPos +=1
        break
    elif PreviousNumber > CurrentNumber:
        PreviousNumber = CurrentNumber
        newPos +=1  


#sort by array one values        
numberResults = sorted(numberResults, key=operator.itemgetter(0)) 
numberResultsDf = pd.DataFrame(numberResults)
5
  • What did you try? Commented Nov 8, 2017 at 18:02
  • The output looks ragged (diferent no. of elems per row because some won't have matches). Could you confirm? Commented Nov 8, 2017 at 18:05
  • @kabanus I added what I have tried Commented Nov 8, 2017 at 18:15
  • @Divakar, yes some will not have matches. because some will drop off drop array 2. Commented Nov 8, 2017 at 18:15
  • anyone have any suggestions? Commented Nov 9, 2017 at 18:02

2 Answers 2

2

You can use NumPy broadcasting to build a distance matrix:

a = numpy.array([26.12, 43.12, 46.81, 56.23, 111.07, 166.38,])
b = numpy.array([25.03, 42.12, 45.74, 46, 110.05, 165.41,])

numpy.abs(a[:, None] - b[None, :])
# array([[   1.09,   16.  ,   19.62,   19.88,   83.93,  139.29],
#        [  18.09,    1.  ,    2.62,    2.88,   66.93,  122.29],
#        [  21.78,    4.69,    1.07,    0.81,   63.24,  118.6 ],
#        [  31.2 ,   14.11,   10.49,   10.23,   53.82,  109.18],
#        [  86.04,   68.95,   65.33,   65.07,    1.02,   54.34],
#        [ 141.35,  124.26,  120.64,  120.38,   56.33,    0.97]])

of that matrix you can then find the closest elements using argmin, either row- or columnwise (depending of if you want to search in a or b).

numpy.argmin(numpy.abs(a[:, None] - b[None, :]), axis=1)
# array([0, 1, 3, 3, 4, 5])
Sign up to request clarification or add additional context in comments.

6 Comments

how can I get the ID's associated to the result set?
Nils Werner, thank you for your response. However for the final result set has the lowest value as 0 & 0, 1 & 1, 2 & 3, 3 & 3, 4 & 4, 5 & 5. Position 3 is called out twice. Can you see the expected results that I posted and give me your thoughts? let me know if you see a way out of the for loop. thanks again for your help!
Well, 45.74 and 46 are both closest to 46.81. Totally expected and valid result, no?
the final match is based on nearest and next available. so, 46.81 is closest to 46 (diff is .81). meaning 45.74 in array 2 was skipped over and can't be used for any matches. since 46.81 is matched to 46, the next available number from array 2 is 110.05. array 1 number 56.23 will then subtract from array 2 numbers 110.05 and 165.41 to find the nearest (110.05) since is the closest and it is matched to 56.23 from the left over numbers. it is hard to explain, I hope this makes sense.
OK, please reword your question, as this is absolutely not clear from it.
|
0

Compute all the differences, and use `np.argmin to lookup the closest.

    a,b=np.random.rand(2,10)

    all_differences=np.abs(np.subtract.outer(a,b))

    ia=all_differences.argmin(axis=1)

    for i in range(10):
        print(i,a[i],ia[i], b[ia[i]])



    0 0.231603891949 8 0.21177584152
    1 0.27810475456 7 0.302647382888
    2 0.582133214953 2 0.548920922033
    3 0.892858042793 1 0.872622982632
    4 0.67293347218 6 0.677971552011
    5 0.985227546492 1 0.872622982632
    6 0.82431697833 5 0.83765895237
    7 0.426992114791 4 0.451084369838
    8 0.181147161752 8 0.21177584152
    9 0.631139744522 3 0.653554586691

EDIT

with dataframes and indexes:

va,vb=np.random.rand(2,10)
na,nb=np.random.randint(0,100,(2,10))

dfa=pd.DataFrame({'id':na,'odo':va})
dfb=pd.DataFrame({'id':nb,'odo':vb})


all_differences=np.abs(np.subtract.outer(dfa.odo,dfb.odo))

ia=all_differences.argmin(axis=1)

dfc=dfa.merge(dfb.loc[ia].reset_index(drop=True),\
left_index=True,right_index=True)

Input :

In [337]: dfa

Out[337]: 
   id       odo
0  72  0.426457
1  12  0.315997
2  96  0.623164
3   9  0.821498
4  72  0.071237
5   5  0.730634
6  45  0.963051
7  14  0.603289
8   5  0.401737
9  63  0.976644

In [338]: dfb
Out[338]: 
   id       odo
0  95  0.333215
1   7  0.023957
2  61  0.021944
3  57  0.660894
4  22  0.666716
5   6  0.234920
6  83  0.642148
7  64  0.509589
8  98  0.660273
9  19  0.658639

Output :

In [339]: dfc
Out[339]: 
   id_x     odo_x  id_y     odo_y
0    72  0.426457    64  0.509589
1    12  0.315997    95  0.333215
2    96  0.623164    83  0.642148
3     9  0.821498    22  0.666716
4    72  0.071237     7  0.023957
5     5  0.730634    22  0.666716
6    45  0.963051    22  0.666716
7    14  0.603289    83  0.642148
8     5  0.401737    95  0.333215
9    63  0.976644    22  0.666716

2 Comments

will nearest take into effect the skipping some from array 2? For example, final result set skips id 5000.
here it's line numbers. It's in principle not a problem to find the index with the line number.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.