1

I have a set of N points on a graph defined by (x,y) coordinates, as well as a table with their pairwise distance. I'm looking to generate a table with their relative "closeness ranking", e.g. if closeness[5][9] == 4, then node 9 is the fourth closest item relative to node 5.

The obvious way of doing this is by generating a list of indeces and sorting them based on d[i][j] < d[i][k] for every i (1->n), then transforming the table by the knowledge that sorted[5][4] == 9 implies closeness[5][9] == 4.

This would require O(n² log n) time. I feel like there could be a more efficient way. Any ideas?

4
  • 2
    Just looking through the table is O(n^2), so O(n^2 log n) doesn't seem that bad. Are the distances euclidean? Commented Dec 12, 2012 at 14:58
  • This might be significantly improved if you say you are not interested in all the distance orders, but just the first few closest points to each point. Commented Dec 13, 2012 at 9:41
  • @VaughnCato The additional log n does hurt performance quite a bit. The distances are euclidian, but suggestions for the general case are welcome too. Commented Dec 14, 2012 at 15:17
  • @BorisStrandjev I can think of a "roughly" O(n²) solution for a tiny, fixed amount of points, but I still need a significant portion. Think of 'n' in the range 1k-20k and I'd want to be able to handle the ~10% top nodes. Stricter limits can be worked with if there's a very elegant solution. Commented Dec 14, 2012 at 15:20

1 Answer 1

1

Okay, I'm going to try and take a stab at this.

For the background knowledge: This problem is somewhat related to k-nearest neighbor. I'm not sure how you generated your pairwise distance, but k-d tree is pretty good at solving these type of problem.

Now, even if you use k-d tree, it will help a bit (you only query what you need, instead of sorting ALL the points): building the tree in O(N log N) time, then for the K nearest points you want to query, each would take O(log N) time. In the end, you are looking at O(N log N) + O(NK log N).

Okay, now, the actual heuristic part. This will depend on your data, you might want to see if they are close together or far apart. But, you can try a divide and conquer approach, where you divide the plane into bins. When you need to find the closest points, find out which bin the point you are working with belong to, then you only work with neighboring bins, and explore more neighboring bins as you need more points.

Hopefully this helps, good luck.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for the input. The k-d tree is unlikely to provide much improvement, as I do need a reasonable number for K and in terms of practical efficiency I think sorting will have it beat, especially since the different sorts are independent and thus easily parallelized.
On the heuristic part: I did have a type of binning/pigeon-holing solution in mind, especially since a categorical/clustered sort would be sufficient for my application. However, my tests have trouble getting consistent and accurate results on certain point-distributions (e.g. highly clustered vs randomly scattered). Any ideas how to refine the idea?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.