6

Say I have 50 million features, each feature comes from disk.

At the beggining of my program, I handle each feature and depending on some conditions, I apply some modifications to some.

A this point in my program, I am reading a feature from disk, processing it, and writing it back, because well I don't have enough ram to open all 50 million features at once.

Now say I want to sort these 50 million features, is there any optimal algorithm to do this as I can't load everyone at the same time?

Like a partial sorting algorithm or something like that?

0

2 Answers 2

7

In general, the class of algorithms you're looking for is called external sorting. Perhaps the most widely known example of such sorting algorithm is called Merge sort.

The idea of this algorithm (the external version) is that you split the data into pieces that you can sort in-place in memory (say 100 thousands) and sort each block independently (using some standard algorithm such as Quick sort). Then you take the blocks and merge them (so you merge two 100k blocks into one 200k block) which can be done by reading elements from both of the block into buffers (since the blocks are already sorted). At the end, you merge two smaller blocks into one block which will contain all the elements in the right order.

Sign up to request clarification or add additional context in comments.

1 Comment

a bit off-topic but there are two little typo's in your bio: you wrote abou instead of about and functinal instead of functional.
2

If you are on Unix, use sort ;)

It may seem stupid but the command-line tool has been programmed to handle this case and you won't have to reprogram it.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.