In step 3, instead of going thru the entire original array, you can chop the original array into pieces, say each piece contains 10 * N element (tweak with this 10, it could be 5, could be 50...). Sort each piece (we are sorting some smaller arrays), then only take the first N elements of the soted piece, and go thru them.
How is this an improvement (he asks curiously ;-) I don't see how creating a sorting several subsets of the data can be cheaper in time or space than a single pass through everything.