Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

[OT] A measure of 'sortedness'?

by BrowserUk (Patriarch)
on Mar 18, 2015 at 21:45 UTC ( [id://1120496]=perlquestion: print w/replies, xml ) Need Help??

BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

This is both off-topic, and about as airey-fairey a question as I've ever asked.

You have two (equal length, though that may be irrelevant), sorted buffers of data, call them numbers for simplicity, and you have to merge them. In-place; though a small (say 10%) extra buffer could be used if it would help.

For reasons nothing to do with the merge, before performing it, you have to make (individually) a sequential pass of both buffers. Why, irrelevant to this question.

Is there anything that you could do during that pass, that would some how inform you in a way that would make the subsequent merge more efficient?

For example: although the passes over the two buffers only need to be done individually for their own requirement; is there anything that could be learned by running those passes in lock-step?

Ie. instead of: for i ( 0 .. n ) f( a[i] ); for i ( 0 .. n ) f'( b[i] ); we might do:

for i ( 0 .. n ) { f( a[i] ); // the required stuff for a[] f'( b[i] ); // the required stuff for b[] g( a[i], b[i] ); // and whilst we have them, do something to, or a +ccumulate some statistic, that might help the subsequent merge. }

Alternatively, there would be negligible cost and might be some benefit in doing:

f( a[0] ); f'( b[0] ); for i ( 1 .. n ) { f( a[i] ); f'( a[i] ); g( a[i], a[i-1] ); g'( b[i], b[i-1] ); }

I know it is a nebulous question, so please don't bother telling me that; but, if just reading it triggers some idea; no matter how incomplete, please do.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

Replies are listed 'Best First'.
Re: [OT] A measure of 'sortedness'?
by SuicideJunkie (Vicar) on Mar 18, 2015 at 22:35 UTC

    Does the sequential pass of the buffers have to be done first?

    Given that the merge sort runs through each list fully in the process of merging, could you not make the results of the scan a side effect of the merge?

    I suppose you might be able to determine an upper limit during your sequential scans, to see how much buffer space would be required to do the merge without swapping.

    On the other hand, even if you have to do some swapping in order to keep the buffer to a fixed value, you're still going to get linear worst case (starting with list A before list B, if all the Bs sort to the front, then you'll have to swap each A into the vacated B positions before moving them once more to their final position.

    Perhaps the more important question is in the nature of the items. How much work does it take to determine their value/sort order, and how much work does it take to move them?

    If they are complex objects with an expensive sort function, you could make the initial pass pre-calculate that. If the values of sequential items are correlated, you could possibly save time and/or brainpower doing it up front.

    If the moves are particularly expensive, then the pre-calculations getting a measure of sortedness could provide an improved time to completion estimate by guessing how many items need to be swapped twice to do it in-place.

    PseudoEdit:
    sauoc's idea sounds like a way to optimize down the number random disk accesses by doing bunches of objects at a time. I got the impression your data was all in memory however.

      Does the sequential pass of the buffers have to be done first?

      Yes.

      Perhaps the more important question is in the nature of the items.

      I think I mentioned "call them numbers". The point is that they are fixed length records; so even if there are multiple numbers, or a string, the comparison function will be very simple, so not much mileage there.

      I got the impression your data was all in memory however.

      Indeed. Whilst the data originates on disk and is bigger than memory; the two buffers being merged here are both fully in memory, but combined are close to the limits of memory, hence not enough space to perform the n-way merge.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
        Whilst the data originates on disk and is bigger than memory; the two buffers being merged here are both fully in memory, but combined are close to the limits of memory, hence not enough space to perform the n-way merge.

        If the overall pair of data sets are bigger than memory, thus requiring doing it in chunks that fit in memory, why not smaller chunks?

Re: [OT] A measure of 'sortedness'?
by sauoq (Abbot) on Mar 18, 2015 at 22:08 UTC
    if just reading it triggers some idea; no matter how incomplete, please do.

    Depending on the actual size of your buffer in comparison to, say, the number of indexes you have (I don't know what you mean by 10%, exactly) you could essentially do the merge while passing through each by recording the points where you need to switch and grab from the other buffer. I'm not sure how much of a help it actually is.

    I'm also not sure how well I explained it.

    -sauoq
    "My two cents aren't worth a dime.";
      Depending on the actual size of your buffer in comparison to, say, the number of indexes you have (I don't know what you mean by 10%, exactly)

      As an indication: the two buffers might be 2GB each.

      The 10% was to indicate that I don't have enough physical memory to allocate another 4GB in which to do a normal n-way merge.

      When merging in place, a typical requirement is that there are two segments, 1 in each of the buffers, that need to be interchanged:

      ab cd ef gh [..............xxxxxxx....................][......yyyyyyyyyyy......... +...............]

      Above, the two buffers are in 'mid merge'. All the valued f..g are > a and < d; and all the values b..c are > e and < h.

      Typically, this would be done by exchanging all the xs with ys:

      af d eb c gh [..............yyyyyyy....................][......xxxxxxxyyyy......... +...............]

      Then rotating the whole section d..c forward 1 at a time:

      af d eb c gh [..............yyyyyyyy....................][......xxxxxxxyyy......... +...............] af d eb c gh [..............yyyyyyyyy...................][.......xxxxxxxyy......... +...............] af d eb cgh [..............yyyyyyyyyy..................][........xxxxxxxy......... +...............] af gd eb ch [..............yyyyyyyyyyy.................][.........xxxxxxx......... +...............]

      Which means all the records between original position of d, and the final position of c, have to move forward 1, 4 times.

      Which doesn't look so bad when the difference is just 4 and length is 20 or so, but if the difference is a few hundred or more, and the length (say) 1/2GB, it is very costly. n00 x 1/2GB.

      If instead, you can copy the 4 (or few hundred) to an out-of-line buffer, move the 20+ (or 1/2GB) forward by the length of that buffer just once, then put the ool buffer back, it saves lots of time.

      And if the out-of-line buffer (say 2MB) is smaller than the residual for the move (which will be a very rare event), then you do the move in two (or more) chunks. You've still divided the total number of record swaps by 2 million.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

        I know you said "call them numbers", but if they really are bytes of disk I/O data, the one-byte value range restriction does leave open the possibility of a very small index -- you could mark the first occurrence of each value in each of the buffers during the sequential scans.

        Other metadata could be computed as well, if they will be revisited frequently, like run-length of a particular value.

        During the merge operation, you might be able to speed up merge point decisions into what's left of the etherical 10% side buffer based on knowing where to go to get the next chunk of data that needs resequencing.

        In fact, a run-length table in lieu of actual data might buy you the space you need for a one-pass optimization. Forget saving index offset. Save the value and the number of them. (WARNING: Corner case where lots of length < sizeof(integer) runs causes the working buffer to actually be larger than the source buffers should be considered.). Then just overwrite the buffer with the resulting data.

        This would, of course, be an application-specific solution and not a general case proof for all airless merges, but sometimes that's the call the Engineer has to make.

        Okay, I'm done spamming you for the moment and heading home. This one sounds basic but fun; I'll be following this thread closely.

        The more I've thought about this, the more I doubt the first pass is really going to buy much at all that'll make your merge more efficient.

        In the long run, you still have to do the same comparisons and the same moves. In terms of performance, it doesn't matter much which loop they're in. My original thought above, if it's feasible at all, only really adds overhead in almost every case.

        The only little chance of saving anything, I think, is if you have a chunk of data at the beginning or end of the combined buffer that doesn't have to move. You might be able to identify that on your first past and shorten your second pass to the data in between.

        That's all I got.

        -sauoq
        "My two cents aren't worth a dime.";
Re: [OT] A measure of 'sortedness'?
by marinersk (Priest) on Mar 18, 2015 at 22:45 UTC

    Re-Update: I should have known BrowserUk would not have tossed out an easy problem, so of course the buffers are not half-filled. Double-buffer shuffle with little to no air in the combined buffer space. Now I really will be up all night, as this just became a truly fun problem.

    Update: Shout-out to ++SuicideJunkie; seems like some of the same thoughts, only yours were more coherently assembled.

    I'll probably be up all night thinking about this, so thanks in advance for giving me something to do when insomnia kicks in tonight.

    The in-place requirement for the merge is interesting, and highlights some clarifying questions which might, in turn, inspire some pre-optimization. Please bear with me, I'm writing this off the cuff.

    Inventory:

    1. Two input buffers;
    2. Both buffers pre-sorted;
    3. Merge-in-place required;
      • Which place?
        • Input buffers start only half-filled?
        • Result needs to be split between the two input buffers?
      • Okay, stop interrupting yourself, marinersk. Get on with it.

    Okay, I'm going to proceed on the assumption that the merge is to buffer1, which is currently only 50% or less consumed, making room for the whole merged buffer to fit into buffer1 without hassle.

    <ramble>
    Okay, small side buffer is allowed. For the moment, let's pretend we're using one, and see later if we can move that into the spare space in the buffer itself.

    <lightbulb>
    Hey, borrow an idea from WWII Pacific Theater combat airplane design -- fuel tank bladders. As data is migrated from buffer2 to buffer1, more and more free space opens up in buffer2 which could be used to house a potentially growing amount of useful clue data.
    </lightbulb>

    So the first thing that popped into my head on the first read: Is there any value in knowing a small number of key values in advance? If the buffers were unsorted, I'd go for "Where is the lowest" and "Where is the highest", but they're already sorted.

    So how about the median? Is there anything about our merge operation that might be expedited by knowing key points like the median value offset? So our search and placement algorithm knows at what point to optimize to the second half of the buffer?

    Hmm. Gut says no, you're likely to be incrementing through your buffer indices on the fly.

    So here's a question: Do you have the privilege of choosing in which order the buffers are pre-scanned? If so, scanning buffer2 first would permit you to collect potentially useful information like lowest value, and whilst scanning buffer1 you could make note of where that first insert will take place. Can that be used to optimize the first merge pass? At the very least, it should optimize edge cases like data collusion/dispersion.

    My mind now wants to wander into advanced sort/merge logic, like tossing values into the spare space in the buffers akin to the way a binary sort reads the data, later to be assembled in order in a quasi-linear fashion, or something similarly crazy. That would take some careful thinking, and so I am going to take that thought as an indication that my ramble timer just expired.
    </ramble>

    Back to you, BrowserUk.

Re: [OT] A measure of 'sortedness'?
by marinersk (Priest) on Mar 18, 2015 at 23:00 UTC

    Side thought: This issue has likely been addressed by business-grade disk optimization routines which, in the 80s, had to consider the possibility of a full or nearly-full disk and still function.

    Might a bit of historical research on defrag utilities and algorithms possibly turn up some inspirational insight?

      Side thought: This issue has likely been addressed by business-grade disk optimization routines which, in the 80s, had to consider the possibility of a full or nearly-full disk and still function.

      That's why I love throwing out these questions to wider audiences.

      Many years ago whilst contracting at IBM, I was taken through the operational steps of what was one of, if not the, first defrag in situ utilities -- called 'vfrag' from memory -- by its author, in the company refectory. No paper or whiteboards, just knives and forks and (many) salt and pepper pots (much to the annoyance of other dinners).

      And you are right, it did have similar problems. The initial, slow, but very reliable, version required just one block or maybe cluster of free-space, and shuffled everything through that one space.

      The second, much faster version first consolidated the whitespace by shuffling up the in-use (discontiguous lumps) clusters starting from the front of the disc and working down, gradually accumulating the free-space clusters together and thus creating a bigger and bigger buffer into which to move the downstream chunks; until all the freespace was accumulated at the end of the disk.

      Once done, the process of defragging the individual files became much simpler.

      Unfortunately, I'm not sure it helps here as I have no free space within the buffers. Though I can use my small out-of-line buffer to create some, as described elsewhere.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
Re: [OT] A measure of 'sortedness'?
by GotToBTru (Prior) on Mar 19, 2015 at 19:14 UTC

    Given:

    @b1 = (1,1,1,4,4); @b2 = (1,2,3,4,6);

    and desiring:

    @b1 = (1,1,1,1,2); @b2 = (3,4,4,4,6);

    Why not use a sort algorithm to sorts the virtual array consisting of @b1 + @b2, using a function that maps the indices?

    sub ind { $i = shift; if ($i <= $#b1) { return (\$b1[$i]) } else { return (\$b2[$i-$#b1-1]) } }

    This doesn't make any use of those free passes thru each buffer. But it also doesn't need an external buffer.

    Dum Spiro Spero
      Why not use a sort algorithm?

      The best sorts are at best O(N log N), but they are not adaptive to partially sorted data.

      I tried it using the crt qsort(), The first pass sorts 200 equal partitions of the array, which is actually quicker than sorting the entire array in one go.

      The second pass sorts the buffers together in pairs, moving left to right and back again in a shaker sort process until done:

      C:\test\C>mergeParts 200000000 200 qsort took 59.047691648 secs for 200 partitions of 200000000 element a +rray qsort took 7121.598593905 secs to merge 200 partitions of 200000000 el +ement array

      The killer here is the shaker sort required to bring all the bits together, which is (N*(N-1)):

      C:\test>BiDiBubbleSort -N=4 #### 4*3/2 = 6 pairs [ D<>C B A ]: comp&swap( 0, 1 ) D < C [ C D<>B A ]: comp&swap( 1, 2 ) D < B [ C B D<>A ]: comp&swap( 2, 3 ) D < A [ C B<>A D ]: comp&swap( 1, 2 ) B < A [ C<>A B D ]: comp&swap( 0, 1 ) C < A [ A C<>B D ]: comp&swap( 1, 2 ) C < B A B C D C:\test>BiDiBubbleSort -N=5 #### 5*4/2 = 10 pairs. [ E<>D C B A ]: comp&swap( 0, 1 ) E < D [ D E<>C B A ]: comp&swap( 1, 2 ) E < C [ D C E<>B A ]: comp&swap( 2, 3 ) E < B [ D C B E<>A ]: comp&swap( 3, 4 ) E < A [ D C B<>A E ]: comp&swap( 2, 3 ) B < A [ D C<>A B E ]: comp&swap( 1, 2 ) C < A [ D<>A C B E ]: comp&swap( 0, 1 ) D < A [ A D<>C B E ]: comp&swap( 1, 2 ) D < C [ A C D<>B E ]: comp&swap( 2, 3 ) D < B [ A C<>B D E ]: comp&swap( 1, 2 ) C < B A B C D E

      For 200 partitions, that/s 200*199/2 = 19,900 O(N log N) sorts required to merge the partitions.

      The only sorts that are adaptive to partially sorted data, are the O(N2) ones like insertion sort and shell sort; which makes them look great for the purpose when the number of elements is 30 or 50 items, but means that even with their adaptive properties, the aren't great for merging 2 runs of 1/4 a billion elements ( 2 * 2GB / 8-byte ). Better than qsort() , but still not good when run 20,000 times.

      An N-way merge in O(N), but requires an unlimited output buffer, ie. disk; and that is 3000 times slower than (cached) memory. If I can merge in place whilst avoiding moving every item 3000 times, I should save time. An O(N) in-place merge is possible, but the algorithms I've found described are complex, messy, badly described and a bitch to program.

      I have "discovered" another algorithm; still fairly complex, but two-stage, mutually tail-recursive, with both parts being relatively clean in the small.

      The first part deals with two, equal length inputs and requires a (partial) linear scan of both buffers to split the two into 2 parts each. Of those 4 parts, 2, (1 in each of the original partitions) requires either no further processing, or a simple, in-place, equal length swap. (Which also, by the by, completes the partial scans.)

      The other two parts are now recursed upon. Classic divide and conquer strategy.

      Aside: The qsort's greatest weakness -- the O(N2) reverse-sorted input -- can be trivially dealt with by using a single initial pass over the data:

      for( p1 = start, p2 = end; p1 < p2; ++p1, --p2 ) if( a[p1] < a[p2] ) s +wap( p1, p2 ).

      If the buffer was reverse sorted, it is now sorted. If you count the swaps and swaps == int(items/2), you're done. O(N) for qsort's nightmare. If the input was already sorted, no swaps and you're also done. Also O(N). Anywhere in between, and you've done no harm. Worst case now becomes: Biggest, third biggest, fifth biggist; ...; sixth biggest, fourth biggest, second biggest, which isn't helped by the pre-sort; but neither is it harmed; but is also far, far less likely than reverse sorted. This is a similar approach to pre-shuffling to avoid the problems, accept it benefits hugely in two common cases. The "discovery" of the pre-scan aided (avoided) the processing.

      And that's where this thread arose: As I'm doing sometimes partial/sometimes full scan in the first stage; can I gather any information during it, that can beneficially inform the second stage. So far, the answer seems to be no; but it triggered a couple of ideas that were worth investigating.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
Re: [OT] A measure of 'sortedness'?
by Jenda (Abbot) on Mar 27, 2015 at 21:08 UTC

    Not sure what do you need to do with the merged data, but assuming the individual elements are sufficiently large, what about using and keeping the 10% buffer as an "array of references" instead. It's true than when looping through the merged data, you'd have to do an array accesses and dereference instead of one array access, but it still might work out.

    What I mean is that having

    @A = qw(a d f i j k l m o q); @B = qw(c e g h p r s t v w);
    you'd build
    @ref = (\$A[0] \$B[0] \$A[1] \$B[1] \$A[2] \$B[2] \$B[3] \$A[3] \$A[4] + \$A[5] \$A[6] \$A[7] \$A[8] \$B[4] \$A[9] \$B[5] \$B[6] \$B[7] \$B[8 +] \$B[9]);
    by a simple, O(n), merge and then the n-th element of the result would be ${$ref[$n]}.

    You'd need the buffer to be able to hold 2*n references.

    When merging with the next two buffers with their own reference array, you'd need another 4*n*size-of-reference buffer to do the merge and then you'd drop the two old reference arrays.

    Jenda
    Enoch was right!
    Enjoy the last years of Rome.

Re: [OT] A measure of 'sortedness'?
by Anonymous Monk on Mar 19, 2015 at 19:30 UTC

    Pardon the YX kind of question, but... why isn't bucket sort (on the whole dataset) applicable in your situation?

      why isn't bucket sort (on the whole dataset) applicable in your situation?

      1. How do you divide a (minimum) input range of 2^64, with a typical 100GB dataset representing just 0.00000007275957614183% of that range; into buckets?
      2. Because it is O(N2) worst case.

        But no indication of what provokes that worst case.

      3. Because it requires O(N+K) space worst case.

        (With no indication that I can see for what K is.)

      4. Because I haven't seen anything about a bucket sort that makes it a candidate for the problem?

        (A radix sort has possibilities; but I haven't gotten to pursuing them.)


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

        BrowserUk:

        This has been an interesting thread. I've been busy at work, so haven't really had a chance to think it over terribly much. I was kicking around the idea of radix sorting, so since you mentioned that it may have possibilities, I thought I'd pass along my experience with them. Simply put, I really liked the result. It was very fast.

        I just dug it out (hard to believe it was that long ago) and anonymized it. I may have over-anonymized it, so let me know if I broke it, so I can make suitable repairs. To make it as easy as possible for myself (as well as fast as I could), I added a pointer field to each record and made a linked list of the records as I read them in. Then I used the radix sort algorithm to rearrange the pointers, without moving the record bodies. Then I wrote the list in the indicated order.

        I had another thought as well--If your keys are randomly distributed throughout the key space, and you know the number of records beforehand, you could make a pretty reasonable guess of the approximate final location of each record based on its key. If you used that knowledge while creating the file (for example), you could build a file that was *nearly* sorted, and then use a sliding window to sort the file.

        Since you're thinking of huge buffers and swaps, I thought you could use the information like so: Read a block "A" and do your preprocessing pass. As part of that process, you could compute the block and location of the best guess for the record. You could then use that guess to tell you which block may be best to pair "A" against to get records into their final locations in as few swaps as possible.

        I don't know how involved you are in creating the original file, but if you have control over it, you may be able to possibly take advantage there as well.

        Maybe it'll trigger a useful thought...

        ...roboticus

        When your only tool is a hammer, all problems look like your thumb.

Re: [OT] A measure of 'sortedness'?
by Anonymous Monk on Mar 20, 2015 at 17:07 UTC

    Q: How to sort lots of data.
    A: Using mergesort.

    Let's say we have a dataset amounting to D, and M is the size that can be comfortably sorted in-memory. Now the first step suggests itself: divide data up in n = D / M chunks, sort each one separately. This much is already on the blueprints, I'd imagine.

    The second step is to merge the n stripes. Merge pass is a streaming operation, it does not require any storage per se. Though for efficient handling, fill buffers are practical.

    N-way merge may be accomplished via network of two-way merges. E.g. 4-way merge might be defined as follows:

    merge2(merge2(,), merge2(,))
    Those merge junctions assemble like drainage pipes. An n-way merge requires n-1 junctions, and 2n-1 fill buffers for I/O. The second pass will read and write all of the data records for the second time.

    As for the implementation: AVX2 intrinsics should prove handy with vectors of 64bit integers. Fill buffers in megabytes might be needed with mechanical storage.

      Q: How to sort lots of data. A: Using mergesort.

      Ooh! Why didn't I think of that. Erm . Let me re-phrase: What makes you think I haven't already considered that?

      That is the traditional way isn't it. (Surprised its taken you 3 days to look that up.) So what's wrong with the traditional way?

      To find out, generate yourself a 100GB file of random 64-bit ints (about 15 minutes work with a one-liner) and then use your local system sort utility to sort it. Then come back to me when its finished, and we'll discuss it further.

      See you in 3 or 4 days!


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

        If you sort the chunks while generating, then apply the n-way merge as described—that full procedure results in a 100GB write, plus another 100GB of reads and 100 GB of writes. In total, 300GB of streaming (external memory access).

        You are searching for an algorithm that does better, or claim to have found one?

A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1120496]
Approved by marinersk
Front-paged by marinersk
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (6)
As of 2024-04-25 08:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found