in reply to [OT] A measure of 'sortedness'?

if just reading it triggers some idea; no matter how incomplete, please do.

Depending on the actual size of your buffer in comparison to, say, the number of indexes you have (I don't know what you mean by 10%, exactly) you could essentially do the merge while passing through each by recording the points where you need to switch and grab from the other buffer. I'm not sure how much of a help it actually is.

I'm also not sure how well I explained it.

-sauoq
"My two cents aren't worth a dime.";

Replies are listed 'Best First'.
Re^2: [OT] A measure of 'sortedness'?
by BrowserUk (Pope) on Mar 18, 2015 at 22:48 UTC
Depending on the actual size of your buffer in comparison to, say, the number of indexes you have (I don't know what you mean by 10%, exactly)

As an indication: the two buffers might be 2GB each.

The 10% was to indicate that I don't have enough physical memory to allocate another 4GB in which to do a normal n-way merge.

When merging in place, a typical requirement is that there are two segments, 1 in each of the buffers, that need to be interchanged:

```
ab     cd                          ef         gh
[..............xxxxxxx....................][......yyyyyyyyyyy.........
+...............]

Above, the two buffers are in 'mid merge'. All the valued f..g are > a and < d; and all the values b..c are > e and < h.

Typically, this would be done by exchanging all the xs with ys:

```
af      d                          eb     c   gh
[..............yyyyyyy....................][......xxxxxxxyyyy.........
+...............]

Then rotating the whole section d..c forward 1 at a time:

```
af       d                          eb     c  gh
[..............yyyyyyyy....................][......xxxxxxxyyy.........
+...............]

af        d                          eb     c gh
[..............yyyyyyyyy...................][.......xxxxxxxyy.........
+...............]

af         d                          eb     cgh
[..............yyyyyyyyyy..................][........xxxxxxxy.........
+...............]

af         gd                          eb     ch
[..............yyyyyyyyyyy.................][.........xxxxxxx.........
+...............]

Which means all the records between original position of d, and the final position of c, have to move forward 1, 4 times.

Which doesn't look so bad when the difference is just 4 and length is 20 or so, but if the difference is a few hundred or more, and the length (say) 1/2GB, it is very costly. n00 x 1/2GB.

If instead, you can copy the 4 (or few hundred) to an out-of-line buffer, move the 20+ (or 1/2GB) forward by the length of that buffer just once, then put the ool buffer back, it saves lots of time.

And if the out-of-line buffer (say 2MB) is smaller than the residual for the move (which will be a very rare event), then you do the move in two (or more) chunks. You've still divided the total number of record swaps by 2 million.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

I know you said "call them numbers", but if they really are bytes of disk I/O data, the one-byte value range restriction does leave open the possibility of a very small index -- you could mark the first occurrence of each value in each of the buffers during the sequential scans.

Other metadata could be computed as well, if they will be revisited frequently, like run-length of a particular value.

During the merge operation, you might be able to speed up merge point decisions into what's left of the etherical 10% side buffer based on knowing where to go to get the next chunk of data that needs resequencing.

In fact, a run-length table in lieu of actual data might buy you the space you need for a one-pass optimization. Forget saving index offset. Save the value and the number of them. (WARNING: Corner case where lots of length < sizeof(integer) runs causes the working buffer to actually be larger than the source buffers should be considered.). Then just overwrite the buffer with the resulting data.

This would, of course, be an application-specific solution and not a general case proof for all airless merges, but sometimes that's the call the Engineer has to make.

Okay, I'm done spamming you for the moment and heading home. This one sounds basic but fun; I'll be following this thread closely.

I think this has triggered a workable idea.

Given a 2GB buffer of (minimum) 8-byte records = 268,435,456 records; using 1-bit per record requires 32MB. Well within my 10%.

So, what can I record with 1-bit?

My first though is that I can record whether this record is contiguous with the preceding one. Ie. Is is equal to it; or exactly one greater. In effect, contiguous runs of 1-bits become chunks of the buffer(s) that can be translocated to their final position en-masse.

And (I think) by processing down those bitmaps in parallel (though not lock-step), and knowing the first value in each of the two buffers, it gives me a a straight forward map of how things have to be moved into their final positions.

Does that convert the original O(N2) process into an O(2(or 3?)N) process?

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

In the long run, you still have to do the same comparisons and the same moves. In terms of performance, it doesn't matter much which loop they're in. My original thought above, if it's feasible at all, only really adds overhead in almost every case.

The only little chance of saving anything, I think, is if you have a chunk of data at the beginning or end of the combined buffer that doesn't have to move. You might be able to identify that on your first past and shorten your second pass to the data in between.

That's all I got.

-sauoq
"My two cents aren't worth a dime.";
The only little chance of saving anything, I think, is if you have a chunk of data at the beginning or end of the combined buffer that doesn't have to move.

That's exactly what the pre-scan is doing. "Topping and tailing" the two buffers for bits that either don't need to move, or only need to be block-moved rather than merged:

```                  |---|
---|   |                 ___
---      |           ___---
---          |     ___---
---             |__---
---|               |
---   |               |

Don't read too much into the ascii art, but there are obviously two chunks, the left and right ends of the left buffer, that can be either left where they are or just block-moved into place, respectively.

There are 5 other possibilities:

1. 1 requires no further processing: all of the left buffer is less than all of the right buffer.

(Full scan required.)

2. 1 requires a single block swap; all of the left buffer is greater than all of the right buffer.

(Full scan required.

3. 4 (including the above) that either leave in place or block-move two smaller chunks and need to recurse on the other two.

(Partial scan if left in place or full if moved.)

What I was looking for was something discoverable or doable during that first scan, that abetted the next stage.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

I'm coming to the same conclusion. Inevitably, all my creativity winds up solving the "how to conserve space" and to a lesser degree the "how to minimize data moves" issues -- and none of my thinking in those areas is yielding a first-pass-scan benefit.

This is maddeningly entertaining because I can't count the number of times in my career I could optimize the inner algorithm if only I had the privilege of a pre-processing linear scan, which was too costly to be worth implementing.

Here, I have the exact opposite problem -- I am going to get a pre-processing linear scan, and I can't make use of it to optimize the inner process.

Somewhere, some deity of software development is having a huge guffaw at my expense.

:-)