Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

[OT] Stats problem

by BrowserUk (Pope)
on Feb 26, 2015 at 09:47 UTC ( #1117930=perlquestion: print w/replies, xml ) Need Help??

BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

Any group of 4 bytes in memory can contain any of 2^32 values.

4GB of memory contains 2^29 (1/2 a billion) 8-byte aligned, 4-byte fields.

If a runaway loop writing random bytes overruns its buffer and scribbles over one of the 32-bit fields, what are the odds that the 4-bytes, viewed as an unsigned 32-bit integer, would match that fields offset into the 4GB of ram?

Illustratively:

memory:(hex)[00000000........][00000008........][00000010........][000 +00018........]...[12345678........][12345680........]... offset:(dec) 0 8 16 24 + ... 305419896

The premise is that if free blocks are maintained with their offset into the 4GB block, and a buffer overun occurs, then any block that has been overwritten is easily detectable, with a high degree of certainty. A far higher degree than if the typical fixed known bit-pattern is written there.

Corruption could write any of the possible values anywhere; and if a fixed bit-pattern is used as the marker, there is, assuming total random corruption, a 1 in 2^32 chance of a specific field being mistaken for the fixed bit pattern. But with potentially 2^29 fields, those odds reduce markedly. I think to 1:(2^32/2^29) = 1 in 8 chance if the whole 4GB were corrupted.

But the odds that an exact offset value will be written at that exact offset has to be many time higher. But how high?


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

Replies are listed 'Best First'.
Re: [OT] Stats problem
by salva (Abbot) on Feb 26, 2015 at 11:40 UTC
    if I understand your question correctly (and I am not completely sure I do) what you suggest is equivalent to the following problems:

    What are the odds of picking the ace of spades from a shuffled deck of cards?

    Now, pick a card, insert it back in the deck and shuffle. What are the odds of picking the same card again?

    Finally, pick cards from the deck until you get a face card, insert all them back and shuffle. If you pick a card now, what are the odds of it being that face card again?

      Now, pick a card...

      I wish I had thought of that analogy.

      Recouched in terms of that analogy the problem becomes:

      Assume the 'standard ordering' of a pack of cards (minus jokers) is A..K of Spades, A..K of Diamonds, A..K of Clubs, A..K of Hearts.

      |Shuffle the pack fairly. Now go through the pack and check how many (if any) of the cards end up in their 'standard position'. Repeat many times.

      How often after does a card end up in its standard position? What are the odds of a card ending up in its standard position?

      Its not quite right because each card can only end up in one position; whereas with memory corruption; a value may (and usually will) be repeated.

      Also, all the possible values (cards) will appear; but in memory, it takes 4-bytes to hold the offset of a single byte. But then again, we are only interested in offsets %8...and only half of the values in memory...I'm lost again.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
        What are the odds of a card ending up in its standard position?

        That's easy enough - it's 1 in 52.
        A card can be in any one of 52 positions, only one of which is its "standard position" - and there's no bias that favours one position over another.

        UPDATE What follows doesn't scale correctly ... it aint right IOW - probably the linkage that salva was getting at.

        The chance that none of the cards are in their standard position is simply (51/52) ** 52, which is around 0.3643 or approximately 1 in 3.
        Therefore the chance that 1 or more cards are in their standard position is 1 - ((51/52) ** 52) which is around 0.6357 or approximately 2 in 3.
        (You could turn this into a rather boring game of patience/solitaire.)

        As for what you're actually trying to calculate, I'm still trying ot get my head around it.
        (Please stick to cards in future ;-)

        Cheers,
        Rob

        Perhaps a better analogy is this: Imagine six six-sided dice, and six boxes labeled 1 to 6 on the table in front of you*; i.e. one die per box into which it will be rolled. Assuming the die rolls are independent of one another, the probability of each die showing the label of its box is 1/6 - no matter what that label is! The probability of one die showing its box's label and also a second die independently showing its box's label is (1/6)*(1/6); N dice, (1/6)^N.

        However, this all changes if the die rolls are not truly random or not truly independent of one another!

        * That's just for simplicity; it's too easy to get hung up on those numbers - the number of dice, boxes, and sides, or even whether they are labeled with numbers or hieroglyphs does not matter!

        If I'm remembering my probabilities correctly, the probability of event A happening and event B happening independently is P(A)*P(B), so the probability an event A occurring independently N times is P(A)^N.

Re: [OT] Stats problem
by RichardK (Parson) on Feb 26, 2015 at 11:14 UTC

    AFAICT the odds are exactly the same. You seem to be asking what are the odds of a given 4 bytes being set to a single specific value. It doesn't make any difference if that value is 0xDEADBEEF or its memory address -- the odds are still 1 in 2^32.

      Hm. Not convinced.

      If you write random bytes to the whole 4GB, then inspect them as 4-byte aligned U32s, then only 1/4 of the possible values or less will appear.

      Only inspect every other 4-byte aligned U32 and only 1/8th or less of the possible values will appear.

      And if you write the the same value to every single slot, it could only match the offset at one position.

      So the 7/8th or more of the possible values will not appear any where; and of the rest, the chances that the value will appear at an offset that matches the value have to be slim. Bordering on impossible.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
        If you're not convinced then why not build yourself a small monte carlo simulation and try it out?
Re: [OT] Stats problem
by roboticus (Chancellor) on Feb 26, 2015 at 13:12 UTC

    BrowserUk:

    I'm not sure I know exactly what you're asking, and unfortunately, statistics is one of those things where intuition can often lead you astray.

    Let's do a little hand waving: Assume our initial state is that each field contains the ones complement of its address, and our addresses and values are a uniform distribution.

    Case 1: select a random address (byte-aligned) and write a random byte. Probability of any field containing its address is obviously[1] 0. Consider that with the initial setup, you need to alter four consecutive bytes starting at a field boundary for there to be any chance of having a field value matching its address.

    Case 2: select four addresses at random, and write a random byte to each one. Probability of any field containing its address: not zero, but much[5] less than 1/2^32. You'd have to (1) select four adjacent addresses, (2) all four addresses must be in the same field, and (3) write the correct values. I'm not going to compute this because I'm not sure that's the question you're asking[3].

    Case 3: select n addresses at random, and write a random byte to each one, where n is larger than four. The probability will be larger than case 2, but still pretty darned small.

    Case 4:[6] select a random address, and write n random bytes. For n less than four, the probability is obviously[4] 0. Where n is four through six, then it's much[5] more likely than for case 2, as n-1 addresses aren't random, so you won't miss overwriting a complete field as frequently (you'd miss 25%, 50% or 75% of the time, respectively). Where n is greater than 6, you'd never miss overwriting a field, so the probability increases more.

    So if we knew a bit more about the behaviour of the overwriting process, it might be easier to figure out what the probabilities would look like. My hand-wavy opinion is that for cases where you're writing sequential strings of bytes (significantly longer than four bytes in a row) that the probability approaches 1/2^32; while for selecting bytes at random and overwriting them would be more like the Bloom filter math, and much less likely until the number of bytes being written becomes large enough (in which case, I'd expect your program to have failed spectacularly well in advance of hitting the tragic case.)

    Oh, well, off to shower & work. I may think about it a bit more on my drive to work.

    Notes:

    1) Every time[2] I work with statistics and say "X is obviously Y", I'm wrong.

    2) Almost. If it were that reliable, I could just insert the word "not" after obviously, and be brilliant.

    3) Not to mention that it'd take me a while to convince myself whether what I computed would be correct or not.

    4) There's that word again. Feel free to insert a "not" here if I'm wrong.

    5) The value of much left as an exercise to the reader.

    6) Yeah, this should really be split out into more cases, but it's a pain and I've gotta get to work.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re: [OT] Stats problem
by SuicideJunkie (Vicar) on Feb 26, 2015 at 14:49 UTC

    I think the other monks have adequately covered the case of writing any possible value anywhere.

    However, when things are going wrong, what values are more or less likely to be written to a cell? I would expect the likelihood of writing a number to the cell to be inversely proportional to the "magicness" of the number.

    That is to say, writing zeros is pretty common. One, or 0xFFFFFFFF? Still very common values. Byte sequences from strings and constants already present in your code have good odds. Common values of local variables are likely to be written too. That includes the address of the cell, because that's what you're using to write to the cell in the first place (more so in C code than perl, however).

    That leaves high value numbers which are somewhat distant from any constants in your program as the least likely to appear as corrupt data. Each byte of the value should be above 0x80 to stay away from common strings. 0xDEADBEEF matches all those conditions while being fairly obvious as a sentinel value to human (and particularly english-speaking) readers. Grepping the dictionary for words that contain only A-FSIO (S->5,I->1,O->0) will get you some nice alternatives if you want to mix it up.

      See Re^2: [OT] Stats problem


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
Re: [OT] Stats problem
by Anonymous Monk on Feb 26, 2015 at 11:33 UTC

    I would think that if the bytes being written are truly random, and you're viewing each of the fields independently, then shouldn't the probability of corrupt data being mistaken for a "marker value" always be 1 in 2**32, no matter what marker value you have chosen for that particular field? But since I suspect the data isn't truly random, and it also seems like you're asking about some combination of events, then more specific definitions of these are probably necessary (e.g. what is "total random corruption" and "if the whole 4GB were corrupted"?).

    Is the only purpose of saving the offset with the value to detect errors? If "Corruption could write any of the possible values anywhere", wouldn't that mean that a bad write to a 4-byte value instead of one of the 4-byte offsets could not be detected? Are FEC codes or CRCs not an option?

Re: [OT] Stats problem
by salva (Abbot) on Feb 26, 2015 at 15:20 UTC
    In any case, the probability of detecting a buffer overrun using a fixed magic value is already 1 - 2**-32.

    Isn't that enough for you?

      In any case, the probability of detecting a buffer overrun using a fixed magic value is already 1 - 2**-32. Isn't that enough for you?

      The problem is that in real life the data written by overruns is rarely ever random.

      The problem I saw yesterday was that the over run occurred at both the source and destination of a memcpy().

      Running it under the debugger with a debug malloc() in force and what happened is that as both source and destination fields were equal sized and equally aligned, the debug telltale signature that followed both chunks was identical; so the over run copied the telltale following the source buffer exactly on top of the telltale following the destination buffer.

      So the heap check didn't detect the overrun!

      But data following the telltale following the source also got copied over the data following the telltale at the destination; which showed up later when an entirely unrelated variable contained an impossibly high value and crashed the program. And I spent a long time looking in the wrong place.

      Hence, I reasoned that if the telltales were all different the heap check would have found the error earlier and pointed me at the real culprit.

      But, how could you make the telltales all different and still know what to look for? And that's when the offset idea hit me.

      Even for 64-bit addresses, it is fast to derive a 32-bit offset from the nearest preceding 4GB boundary.

      And the only way the same problem could occur is if the source and destination pointers were exactly 4GB apart; which is considerably less likely than the 16/32/64-byte granularities often used by standard mallocs.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

        BrowserUk:

        For that particular use case, you might try something I did when I had a similar problem: Put a matching marker just *before* the address you return in your malloc routine. (In mine, I had a counter inside of malloc, so the marker just happened to be the number of times malloc was called (plus the starting value). I bracketed the requested memory area with the marker, and filled the block with junk. I added the 'before' marker because I was tracking down a buffer overrun as well. (Another thing I did was to put the caller's address and the requested allocation size in the preamble as a debugging aid. Fit nicely in a 'paragraph' in 32-bit mode.

        ...roboticus

        When your only tool is a hammer, all problems look like your thumb.

        If you don't mind the debug mode speed hit, you could use an alternate memcpy() and friends.

        The idea being, the memory functions will check the values they're copying, and assert that they're not writing the telltale value, since it should never be used as a value in your program.

        I see, what you propose makes sense.

        You can also mix both the offset and the magic marker using xor:

        mem[ix] = ix ^ magic_value;
        As the 4GB offset repeats periodically, why not some unpredictable, non-periodic function that can be easily created and checked?
        • random value based on 64 bit address as seed (last 32 bits)
        • MD5 hash (last 32 bits)
        • XOR of upper and lower 32 bits of address

        I'm sure you can come up with better, faster functions.

        -QM
        --
        Quantum Mechanics: The dreams stuff is made of

Re: [OT] Stats problem
by trippledubs (Deacon) on Feb 26, 2015 at 17:14 UTC
    I thought 4GB of memory = 4 * 2^30 = 2^2 * 2^30 = 2^32.
      Got it.. 2^32 / 2^3 = 2^29. okay now I can read the rest

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1117930]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (5)
As of 2020-05-31 11:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    If programming languages were movie genres, Perl would be:















    Results (173 votes). Check out past polls.

    Notices?