http://www.perlmonks.org?node_id=1056280

saintmike has asked for the wisdom of the Perl Monks concerning the following question:

Seems that starting in perl 5.17, keys() to obtain the keys in a hash is now truly randomized, which means that even within the same process, calling keys() twice on the same hash will result in a different key order.

In previous versions, this used to be just per-process, so you couldn't rely on the sort order from one process invocation to the next, but calling keys() twice on the same hash within the same process always returned consistent results.

This seems like a major challenge for testing applications. If my hash serializer produces different results every time, should my CPAN module have to deal with the daunting task of creating every single permutation of the keys() result that's used underneath, leading to an explosion of application-layer results to test against?

So far, in the test suite for https://github.com/mschilli/php-httpbuildquery-perl I've worked around this issue by running keys() once to determine the order, to test against the result generated by my second call by the application. With perl 5.17, this is no longer possible.

What do people do to test against unpredictable core functions? Shouldn't the core provide some kind of API to figure out what the order is/was for testing?

Replies are listed 'Best First'.
Re: Truly randomized keys() in perl 5.17 - a challenge for testing?
by tobyink (Canon) on Sep 30, 2013 at 06:50 UTC

    Sort the keys in tests, or use tied hashes that preserve key order. Or test a round-trip (i.e. use your module to convert Perl data structure -> serialized -> Perl data structure, and use is_deeply to compare the input and output data structures).

    use Moops; class Cow :rw { has name => (default => 'Ermintrude') }; say Cow->new->name
Re: Truly randomized keys() in perl 5.17 - a challenge for testing?
by hdb (Monsignor) on Sep 30, 2013 at 06:34 UTC

    Sorting your keys in your tests would be one way, but it would not detect any bugs where you implicitly rely on sorted keys...

      but it would not detect any bugs where you implicitly rely on sorted keys...
      I hope saintmike isn't relying on implicit key ordering, especially for tests?

      -QM
      --
      Quantum Mechanics: The dreams stuff is made of

        I did not want to suggest he is but others might be reading this as well... Just jumped to my mind, that there could be code that worked with ordered keys but not otherwise.

Re: Truly randomized keys() in perl 5.17 - a challenge for testing?
by kcott (Archbishop) on Sep 30, 2013 at 10:54 UTC
Re: Truly randomized keys() in perl 5.17 - a challenge for testing?
by BrowserUk (Patriarch) on Sep 30, 2013 at 10:50 UTC

    The problem with this new key randomisation code is not the non-determinacy it introduces -- that was already there a long time since; albeit in a lesser form.

    The problem is that is a pointless prophylactic that doesn't even come close to solving the "problem" that was used to justify its addition to the code-base.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Proof?

        Where's the proof it does?

        How can an undocumented problem be countered?


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
      Actually, this change is removal from the code base. It's a simplification of the existing mechanism. Rather than perturbing the hash when an attack is detected, the salt is always applied. To make that simplification safe, the salt needs to be different for each hash.

        In demerphq's own words:

        bulk88> I disagree there is a measurable cost to the current implementation of

        bulk88> REHASH. If the rehash code is so rare, maybe it should be removed from

        bulk88> hv_common and placed in its own func, but looking at the asm, the

        bulk88> overhead is so tiny I dont think the rehash code even matters compared

        bulk88> to the rest of the design of hv_common. I don't see any performance

        bulk88> gains by removing it.

        demerphq Yes, I am unable to show any actual performance gains either.

        So, not for performance (gain) reasoning.

        Also in his own words:

        demerphq So I think that the current rehash mechanism is about as secure as the random hash seed proposal.

        And:

        demerphqPersonally I dont think its worth the effort of doing much more than thinking about this until someone demonstrates at least a laboratory grade attack. IMO in a real world environment with multi-host web sites, web servers being restarted, load balancers, and etc, that simple hash randomization is sufficient. Seems like any attack would require large numbers of fetch/response cycles and in general would just not be effective in a real production environment. I would assume that the administrators would notice the weird request pattern before an attacker could discover enough information to cause damage. Same argument for non-web services IMO.

        And it doesn't make anything more secure.

        And Dave_the_m said:

        Indeed, based on the thread starting at

        Message-ID: <2003101002...@perlsupport.com>

        it looks like the primary motivation for moving to rehash was to restore the binary compatibility within the 5.8.x branch inadvertently broken by 5.8.1.

        I'm not particularly keen on having hashes always randomised - it makes debugging harder, and reproducing a reported issue nigh-on impossible; but if Yves can show a measurable performance gain without the rehash checks, then I'll approve, as long as the hash seed can still be initialised with the env var PERL_HASH_SEED=0 - otherwise debugging becomes impossible.

        Which mirrors the OPs objections.

        So, significant consequences

        So, why? What did Perl gain?


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Truly randomized keys() in perl 5.17 - a challenge for testing?
by marinersk (Priest) on Sep 30, 2013 at 10:21 UTC
    sort keys %hash

    or

    # Generation Code my $newkey = &generateRandomHashKey(); # Store keys as generated in original order push @Keylist, $newkey; $hash{$newkey} = &generateData(); # ... # Testing Code foreach my $testkey (@Keylist) { # Perform test against $hash{$testkey}; }
Re: Truly randomized keys() in perl 5.17 - a challenge for testing?
by ikegami (Patriarch) on Oct 01, 2013 at 13:47 UTC

    which means that even within the same process, calling keys() twice on the same hash will result in a different key order.

    That's not what it means at all. For a given hash, multiple calls to keys (and values) are still guaranteed to return the same order if there has been no change to the hash, and the order has always been subject to change after hash modifications.

    Difference one: The order is more likely to change on hash modification.

    Difference two: In a given interpreter, if you built two hashes using identical insert and delete steps, you used to get the same key orderings. This is not always the case now.

        You are mistaken. That file never calls keys twice on the same hash. It call keys on two different hashes (containing the same data). That code has been buggy since 5.8.1. The bug is just more likely to occur now.
Re: Truly randomized keys() in perl 5.17 - a challenge for testing?
by saintmike (Vicar) on Sep 30, 2013 at 16:35 UTC
    Let me clarify: The module referenced in the original posting has a function hash_serialize() that takes a reference to a hash like { a => "b", c => "d" } and turns it into something like "a=b&c=d" or "c=d&a=b", depending on what keys() returns underneath:

    https://github.com/mschilli/php-httpbuildquery-perl/blob/master/HTTPBuildQuery.pm#L63

    Now how am I supposed to test that the outcome is what I'd expect? With two hash elements, you could argue that I could generate all permutations of possible result strings and check if one of them matches the one I got by running the function, but with thousand entries this becomes unwieldy.

    The general problem is this: I have an unpredictable function (keys()) and its result gets processed, and the processed result needs to get checked against an expected outcome.

    Unless keys() can be switched to a test mode to produce predictable results (doesn't need to be sorted, just give me something predictable, like before perl 5.17), what are the options?

    By the way, some people have suggested to use "sort keys" in my algorithm every time, but putting in extra complexity (sort) at runtime to make sure I can test perfectly fine code is just plain wrong.

      Now how am I supposed to test that the outcome is what I'd expect?

      Instead of hardcoding a serialized string and testing for an exact match, deserialize the results and test with cmp_deeply or something else which treats a hash as a set.

      Unless keys() can be switched to a test mode to produce predictable results (doesn't need to be sorted, just give me something predictable, like before perl 5.17), what are the options?
      To test hash_serialize():

      I would deserialize it to memory, and compare with original hash (with something based on "sort keys" or with cmp_deeply).

      However this violates some ideas of unit testing (while it's ok for integration testing).
      If it's worrying you - test it for 1-element hash and 2-elements hash (yes, all 2 permutations), and stop. Do other testing with deserialization.

      To test other code, which uses hash_serialize():

      mock (fake) hash_serialize().

      There is probably no general solution to the problem but each application requires another way of doing the testing. For example, in your serialization example, you could do sorting on the outcome:

      $result = join '&', sort split '&', $result;

      just for your testing. This seeems to be simple enough that no new bugs are introduced and will not require a sort in your production code.

      While de-serialization could be a solution as well, one has to be very careful as it adds additional complexity. For example, if your serialization function returns "a=b&c=d&a=b" because of some bug, it could easily be "fixed" by a de-serialization procedure:

      my %hash = map { split '=', $_ } split '&', $result;

      Clearly, there are ways around this, like testing for the length of the string as well. But one has to spend extra time for each application and additional complexity can not be avoided.

      Unless keys() can be switched to a test mode to produce predictable results (doesn't need to be sorted, just give me something predictable, like before perl 5.17), what are the options?

      It hasn't been predictable since 5.8.1. It just happened to be the same most of the time.

        I meant "reproduceable", not "predictable" here.

      How many keys are there? I'm sure it's not always the same, but what's the most you're likely to see?


      Dave

        At this point it's more of an academic exercise, I'm looking for a general solution to the problem of having a unpredictable function somewhere deeply embedded in my call hierarchy, and writing unit tests to confirm the result.

        Do other languages have this problem, except for very few functions like random()?

      Another possibility for your particular case is to sort keys. Performance penalty is small, and having random URLs generated each time is a bigger problem and can cause some caching issue and other troubles.
        You don't know if the performance penalty is small, who said the function isn't going to be used to run a billion times a second?

        The mere thought of dealing with this by adding useless production code to fix a test problem I find very disturbing.

      Unless keys() can be switched to a test mode to produce predictable results ...
      Several hours before you wrote that, someone else told you what the relevant environment variables are.