http://www.perlmonks.org?node_id=755329

Nkuvu has asked for the wisdom of the Perl Monks concerning the following question:

OK, I'm stumped. I have a script that looks in two Excel files. The first file lists a bunch of changes to variable ranges. The second file contains the actual ranges. The script is just to verify that the range in the "range_file" has been updated properly according to the values given in the "change_file".

For all but seven of the variables, the comparison is working just fine. With these seven, though, a difference is reported -- except that the values are identical. Copy and pasted between the two spreadsheets, even. My best guess is that the inequality is due to floating point representation, even though I have about three or four dozen other floating point variables that match just fine. But I'm really not sure that that's the problem.

Note that the script is working fine for what I want to do, and this isn't a critical issue -- I can easily overlook the false positive difference report on a handful of variables. But it's more of a curiosity on "why is it doing this?" than trying to get my script to report these as the same.

The output below is cut and paste, except with variable names sanitized. The values listed are the exact values being compared. The perl snippet after that is a direct copy and paste. It's been far too long since I've had to convert to floating point binary representation. But if the "change_file" and "range_file" both have the same value in the cell, shouldn't the binary representation be the same? None of the values are the results of any sort of calculations, just straight assignment from numeric values in cells to hash entries.

variable_a has different maximum value: Change list: 101.945484 (Intermediate Variables) Range file : 101.945484 (Intermediate Variables) variable_b has different minimum value: Change list: -80.5498336616321 (Intermediate Variables) Range file : -80.5498336616321 (Intermediate Variables) variable_c has different maximum value: Change list: 45142.388774415 (Intermediate Variables) Range file : 45142.388774415 (Intermediate Variables) variable_d has different minimum value: Change list: -159.905028384 (Intermediate Variables) Range file : -159.905028384 (Intermediate Variables) variable_e has different minimum value: Change list: -159.905028384 (Intermediate Variables) Range file : -159.905028384 (Intermediate Variables) variable_f has different maximum value: Change list: 181.019658002611 (Intermediate Variables) Range file : 181.019658002611 (Intermediate Variables) variable_g has different maximum value: Change list: 1.57615706403821 (Intermediate Variables) Range file : 1.57615706403821 (Intermediate Variables)

and the perl snippet that produces this output:
# Testing. Script reporting differences in variables that have # no visual difference. Try assigning to temporary intermediates. # (no difference, obviously, or I wouldn't be writing this node) my $change_min = $changes{$variable}{'min'}; my $change_max = $changes{$variable}{'max'}; my $range_min = $range_file{$variable}{'min'}; my $range_max = $range_file{$variable}{'max'}; if ($change_min != $range_min) { print "$variable has different minimum value:\n"; print " Change list: $changes{$variable}{min} ($changes{$variable +}{sheet})\n"; print " Range file : $range_file{$variable}{min} ($range_file{$va +riable}{sheet})\n"; } if ($change_max != $range_max) { print "$variable has different maximum value:\n"; print " Change list: $changes{$variable}{max} ($changes{$variable +}{sheet})\n"; print " Range file : $range_file{$variable}{max} ($range_file{$va +riable}{sheet})\n"; }

If any more information is needed, please let me know.

Update: Thanks to all for the ideas on how to narrow down the problem. As it turns out, the fault lies with Excel (I know, big surprise). I updated the cell contents for the values that differed -- merely by editing the cell and pressing enter, not actually changing the value in any way. Once re-saved, my script no longer reported any differences. But I'm still glad I asked the question, picked up a few ideas on how to compare floats.

Replies are listed 'Best First'.
Re: False positive on inequality comparison
by kennethk (Abbot) on Apr 03, 2009 at 19:29 UTC

    The flow of the code does seem straight forward enough to suggest that it might by a representation issue. Rather than printing the values, you might consider printing the differences - if they are not literally identical, this should reveal what's getting lost (though I thought there was no truncation in an unformatted print statement). If this does prove to be the issue, you can avoid the false positives by setting a tolerance, e.g.:

    my $tolerance = 1.0e-16; # Change this value based on observation if ($tolerance*abs($range_min + $change_min) < abs($range_min - $chang +e_min)) { print "$variable has different minimum value:\n"; print " Change list: $changes{$variable}{min} ($changes{$variable +}{sheet})\n"; print " Range file : $range_file{$variable}{min} ($range_file{$va +riable}{sheet})\n"; my $difference = $range_min - $change_min print " Difference : $difference\n"; }

      I appreciate the idea, but... still confused. I've changed the code to the following:

      my $tolerance = 1.0e-16; # Change this value based on observation if ($tolerance*abs($range_min + $change_min) < abs($range_min - $chang +e_min)) { print "$variable has different minimum value:\n"; print " Change list: $changes{$variable}{min} ($changes{$variable +}{sheet})\n"; print " Range file : $range_file{$variable}{min} ($range_file{$va +riable}{sheet})\n"; my $difference = $range_min - $change_min; printf " Difference : %0.24f\n", $difference; } if ($tolerance*abs($range_max + $change_max) < abs($range_max - $chang +e_max)) { print "$variable has different maximum value:\n"; print " Change list: $changes{$variable}{max} ($changes{$variable +}{sheet})\n"; print " Range file : $range_file{$variable}{max} ($range_file{$va +riable}{sheet})\n"; my $difference = $range_min - $change_min; printf " Difference : %0.24f\n", $difference; }

      Only difference to what you posted is the printf rather than print. And that was because the difference being shown with print was zero. This change in the code eliminated differences for four of the seven variables, but three remain:

      variable_c has different maximum value: Change list: 45142.388774415 (Intermediate Variables) Range file : 45142.388774415 (Intermediate Variables) Difference : -0.000000000000014210854715 variable_f has different maximum value: Change list: 181.019658002611 (Intermediate Variables) Range file : 181.019658002611 (Intermediate Variables) Difference : 0.000000000000000000000000 variable_g has different maximum value: Change list: 1.57615706403821 (Intermediate Variables) Range file : 1.57615706403821 (Intermediate Variables) Difference : 0.000000000000000000000000

      Variable C makes some sort of sense. I mean, I'm still under the impression that if I assign 45142.388774415 directly to two different variables, they should both have the same floating point representation. But they don't, obviously.

      But what the hork is up with the other two? It seems to me that if the tolerance is 1e-16, there should be some difference shown in 24 places in the floating point number.

        Can I suggest that you dump each value to a file along with some natural identifier (line number or cell ID), using:

        print unpack 'b64', pack 'd', $float;; 1011110110001011101100101110110101100111111000110111101000000011

        Do this immediately after reading it into your script to ensure the minimum possibility of 'internal influences'. Then, when you encounter unexpected/unexplainable differences, dump both values in the same form, along with their original identifiers.

        I can't offer an explanation for how or why what you are seeing could happen, but I have seen some weirdness in Perl's internal handling of floats in the past. Dumping bit patterns is surest way I know of inspecting what is actually there.

        That said, you can still be fooled, as IEEE allow for the possibility of denormalised representations. That is, the bit patterns can be different but represent (exactly) the same values. Kind of like binary equivalent of 10e-6 vs. 1e-5.

        Anyway, if you can track what differences are being perceived, then you may be able to backtrack to where they arise. And if you can do that, an explanation of why may be obvious or forthcoming.

        Good luck.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: False positive on inequality comparison
by toolic (Bishop) on Apr 03, 2009 at 20:08 UTC

      I'm familiar with potential problems with floating point inequality when the values are the results of some computation. But if I assign 1.2345 to $var_a and then assign 1.2345 to $var_b, then perform the equality check, I would expect them to be equal. Direct assignments should not cause different floating point representations, as far as I know.

      Once that value is used in any sort of arithmetic, all bets are off.

      As an additional test, I wrote a short script (which I should have done earlier, but it still doesn't really explain anything):

      #!/usr/bin/perl use strict; use warnings; my %values; while (my $line = <DATA>) { if ($line =~ /(\w)_(r|ch)ange\s(\S+)/) { $values{$1}{$2."ange"} = $3; } } for my $ key(sort keys %values) { if ($values{$key}{'change'} != $values{$key}{'range'}) { print "Not "; } print "Equal:\nChange: $values{$key}{change}\nRange : $values{$key +}{range}\n\n"; } __DATA__ var_a_change 101.945484 var_a_range 101.945484 var_b_change -80.5498336616321 var_b_range -80.5498336616321 var_c_change 45142.388774415 var_c_range 45142.388774415 var_d_change -159.905028384 var_d_range -159.905028384 var_e_change -159.905028384 var_e_range -159.905028384 var_f_change 181.019658002611 var_f_range 181.019658002611 var_g_change 1.57615706403821 var_g_range 1.57615706403821

      Every single comparison there passes the equality check. And the only thing this script is doing differently than my actual script is reading the values from Excel. That may be key, I don't know.

        I believe you've hit the nail on the thumb right here.

        And the only thing this script is doing differently than my actual script is reading the values from Excel.

        Depending on how the values were added to Excel (or calculated there), the representations may be different even though they look the same.

        This is another instance of a more subtle problem. You can't truly trust any data that comes from outside your program. The real world always seems to be messier than it needs to be.<shrug/>

        G. Wade
Re: False positive on inequality comparison
by tilly (Archbishop) on Apr 04, 2009 at 05:01 UTC
    A quick and dirty but surprisingly useful trick to use when dealing with floating point issues is to use string comparison operators rather than numeric ones. Try it and see if it works.

      That's something I would never have thought of. But it worked perfectly. String comparison found no difference -- even when a numeric comparison did find a difference. Thanks for the idea.

        Using string comparison on floats is dangerous, unless you don't care if the floats are actually different!

        As this shows in the last line, Perl's default stringification for floats will only show you 6 decimal places of precision, anything beyond that accuracy and you are just pretending the are the same.

        $first = unpack 'd', pack 'b64', '1001110001010011100010111001000010000101000001010110011000000010';; $second = unpack 'd', pack 'b64', '1010110001010011100010111001000010000101000001010110011000000010';; printf "%f %f :> %17.17f\n", $first, $second, $first - $second;; 181.019658 181.019658 :> 0.00000000000011369

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: False positive on inequality comparison
by andk (Novice) on Apr 05, 2009 at 08:51 UTC
    Given that you hit a bug it would be cool if you could narrow it down, eg. try the same program on a different architecture or shorten it to a oneliner. Floating point bugs are often architecture specific and so the number of developers to discover them is smaller than for other bugs. Once you get it down to a oneliner it can become part of a test in either the Excel module you're using or perl itself and thus be extinguished. Don't give up on it!