Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

May I be bitten by floating point arithmetic in the following restricted case?

by rubasov (Friar)
on Feb 23, 2010 at 17:40 UTC ( #824902=perlquestion: print w/ replies, xml ) Need Help??
rubasov has asked for the wisdom of the Perl Monks concerning the following question:

Today I've written a little program that classifies and counts files based on the integer part of log2($file_size). So I would get these classes:
0 <= SIZE < 1 1 <= SIZE < 2 2 <= SIZE < 4 4 <= SIZE < 8 and so on...
For that I would like to write this:
$size_class = $size ? int( log($size) / log(2) ) : -1; # red light on, possible error caused by floating point arithmetic
Now let's forget about the special case of $size == 0. My input can only be positive integers (because these are file sizes), so I suppose the error coming from floating point arithmetic can bite me only in the case when $size is a power of 2 (in the expression $size = 2**$x the exponent is a natural number). Is this true?

But let's take it further. I've written a little test to check whether I will be bitten for all the powers of 2 that fits in my machine's floating point represantation:

bash $ diff <( perl -E 'say int( log(2**$_) / log(2) ) for 1 .. 2**10' + ) <( seq 1 $(( 2**10 )) )
The output is this:
1024c1024 < inf --- > 1024
So there is not a single power of 2 (under 2**1024) that would be misclassified at least on my machine.

So my question is: Is there a theoretical explanation (for this restricted case of floating point aritmetic) why it can't be problematic or is it just by accident?

Please shed some light on this issue.

ps: Sorry, I know this is probably not Perl specific, but my question arose because there is no log2() function in Perl. (For example with a properly setup lookup table I could be able to do this classification with integer arithmetic only, but that would be a pain instead of this little snippet.)

Comment on May I be bitten by floating point arithmetic in the following restricted case?
Select or Download Code
Re: May I be bitten by floating point arithmetic in the following restricted case?
by ikegami (Pope) on Feb 23, 2010 at 18:30 UTC

    As you know, a tiny error log($size) / log(2) could lead to a big error in the result. Perl Differences in the underlying C libraries and hardware can result in those tiny differences on other systems even if they don't exist on your own.

    I suppose the error coming from floating point arithmetic can bite me only in the case when $size is a power of 2

    You also need to check one less than powers of two for errors rounding up.

    249 = 562949953421312

    $ perl -E'say int( log(562949953421309) / log(2) )' 48 $ perl -E'say int( log(562949953421310) / log(2) )' 49 XXX $ perl -E'say int( log(562949953421311) / log(2) )' 49 XXX $ perl -E'say int( log(562949953421312) / log(2) )' 49 $ perl -E'say int( log(562949953421313) / log(2) )' 49

    (Problem found by accident)

Re: May I be bitten by floating point arithmetic in the following restricted case?
by kennethk (Monsignor) on Feb 23, 2010 at 18:46 UTC

    There are two limiting cases for failure here. Assuming LIMIT is a power of two, either SIZE = LIMIT could be binned too low or SIZE = LIMIT-1 could be binned too high. For the latter to happen, 1/LIMIT would have to be smaller than your number of significant figures (you can show this with a Taylor series). Obviously where that falls depends on your machine and build, but on mine that starts failing at 2**48 and will always precede the transition to floating point representation for $size.

    for (1 .. 2**10) { my $size = 2**$_; print "Fail $_ high\n" if (log((2**$_ - 1)) / log(2) ) == $_; }

    Regarding what I perceive as the main question, note that you are getting fortunate. If you run the code printf("$_: %.16e\n", log(2**$_)/log(2)) for (1 .. 2**10); (assuming native doubles) you will see that the result of your division is not exactly correct - your last digit is high in a large fraction of the offerings. This is a function of the logarithm as implemented. If instead of the above, I explore the powers of 3, all inexact cases are low, not high. This implies to me that the internal representation of log(2) is ever so slightly lower than the true value.

    The better question is if you should care about this inaccuracy. If you are just gathering file statistics, inaccuracy in the absolute position of the boundary should not significantly skew your results assuming a smooth file size p.d.f. By the time your algorithm fails, you are nearly to a point where you can no longer identify file sizes with integers. However, if it is mission critical to be literally correct, you could use a hash to build a look-up table.

Re: May I be bitten by floating point arithmetic in the following restricted case?
by BrowserUk (Pope) on Feb 23, 2010 at 19:03 UTC

    Take a look at POSIX::frexp(). It seems to do the right thing all the way to the 53-bit limit of doubles. It gets a little flakey thereafter:

    printf "-1:%2d %30f:%d +1:%2d\n", (frexp( $_ - 1 ))[1], $_, (frexp( $_ ))[1], (frexp( $_ +1 ))[1] for map 2**$_, 1 .. 64;; -1: 1 2.000000:2 +1: 2 -1: 2 4.000000:3 +1: 3 -1: 3 8.000000:4 +1: 4 -1: 4 16.000000:5 +1: 5 -1: 5 32.000000:6 +1: 6 -1: 6 64.000000:7 +1: 7 -1: 7 128.000000:8 +1: 8 -1: 8 256.000000:9 +1: 9 -1: 9 512.000000:10 +1:10 -1:10 1024.000000:11 +1:11 -1:11 2048.000000:12 +1:12 -1:12 4096.000000:13 +1:13 -1:13 8192.000000:14 +1:14 -1:14 16384.000000:15 +1:15 -1:15 32768.000000:16 +1:16 -1:16 65536.000000:17 +1:17 -1:17 131072.000000:18 +1:18 -1:18 262144.000000:19 +1:19 -1:19 524288.000000:20 +1:20 -1:20 1048576.000000:21 +1:21 -1:21 2097152.000000:22 +1:22 -1:22 4194304.000000:23 +1:23 -1:23 8388608.000000:24 +1:24 -1:24 16777216.000000:25 +1:25 -1:25 33554432.000000:26 +1:26 -1:26 67108864.000000:27 +1:27 -1:27 134217728.000000:28 +1:28 -1:28 268435456.000000:29 +1:29 -1:29 536870912.000000:30 +1:30 -1:30 1073741824.000000:31 +1:31 -1:31 2147483648.000000:32 +1:32 -1:32 4294967296.000000:33 +1:33 -1:33 8589934592.000000:34 +1:34 -1:34 17179869184.000000:35 +1:35 -1:35 34359738368.000000:36 +1:36 -1:36 68719476736.000000:37 +1:37 -1:37 137438953472.000000:38 +1:38 -1:38 274877906944.000000:39 +1:39 -1:39 549755813888.000000:40 +1:40 -1:40 1099511627776.000000:41 +1:41 -1:41 2199023255552.000000:42 +1:42 -1:42 4398046511104.000000:43 +1:43 -1:43 8796093022208.000000:44 +1:44 -1:44 17592186044416.000000:45 +1:45 -1:45 35184372088832.000000:46 +1:46 -1:46 70368744177664.000000:47 +1:47 -1:47 140737488355328.000000:48 +1:48 -1:48 281474976710656.000000:49 +1:49 -1:49 562949953421312.000000:50 +1:50 -1:50 1125899906842624.000000:51 +1:51 -1:51 2251799813685248.000000:52 +1:52 -1:52 4503599627370496.000000:53 +1:53 -1:53 9007199254740992.000000:54 +1:54 -1:55 18014398509481984.000000:55 +1:55 -1:56 36028797018963968.000000:56 +1:56 -1:57 72057594037927936.000000:57 +1:57 -1:58 144115188075855870.000000:58 +1:58 -1:59 288230376151711740.000000:59 +1:59 -1:60 576460752303423490.000000:60 +1:60 -1:61 1152921504606847000.000000:61 +1:61 -1:62 2305843009213694000.000000:62 +1:62 -1:63 4611686018427387900.000000:63 +1:63 -1:64 9223372036854775800.000000:64 +1:64 -1:65 18446744073709552000.000000:65 +1:65

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: May I be bitten by floating point arithmetic in the following restricted case?
by rubasov (Friar) on Feb 24, 2010 at 00:34 UTC

    Thank you all for your responses. I'm still digesting some of your notes, but I do see now that the scope of my little investigation was too narrow, that's why I wasn't able to find failing cases. (And as I haven't found those I was wondering how that can be...)

    Fortunately my program would be used for statistical purposes, and nowhere near to the petabyte (2**50) file size range, so it'll avoid the real problems.

    It is still interesting what would I do if I should care about the inaccuracy. Probably something like this sub I've written since my OP:

    use bigint; { my @lookup = (1); # 2, 4, 8, ... sub int_log2 { my $n = shift; return if $n < 1; my $exp = 0; # this still could be optimized by starting at # $exp = min( $#lookup, $some_educated_guess_using_floating_point_ +approximation ) while ( $lookup[ $exp++ ] <= $n ) { $lookup[$exp] = 2 * $lookup[ $exp - 1 ] if not exists $lookup[$e +xp]; } return $exp - 2; } } for ( 1 .. 2**10 ) { print int_log2(2**$_-1), ' ', int_log2(2**$_), ' ', int_log2(2**$_+1 +), "\n"; }

    This completes under 100sec on my several year old laptop.

Re: May I be bitten by floating point arithmetic in the following restricted case?
by NiJo (Friar) on Feb 24, 2010 at 18:24 UTC
    There is another facet of your problem: Risk. What would happen in your application and the work flow it is used in if your calculation is 10 % off? A skewed file size statistic won't kill anyone. You get a gentle slap from your software testers and/or users and you adopt the documentation to the skewed implementation.

    This is not to say that your question is always unimportant...

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://824902]
Approved by Corion
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (5)
As of 2014-08-30 10:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (292 votes), past polls