http://www.perlmonks.org?node_id=380259

BioGeek has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to implement the following recursive formula in Perl:
''''''''''''''''''''''''''''''''''''n
P(r1,r2,...,rn) = Σ(rn-i+1 - rn-i)P(r1,r2,...,rn-i,rn-i+2,...,rn)
'''''''''''''''''''''''''''''''''''i=1

(if this doesn't render well, it is the second formula on this page),
where r0=0 and the recursive call to P supplies all of the original arguments except the (n‑i+1)th argument.
I've looked at Math::Sequence and Math::Series, but the examples given there are only for simple cases like
x(n+1) = 1 / (xn + 1)
and I'm a bit lost on how to implement this formula.

As sample data you can use:
r1 0,11 0,43 0,93 0,91 0,52
r2 0,07 0,31 0,78 0,12 0,18
r3 0,19 0,37 0,82 0,15 0,32

(which should give 5 P(r1,r2,...,rn)-values)


Update: I've tried to work out a litlle example of the algorithm myself (but correct me If I'm doing something wrong).
P(r1,r2,r3)= (r3 - r2)P(r1,r2) + (r2 - r1)P(r1,r3) + (r1 - r0)P(r2,r3)

Now the iteration comes into play, because to calculate the P-values in the rigth hand side of the formula, we use our original formula again.
P(r1,r2) = (r2 - r1)P(r1) + (r1 - r0)P(r2)
P(r1,r3) = (r3 - r1)P(r1) + (r1 - r0)P(r3)
P(r2,r3) = (r3 - r2)P(r2) + (r2 - r0)P(r3)

Update: the r-values are scores, and P is a probability:
so P(r1) is the chance of having score r1,
P(r1,r2) is the chance of both having score r1 and score r2,
and so on.

Replies are listed 'Best First'.
Re: recursive formula.
by herveus (Prior) on Aug 05, 2004 at 14:53 UTC
    Howdy!

    #!/usr/bin/perl use strict; use warnings; sub P { my @r = @_; my $n = $#r; return $r[1] if $n == 1; # inferred end to recursion my $sum = 0; for my $i (1..$n) { my @r_prime = @r; splice(@r_prime, $n-$i+1, 1); $sum += ($r[$n-$i+1] - $r[$n-$i]) * P(@r_prime); } return $sum; } while (<DATA>) { chomp; my @r = split(/,/, $_); unshift @r, 0; # sets r-naught print "P($_) = ", P(@r), "\n"; } __DATA__ 0.11, 0.07, 0.19 0.43, 0.31, 0.37 0.93, 0.78, 0.82 0.91, 0.12, 0.15 0.52, 0.18, 0.32
    gave the following output:
    P(0.11, 0.07, 0.19) = 0.001595 P(0.43, 0.31, 0.37) = 0.046225 P(0.93, 0.78, 0.82) = 0.548235 P(0.91, 0.12, 0.15) = 0.439894 P(0.52, 0.18, 0.32) = 0.010192

    The expansion looks similar to computing a determinant, but not quite...this is a straight-up translation of the formula into Perl. No trickery; no premature optimization.

    Update:

    I elected to prepend a zero to the array of r values to simplify life. That also means that the indices in the formula now correspond to the array indices in the Perl code, making the Perl read more nearly like the formula. I elected to construct the array for the recursion explicitly, by directly striking the omitted r value.

    This is meant to make it easier to audit the code to verify that it says what it is meant to say...

    yours,
    Michael
      Your results look sound to me.

      The above formula is used in extreme value statistics, where outliers are important. You wouldn't catch this when you just take the mean of the values:

      M() = mean

      M(0.11, 0.07, 0.19) = 0.12 M(0.43, 0.31, 0.37) = 0.37 M(0.93, 0.78, 0.82) = 0.84 M(0.91, 0.12, 0.15) = 0.39 M(0.52, 0.18, 0.32) = 0.34
      Now, with our formula: all low scores, also give a low P-value:
      P(0.11, 0.07, 0.19) = 0.001595and
      P(0.52, 0.18, 0.32) = 0.010192
      all high scores give a high P-value:
      P(0.93, 0.78, 0.82) = 0.548235
      both as expected. But now: low scores with a high outlier are also important for us,
      and indeed it also gives a high P-value:
      P(0.91, 0.12, 0.15) = 0.439894
      Only for P(0.43, 0.31, 0.37) = 0.046225 I would intitutivly have tougth of a lower value,
      but it's probably correct.

      Thanks for the effort you put into it.
Re: recursive formula.
by rsteinke (Scribe) on Aug 05, 2004 at 14:35 UTC

    Should be something like

    sub p { my @args = @_; # for clarity, more efficient not to copy # initializing this handles the zero argument case my $result = 0; $result += ($args[$_] - ($_ ? $args[$_-1] : 0)) * p(@args[0..($_-1)],@args[($_+1)..(@args-1)]) foreach(0..(@args-1)); return $result; }
    note: untested

    Ron Steinke rsteinke@w-link.net
Re: recursive formula.
by periapt (Hermit) on Aug 05, 2004 at 15:56 UTC
    This seems to work. Loading the data array for the next recursion seemed, to me, to be the tricky part. I did this by hand and the numbers match. Of course, even with higher math, I still can't add very well ;o)
    use strict; use warnings; use diagnostics; #main() { my @data = ([0.11,0.07,0.19], [0.43,0.31,0.37], [0.95,0.78,0.82], [0.91,0.12,0.15], [0.52,0.18,0.32]); foreach (@data){ my $pval = P($_); print "value = $pval\n"; } exit; } #end main() sub P{ my $data = shift; my $n = @{$data}; my $rslt1 = 0; my $rslt2 = 0; my $rslt = 0; return $$data[0] if($n == 1); #r(0) = 0 ==> r(1)-r(0) = r(1) #assume P(r(0)) = 1 unshift @{$data}, 0; for(my $i=1;$i <= $n; $i++){ my @nextdata = (); for (1..$n){ next if($_ == ($n-$i+1)); push @nextdata, $$data[$_]; } # split up rslt1 & 2 for clarity $rslt1 = ($$data[$n-$i+1] - $$data[$n-$i]); $rslt2 = P(\@nextdata); $rslt += $rslt1 * $rslt2; } return $rslt; } __DATA__ value = 0.001595 value = 0.046225 value = 0.549005 value = 0.439894 value = 0.010192

    PJ
    unspoken but ever present -- use strict; use warnings; use diagnostics; (if needed)
Re: recursive formula.
by QM (Parson) on Aug 05, 2004 at 17:30 UTC
    It seems you have some solutions, so I'm taking a different tack. Two points:

    1) You'd probably benefit more from showing us a little code with your question, before seeing the solutions.

    2) You may want to look at Memoize, like so:

    use Memoize; # make sure memoize happens before p is called BEGIN { memoize('p'); }
    Or, cache it yourself:
    { # closure for sub p my %seen; sub p { my $arg_string = join '^', @_; return $seen{$arg_string} if exists $seen{$arg_string}; # compute $result here as usual # ... return $seen{$arg_string} = $result; } # end sub p } # end closure for sub p
    I leave it to you to work out how to handle the empty parameter list :)

    Update: See BrowserUk's Re^2: recursive formula. below for reasons why this isn't a good idea.

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of

      Memoize won't help for this as for any given set of data, the function will never be called twice with the same set of values. (Apart from the trivial case where the function is called with a single value. And in that case, the single value is return as a special case to end the recursion.)

      For large datasets, Memoize would actually slow this function down by building a cache of values that would only ever be hit by chance. That chance, is if there exists two (or more) identical values in dataset.

      Given the parameters are lists of floating point values with a continuous domain, "identical values" is an ethereal thing.

      Also, the cache-key will be a function of both the number of values in the list, and their ordering. The size of the cache can quickly grow very large without ever providing payback.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail
      "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
        Memoize won't help for this as for any given set of data, the function will never be called twice with the same set of values.
        Ah, yes, I missed that.

        I initially thought that the function might be called multiple times with only slight changes to the data tables. [After all, why write a program if it's only going to be used once?] If it's only called once on 1 table of data, you are correct, there's little to gain. And if it's called with more than one data table, but they are largely independent, there's still little gain. It's only if the tables all have lots of overlap that it makes a difference. We could also debate whether recursion this shallow has much to gain, regardless of the undlerlying function or data. We'll have to let the OP comment on his intended use.

        All of this reminds me of TheDamian's Tachyonic Variables talk. I won't divulge the secret, but in any case the actual module he developed isn't available yet.

        -QM
        --
        Quantum Mechanics: The dreams stuff is made of

      I liked the following line from the Memoize perldoc:
      "This module exports exactly one function, memoize. The rest of the functions in this package are None of Your Business."

      I will look into the empty parameter list stuff.
Re: recursive formula.
by trammell (Priest) on Aug 05, 2004 at 14:41 UTC

    What is P()? One? Or is there some special recipe telling what to do when P() has one argument?

    A complete example calculation would help immensely.

      the r-values are scores, and P is a probability:
      so P(r1) is the chance of having score r1,
      P(r1,r2) is the chance of both having score r1 and score r2,
      and so on.

        Ah, so P() is 1 (the probablity of having anything happen).

        Ok, you'd have to ammend my code with

        return 1 if !@_:
        at the top or something

        Ron Steinke rsteinke@w-link.net
        So is it the case that:
        P(r1) = (r1 - r0) P() = r1
        Correct?
Re: recursive formula.
by BrowserUk (Patriarch) on Aug 05, 2004 at 15:41 UTC

    Update: THIS IS A WRONG IMPLEMENTATION. PLease don't++ it!! Thanks, buk.

    I'm getting different results from other people, so this is probably wrong, but then maybe not, so...

    #! perl -slw use strict; use List::Util qw[ reduce ]; $a = $a; ## Disable the dumbest warning in perl! my @samples = ( ## r1 r2 r3 [ qw[ 0.11 0.07 0.19 ] ], [ qw[ 0.43 0.31 0.37 ] ], [ qw[ 0.93 0.78 0.82 ] ], [ qw[ 0.91 0.12 0.15 ] ], [ qw[ 0.52 0.18 0.32 ] ], ); sub P{ return 1 if @_ == 1; my @r = @_; return reduce { $a + ( $r[ $b ] - $r[ $b - 1 ] ) * P( @r[ 0 .. ( $#r - $b ) ] + ) } 0 .. $#r; } my @results = map P( @$_ ), @samples; print "@results"; __END__ P:\test>380259 0.1216 0.0744 0.0624999999999999 0.6541 0.2556

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
Re: recursive formula.
by jdalbec (Deacon) on Aug 06, 2004 at 01:27 UTC
    r10,110,430,930,910,52
    r20,070,310,780,120,18
    r30,190,370,820,150,32
    Shouldn't 0<r1<r2<r3? Otherwise you're multiplying by negative numbers in the formula.
Re: recursive formula.
by BrowserUk (Patriarch) on Aug 06, 2004 at 11:34 UTC

    I believe this to be a correct implementation.

    I've tried to retain the auditability of herveus' implementation by aliasing perl's working variables to names that marry with those used in the formula, whilst avoiding copying lots of arrays.

    For the small numbers in the samples it makes little difference, but by the time you get to a dozen or more, the double duplication and the spliceing in a recursive subroutine become a severe resource drain. If the numbers involved are anything like those typical for biogenetic work, they would become untenable quite quickly.

    Update: Use the non-aliasing version.

    The aliasing version consumes prodigous amount of memory,

    This version happily processes

    #! perl -slw use strict; use List::Util qw[ reduce ]; sub P; sub P{ ## Non-aliasing warn "@_\n"; return $_[ 0 ] if @_ == 1; return reduce { $a += ( $_[$#_ - $b + 1] - $_[$#_ - $b] ) * P2 @_[ 0 .. $b-1, $b+1 .. $#_ ]; } 0, 1 .. $#_; } =do not use this version This version looks pretty and should be logically identical to the abo +ve and it appears to work okay for short lists. But through what I think is a bug in multiplicity Perl the use of c<local> cause huge memory consumption: > 750 MB for an input list of 10 items!? sub P { ## Pretty, aliasing, voracious memory consumer! return $_[ 0 ] if @_ == 1; our( @r, $i, $a, $b, $sigma ); local *r = *_; local *i = *b; local *sigma = *a; my $n = $#r; return reduce{ $sigma += ($r[$n-$i+1] - $r[$n-$i]) * P @r[0 .. $i-1, $i+1 .. +$n]; } 0, 1 .. $n; } =cut my @samples = ( ## r1 r2 r3 [ qw[ 0.11 0.07 0.19 ] ], [ qw[ 0.43 0.31 0.37 ] ], [ qw[ 0.93 0.78 0.82 ] ], [ qw[ 0.91 0.12 0.15 ] ], [ qw[ 0.52 0.32 0.18 ] ], [ qw[ 1.0 1.0 1.0 ] ], [ qw[ 0.5 0.5 0.5 ] ], [ qw[ 0.0 0.0 0.0 ] ], [ qw[ 0.19 0.11 0.07 ] ], [ qw[ 0.43 0.37 0.31 ] ], [ qw[ 0.93 0.82 0.78 ] ], [ qw[ 0.91 0.15 0.12 ] ], ); print "P( @$_ ) = ", P( @$_ ) for @samples; __END__ P( 0.11 0.07 0.19 ) = 0.001232 P( 0.43 0.31 0.37 ) = 0.004644 P( 0.93 0.78 0.82 ) = 0.016833 P( 0.91 0.12 0.15 ) = 0.547183 P( 0.52 0.32 0.18 ) = 0.045552 P( 1.0 1.0 1.0 ) = 0 P( 0.5 0.5 0.5 ) = 0 P( 0.0 0.0 0.0 ) = 0 P( 0.19 0.11 0.07 ) = 0.002128 P( 0.43 0.37 0.31 ) = 0.004644 P( 0.93 0.82 0.78 ) = 0.016833 P( 0.91 0.15 0.12 ) = 0.547183

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
Re: recursive formula.
by BrowserUk (Patriarch) on Aug 06, 2004 at 06:02 UTC

    What should the result be for P( 1, 1, 1 )?

    My interpretation is that each term will be: P( 1 - 1 )P( ... )

    Which as 1 - 1 will always be zero, the results of the multiplication is zero, so the overall results will be zero?

    If this is the case, then herveus' solution is incorrect as given input of ( 1, 1, 1 ), he returns 1.

    If I'm correct, of which there is no guarentee, then I think that the problem lies with his simple expedient of prefixing the list with a 0 to get around the "indices starting from 1" problem, which is causing the function to iterate and recurse once too often.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
      You forget that r_0 is always zero, so
      P(1, 1, 1) = (1 - 1) P(1, 1) + (1 - 1) P(1, 1) + (1 - 0) P(1, 1) = P(1, 1) = P(1) = 1
      Ron Steinke rsteinke@w-link.net

        Thanks. That's the source of my misunderstanding. I saw the r0 = 0 reference, but I also saw r running 1 .. n and i running 1 to n.

        I missed that n-i = 0 whenever n=i. Stupid.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
        "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon