BrowserUk has asked for the
wisdom of the Perl Monks concerning the following question:
Given 2 arrays of numbers, I want an efficient algorithm for "cancelling out". (There is probably a proper name for this, but I don't know it).
Most easily described by example. Given:
@a = ( 10, 20, 33, 45, 60 );
@b = ( 2, 5, 10, 12, 16, 23, 45 );
I want to cancel values between the arrays to end up with:
@c = ( 5, 33 ); Please note that 5 does not appear in @a.
@d = ( 8, 23 ); And 8 does not appear in @b.
Note. This is a trivial example. Both the sizes of the arrays and the values will be much larger.
Update: This is not the elimination of duplicates or intersection of arrays. Once any duplicates have been eliminated, it then becomes a task of reducing factors from @a & @b unil no further factors can be removed.
Examine what is said, not who speaks  Silence betokens consent  Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco.  Rule 1 has a caveat!  Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
20050816 Retitled by Arunbear, as per Monastery guidelines Original title: 'Algorithm wanted'
Re: Algorithm for cancelling common factors between two lists of multiplicands by GrandFather (Cardinal) on Aug 08, 2005 at 19:41 UTC 
Will the arrays have equal numbers of values?
Will there be repeated values in teh individual arrays (1, 1, 10, 20, ...)?
Could the arrays fit in (virtual) memory or are they disk based and too big?
Perl is Huffman encoded by design.
 [reply] 

Will the arrays have equal numbers of values?
Unimportant.
Will there be repeated values in teh individual arrays (1, 1, 10, 20, ...)?
Possibly.
Could the arrays fit in (virtual) memory or are they disk based and too big?
They will fit in memory.
Examine what is said, not who speaks  Silence betokens consent  Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco.  Rule 1 has a caveat!  Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
 [reply] 
Re: Algorithm for cancelling common factors between two lists of multiplicands by ikegami (Pope) on Aug 08, 2005 at 19:46 UTC 
$c{$_}++ foreach @a;
$c{$_}++ foreach @b;
@c = grep { $c{$_} != 2 } @a;
@d = grep { $c{$_} != 2 } @b;
 [reply] [d/l] 

my %a = map { $_ => 1 } @a;
my %b = map { $_ => 1 } @b;
@c = grep { !$b{$_} } @a;
@d = grep { !$a{$_} } @b;
 [reply] [d/l] 

 [reply] [d/l] 

use Math::BigInt ();
use Math::Big::Factors ();
my $a = Math::BigInt>new(1);
$a *= $_ foreach @a; # 17820000
my $b = Math::BigInt>new(1);
$b *= $_ foreach @b; # 19872000
my $gcd = Math::BigInt::bgcd($a, $b); # 108000
my $c = $a / $gcd; # 165
my $d = $b / $gcd; # 184
my @c = Math::Big::Factors::factors_wheel($c); # 3, 5, 11
my @d = Math::Big::Factors::factors_wheel($d); # 2, 2, 2, 23
Not quite the solution you asked for. Close enough?
 [reply] [d/l] 

Is this a factoring problem? Math::Factor and Math::Big::Factors may have useful routines. In particular Math::Factor has the subroutine "match" that sounds something like what you're trying to do, though as others have stated, it's still not clear what exactly that is.
xdg
Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.
 [reply] 
Re: Algorithm for cancelling common factors between two lists of multiplicands by salva (Monsignor) on Aug 08, 2005 at 19:55 UTC 
$i{$_}++ for (@a);
$i{$_} for (@b);
@c=grep { $i{$_} > 0 } keys @a;
@d=grep { $i{$_} < 0 } keys @b;
btw, are your sample solution arrays right?  [reply] [d/l] 

 [reply] 
Re: Algorithm for cancelling common factors between two lists of multiplicands by hv (Parson) on Aug 08, 2005 at 19:58 UTC 
Unless the individual values get very large, I think the easiest would be to factor them:
sub combine_factors {
my $list = shift;
my $factors = {};
for (@$list) {
for (factors($_)) {
++$factors>{$_};
}
}
$factors;
}
my $fa = combine_factors(\@a);
my $fb = combine_factors(\@b);
($fa>{$_} = 0) = $fb>{$_} for keys %$fb;
my @c = map +($_) x $fa>{$_}, grep $fa>{$_} > 0, keys %$fa;
my @d = map +($_) x $fa>{$_}, grep $fa>{$_} < 0, keys %$fa;
This has the disadvantage that it will leave a list of primes: @c == (3, 5, 11), @d == (2, 2, 2, 23). I'm not sure how you'd retain the original composites (or parts thereof), nor how you'd define which composites to keep, so I chose to ignore that aspect of it.
Hugo  [reply] [d/l] [select] 

 [reply] 

{
my @p; BEGIN { @p = (2, 3) }
sub nextprime {
my $p = $p[$#p];
DIV: while (1) {
$p += 2;
my $pc = 0;
while (1) {
my $d = $p[$pc++];
last if $d * $d > $p;
next DIV unless $p % $d;
}
$p[@p] = $p;
return $p;
}
}
sub factors {
my $n = shift;
my($pc, @result) = (0);
while ($n > 1) {
my $d = $p[$pc++]  nextprime();
if ($d * $d > $n) {
push @result, $n;
$n = 1;
} else {
while ($n % $d == 0) {
$n /= $d;
push @result, $d;
}
}
}
\@result;
}
}
The rest of the time I use Math::Big*, sometimes with Pari or (lately) GMP.
Hugo  [reply] [d/l] 

Re: Algorithm for cancelling common factors between two lists of multiplicands by sgifford (Prior) on Aug 08, 2005 at 20:00 UTC 
Hi BrowserUK,
I'm not understanding what you mean by cancelling out. Can you explain a bit more thoroughly? I see you've updated the original node, but it's still not clear to me. If an item in b is a factor of an item in a, should both be eliminated? Or if they have a factor in common? Or...?
Also, will the arrays be sorted, as they are in your example?
 [reply] [d/l] [select] 
Re: Algorithm for cancelling common factors between two lists of multiplicands by BrowserUk (Pope) on Aug 08, 2005 at 20:14 UTC 
The same exampled worked through:
 @a = ( 10, 20, 33, 45, 60 );
 @b = ( 2, 5, 10, 12, 16, 23, 45 );
Eliminate common elements:
 @a = ( 20, 33, 60 );
 @b = ( 2, 5, 12, 16, 23 );
Cancel 2 with 20 to give 1 and 10 (discard the 1):
 @a = ( <strike>20</strike><ins>10</ins>, 33, 60 );
 @b = ( <strike>2</strike><ins>1</ins>, 5, 12, 16, 23 );
Cancel 5 with 10 to give 1 and 2:
 @a = ( <strike>10</strike><ins>2</ins>, 33, 60 );
 @b = ( <strike>5</strike><ins>1</ins>, 12, 16, 23 );
Cancel 2 with 12 to give 1 and 6 (discard the 1 )
 @a = ( 2, 33, 60 );
 @b = ( 12, 16, 23 );
Cancel 6 with 60 to give 1 and 10 (discard the 1 )
 @a = ( 33, 60 );
 @b = ( 6, 16, 23 );
Cancel 16 with 10 to give 8 and 5
 @a = ( 33, 10 );
 @b = ( 16, 23 );
Result: (Other reductions are possible!)
 @a = ( 33, 5 );
 @b = ( 8, 23 );
Examine what is said, not who speaks  Silence betokens consent  Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco.  Rule 1 has a caveat!  Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
 [reply] [d/l] [select] 

Why not do this:
foreach, element in an array, factor to primes (i.e. if starting from 10, then 1*2*5). Create a hash for for the first array where keys are factors of each element and values are the number of times each factor has occurred. Repeat for second array. "Subtract" the two hashes. Multiply the resulting keys by their respecting values > should give you the elements of the resulting array.
 [reply] 

Here is my shot at a pseudocode
1. Pick the first element of array a
2. Calcuate the GCD(a[0],b[0]) or better yet a[0] = mod (a[0], gcd(a[0],b[0]) and b[0] =mod (a[0], gcd(b[0],b[0]) (just divide elements by gcd, it should be an integer)
3. If anything is 1 pop it out of the list
4. Go through the list of elements for b with a[0] again. Do this until all elements are scanned in b.
5. Take the second element in a and repeat the process.
Does this work?
I shall post the code if i get a chance to implement it
SK
PS: This reminds of division in high school days ;)
 [reply] 

PS: This reminds of division in high school days ;)
I guess that's where the "cancelling out" phrase comes from :)
Maybe the confusion comes from the fact that most people younger than me never learned to cancel out their equations because they all used calculators from an early age.
They probably don't know what "casting out the nines" is either.
Examine what is said, not who speaks  Silence betokens consent  Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco.  Rule 1 has a caveat!  Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
 [reply] [d/l] 

If the solution in my earlier node doesn't return what you want, this surely will:
use Math::BigInt ();
*gcd = \&Math::BigInt::bgcd;
sub embiggen {
local $_ = @_ ? $_[0] : $_;
return Math::BigInt>new($_);
}
my @a = ( 10, 20, 33, 45, 60 );
my @b = ( 2, 5, 10, 12, 16, 23, 45 );
# my %a = map { $_ => 1 } @a;
# my %b = map { $_ => 1 } @b;
#
# my @c = map embiggen,
# grep { !$b{$_} }
# @a;
#
# my @d = map embiggen,
# map { Math::BigInt>new($_) }
# grep { !$a{$_} }
# @b;
my @c = @a;
my @d = @b;
foreach my $c (@c) {
foreach my $d (@d) {
my $gcd = gcd($c, $d);
$c /= $gcd;
$d /= $gcd;
}
}
@c = grep { $_ != 1 } @c;
@d = grep { $_ != 1 } @d;
or this messier but potentially faster alternative:
use Math::BigInt ();
use Math::Big::Factors ();
*gcd = \&Math::BigInt::bgcd;
*factor = \&Math::Big::Factors::factors_wheel;
sub embiggen {
local $_ = @_ ? $_[0] : $_;
return Math::BigInt>new($_);
}
my @a = ( 10, 20, 33, 45, 60 );
my @b = ( 2, 5, 10, 12, 16, 23, 45 );
# my %a = map { $_ => 1 } @a;
# my %b = map { $_ => 1 } @b;
#
# my @c = map embiggen,
# grep { !$b{$_} }
# @a;
#
# my @d = map embiggen,
# grep { !$a{$_} }
# @b;
my @c = @a;
my @d = @b;
my %c_factors;
foreach my $c_idx (0..$#c) {
my @c_factors = factor($c[$c_idx]);
foreach my $c_factor (@c_factors) {
push(@{$c_factors{$c_factor}}, $c_idx);
}
}
foreach my $d (@d) {
my @d_factors = factor($d);
foreach my $d_factor (@d_factors) {
next unless $c_factors{$d};
next unless @{$c_factors{$d}};
my $c_idx = shift(@{$c_factors{$d}});
$c[$c_idx] /= $d_factor;
$d /= $d_factor;
}
}
@c = grep { $_ != 1 } @c;
@d = grep { $_ != 1 } @d;
Both untested.
 [reply] [d/l] [select] 

Update I wrote this when we were trying to figure out the exact output. Not relevant anymore
ikegami,
Your first should work correctly. Did not read the second one yet. One thing you want to note when you test this code is that the result will not match exactly what BrowserUk had coz the cancellation process is kind of subjective when done by humans as opposed to computers
take for example if i have 15 * 3 * 12 / 9 * 17
then I could either do 5 * 12/17 or 15 * 4/17
I guess the final solution should match product of the elements in the array
Another thing, you might get a tiny little benefit if you check for the elements == 1 inside the loop construct and pop them out. I am not sure if that is even possible.
BTW, i think you can add next if $c == 1 or $d == 1; to avoid computing gcd's on 1's.
SK
 [reply] [d/l] [select] 
Re: Algorithm for cancelling common factors between two lists of multiplicands by techra (Pilgrim) on Aug 08, 2005 at 20:32 UTC 
I'm curious when I see a post like this as to why? Not only why do you have this data that you want to crunch in this way but also, what do you get out of these crunched numbers? I understand that often (myself included) we're restricted by NDAs to actually talk about the application of our work, but something like this just makes me wonder what's going on and why do I never come across any problems that this becomes a question?  [reply] 

 [reply] 

Can you provide a sample data set (i.e., matrix) that you consider to be "large"? Also, could you give me an idea of how long it takes to compute P_{cutoff} without using arithmetic optimizations? (I would like to try out a quick Haskellbased implementation I whipped up on a real data set.)
 [reply] 


 
 
 

The original problem, framed in terms of arrays, is to reduce a fraction.
The
equation above is less general.
The less general problem can be restated as reduce a fraction where the numerator and denominator are both the product of factorials.
An example of the original problem is to reduce
(9 * 8 * 8 * 8 * 6 * 4 )/(5 * 3 * 2)
An example of the factorial problem is to reduce
(9! * 8! * 8! * 8! * 6! * 4! )/(5! * 3! * 2!)
Do I understand correctly, are you are interested in the factorial type of problem?
 [reply] 

Re: Algorithm for cancelling common factors between two lists of multiplicands by tlm (Prior) on Aug 09, 2005 at 01:17 UTC 
Memoize the logs of the factorials.
BTW, I posted some FET code earlier that does just that; if there's something wrong with it, please let me know.
 [reply] 

 [reply] 

print Fishers_Exact::fishers_exact( 989, 9400, 43300, 2400 );
I get these results:
P:\test>FETtlm.pl
0.99999999999999999999999999999999999999999999999[... 950 9s truncated
+]
9999999999999999999999999999999999999999999999999[... 950 9s truncated
+]
9999999999999999999999999999999999999999999999999[... 950 9s truncated
+]
9999999999999999999999999999999999999999999999999[... 950 9s truncated
+]
9999999999999999999999999999999999999999999999999[... 950 9s truncated
+]
9999999999999999999999999999999999999999999999999[... 950 9s truncated
+]
9999999999999999999999999999999999999999999999999[... 950 9s truncated
+]
99999999999999999999999999999999952665545273822897512768877675
1 trial of _default ( 1.385s total), 1.385s/trial
Now, given that your code needs Math::Pari, and the "simple" solution using that module produced the correct answer in 26ms, I can't help wonder what all that messing around with Stirling and Gosper approximations bought you?
I also think that if you are going to require exporter, you might as well let it do it's job and skip all that autoload stuff. And I personally find that littering your code with a dozen BEGIN{} blocks makes it very hard to follow.
Examine what is said, not who speaks  Silence betokens consent  Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco.  Rule 1 has a caveat!  Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
 [reply] [d/l] [select] 
Re: Algorithm for cancelling common factors between two lists of multiplicands by QM (Vicar) on Aug 09, 2005 at 03:24 UTC 
There are better ways to compute factorial quotients without overflow or loss of precision. But I won't go into that here. Instead I'll address the problem you stated originally.
After examining all of the replies to date, it seems you want to reduce the quotient. Since you're dealing with large numbers (products of factorials), reducing it won't be so easy. Here's a schoolboy method that just relies on breaking the process down into steps, nothing fancy:
First in pseudocode:
1) Factor each term in the numerator and denominator into its prime factors. (Since these are factorials, this is pretty quick).
2) Collect all of the numerator factors into one array, and all of the denominator factors in another array.
3) Sort both arrays.
4) Do a modified merge sort to cancel like terms in both arrays. Keep any terms that don't have a match in the other array.
5) You can compute the quotient to a high degree of accuracy without overflowing or underflowing by judicious choices in the next term to grab.
Now let's see if I can do this in code (untested):
my @num = ( 3, 7, 11 );
my @den = ( 5, 12, 19 );
# get prime factors (you have to write this yourself :)
# it returns the list of prime factors for each argument
my @num_fac = get_prime_factors_list( 2..$num[1] );
my @den_fac = get_prime_factors_list( 2..$den[1] );
@num_fac = sort { $a <=> $b } @num_fac;
@den_fac = sort { $a <=> $b } @den_fac;
# modified merge sort
my $n, $d;
while (( $n < $#num_fac ) and ( $d < $#den_fac ))
{
if ( $num_fac[$n] == $den_fac[$d] )
{
# drop both
$n++; $d++;
}
elsif ( $num_fac[$n] < $den_fac[$d] )
{
push @num_fin, $num_fac[$n]; $n++;
}
else
{
push @den_fin, $den_fac[$d]; $d++;
}
}
# any remaining factors are not shared
push @num_fin, @num_fac[$n..$#num_fac];
push @den_fin, @den_fac[$d..$#den_fac];
# edge cases
@num_fin = (1) unless @num_fin;
@den_fin = (1) unless @den_fin;
# carefully multiply and divide
my $q = shift @num_fin;
while ( @num_fin and @den_fin )
{
if ( $q > 1 )
{
$q = $q / (shift @den_fin);
}
if ( $q < 1 )
{
$q = $q * (shift @num_fin);
}
}
# only one of these will be true:
$q = $q * (shift @num_fin) while @num_fin;
$q = $q / (shift @den_fin) while @den_fin;
You can probably improve this considerably. For instance, instead of keeping each prime factor (including duplicates), only keep a count.
QM

Quantum Mechanics: The dreams stuff is made of
 [reply] [d/l] 

 [reply] 

 [reply] [d/l] [select] 
Re: Algorithm for cancelling common factors between two lists of multiplicands by Anonymous Monk on Aug 09, 2005 at 12:55 UTC 
Seems to me that this problem requires you to factor numbers. Factoring numbers is considered to be a really hard problem  it's what public key encryption makes secure. Noone knows a fast way to factor numbers (well, maybe someone does, but then, they have kept it a secret), so if you are going to have big numbers, the factoring will be the bottleneck.  [reply] 

There's big and big. Factoring values of a few thousand or even a thousand million isn't so hard.
Cryptography uses numbers much, much bigger than these.
Examine what is said, not who speaks  Silence betokens consent  Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco.  Rule 1 has a caveat!  Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
 [reply] 
Re: Algorithm for cancelling common factors between two lists of multiplicands by tlm (Prior) on Aug 09, 2005 at 13:53 UTC 
Here's a another solution. It takes two array refs, makes copies of them, and cancels out common terms, removing all 1's.
sub list_reduce {
my ( $top, $bot ) = map [ grep $_ != 1, @$_ ], @_;
die "Bad data\n" if ( grep $_ < 1, @$top, @$bot );
OUTER:
for my $i ( reverse 0 .. $#$top ) {
for my $j ( reverse 0 .. $#$bot ) {
( $top>[ $i ], $bot>[ $j ] ) = reduce( $top>[ $i ], $bot>[ $
+j ] );
splice @$bot, $j, 1 if $bot>[ $j ] == 1;
if ( $top>[ $i ] == 1 ) {
splice @$top, $i, 1;
next OUTER;
}
}
}
return ( $top, $bot );
}
sub reduce {
my ( $top, $bottom ) = @_;
my $gcd = Math::BigInt::bgcd( $top, $bottom );
return ( $top/$gcd, $bottom/$gcd );
}
 [reply] [d/l] [select] 
Re: Algorithm for cancelling common factors between two lists of multiplicands by Limbic~Region (Chancellor) on Aug 09, 2005 at 16:26 UTC 
BrowserUk,
The order and pairings you choose to cancel out affect the result. This may prevent optimizations that seek to cancel out terms in a different order (such as largest terms first). For instance in your example, it took 6 cancelings to get the resut. Here is another possibility that doesn't result in prime factorization but only takes 3 pairings:
My algorithm doesn't require prime factorization (just GCD).
It doesn't produce the output you wanted so I didn't bother trying to optimize it WRT choosing pairings. With the lack of optimizations it is likely not that efficient  but it was fun.
 [reply] [d/l] [select] 

It doesn't produce the output you wanted ...
As I mentioned in 482012, the exact reduction isn't important. The results of your 3 manual steps (33,5)(2,4,23) and my 5 manual steps (33,5)(8,23) are completely equivalent.
Adding a dependancy upon Inline::C is heavier than Math::Pari, though GCD could be written in perl of course.
Your algorithm certainly works. However, using GCD requires multiple, O(m*n) passes over the datasets. hv's prime factors approach eliminates the multiple passes and seems, on the small sets I've tried, to get the results more quickly, even with pure perl factoring code.
Coming up with an efficient, pure perl implementation of prime factorisation (limiting myself to integers 0 .. 2**32 ) is my current fun challenge :)
Examine what is said, not who speaks  Silence betokens consent  Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco.  Rule 1 has a caveat!  Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
 [reply] [d/l] [select] 

BrowserUk,
I would be curious to see the results if you ever benchmark the GCD approach with the prime factor approach on the problem size you were mentioning
The GCD Approach is O(m*n)*O(GCDalgo) but I am not sure how the complexity reduces for the prime factorization approach.
You need to calculate the prime factor each of the product value right? i.e. if you have to do 100*101*102*103*104/57*58*59 then you need factor for each of the values correct?
The GCD approach will loop (5 * 3)*O(GCDalgo) times. But prime factor approach needs to calc factors for 8 values and that could potentially loop a lot more per value (max upto the value) and the complexity will be more than the GCD approach.
I ran the GCD approach and took about 9 minutes (not a formal benchmark) just to cancel out terms. Don't have any math package to do big multiplications/divisions. I even tried to reduce the inner loop by switching @a and @b but did not see a huge impact
Also, i am confused whether the Haskell code actually computed the example (40000+!) because the execution speed was just unbelievable. Did it loose any precision when you tested it? I guess the approach was similar in canceling out terms. It makes me wonder why Perl canceling out takes so much longer than Haskell implementation
will be happy to hear on how your final implementation look.
Thanks,
SK
 [reply] [d/l] 


 
 
 

BrowserUk,
I misinterpreted "Other reductions possible" to read "Further reductions possible" given the example and where the commentary appeared. Admittedly I haven't read the entire thread and was under the incorrect impression for any given input there was only 1 acceptable output. I was also under the incorrect impression that you were not interested in prime factorization but what would be the result if the problem of "cancelling out" had been attacked by hand.
I believe the approach I presented could be made more efficient though I doubt it would be able to compete with a good prime factorization algorithm. I am familiar with several, some may have even been used in this thread, such as the quadratic sieve  but I don't currently have the time to write a perl implementation. IIRC numbers smaller than a certain size are best done with 1 algorithm and larger than a different number with a different algorithm with the numbers falling in the middle being a toss up.
If I do come up with anything I will be sure to let you know. Thanks for the fun problem.
 [reply] 

Coming up with an efficient, pure perl implementation of prime factorisation (limiting myself to integers 0 .. 2**32 ) is my current fun challenge :)
You might have a look at Math::Big::Factors, which I found useful, and is pure Perl. (Note that the docs have a typo, and mention factor_wheel where the function is actually named factors_wheel.)
QM

Quantum Mechanics: The dreams stuff is made of
 [reply] [d/l] [select] 


 Re: Algorithm for cancelling common factors between two lists of multiplicands by tlm (Prior) on Aug 10, 2005 at 02:32 UTC 
From looking at the code you have posted I get the impression that you misunderstand the Fisher's Exact Test. What you are computing in your various implementations is only one term of the sum that makes up the FET statistic.
The definition of the FET statistic I'm referring to here is entirely standard (though there is some variability as to whether the onesided or twosided version of this sum is used).
The test says, if we assume the null hypothesis (independence between the two categorical variables), what is the probability of getting a sampling that is at least as far from the expected value (under the null hypothesis) as the actual data being tested; unless one of the cells is empty, this formulation entails computing a sum of terms. If you look at the code I posted, I don't compute a single ratio of factorials, but a whole sum of them. The value of this sum can be considerably larger than the value of a single term; in fact it can be 1, depending on the hypothesis being tested and the data at hand.
Update: Here's the relevant passage from the MathWorld page whose formula you cited elsewhere (my emphasis):
To compute the Pvalue of the test, the tables must then be ordered by some criterion that measures dependence, and those tables that represent equal or greater deviation from independence than the observed table are the ones whose probabilities are added together.
 [reply] 

I'm am completely aware that all the other terms need to be calculated in order to determine whether the value is significant. I even mentioned this in 482056.
And, I guess I could be misusing your code, but when I feed it these values 5 0 4 1, it produces the same output exactly as my "misguided" code. Funnily enough, for those values, it produces the same output as tmoertel's (apparently equally misguided) code.
Is it only when given the larger set of test values 989 9400 43300 2400 that your special code does something extraordinarily clever that I'm to stupid to understand?
$T>start( '[5 0 1 4]' );
print Fishers_Exact::fishers_exact( 5, 0, 1, 4 );
$T>stop( '[5 0 1 4]' );
$T>report;
$T>start( '[989 9400 43300 2400]' );
print Fishers_Exact::fishers_exact( 989, 9400, 43300, 2400 );
$T>stop( '[989 9400 43300 2400]' );
$T>report;
__END__
P:\test>FETtlm.pl
0.02380952380952380952380952380
1 trial of [5 0 1 4] ( 917us total), 917us/trial
0.9999999999999999999999999999999999999999999999999999999999999999999
999999999999999999999999999999999999999999999999999999999999999999999
999999999999999999999999999999999999999999999999999999999999999999999
999999999999999999999999999999999999999999999999999999999999999999999
999999999999999999999999999999999999999999999999999999999999999999999
999999999999999999999999999999999999999999999999999999999999999999999
999999999999999999999999999999999999999999999999999999999999999999999
99999999999999999999999999999999952665545273822897512768877675
1 trial of [5 0 1 4] ( 917us total), 917us/trial
1 trial of [989 9400 43300 2400] ( 1.353s total), 1.353s/trial
Examine what is said, not who speaks  Silence betokens consent  Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco.  Rule 1 has a caveat!  Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
 [reply] [d/l] 

 [reply] 


 Re: Algorithm for cancelling common factors between two lists of multiplicands by QM (Vicar) on Aug 11, 2005 at 23:33 UTC 
I have toyed with this more, and have a pure Perl solution that runs your example in about 7 minutes on my PC, and all but 5 seconds is the factoring by Math::Big::Factors. I have an idea about prefactoring up to some limit, which I'll report on later.
You might note that 2**32 is not a practical limit, unless most of the terms cancel out. For instance, if you had to compute (2**32)!/(2**31)!, that would be 2**31 terms in the numerator.
Can you comment on the practical limits of your application?
For your example, which is essentially:
10389! 4570! 44289! 11800!

56089! 989! 9400! 43300! 2400!
and comparing results:
8.070604647867604097576877675243109941805e7030
8.070604647867604097576877675243109941729476950993498765160880e7030
(Note that MBF actually moved the decimal and gave the exponent as "7069"  I've normalized it to the other result.)
Here's my code:
Update: I ran some variations, here's the time comparisons:
Order = 5, Accuracy = 40, time = 7 min
Order = 6, Accuracy = 60, time = 35 min (!!)
no factoring, Accuracy = 40, time = 28 sec
no factoring, Accuracy = 60, time = 30 sec
no factoring, Accuracy = 40, eval, time = 14.0 sec
no factoring, Accuracy = 60, eval, time = 14.7 sec
The "eval" version changes this code:
foreach my $t ( keys %{$factors} )
{
my $bft = $ONE>copy()>bmul($t)>bpow($factors>{$t});
$q>bmul($bft);
}
to this:
my $s = '$q';
foreach my $t ( keys %{$factors} )
{
next if ( $factors>{$t} == 0 );
if ( $factors>{$t} > 0 )
{
$s .= ">bmul($t)" x $factors>{$t};
}
else
{
$s .= ">bdiv($t)" x abs($factors>{$t});
}
}
eval $s;
die "Error on eval: $@" if $@;
I tried it with >bpow(), but that's 2X slower. (I'd offer that >bpow() should only be used for noninteger powers.)
Seems eval has its uses!
Update 2: I made a typo in above on bdiv  should have used abs. It's been changed, and the timings updated (roughly doubled).
QM

Quantum Mechanics: The dreams stuff is made of
 [reply] [d/l] [select] 

[ 3:24:33.57] P:\test>fet3 989 9400 43300 2400
[989 9400 43300 2400]
0.80706046478676263e7029
1 trial of _default ( 8.164s total), 8.164s/trial
The precision is limited to that of a standard double, but I played some games with scaling during the final calculation in order to achieve a range beyond that of a 1e±308 of a double.
Theortically, it will now handle some pretty big numbers, but the limiting factor is memory as I am generating some pretty big lists. I should be able to alleviate that substantially by implementing Limbic~Region's idea of cancelling the factorials against each other before generating the product lists rather than building all the lists and then cancelling them as I am now. Ie.
10389! 4570! 44289! 11800!

56089! 989! 9400! 43300! 2400!
becomes
44289! 11800! 10389! 4570!

56089! 43300! 9400! 2400! 989!
(10389..9401) (4570..2401)

(56089..44290) (43300..11801) 989!
where (m..n) indicates the product of the series
In this example that reduces the number of factorisations from 183,406 to 47,444, which ought to make a big difference both on speed and memory consumed, and these saving would be even greater for larger observations in all but the most pathelogical of cases.
I've had some trouble getting that to work properly though. Encoding the heuristics that seem obvious by inspection of one or two cases, into a generalised algorithm for all inputs, is eluding me for the moment.
You might note that 2**32 is not a practical limit, unless most of the terms cancel out.
I agree that 2**32 is very much an upper bound. The main reason for stating it is that it pretty much precludes the benefit of many of the more sophisticated factorisation methods I've read about. It also places a practical upper bound on the number of primes I need.
The code (large because of the embedded primes list)
Examine what is said, not who speaks  Silence betokens consent  Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco.  Rule 1 has a caveat!  Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
 [reply] [d/l] [select] 

I am not sure if you had a chance to look at my node Re^7: Algorithm for cancelling common factors between two lists of multiplicands
I guess my approach is same as Limbic~Region as I was using subtraction of lower factorial terms with higher ones. I copied the input list from one of your previous examples and I have 45700 instead of 4570 but aside from that the implementation should be easy to understand.
I guess if you sort numerator and denominator and subtract them you should be all set. I have not proved this formally but here is a stab at a simple proof that shows sorting and subtracting should work...
Let the fraction be
X! Y! Z!

a! b! c!
WLOG let's assume that X > Y > Z and a > b > c. Also let's assume that
+ b > Y , X > a (weaker assumption: Z > c)...<p>
NOTE: if we divide X!/a! we will have (Xa) elements <p>
To Prove: (Xa) + (Zc) + (bY) is the shortest list one can find
or in other words
(Xa) + (Zc) + (bY) <= (Xp) + (Zq) + (rY) for any p,q,r in permut
+ation(a,b,c)
Proof:
From the above equation since b > Y and Z > c, r should be equal to ei
+ther a or b. If r = b then the solution is trivial<p>
If r = a then we get
(Xa) + (Zc) + (bY) ?<= (Xb) + (Zc) + (aY)
canceling terms
a  c + b ?<= b c + a
a + b ?<= b + a ====> YES
since a > b we see that r = a is not the smallest list so r = b<p>
Similarly we can also show that (Xa) + (Zc) + (bY) <= (Xa) + (Yc)
+ + (bZ)
I don't think this is a rigourous proof this method but i sort of feel sorting and subtracting should give us what we need...
cheers
SK
PS: I think there will be 47448 elements and not 47444 as you suggested? as you need to count the first element too..  [reply] [d/l] [select] 



 
 [reply] [d/l] 




 
Here's my take on it. Highlights: use Eratosthenes' sieve to determine a factor or primality for integers up to the max required; work out when an additional numerator or denominator kicks in (@switch), and pass over the numbers in reverse order, so each need be visited only once; build up two BigInts for numerator and denominator and just switch to BigFloats to divide the two at the end.
The single division at the end takes over 2/3 of the total time, so I'd expect using GMP or Pari for the maths would cut it greatly:
zen% perl t1
807060464786760409757687767524310994172947695099349876516088e7089
sieve: 0 wallclock secs ( 0.09 usr + 0.02 sys = 0.11 CPU)
build: 4 wallclock secs ( 4.38 usr + 0.00 sys = 4.38 CPU)
divide: 9 wallclock secs ( 8.90 usr + 0.00 sys = 8.90 CPU)
total: 13 wallclock secs (13.37 usr + 0.02 sys = 13.39 CPU)
zen%
Things that made negligible difference: dropping accuracy to 40 (!); initialising my $bii = Math::BigInt>new($i) before the multiply loops; the $r>bsstr call.
I noticed that the numerator consists of 855 prime factors comprising 755 distinct primes; the denominator has 2308 of which 2306 are distinct. No surprise, then, that bpow() isn't a win.
Here's the code:
Hugo  [reply] [d/l] [select] 

