Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re^15: Divide array of integers into most similar value halves (Updated)

by BrowserUk (Pope)
on Sep 04, 2008 at 22:28 UTC ( #709140=note: print w/ replies, xml ) Need Help??


in reply to Re^14: Divide array of integers into most similar value halves
in thread Divide array of integers into most similar value halves

Update: I know know why the problem didn't show up in my version of the script. As I mentioned, I'm using a custom shuffle routine, which does an in-place shuffle. The List::Util version of shuffle is not in-place, hence the last line of the for loop should be @in = shuffle @in; for use with List::Util, not shuffle @in; as for my version. I've update the code below.

I apologise. It is a legit bug. I'm amazed it hasn't shown up before and confused as to why it doesn't show up in my lightly modified version of the same script you posted, but...this fixes it. Till the next one....

sub partition { my( $limit, $aRef ) = @_; my @in = sort{ $a <=> $b } @$aRef; my $target = sum( @in ) >> 1; my( $best, @best ) = 9e99; my $soFar = 0; my @half; for( 1 .. $limit ) { # printf "%6d : [@half] [@in] [@best]\n", abs( $soFar - $target + ); $soFar += $in[ 0 ], push @half, shift @in while @in > 1 and $soFar < $target; return( \@half, \@in ) if $soFar == $target; my $diff = abs( $soFar - $target ); ( $best, @best ) = ( $diff, @half ) if $diff < $best; $soFar -= $half[ 0 ], push @in, shift @half while @half > 1 and $soFar > $target; return( \@half, \@in ) if $soFar == $target; $diff = abs( $soFar - $target ); ( $best, @best ) = ( $diff, @half ) if $diff < $best; srand( $_ ); @in = shuffle @in; ## shuffle @in; ## Only works for an inplace shuffle! } my %seen; $seen{ $_ }++ for @best; return \@best, [ grep{ --$seen{ $_ } < 0 } @$aRef ]; }

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.


Comment on Re^15: Divide array of integers into most similar value halves (Updated)
Select or Download Code
Re^16: Divide array of integers into most similar value halves
by Pepe (Sexton) on Sep 04, 2008 at 22:46 UTC
    Thank you so much BrowserUk!

    It's working so far in a relatively small number of arrays (1000).
    I already found one script that's giving me no error in tens of thousands on arrays. You can get it from the conversation, posted by 'tilly'.

    For now I think I'll stick to that one. But after two days of trying everything I found here, I believe I must try yours together with tilly's and see if both of you have reached the same nirvana.
    Just for the heck of being able to thank everyone that helped, and post a couple of scripts that work undeniably well.

    My most sincere thanks.

    Pepe
        If you have N random integers of size 0-M, worst case memory use and running time should both scale with my algorithm like O(N*N*M). In some cases memory use might scale a little better than that.

        About a reproducible random shuffle, why not just call srand with some specific value and then shuffle using the fact that rand is now a known, deterministic function? For testing purposes you can combine different combinations of srand and datasets.

      Great! I'm glad you found a solution to your liking.

      However, you have a decision to make. It's a trade off between consistently optimal results and consistently predictable timings. tilly's exhaustive search suffers from similar problems to my own attempt at such. Whilst it will eventually find an optimum solution, sometimes quite quickly, quite frequently it will spend an inordinate amount of time finding it. And an inordinate amount of memory too!

      Below, you will see a comparison of tilly's (latest) and my (latest) code run against the same random data sets with 1e1 thru 1e6 elements. Notice how with 1e3 and 1e5 elements, tilly's code was timed out after a full five minutes where the semi random approach never takes more than ~10 seconds. For the 1e3 case, tilly's code uses over 1GB of ram. For the 1e5 case, it came perilously close to exhausting my 1.5GB of physical memory.

      c:\test>708290-b -T=300 Testing buk with 10 random values in the range 0 .. 10000 Differ +ence:= 26; took 0.002259 seconds Testing tilly with 10 random value sin the range 0 .. 10000 Differ +ence:= 2; took 0.000975 seconds Testing buk with 100 random values in the range 0 .. 10000 Differ +ence:= 1; took 0.039580 seconds Testing tilly with 100 random values in the range 0 .. 10000 Differ +ence:= 1; took 8.175000 seconds Testing buk with 1000 random values in the range 0 .. 10000 Differ +ence:= 1; took 0.070234 seconds Testing tilly with 1000 random values in the range 0 .. 10000 + ******* timed out after 300 seconds Testing buk with 10000 random values in the range 0 .. 10000 Differ +ence:= 0; took 0.203125 seconds Testing tilly with 10000 random values in the range 0 .. 10000 Differ +ence:= 0; took 0.161583 seconds Testing buk with 100000 random values in the range 0 .. 10000 Diffe +rence:= 1; took 3.093750 seconds Testing tilly with 100000 random values in the range 0 .. 10000 + ******* timed out after 300 seconds Testing buk with 1000000 random values in the range 0 .. 10000 Diff +erence:= 0; took 10.578125 seconds Testing tilly with 1000000 random values in the range 0 .. 10000 Diff +erence:= 0; took 21.593750 seconds

      The memory consumption may not be a problem if your data sets are always smallish, but occasionally even with a relatively small dataset of 100 elements it will take longer that 60 seconds to find a solution, and yet only take 18 seconds for 1 million!

      c:\test>708290-b -MAX=1e1 Testing buk with 10 random valuesin the range 0 .. 1e1 Differenc +e:= 8; took 0.002118 seconds Testing tilly with 10 random valuesin the range 0 .. 1e1 Differenc +e:= 8; took 0.001475 seconds Testing buk with 100 random valuesin the range 0 .. 1e1 Differenc +e:= 1; took 0.039413 seconds Testing tilly with 100 random valuesin the range 0 .. 1e1 Differenc +e:= 1; took 7.955964 seconds Testing buk with 1000 random valuesin the range 0 .. 1e1 Differenc +e:= 1; took 0.088507 seconds Testing tilly with 1000 random valuesin the range 0 .. 1e1 Terminati +ng on signal SIGINT(2) c:\test>708290-b -MAX=1e1 Testing buk with 10 random valuesin the range 0 .. 1e1 Differenc +e:= 1; took 0.000477 seconds Testing tilly with 10 random valuesin the range 0 .. 1e1 Differenc +e:= 1; took 0.001215 seconds Testing buk with 100 random valuesin the range 0 .. 1e1 Differenc +e:= 1; took 0.046773 seconds Testing tilly with 100 random valuesin the range 0 .. 1e1 c:\test>708290-b -MAX=1e3 Testing buk with 10 random valuesin the range 0 .. 1e3 Differenc +e:= 7; took 0.001884 seconds Testing tilly with 10 random valuesin the range 0 .. 1e3 Differenc +e:= 7; took 0.001118 seconds Testing buk with 100 random valuesin the range 0 .. 1e3 Differenc +e:= 1; took 0.039995 seconds Testing tilly with 100 random valuesin the range 0 .. 1e3 + ******* timed out after 60 seconds Testing buk with 1000 random valuesin the range 0 .. 1e3 Differenc +e:= 1; took 0.078125 seconds Testing tilly with 1000 random valuesin the range 0 .. 1e3 + ******* timed out after 60 seconds Testing buk with 10000 random valuesin the range 0 .. 1e3 Differenc +e:= 0; took 0.203125 seconds Testing tilly with 10000 random valuesin the range 0 .. 1e3 Differenc +e:= 0; took 0.183086 seconds Testing buk with 100000 random valuesin the range 0 .. 1e3 Differen +ce:= 1; took 2.879401 seconds Testing tilly with 100000 random valuesin the range 0 .. 1e3 + ******* timed out after 60 seconds Testing buk with 1000000 random valuesin the range 0 .. 1e3 Differe +nce:= 0; took 8.453125 seconds Testing tilly with 1000000 random valuesin the range 0 .. 1e3 Differe +nce:= 0; took 17.265625 seconds

      I appreciate your sentiments elsewhere that tilly's code is reliable--that's why I chose it to verify my own against:)--but being exhaustive and deterministic makes it easier to test and debug. The non-determinism of a random (genetic) algorithm makes it much harder to test. That's why I've spent the last two days attempting to get a reproducible random shuffle.

      But, for NP-hard problems, Genetic Algorithms have one huge advantage. You can specify how long you are prepared to ever wait (in terms of either iterations or best solution within a time limit) for a solution. That's much harder to arrange with an exhaustive solution.

      There is another possibility. You can combine the two approaches. You start off using the exhaustive approach and either through a time limit, or (and you'd have to consult tilly for details), through some heuristic guesstimate of how long it is likely to take for the given data set, switch to using the genetic algorithm. Perhaps passing the best found so far from the exhaustive attempt into the genetic to see if it can improve it within some specified number of iterations.

      In any case, whichever approach meets your needs, thank you for presenting an extremely interesting (if at times, frustrating:) problem.

      That's only the fourth or fifth GA I've ever attempted, and they are frustratingly hard to verify. I learnt one big thing from your feedback. Using small datasets (3 or 4 elements) is a far better testing strategy that large arrays of random values! Three nested loops cycling through all the possible permutations of 3 x 0 .. 99, would be trivial and fast to verify through brute force, but (as your feedback showed), stands to highlight bugs very quickly.

      Obvious now I seen it in action. But hey! We're never too old to learn something new.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://709140]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (13)
As of 2014-07-14 12:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (259 votes), past polls