http://www.perlmonks.org?node_id=885823

Lady_Aleena has asked for the wisdom of the Perl Monks concerning the following question:

I have a little script that is running my computer out of memory, and I was wondering if there might be a way around this problem. I have been thinking about it for the last few days and haven't come up with anything. Someone a few days ago mentioned something called forking. I can guess at what that is, but for some reason I get the feeling that forking isn't well liked, but I may be wrong there.

#!/usr/bin/perl -l use strict; use warnings; use feature qw(say); use List::Util qw(sum max); use lib 'lib'; use Base::Nifty qw(commify); say "How many children are in the first generation?"; my $generation = <>; say "How many generations do you want to generate?"; my $generations = <>; chomp($generation,$generations); my %generations = ( 1 => $generation, ); for (1..$generations) { my @generation; for (1..$generation) { my $children = int(rand(6)); push @generation, $children; } $generation = sum(@generation); $generations{$_ + 1} = $generation; @generation = (); } my $max_length_generation = length(commify(max(keys %generations))); my $max_length_children = length(commify(max(values %generations))); for (sort {$a <=> $b} keys %generations) { printf "%${max_length_generation}s: %${max_length_children}s\n",$_,c +ommify($generations{$_}); }

When I run that with 5 initial children and 20 generations, I run out of memory. 5 and 19 doesn't. I have a feeling that my $children = int(rand(6)); is what is killing my memory when the amount of children in a generation is in the billions. By the way, as with all my scripts, this is just a bit of fun, so I am not in any great rush.

Have a cookie and a very nice day!
Lady Aleena

Replies are listed 'Best First'.
Re: A script with a loop is running my computer Out of memory
by davido (Cardinal) on Feb 02, 2011 at 19:51 UTC

    The history of mathematics and (more recently) computer science has been moved along with respect to innovation with almost a singularly common goal; finding ways to calculate things that require less work. Your calculations are simple, but the work you're doing is hard, and significant. It would be better to look again at the model, and see if there's a mathematical solution (ie, a better algorithm) to accomplish the same thing with less work.

    Your current solution is going O(c^n), ie, exponential. I suspect that there is an O(1) solution to the question of population growth rates. What that means is your current solution requires an exponential amount of work per additional generation, while there probably exists a non-iterative solution instead.

    I recall we talked in CB about this a little. I may have been the person who mentioned forking, but it was in jest. My point was "If you think this solution crashes your computer, you ought to see what it does if the children are forks." (Which is akin to fork-bombing). That was intended as a muse, not as a suggestion. The concept behind a fork bomb is that each child spawns a new process, which in turn spawns a few new processes, and so on until the operating system and hardware simply can't accommodate them all. But that situation is actually similar to what's happening here, just without forks; You're spawning so many values, and holding onto so many of them, that the system cannot accommodate them all. And to what goal? To solve a problem that can be solved with a mathematical equation that doesn't require repetition.

    I also briefly touched on the concept of Big O notation. You mentioned that is an unfamiliar topic. So we might as well discuss it a little here. Think of "Big O notation" as a measure of resources consumed. The resource may be computational work (time), complexity, or memory, for starters. Your code is fairly intensive of both time and memory. But we'll focus on how much work is being done.

    I'm going to briefly discuss Big-O, and in terms that are oversimplified. Common Big-O notation is written in the format of O(x), where 'x' is a formula that describes how much work is being done in terms of 'n' (n being number of items being worked on). O(n) means the amount of additional work per additional item is equal as the number of items grows.

    Other common units include:

    • O(1) => No increased work as n grows.
    • O(log n) => work increases logarithmically as n grows. (ie, work increases, but at a rate slower than the growth of n.)
    • O(n) => Work increased exactly as n grows.
    • O(n^2) => Work increases quadratically.
    • O(n^3) => Work increases cubically.
    • O(c^n) (where c is constant) => Work grows exponentially as n increases.
    • O(n!) => Work grows as a factorial of the size of n.

    So if you look at your program, each new generation requires exponentially more work than the previous generation. This is a pretty 'bad situation' to be in from a programming standpoint. You have to start saying to yourself, can this be reduced to a statistical model that is characterized by a simple formula? If that's the case, your work drops to O(1).


    Dave

      There is indeed an analytic approach.

      If a generation contains $n people, then what the script does is adding $n random variables. Since all 6 possible values are equally likely, the resulting sum is a Binomial distribution.

      For big $n (let's say $n >= 20) the Binomial distribution can be very well approximated by a normal distribution.

      Normally distributed random variables can be easily and efficiently implemented with the Box–Muller transform, which turns two uniformly distributed random numbers (what rand returns) into two normally distributed random numbers.

      If this were a project of mine (and I needed an efficient approach), I'd follow the approach of individual random numbers per population item as long as the population is less than 20, and for higher numbers I'd use the normal distribution approximation.

      If there's interest, I can also show how the mean and variance of the random variable is calculated, but I'm tired right now and I'd do it tomorrow :-)

      Update: mean and variance isn't as complicated as I thought.

      The mean is just (0 + 1 + 2 + 3 + 4 + 5) / 6 = 2.5, and the variance is 1/6 * ((0-2.5)**2 + (1 - 2.5)**2 + (2-2.5)**2 + (3-2.5)**2 + (4-2.5)**2 + (5-2.5)**2) = 35/12 = 2.91666666666667. With these parameters you can just use the formulas in the various Wikipedia entries.

      • O(log n) => work increases logarithmically as n grows. (ie, work increases, but at a rate slower than the growth of n.)
      That's a pretty vague description of log(n)...

      An exact description of log(n) is: for every multiplication of n with a constant factor (like doubling, or 10-fold increase), a fixed amount of extra work gets added.

        That is better. We might also change "a fixed amount of extra work gets added" to "a single unit of extra work gets added." Even that is probably more vague than just coming to an understanding of what logarithmic graphs look like as 'n' increases. Unfortunately, that's sort of difficult to post here. I do like trying to improve the description though. Wikipedia has a great writeup on Big-O notation. Why didn't we have that when I was in college? ;)


        Dave

Re: A script with a loop is running my computer Out of memory
by SuicideJunkie (Vicar) on Feb 02, 2011 at 19:52 UTC

    Billions of scalars would certainly blow out your memory since you only have a very few billion bytes of RAM and there is overhead.

    If you're going to merely sum over all the children in @generation, why not simply do $generation{$outerLoopIndex +1} += int(rand(6)); directly, and avoid burning all that memory on very temporary values?

Re: A script with a loop is running my computer Out of memory
by davido (Cardinal) on Feb 02, 2011 at 20:16 UTC

    By the way. One thing I noticed and forgot to mention. You are making the assumption that each generation has up to 5 offspring. That means each couple must have up to 10, or each woman up to 10... or that both men and women can bare offspring of up to 5 each.

    Is this a worm population growth chart?

    ;)


    Dave

      The assumption is that into each generation is born a certain number of children, and those children have a certain number of children, etc. Gender is not being taken into account, and the children's spouses are not being tallied here. And no, this is not a worm population growth chart. I am just trying to get an idea of how many generations it would take to get to about 250 billion people born into it, give or take a couple billion. I haven't taken interbreeding into account either, as after about a 5 generation gap some consider it safe to interbreed though some consider sooner than that safe. Doing that however would add yet even more complexity that I haven't even tried to think about adding.

      Have a cookie and a very nice day!
      Lady Aleena

        Look, if you know the generational growth rate (i.e. for each person in generation N, there are 2.5 people in generation N+1, or whatever), then just start with your 250 billion and work backwards. (2.5 is the average of rand(6).)

        $n = 250_000_000_000; $a = 2.5; while ($n>2) { print "$n\n"; $n /= $a; } ^D 250000000000 100000000000 40000000000 16000000000 6400000000 2560000000 1024000000 409600000 163840000 65536000 26214400 10485760 4194304 1677721.6 671088.64 268435.456 107374.1824 42949.67296 17179.869184 6871.9476736 2748.77906944 1099.511627776 439.8046511104 175.92186044416 70.368744177664 28.1474976710656 11.2589990684262 4.5035996273705

        There. 28 generations.

        Approximately.

Re: A script with a loop is running my computer Out of memory
by jethro (Monsignor) on Feb 02, 2011 at 21:53 UTC

    First two off topic optimizations: If you used an array for %generations you wouldn't need to sort the keys (just store generation x into arrays slot x). Also max(keys %generations)==$generations.

    Also you are summing up a lot of random values between 0 and 5 (distributed evenly). Basically the average number of offsprings per person you will get is 2.5, if you do this a lot. Change your last line to

    printf "%${max_length_generation}s: %${max_length_children}s %s\n" +,$_,commify($generations{$_}),$generations{$_}/$generations{max($_-1, +1)};

    to see this effect. It also shows how to calculate the next generation directly if you don't mind to get a clean statistical average instead of your calculated numbers

    By the way, if you want to try to not have an evenly distributed number of offspring (i.e. families with two kids should be more common than with five, at least in our century), you might define an array with a different distribution:

    my @distribution=(0,0,1,1,1,2,2,2,2,3,3,4,5);

    and in your loop use the random number to get one random value out of the array:

    my $children = $distribution[rand(@distribution)]; # instead of my $children = int(rand(6));

      Thank you for the idea of distribution, I toyed with the idea while writing it, but thought int(rand(6)) was easier. You are right though, it should be weighted towards fewer children per person of the previous generation. However, I found a situation where I have had to stop the loop. In the first few iterations of the main loop, a generation could end up with 0 children in it. At that point, going any further would be pointless. So while I could still push the generation with 0 children to an array instead of adding another key to the hash, max(keys %generations) != $generations. I am still working the array versus hash idea, I still have to figure out everything I would have to change. I am just more comfortable dealing with hashes than arrays. Meanwhile, here is the updated script.

      #!/usr/bin/perl -l use strict; use warnings; use feature qw(say); use List::Util qw(sum max); use lib 'lib'; use Base::Nifty qw(commify); say "How many children are in the first generation?"; my $generation = <>; say "How many generations do you want to generate?"; my $generations = <>; chomp($generation,$generations); my %generations = ( 1 => $generation, ); my @distribution = (0,0,0,1,1,1,2,2,2,3,3,4,5,6); for (1..$generations) { my @generation; for (1..$generation) { my $children = $distribution[rand(@distribution)]; push @generation, $children; } $generation = sum(@generation); last if sum(@generation == 0); $generations{$_ + 1} = $generation; @generation = (); } my $max_length_generation = length(commify(max(keys %generations))); my $max_length_children = length(commify(max(values %generations))); for (sort {$a <=> $b} keys %generations) { printf "%${max_length_generation}s: %${max_length_children}s\n",$_,c +ommify($generations{$_}); }
      Have a cookie and a very nice day!
      Lady Aleena
Re: A script with a loop is running my computer Out of memory
by JavaFan (Canon) on Feb 02, 2011 at 23:40 UTC
    Your problem is that for each generation, you create an array that for each child, holds the number of children it gets in the next generation.

    There's no need for that. Far easier is:

    my $people = ...; # First generation. my $generations = ...; for (2 .. $generations) { $people += int rand 6 for 1 .. $people; }
    It can be made faster (by using a formula to calculate the number of children in each generation without the inner loop), but the above is simple, and most of all, won't run out of memory (assuming there's a few Mb available to start perl).

      Thanks for that tip, JavaFan. I used it, and you are so right. It is much faster and my memory doesn't run out! Unfortunately, I did come against another error "Range iterator outside integer range". I haven't quite got that figured just yet.

      I didn't use your exact fragment, but I think it is close enough.

      #!/usr/bin/perl -l use strict; use warnings; use feature qw(say); use List::Util qw(sum max); use lib 'lib'; use Base::Nifty qw(commify); say "How many children are in the first generation?"; my $generation = <>; say "How many generations do you want to generate?"; my $generations = <>; chomp($generation,$generations); my %generations = ( 1 => $generation, ); my @distribution = (0,0,0,1,1,1,2,2,2,3,3,4,5,6); for my $gen (1..$generations) { my $children = 0; for (1..$generation) { $children += $distribution[rand(@distribution)]; } $generation = $children; last if ($children == 0); $generations{$gen + 1} = $generation; } my $max_length_generation = length(commify(max(keys %generations))); my $max_length_children = length(commify(max(values %generations))); for (sort {$a <=> $b} keys %generations) { printf "%${max_length_generation}s: %${max_length_children}s\n",$_,c +ommify($generations{$_}); }

      The += did a whole lot of good! I haven't used that construction much in the past, so I don't automatically think of it. Hopefully one day it will sink in.

      Have a cookie and a very nice day!
      Lady Aleena
Re: A script with a loop is running my computer Out of memory
by wiredrat (Acolyte) on Feb 03, 2011 at 17:03 UTC
    You can also cut the number of 'concurrent' generations. Just keepeing 3 generations is a good approximation for human comunity. You can keep the number of offsprings per generation in a array and keeping the array 3 values long. So, borrowing ideas from previuss replies
    my @generations = ( $generation ); for(2 .. $generations ) { push @generations, $generations[-1]*2.5; shift @generations if $#generations >= 3; }

      wiredrat, you got me thinking about what I was really after and are right about me wanting the total population. Up in Re^2: A script with a loop is running my computer Out of memory I was talking about the number 250 billion. Well, that is the total population I am trying to get without something clunking out on me. So, I did a little finagling and came up with the following which returns the amount born into a generation and the total population which include the previous two generations. So, while I am still storing every generation generated, the current code is faster even with the new addition of total population.

      #!/usr/bin/perl -l use strict; use warnings; use feature qw(say); use List::Util qw(sum max); use Data::Dumper; use lib 'lib'; use Base::Nifty qw(commify); say "How many children are in the first generation?"; my $generation = <>; say "How many generations do you want to generate?"; my $generations = <>; chomp($generation,$generations); my %generations = ( 1 => { children => $generation, total_population => $generation, } ); my @distribution = (0,0,0,1,1,1,2,2,2,3,3,4,5,6); for my $gen (2..$generations) { my $children = 0; for (1..$generation) { $children += $distribution[rand(@distribution)]; } $generation = $children; last if ($children == 0); $generations{$gen}{children} = $generation; my $total_population = $generations{$gen}{children}; $total_population += $generations{$gen - 1}{children} if $generati +ons{$gen - 1}; $total_population += $generations{$gen - 2}{children} if $generati +ons{$gen - 2}; $generations{$gen}{total_population} = $total_population; } my @children; my @population; for (1..$generations) { push @children, $generations{$_}{children}; push @population, $generations{$_}{total_population}; } my $max_length_generation = length(commify(max(keys %generations))); my $max_length_children = length(commify(max(@children))); my $max_length_population = length(commify(max(@population))); for (sort {$a <=> $b} keys %generations) { printf "%${max_length_generation}s: %${max_length_children}s %${max_ +length_population}s\n", $_,commify($generations{$_}{children}),commify($generations{$_}{tota +l_population}); }

      Thanks again for the idea!

      Have a cookie and a very nice day!
      Lady Aleena
Re: A script with a loop is running my computer Out of memory
by bart (Canon) on Feb 03, 2011 at 20:50 UTC
    I have a little script that is running my computer out of memory, and I was wondering if there might be a way around this problem.

    Someone a few days ago mentioned something called forking.

    Whatever you do, forking is probably not going to fix and "out of memory" problem. Why? Because at the moment you fork, you make an exact clone of your process. If your original process is using, say, 221MB of memory, then the child process will also be using 221MB.

    Is your script running in circles, and only running out of memory after a long time? Then a better strategy to think about is to regularly wipe your memory in the big loop. Either restart the program, by launching it again and then exiting; or undef your big data structures.