No garbage collection for my-variables

Dear monks,

recently, I've learned something from ikegami in Re: out of memory problem.

Apparently, the memory that is (directly) occupied by my-variables is never automatically freed. The term "directly" includes the case where a my-variable holds a (very long) string.

As ikegami has demonstrated, a string buffer will grow when needed, but it will never (automatically) shrink or disappear. Apparently this is part of some kind of optimization to avoid constantly reallocating buffers for code that is used often.

I've shown that a buffer will only be reused for the same variable. This means that it will not be reused for a variable in another subroutine. This is were I see a big problem. In large programs there are thousands of lexical variables, and not all of them are used more than a few times, but all of them retain their buffers. Even for a buffer that it reused often, this is not optimal: Once it will have a large chunk of data, it will stay at that size, regardless how small the data is in the next calls.

And I've done my homework and done a super search. As a matter of fact, this topic has been discussed before (Garbage collection of 'my' variables, Re: Tracking Memory Leaks).

The commonly suggested workarounds are:

undef-ing variables after using them. (Actually this is not always practicable, e.g. in the case where you want to return the scalar)
Designing the code so that it works on references or aliases.

Alright, but the problem is that code is generally not designed like this. Of course, you can design your code this way if you plan to handle large data. However, almost all serious projects use external code that they haven't written themselves. In my search for the most obvious example, I found Encode.pm:

Consider this code:

#!/usr/bin/perl
use strict;
use warnings;
use Encode;


sub init {
    encode('utf8', 'x' x 100_000_000);

    return ();
}

print "starting\n";
sleep 5;
print "initializing\n";
init();
print "initialized\n";
sleep 5;
print "cleaning\n";
undef &Encode::encode;
sleep 5;
[download]

If you watch this program's memory consumption, you'll find that it will use approximately 288MB after "initialized" has been printed. After "cleaning" has been printed, the amount will shrink considerably to 98MB. (Actually it will shrink even more if you wipe out the "init" subroutine itself, I guess this is because of the large string constant.)

Responsible is this code in Encode.pm:

sub encode($$;$)
{
    my ($name, $string, $check) = @_;
# ...
    my $octets = $enc->encode($string,$check);
    $_[1] = $string if $check and !($check & LEAVE_SRC());
    return $octets;
}
[download]

Both $string and $octets hold our dear string, and (like I) the author obviously thought that they don't need to free its memory.

I've named the subroutine "init" to suggest that this is code that will only be used at the very start of a long program lifetime, which means that the long string buffer will linger around needlessly.

So, what would I be supposed to do? Don't use Encode and do my character transcoding myself? Or should I actually use "undef" to clean all the subs that I have used? Consider that my initialization code loads an XML configuration file. I'd have to clean most of the namespaces of XML::Simple, XML::Parser and whatelse. And if I actually plan to continue using these modules, I'd have to wipe out "%INC", then require them again, not very nice. (Just an example; I've not really checked these modules, so please don't be offended if you are the author and have considerately undef-ed every variable.)

I've not looked for this optimization in the perl source yet, but I'd really like someone to explain why it is needed. I can agree that it would not be performant to do a lot of malloc/mmap/munmap/brk for every string that is copied, however IMHO there are situations where perl should find some way to realize that any of the following cannot be performant either:

Holding (several) SVs in memory that occupy lots and lots of memory pages
Holding like 1000 SVs in memory where neither has been used more than once

I'll conclude with an example snippet that demonstrates how you can eventually get your computer to use excessive amounts of memory or even swap:

perl -lwe 'my $code = join "", map {
   "sub foo$_ { my \$var = q(x) x 1_000_000; }" } 1..1000;
   eval $code; die if $@;
   for (1..1000) { sleep 1; "foo$_"->() }'
[download]

Because of the "sleep", you can run this snippet and watch it indulge itself by eating one megabyte per second (in GNU, use "top" and press M).

The snippet uses string-eval to generate a lot of subroutines like this:

sub foo1 { my $var = q(x) x 1_000_000; }
sub foo2 { my $var = q(x) x 1_000_000; }
# and so forth...
[download]

then calls them one after another.

Well, that was a large chunk of text now, I've tried to ease your reading by using bold text, I hope that perlmonks' buffers will eventually be freed from this text, and I hope that I haven't missed something obvious.

Back to Meditations