Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Memory leak in unicode substitution

by am12345 (Novice)
on Aug 30, 2019 at 01:25 UTC ( #11105281=perlquestion: print w/replies, xml ) Need Help??

am12345 has asked for the wisdom of the Perl Monks concerning the following question:

I was trying to trace a memory leak in a reasonably large code base we have. In the end I traced it to a very innocuous looking text substitution like this:

$text =~ s/\x{2122}/(TM)/sg;

I distilled this into a simple test case and it seems to really be a pretty heavy memory leak. I even went as far as trying a number of perl docker images from 5.10.1 to 5.30 and it resulted in a memory leak in each one I tried.

How can something so simple be leaking and not be noticed by anyone? I must be missing something.

Here is a no-dependencies test case I tried it with. The wtf_leak() leaks memory, and wtf_noleak() does not. The difference is that one operates on characters and the other on bytes.

root@bbe78e26dc1f:~/rsh# cat memtest-u2a.pl #!/usr/bin/env perl use warnings; use strict; use Encode; my $mem_initial=psmem(); my $mem_last=$mem_initial; print "INITIAL: $mem_initial\n"; foreach my $cycle (0..($ARGV[0] || 200)) { foreach my $i (0..100) { my $u="x"; wtf_leak($u); ### wtf_noleak($u); ### $u =~ s/\x{2122}/(TM)/sg; ### $u =~ s/b/(TM)/sg; } my $mem=psmem(); my $usage_total=$mem - $mem_initial; my $usage_last=$mem - $mem_last; if($usage_last > 0) { print "---------------- CYCLE: $cycle, since-last $usage_last, + total $usage_total, initial $mem_initial, current $mem\n"; } $mem_last=$mem; } my $mem=psmem(); my $usage_total=$mem - $mem_initial; print "FINAL: $mem, LEAKED $usage_total\n"; exit 0; ############################################################## sub psmem { ### chomp(my $mem=`ps -h -o rss -p $$`); chomp(my $mem=`ps -h -o vsz -p $$`); ### dprint "mem=$mem"; return 0 + $mem; } # Leaks memory! # sub wtf_leak { my $text=shift; $text =~ s/\x{2122}/(TM)/sg; return $text; } # DOES NOT leak memory! # my $retm; sub wtf_noleak { my $text=shift; $retm||=Encode::encode('utf8',"\x{2122}"); my $blob=Encode::encode('utf8',$text); $blob=~s/$retm/(TM)/sg; my $ntext = Encode::decode('utf8',$blob); ### print ".... |$text| => |$ntext|\n"; return $ntext; } root@bbe78e26dc1f:~/rsh# perl memtest-u2a.pl INITIAL: 9900 ---------------- CYCLE: 9, since-last 136, total 136, initial 9900, cu +rrent 10036 ---------------- CYCLE: 22, since-last 132, total 260, initial 9900, c +urrent 10160 ---------------- CYCLE: 34, since-last 132, total 392, initial 9900, c +urrent 10292 ---------------- CYCLE: 46, since-last 132, total 516, initial 9900, c +urrent 10416 ---------------- CYCLE: 58, since-last 136, total 644, initial 9900, c +urrent 10544 ---------------- CYCLE: 71, since-last 136, total 772, initial 9900, c +urrent 10672 ---------------- CYCLE: 83, since-last 132, total 896, initial 9900, c +urrent 10796 ---------------- CYCLE: 95, since-last 132, total 1020, initial 9900, +current 10920 ---------------- CYCLE: 108, since-last 136, total 1156, initial 9900, + current 11056 ---------------- CYCLE: 120, since-last 136, total 1284, initial 9900, + current 11184 ---------------- CYCLE: 132, since-last 132, total 1408, initial 9900, + current 11308 ---------------- CYCLE: 145, since-last 132, total 1532, initial 9900, + current 11432 ---------------- CYCLE: 157, since-last 136, total 1668, initial 9900, + current 11568 ---------------- CYCLE: 170, since-last 136, total 1796, initial 9900, + current 11696 ---------------- CYCLE: 182, since-last 132, total 1920, initial 9900, + current 11820 ---------------- CYCLE: 194, since-last 136, total 2048, initial 9900, + current 11948 FINAL: 11940, LEAKED 2040 root@bbe78e26dc1f:~/rsh# perl -v This is perl 5, version 30, subversion 0 (v5.30.0) built for x86_64-li +nux-gnu Copyright 1987-2019, Larry Wall Perl may be copied only under the terms of either the Artistic License + or the GNU General Public License, which may be found in the Perl 5 source ki +t. Complete documentation for Perl, including FAQ lists, should be found +on this system using "man perl" or "perldoc perl". If you have access to + the Internet, point your browser at http://www.perl.org/, the Perl Home Pa +ge.

Replies are listed 'Best First'.
Re: Memory leak in unicode substitution
by haukex (Bishop) on Aug 30, 2019 at 08:04 UTC

    I can confirm the leak here too (Linux), with something as simple as the following, on Perl versions 5.18 and up, earlier versions don't leak. Definitely report this via perlbug.

    #!/usr/bin/env perl use warnings; use strict; for ( 0..500 ) { for ( 0..1000 ) { (my $x = "x") =~ s/\x{2122}/(TM)/sg; } system("ps -hovsz $$"); }
Re: Memory leak in unicode substitution
by vr (Curate) on Aug 30, 2019 at 09:09 UTC
    "x" =~ / [\x{1234}] /x for 0 .. 100_000; "x" =~ /(?: \x{1234} | \x{1234} )/x for 0 .. 100_000; "\x{4321}" =~ / \x{1234} /x for 0 .. 100_000;

    Curious, the bug doesn't bite if character is put in a class or dummy alternation. Most important, there's no bug if target string is utf8 itself. That's why, I think, it wasn't found sooner. Unicode in regexes most often means Unicode in texts.

      Good catch, can confirm. Surprisingly, simply adding use utf8; doesn't fix it though. Aren't all string literals to be Unicode when utf8 is in effect?


      holli

      You can lead your users to water, but alas, you cannot drown them.
        No. use utf8; just means UTF-8 is used in the source code (both in string literals or identifiers). But even use feature qw( unicode_strings ); doesn't help here.

        map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re: Memory leak in unicode substitution
by Anonymous Monk on Aug 30, 2019 at 07:30 UTC
    This is definitely a real bug and I can reproduce it. Please report it to p5p, either by using "perlbug" tool or by mailing perlbug@perl.org (note that your email must contain word "perl" somewhere in the body, otherwise it may be eaten by the spam filter).
      Reported.

      map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://11105281]
Approved by Athanasius
Front-paged by Discipulus
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (5)
As of 2021-03-09 08:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My favorite kind of desktop background is:











    Results (132 votes). Check out past polls.

    Notices?