Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Re^13: Memory leak question

by BrowserUk (Pope)
on Oct 06, 2010 at 14:33 UTC ( #863797=note: print w/ replies, xml ) Need Help??


in reply to Re^12: Memory leak question
in thread Memory leak question

I believe you are being bitten by regex engine leaks.

Here's what I discovered.

  1. If I replace _iso8601rx() with the bare minimum to parse the date/time in the test, the memory leaks disappear completely.
    my %cache; sub _iso8601_rx { my($self,$rx) = @_; my $dmt = $$self{'tz'}; my $dmb = $$dmt{'base'}; return $cache{ $rx } if exists $cache{ $rx }; } $cache{cdate} = '(?<y>\d\d\d\d)-(?<m>\d\d)-(?<d>\d\d)'; $cache{ctime} = '(?<h>\d\d):(?<mn>\d\d):(?<s>\d\d)'; $cache{fulldate} = "$cache{cdate}\\s+$cache{ctime}"; 1;
  2. However, if I change that to using the fully expanded regexes, it goes back to leaking like a sieve:

I thought that it was maybe the use of (so many) named captures, but I tried very hard to make them leak. A single regex with 175,000 named captures; matching /g against a string that contained 10,000 matches for them; in a (v.slow) loop. It grew very arge, but once it maxed out, it didn't leak at all.

So then I remembered that I'd seen the regex trie optimisation caused problems with large alternations, but disabling it didn't change things.

Then I thought to try your monster regexes in a standalone script and run them directly on the sample date in a loop:

#! perl use strict; my %cache = ( ctime => <<'RXA', cdtate => <<'RXB', fulldate -> <<'RXC' + ); ##... monster regex initialisation ellided; my $refull = qr[$cache{ fulldate }]x; my $rectime = qr[$cache{ ctime }]x; my $recdate = qr[$cache{ cdate }]x; for (1..100e6) { "2010-02-01 01:02:03" =~ $refull; "2010-02-01 01:02:03" =~ $rectime; "2010-02-01 01:02:03" =~ $recdate; }

it doesn't leak at all. Not a jot.

So, it's not just the monster regexes, but also how they're are being used, or the results are being used that triggers the leak.

I'm kinda stuck for a direction in which to go now, but I hope that this will help you zero in on the cause. I'll keep looking.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.


Comment on Re^13: Memory leak question
Select or Download Code
Re^14: Memory leak question
by SBECK (Monk) on Oct 06, 2010 at 15:16 UTC
    I think you just became the first person in the world who knows Date::Manip parsing internals better than myself!

    Thanks for everything. I think this gives me enough information to track it down, though it's going to take me some time to really digest everything you said, but I think that this is enough to get me a long ways towards fixing the problem.
      I just found out a little bit about the leak.

      Using the original Date::Manip code, there's a line in the _parse_datetime_iso8601 function which looks like:
      ($y,$md,$d,...) = @+{qw(y m d ...)};
      where I just matched on the regexp from _iso8601_rx. If I comment this line out (and just set $y,$m,$d to some static values), there's no leaking. Note that I STILL match the regexp, I simply never refer to the %+ hash.

      Unfortunately, I wasn't able to reproduce this in a simple test script, so I still need to investigate further, but I think this is an interesting result.
      I was able to reproduce the leak in a trivial script, and I think that I'm down to the most basic illustration.
      $a = '(?<a>\d)'; $b = '(?<b>\d)'; $rx = qr/(?:${a}${b}|${a}:${b})/; #$string = "12"; $string = "1:2"; while (1) { $string =~ $rx; @tmp = @+{qw(a b)}; }
      This leaks.

      If I modify $rx to include only one of the two choices, it doesn't leak. If I plug in a string which matches the first option (i.e. use the $string = "12" line), it doesn't leak. And if you comment out the @tmp = @+ line so you don't access %+, it doesn't leak.

      At this point, I guess I no longer believe that it is a Date::Manip problem. In other words, I don't think the above script is buggy... I think it points out a bug in perl itself. If you agree, I think I'll pass it on as a perl bug.

      Final (I hope) comment?
        If you agree, I think I'll pass it on as a perl bug.

        Yes, I absolutely agree. And your example demos the bug perfectly.

        Nice to know my instincts weren't too far off--I always suspect new features first. But try as hard as I might I couldn't arrive at the simple demo that leaked. Congratulations on that.

        The downside is you'll have to wait a while for the fix, but at least you now know.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        One final thought :)

        There is another way of doing "named captures", without using the construct or %+ or %-. It's a bit um ... obscure, and may be slower, but it might allow you to workaround the problem in the interim without radically altering your existing code.

        #! perl -slw use strict; use re qw[ eval ]; use Data::Dump qw[ pp ]; my $reY = '(\d{4})(?{ $match{ y } = $^N })';; my $reM = '(\d{2})(?{ $match{ m } = $^N })';; my $reD = '(\d{2})(?{ $match{ d } = $^N })';; my $reH = '(\d{2})(?{ $match{ h } = $^N })';; my $reMN = '(\d{2})(?{ $match{ mn } = $^N })';; my $reS = '(\d{2})(?{ $match{ s } = $^N })';; my $reDT = "$reY-$reM-$reD\\s+$reH:$reMN:$reS"; our %match = (); '2010-10-06 20:55:31' =~ $reDT; pp \%match;; __END__ c:\test>junk57.pl { d => "06", h => 20, "m" => 10, mn => 55, "s" => 31, "y" => 2010 }
        Or better still, cut out the middleman and put the captures straight into the names variables themselves (I wish named captures worked this way full stop) :
        #! perl -slw use strict; use re qw[ eval ]; use Data::Dump qw[ pp ]; my $reY = '(\d{4})(?{ $y = $^N })';; my $reM = '(\d{2})(?{ $m = $^N })';; my $reD = '(\d{2})(?{ $d = $^N })';; my $reH = '(\d{2})(?{ $h = $^N })';; my $reMN = '(\d{2})(?{ $mn = $^N })';; my $reS = '(\d{2})(?{ $s = $^N })';; my $reDT = "$reY-$reM-$reD\\s+$reH:$reMN:$reS"; local our( $y, $m, $d, $h, $mn, $s ); '2010-10-06 20:55:31' =~ $reDT; print "$y, $m, $d, $h, $mn, $s"; __END__ c:\test>junk57.pl 2010, 10, 06, 20, 55, 31

        Note: The variables referenced inside the (?{ code }) blocks have to be global, but judicious use of local and our makes it reasonably convenient. Also, I've had iffy results using qr// with this. Never really understood why.

        I realise that it would be considerable work to modify all your regex generators to use this method, but hey!

        You can always knock up a few regex to do it for you ;)


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://863797]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (9)
As of 2014-08-22 05:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (147 votes), past polls