Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Re^13: Memory leak question

by BrowserUk (Pope)
on Oct 06, 2010 at 14:33 UTC ( #863797=note: print w/ replies, xml ) Need Help??


in reply to Re^12: Memory leak question
in thread Memory leak question

I believe you are being bitten by regex engine leaks.

Here's what I discovered.

  1. If I replace _iso8601rx() with the bare minimum to parse the date/time in the test, the memory leaks disappear completely.
    my %cache; sub _iso8601_rx { my($self,$rx) = @_; my $dmt = $$self{'tz'}; my $dmb = $$dmt{'base'}; return $cache{ $rx } if exists $cache{ $rx }; } $cache{cdate} = '(?<y>\d\d\d\d)-(?<m>\d\d)-(?<d>\d\d)'; $cache{ctime} = '(?<h>\d\d):(?<mn>\d\d):(?<s>\d\d)'; $cache{fulldate} = "$cache{cdate}\\s+$cache{ctime}"; 1;
  2. However, if I change that to using the fully expanded regexes, it goes back to leaking like a sieve:

I thought that it was maybe the use of (so many) named captures, but I tried very hard to make them leak. A single regex with 175,000 named captures; matching /g against a string that contained 10,000 matches for them; in a (v.slow) loop. It grew very arge, but once it maxed out, it didn't leak at all.

So then I remembered that I'd seen the regex trie optimisation caused problems with large alternations, but disabling it didn't change things.

Then I thought to try your monster regexes in a standalone script and run them directly on the sample date in a loop:

#! perl use strict; my %cache = ( ctime => <<'RXA', cdtate => <<'RXB', fulldate -> <<'RXC' + ); ##... monster regex initialisation ellided; my $refull = qr[$cache{ fulldate }]x; my $rectime = qr[$cache{ ctime }]x; my $recdate = qr[$cache{ cdate }]x; for (1..100e6) { "2010-02-01 01:02:03" =~ $refull; "2010-02-01 01:02:03" =~ $rectime; "2010-02-01 01:02:03" =~ $recdate; }

it doesn't leak at all. Not a jot.

So, it's not just the monster regexes, but also how they're are being used, or the results are being used that triggers the leak.

I'm kinda stuck for a direction in which to go now, but I hope that this will help you zero in on the cause. I'll keep looking.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.


Comment on Re^13: Memory leak question
Select or Download Code
Replies are listed 'Best First'.
Re^14: Memory leak question
by SBECK (Pilgrim) on Oct 06, 2010 at 15:16 UTC
    I think you just became the first person in the world who knows Date::Manip parsing internals better than myself!

    Thanks for everything. I think this gives me enough information to track it down, though it's going to take me some time to really digest everything you said, but I think that this is enough to get me a long ways towards fixing the problem.
      I just found out a little bit about the leak.

      Using the original Date::Manip code, there's a line in the _parse_datetime_iso8601 function which looks like:
      ($y,$md,$d,...) = @+{qw(y m d ...)};
      where I just matched on the regexp from _iso8601_rx. If I comment this line out (and just set $y,$m,$d to some static values), there's no leaking. Note that I STILL match the regexp, I simply never refer to the %+ hash.

      Unfortunately, I wasn't able to reproduce this in a simple test script, so I still need to investigate further, but I think this is an interesting result.
      I was able to reproduce the leak in a trivial script, and I think that I'm down to the most basic illustration.
      $a = '(?<a>\d)'; $b = '(?<b>\d)'; $rx = qr/(?:${a}${b}|${a}:${b})/; #$string = "12"; $string = "1:2"; while (1) { $string =~ $rx; @tmp = @+{qw(a b)}; }
      This leaks.

      If I modify $rx to include only one of the two choices, it doesn't leak. If I plug in a string which matches the first option (i.e. use the $string = "12" line), it doesn't leak. And if you comment out the @tmp = @+ line so you don't access %+, it doesn't leak.

      At this point, I guess I no longer believe that it is a Date::Manip problem. In other words, I don't think the above script is buggy... I think it points out a bug in perl itself. If you agree, I think I'll pass it on as a perl bug.

      Final (I hope) comment?

        One final thought :)

        There is another way of doing "named captures", without using the construct or %+ or %-. It's a bit um ... obscure, and may be slower, but it might allow you to workaround the problem in the interim without radically altering your existing code.

        #! perl -slw use strict; use re qw[ eval ]; use Data::Dump qw[ pp ]; my $reY = '(\d{4})(?{ $match{ y } = $^N })';; my $reM = '(\d{2})(?{ $match{ m } = $^N })';; my $reD = '(\d{2})(?{ $match{ d } = $^N })';; my $reH = '(\d{2})(?{ $match{ h } = $^N })';; my $reMN = '(\d{2})(?{ $match{ mn } = $^N })';; my $reS = '(\d{2})(?{ $match{ s } = $^N })';; my $reDT = "$reY-$reM-$reD\\s+$reH:$reMN:$reS"; our %match = (); '2010-10-06 20:55:31' =~ $reDT; pp \%match;; __END__ c:\test>junk57.pl { d => "06", h => 20, "m" => 10, mn => 55, "s" => 31, "y" => 2010 }
        Or better still, cut out the middleman and put the captures straight into the names variables themselves (I wish named captures worked this way full stop) :
        #! perl -slw use strict; use re qw[ eval ]; use Data::Dump qw[ pp ]; my $reY = '(\d{4})(?{ $y = $^N })';; my $reM = '(\d{2})(?{ $m = $^N })';; my $reD = '(\d{2})(?{ $d = $^N })';; my $reH = '(\d{2})(?{ $h = $^N })';; my $reMN = '(\d{2})(?{ $mn = $^N })';; my $reS = '(\d{2})(?{ $s = $^N })';; my $reDT = "$reY-$reM-$reD\\s+$reH:$reMN:$reS"; local our( $y, $m, $d, $h, $mn, $s ); '2010-10-06 20:55:31' =~ $reDT; print "$y, $m, $d, $h, $mn, $s"; __END__ c:\test>junk57.pl 2010, 10, 06, 20, 55, 31

        Note: The variables referenced inside the (?{ code }) blocks have to be global, but judicious use of local and our makes it reasonably convenient. Also, I've had iffy results using qr// with this. Never really understood why.

        I realise that it would be considerable work to modify all your regex generators to use this method, but hey!

        You can always knock up a few regex to do it for you ;)


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
        If you agree, I think I'll pass it on as a perl bug.

        Yes, I absolutely agree. And your example demos the bug perfectly.

        Nice to know my instincts weren't too far off--I always suspect new features first. But try as hard as I might I couldn't arrive at the simple demo that leaked. Congratulations on that.

        The downside is you'll have to wait a while for the fix, but at least you now know.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://863797]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (12)
As of 2015-07-07 20:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (93 votes), past polls