http://www.perlmonks.org?node_id=810415


in reply to Re^4: Speeding up named capture buffer access
in thread Speeding up named capture buffer access

If the optimisation is that important, couldn't you just ‘pre-compile’ by doing the counting (laboriously and by hand, if necessary) once and taking into account the various possibilities? It's less elegant, but it seems that speed rather than elegance is your primary driver (with elegance a secondary bonus).
$re = qr/(?<h1>...)(?<m1>...)(?<s1>...)|(?<h2>...)(?<m2>...)(?<s2>...) +/; $string =~ $re; ( $h, $m, $s ) = ( $1 || $4, $2 || $5, $3 || $6 );

Replies are listed 'Best First'.
Re^6: Speeding up named capture buffer access
by SBECK (Chaplain) on Dec 01, 2009 at 17:30 UTC

    I'm more interested in elegance (and maintainability) than in the optimization. I rewrote all of the regexps in Date::Manip to use named buffers, and I'm not interested in going back to numbered buffers.

    As an example, in one place in Date::Manip, I match a set of related regular expressions that match various date strings, and there are 23 different possibilities containing 65 different matches between them (NOT all in the same order), so manually counting all of the match positions, while doable, basically renders that code static and unmaintainable... a simple change to the regexps leads to a very tedious and error-prone piece of work to maintain it. I think that's the worse case... but there are a several other cases that are almost as bad.

    That said, I want as much optimization as I can, within that constraint, and that's the basis for my question.

      I don't know how it compares for speed — probably slower due to the sub calls — but here's an alternative.

      use strict; use warnings; use re 'eval'; # Should be scoped better. sub rc($) { my $ofs = @- + shift; return substr($_, $-[$ofs], $+[$ofs] - $-[$ofs]) } sub compile_pat { qr/$_[0]/ } my @s_months = qw( Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ); my $s_months = compile_pat join '|', @s_months; local our %s_months = map { $s_months[$_] => $_+1 } 0..$#s_months; my @pats = ( qr/ (\d{4})-(\d{2})-(\d{2}) (?{[ rc-3, rc-2, rc-1 ]})/x, qr/ (\d{2})($s_months)(\d{4}) (?{[ rc-1, $s_months{rc-2}, rc-3 ]})/x, ); my $pat = compile_pat join '|', @pats; for (qw( 2009-12-01 01Dec2009 01-12-2009 )) { local our ($y,$m,$d); if (/$pat(?{ ($y,$m,$d) = @{$^R} })/) { printf("%s => %04d-%02d-%02d\n", $_,$y,$m,$d); } else { printf("%s => [No match]\n", $_); } }

      Bonus: $pat can be calculated once and stored in a file.

        This will take a bit of work to put in, and I'm not sure what the performance will be like, but it's worth at least trying.

      As an example, in one place in Date::Manip, I match a set of related regular expressions that match various date strings, and there are 23 different possibilities containing 65 different matches between them (NOT all in the same order)
      You've mentioned several times the need to work around the fact that you don't know which of many alternatives matched. Would it be possible, instead of
      $string =~ /$re1|$re2/ and ( $h, $m, $s ) = ...
      , to do
      $string =~ $re1 and ( $h, $m, $s ) = ... or $string =~ $re2 and ( $h, +$m, $s ) = ...
      and just have to worry about the order for individual regexes (rather than trying to find one order that works for all regexes); or does that also fall afoul of the maintainability requirement? Note that this approach means that introducing one new regex involves one simple counting problem, rather than one big counting problem that could interefere with all the old counts.

        That's how I had it originally... and when you've got 23 different possibilities, it adds unnecessary complexity. There's already 23 possibilities wherever I create the regular expressions, but now there's 23 possibilities wherever I use it as well.

        Worse is that some of the regular expressions are used multiple places. When I modify a regular expression, I'd like to have it be done in one place (wherever the regexp is created) and not have to worry about it in some other place or places (wherever it's used). As it stands now, I can add new ways to express a date in one place, and it'll automatically -- the routine where I create all my regexps, and it'll automatically go into affect in the various places it might be used.

        Not a big problem of course... but I'm a huge fan of Larry's principle of laziness.