Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Re^4: Speeding up named capture buffer access

by SBECK (Chaplain)
on Dec 01, 2009 at 16:24 UTC ( #810411=note: print w/replies, xml ) Need Help??


in reply to Re^3: Speeding up named capture buffer access
in thread Speeding up named capture buffer access

I can't do that for the exact reason that I want to use named capture buffers. The regular expression is very complicated. It's basically of the form:

     $re = qr/($re1|$re2|...|$reN)/;
where each of the pieces may match a valid time. But some may match a partial time (perhaps only hours and minutes), some may match a 24-hour time and others may include an AM/PM string, some may include timezone information, and because there are so many ways to express times, some of them may even have the order of the fields changed, so I wouldn't want to depend on the order of the matches always being ($h,$mn,$s).

So, using numbered matches, I could do something like:

     foreach $re ($re1,$re2,...) {
        ($h,$mn,$s) = $string =~ $re;
        last  if ($h,$mn,$s)
     }
except that that won't work because I'm relying on the order of matches (and assuming that there will always be an $h match, etc).

With named capture buffers, I can do this so elegantly. I define each regexp, name the capture buffers (in whatever order they come in) and the named buffer will contain all the ones that actually matched. Maintaining the complicated regexps in Date::Manip is about 100 times easier now!

  • Comment on Re^4: Speeding up named capture buffer access

Replies are listed 'Best First'.
Re^5: Speeding up named capture buffer access
by BrowserUk (Pope) on Dec 01, 2009 at 17:25 UTC

    OKay, I can see what is driving your requirements. One possibility that might prove a little quicker is Alternative-capture-group-numbering*, which allows you to re-use capture numbering within different match alternatives.

    The example given at the reference above is very pertinent to your use. It might at least be worth benchmarking.

    *Unfortunately #anchors no longer seem to work at perldoc since they added that annoying moving menu :(


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      I'm using a dynamic width style override for perldoc, and the anchors work fine for me. (It also rearranges the sidebars and locks the floating bar to the top and generally squeezes out the unneeded fluff.)

      Your anchored link, for example, looks great to me

      How to make perldoc.perl.org resizable too!

        I don't use FF, and only go to perldoc when I want to post a reference as I have it all locally. Opera has a button(*) that just disables all the CSS which when clicked gives me a nice black on white, full width, basic html page with good size and readable fonts.

        But it does bug me when authors break basic html stuff, and force everyone else in the world to work out how to compensate, just so they can play a few stupid tricks and look 'hip'. It's just style over substance.

        (*)It works wonder on many sites, microsoft, the beeb (though it can't correct for their insistance on using fixed width tabular formatting that only uses half of my screen). And many more are vastly more readable and useable after that simple click.

        I also get bugged by these little footnotes that show up:

        This page is best viewed in an up-to-date web browser with style sheets (CSS) enabled. While you will be able to view the content of this page in your current browser, you will not be able to get the full visual experience. Please consider upgrading your browser software or enabling style sheets (CSS) if you are able to do so.

        I don't go there for some art-fart's idea of a "full visual experience"--I go there to read the news!


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

      Good suggestion. I'm aware of them... I read up on them as I was learning about named buffers, but I quickly skipped them to the much nicer named buffers. However, I did NOT try to implement them. I'll spend some time looking at them as a possible compromise between the two.

Re^5: Speeding up named capture buffer access
by JadeNB (Chaplain) on Dec 01, 2009 at 16:54 UTC
    If the optimisation is that important, couldn't you just ‘pre-compile’ by doing the counting (laboriously and by hand, if necessary) once and taking into account the various possibilities? It's less elegant, but it seems that speed rather than elegance is your primary driver (with elegance a secondary bonus).
    $re = qr/(?<h1>...)(?<m1>...)(?<s1>...)|(?<h2>...)(?<m2>...)(?<s2>...) +/; $string =~ $re; ( $h, $m, $s ) = ( $1 || $4, $2 || $5, $3 || $6 );

      I'm more interested in elegance (and maintainability) than in the optimization. I rewrote all of the regexps in Date::Manip to use named buffers, and I'm not interested in going back to numbered buffers.

      As an example, in one place in Date::Manip, I match a set of related regular expressions that match various date strings, and there are 23 different possibilities containing 65 different matches between them (NOT all in the same order), so manually counting all of the match positions, while doable, basically renders that code static and unmaintainable... a simple change to the regexps leads to a very tedious and error-prone piece of work to maintain it. I think that's the worse case... but there are a several other cases that are almost as bad.

      That said, I want as much optimization as I can, within that constraint, and that's the basis for my question.

        I don't know how it compares for speed — probably slower due to the sub calls — but here's an alternative.

        use strict; use warnings; use re 'eval'; # Should be scoped better. sub rc($) { my $ofs = @- + shift; return substr($_, $-[$ofs], $+[$ofs] - $-[$ofs]) } sub compile_pat { qr/$_[0]/ } my @s_months = qw( Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ); my $s_months = compile_pat join '|', @s_months; local our %s_months = map { $s_months[$_] => $_+1 } 0..$#s_months; my @pats = ( qr/ (\d{4})-(\d{2})-(\d{2}) (?{[ rc-3, rc-2, rc-1 ]})/x, qr/ (\d{2})($s_months)(\d{4}) (?{[ rc-1, $s_months{rc-2}, rc-3 ]})/x, ); my $pat = compile_pat join '|', @pats; for (qw( 2009-12-01 01Dec2009 01-12-2009 )) { local our ($y,$m,$d); if (/$pat(?{ ($y,$m,$d) = @{$^R} })/) { printf("%s => %04d-%02d-%02d\n", $_,$y,$m,$d); } else { printf("%s => [No match]\n", $_); } }

        Bonus: $pat can be calculated once and stored in a file.

        As an example, in one place in Date::Manip, I match a set of related regular expressions that match various date strings, and there are 23 different possibilities containing 65 different matches between them (NOT all in the same order)
        You've mentioned several times the need to work around the fact that you don't know which of many alternatives matched. Would it be possible, instead of
        $string =~ /$re1|$re2/ and ( $h, $m, $s ) = ...
        , to do
        $string =~ $re1 and ( $h, $m, $s ) = ... or $string =~ $re2 and ( $h, +$m, $s ) = ...
        and just have to worry about the order for individual regexes (rather than trying to find one order that works for all regexes); or does that also fall afoul of the maintainability requirement? Note that this approach means that introducing one new regex involves one simple counting problem, rather than one big counting problem that could interefere with all the old counts.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://810411]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (3)
As of 2021-06-15 00:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What does the "s" stand for in "perls"? (Whence perls)












    Results (66 votes). Check out past polls.

    Notices?