Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

New regex trick...

by japhy (Canon)
on Jul 22, 2002 at 19:11 UTC ( #184183=perlmeditation: print w/ replies, xml ) Need Help??

I've got a new regex trick up my sleeve, and although I'm proposing a patch for it, I'm not sure it'll be accepted. It's a new pseudo-anchor, \K, which tells the engine to pretend it just started matching. That's not a very good explanation, so let me use an example:
# code to remove '.z' from $str # old way (slow) $str = join '.', 'a' .. 'z'; $str =~ s/(.*)\..*/$1/; # new way (fast) $str = join '.', 'a' .. 'z'; $str =~ s/.*\K\..*//;
I'll post more soon, when I have a patch and what-not, but I hope this gives you a good idea of what this anchor does.

Another way of looking at it is it lets you have a variable-width look-behind at the beginning of your regex... it allows your regex to have a prelude, as it were.

_____________________________________________________
Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Comment on New regex trick...
Download Code
Re: New regex trick...
by FoxtrotUniform (Prior) on Jul 22, 2002 at 19:18 UTC

    What's the mnemonic for \K?

    Update: Ah, got it.

    This is sort of like Prolog's "cut", isn't it? Match a prefix, then never backtrack back past the \K anchor to satisfy the regex. Can you have multiple \Ks in a single regex? I can see Abigail-II having lots of fun with this.

    --
    The hell with paco, vote for Erudil!
    :wq

      There are two. K = Keep, as in, "keep everything we've just matched", because I envision it being used in substitutions. The other mnemonic is "\K isn't used by anything yet". ;)

      _____________________________________________________
      Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
      s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

      Yes, you can have multiple \K's in a regex, although only the last one really means anything, should the regex succeed. I'm actually curious why you would need more than one in the same branch, but it's perfectly legal.

      But no, it's not cut. Cut is (?>pat). This is just telling the regex engine to pretend it just started matching, that $& is defined starting HERE.

      _____________________________________________________
      Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
      s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Re: New regex trick...
by vladb (Vicar) on Jul 22, 2002 at 19:23 UTC
    So, to simply remove '.z' you go in all the trouble of doing the '.*' (glob all) match etc etc. Whereas a shorter version must be faster:
    $str =~ s/\..$//;
    Unless I'm missing some other deeply nested point {griN}

    _____________________
    # Under Construction
      Yours is a very special case, because we know there's only one character after the last ".". Here is a benchmark I ran with my new patch:
      use Benchmark 'cmpthese'; my $str = join '.', 'a' .. 'z'; cmpthese(-5, { old => sub { (my $x = $str) =~ s/(.*)\..*/$1/ }, new => sub { (my $x = $str) =~ s/.*\K\..*// }, }); __END__ new: 109650.19/s (n=578953) old: 45258.35/s (n=241227)
      ... which is an incredible speed increase!

      _____________________________________________________
      Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
      s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Re: New regex trick...
by erikharrison (Deacon) on Jul 22, 2002 at 19:33 UTC

    I understand (I think) why this is fast. What I don't understand is how it works. I'll admit that I am not much of regex hacker, but how does the variable length look behind now how far to look, in the case of quantifiers, especially greedy ones? It seems that the goal of the anchor is to prevent backtracking (hence the speed gains) but how does it now when to stop? If the \K anchor tells the regex engine to watch for the oncoming regex and stop matching there, whats the difference between \K and a minimal match, or negated character class? If the engine doesn't stop at the first instance of, say '.' (as in your example) how do you keep it from backtracking (which is where I'm presuming the speed gains come from)? Or is \K an optimization using sexegers?

    Now, japhy I have absolute faith in your pattern - foo, so I take it on faith that this works. What I'm curious about is how. Am I missing something in thinking that the performance gains here are related to backtracking?

    Cheers,
    Erik

    Light a man a fire, he's warm for a day. Catch a man on fire, and he's warm for the rest of his life. - Terry Pratchet

      You're thinking too hard. I cheated. My patch just fools the regex engine into thinking it hasn't actually started matching yet. Here's a drawn-out example.
      $str = "abc.def.ghi.jkl"; $str =~ s{ .* # match as much as you can \K # and then pretend HERE is where we start \. .* # then a . and anything else }{}x; # replace with nothing __END__ abc.def.ghi.jkl $& AAAAAAAAAAA .* "abc.def.ghi" \K "" B \. "." BCCC .* ".jkl"
      Does that help you see what I do? My patch consists of a couple lines of support, but this is the beef:
      case KEEP: PL_regstartp[0] = locinput - PL_bostr; break;
      That's what happens when the regex engine encounters the \K. The rest of the patch is just creating the "KEEP" node, and telling toke.c that "\K" is a valid escape sequence.

      _____________________________________________________
      Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
      s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

        Okay, I know it's because I'm dumb, but I still don't get it. Please don't yell at me, but why does the \K anchor keep .* from matching .jkl? And if it backtracks like normal, then where does the speed come from? I think that may be the essence of my confusion - why is this faster?

        If I don't get it this time, I'll give up and just trust it ;-).

        Cheers,
        Erik

        Light a man a fire, he's warm for a day. Catch a man on fire, and he's warm for the rest of his life. - Terry Pratchet

Re: New regex trick...
by broquaint (Abbot) on Jul 22, 2002 at 19:34 UTC
    So \K basically means don't worry, I only started matching *here*, where here is the position where the previous atom of the regex stopped matching e.g
    my $str = "foo bar baz"; # will 'start' matching here ^ $str =~ s< (?: \w+ [ ] )+ \K \w+ >()x; print '[' . $str . ']'; __output__ [foo bar ]
    Or should I just go back to implementing pointers in perl?
    HTH

    _________
    broquaint

    update: added missing whitespace in output (odd how an additional character requires so many to explain ;)

      You've got the right idea (except for a missing whitespace in your output).

      _____________________________________________________
      Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
      s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Re: New regex trick...
by dws (Chancellor) on Jul 22, 2002 at 20:06 UTC
    It's a new pseudo-anchor, \K, which tells the engine to pretend it just started matching.

    What's the impact on pos() if you fake out the starting point? Does pos() still reflect the start of matching, or does it reflect the start of the last \K?

      Present an example that explains your question. pos() refers to the end of the last global match. Are you asking about any interference between \K and \G? I do not foresee any.

      _____________________________________________________
      Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
      s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

        pos() refers to the end of the last global match.

        Oops, you're right. The code fragement I was thinking of used pos() - $delta to get the starting point.

Re: New regex trick...
by japhy (Canon) on Jul 22, 2002 at 20:18 UTC
    If this patch doesn't go through (and I can see reasons why it wouldn't, but I'd like it to go through), the same functionality can be achieved by my new module, Regexp::Keep. It's not on CPAN yet, but it's an XS module that does two things:
    1. filters regexes for the \K escape sequence, and turns them into (?{Regexp::Keep::KEEP})...
    2. and defines an XS function, KEEP(), that does what \K is supposed to do
    Of course, constant-overloading is nasty, so I hope my patch is received well.

    _____________________________________________________
    Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
    s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

      Can you post some benchmarks for the module? Or are the ones you posted above from the module (I'm assuming not).
        Here's the code:
        use Benchmark 'cmpthese'; use Regexp::Keep; my $s = "abc.def.ghi.jkl"; cmpthese(-5, { japhy => sub { (my $x = $s) =~ s/.*\K\..*// }, old => sub { (my $x = $s) =~ s/(.*)\..*/$1/ }, });
        In 5.6.1:
        old = 40280.54/s japhy = 74338.79/s
        And with 5.8.0 (using the module, NOT my patch):
        old = 58188.18/s japhy = 102409.25/s
        So there's an appreciable speed-up.

        _____________________________________________________
        Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
        s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Re: New regex trick...
by dimmesdale (Friar) on Jul 24, 2002 at 21:45 UTC
    If you're looking for input, then I'd have to say that that's a wonderful idea. It's odd, because I was just wishing there was something like this over the past week -- for one, I think it makes the substitutions look a little bit cleaner as to their true intent (but maybe that's just me).

    Look forward to it.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://184183]
Approved by FoxtrotUniform
Front-paged by tadman
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (6)
As of 2014-08-30 08:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (291 votes), past polls