Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re: Re: Ovid, Long Live .*? (dot star question-mark)

by danger (Priest)
on Sep 06, 2001 at 05:46 UTC ( [id://110472]=note: print w/replies, xml ) Need Help??


in reply to Re: Ovid, Long Live .*? (dot star question-mark)
in thread Ovid, Long Live .*? (dot star question-mark)

However, *capturing* just a non-greedy dot star will still suffer from having to test the remaining pattern (outside of the parens) at each step. Thus, the negated character class will perform a lot better in the following:

cc => sub { "this is an amazingly long string" =~ /\s([^l]*)l/ }, ds => sub { "this is an amazingly long string" =~ /\s(.*?)l/ },

However, both approaches *express* different things (they just happen to functionally coincide in the above). For some things, .*? is the right approach, for others, a negated character class is the right approach.

And, to add to japhy's additional warning regarding the stricter meaning of a negated character class, I'll offer another example. For those who do not see the potential difference in meaning and use of each approach, consider the following contrived example: I want to match (and extract) the first two fields of colon separated data, but only when the third field starts with an 'A' (let's not worry about whether split() would be a better approach for a minute):

#!/usr/bin/perl -w use strict; my %data; while(<DATA>){ next unless m/^(.*?):(.*?):A/; # non-greedy DS #next unless m/^([^:]*):([^:]*):A/; # negated CC $data{$1} = $2; } while( my($k,$v) = each %data) { print "$k => $v\n"; } __DATA__ abc:123:A:B def:456:A:C ghi:789:B:A jkl:000:C:C OUTPUT: non-greedy DS: abc => 123 def => 456 ghi => 789:B negated CC: abc => 123 def => 456

The non-greedy DS version doesn't work according the spec (only the first two lines have an 'A' in the 3rd field). That's because dot star part in (.*?): does not say "match only up to the next colon" (as some people occassionally believe it does), it says: "match as few (of *any* characters1) as we can and still have the remainder of the expression match". When the whole pattern is (.*?):, the end result (aside from efficiency) is the same --- but if the pattern that follows is more than a single character, things are not at all the same as a negated character class.

I only wanted to reiterate this because I've often seen beginners and more experienced programmer's make the mistake of thinking that the non-greedy dot star and a negated character class are interchangeable, and they simply aren't.

[1] well 'any character' except a newline, unless /s

Replies are listed 'Best First'.
Re: Re: Re: Ovid, Long Live .*? (dot star question-mark)
by japhy (Canon) on Sep 06, 2001 at 05:56 UTC
    Wow. That sucks about capturing. I guess the engine is really specific about that optimization. There's got to be a reason that capturing is so bad like that.

    _____________________________________________________
    Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
    s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

      Well if you want to make that capture fast, you need to guarantee that at no point in the RE will you use a backreference back to that captured pattern.

      While danger's pattern could be safely optimized, the following one cannot be:

      /foo(.*?)bar and stuff then \1 here/
      Remember that matching arbitrary REs with internal backreferences is an NP complete problem. Matching REs without is not, and an efficient solution is a DFA. As I explained to you, it would be possible to make NFAs efficient on patterns that are solvable by a DFA. But as soon as you start in on capturing and backreferences, you have to throw out pretty much all hope for non-trivial optimizations.
Capturing .* and .*? is fast now. :)
by japhy (Canon) on Sep 06, 2001 at 20:32 UTC
    I found out why captuuring them does not activate the optimisation. It has to do with the regex tree. A regex like /.*?foo/ becomes:
    1 MINMOD(2) 2 STAR(4) 3 REG_ANY(0) 4 EXACT <foo>(6) 6 END
    whereas the tree for /(.*?)foo/ is:
    1 OPEN1(3) 3 MINMOD(4) 4 STAR(6) 5 REG_ANY(0) 6 CLOSE1(8) 8 EXACT <foo>(10) 10 END
    In the non-capturing version, the node directly after STAR is EXACT. In the capturing version, the node is CLOSE. I patched regexec.c to allow for a CLOSE in between the two.

    _____________________________________________________
    Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
    s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://110472]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (9)
As of 2024-04-23 21:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found