Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Matching and nonmatching multiple regexps at once

by Yary (Pilgrim)
on Apr 20, 2012 at 14:44 UTC ( [id://966203]=perlmeditation: print w/replies, xml ) Need Help??

We all know how to match "this or that"-

/this|that/

And we can match "this and not that" simply-

/this/ && !/that/

Once in a while I'm faced with a library that wants me to provide a regexp, and I need it to test a "this and not that" pair of patterns. (If it's my own code, it will accept a coderef like sub {/this/ && !/that/} as a condition, and all is well...) One easy-to-explain workaround is to use a code-conditional pattern:

m:^(?(?{/this/ && !/that/})|(?!)):

or, more verbosely,

m: # use : as delims so / in code won't hurt ^ # Anchor it at start, don't need to backtrack (? # Start a conditional test (?{/this/ && !/that/}) # What to test- # By itself, a code assertion always succeeds, # That's why we wrap the code inside a conditional |(?!) # Simplest "T|F" - match succeed, nonmatch fails ) # End the condition :x

And it works. Still, as I was settling down one evening a few nights ago, I got to thinking of alternatives. We can use look-ahead assertions:

/(?=.*this)(?!.*that)/s

which is fairly easy for me to read. It doesn't need an anchor, the sub-patterns will cause a match or fail at the first position- in fact the anchor slows it down since it is a useless extra operation in this regexp. The s modifier at the end is so the .* won't stop at a line break.

Then I had a thought going back to alternation: /this|that/, and a class I had on logic. Another way of saying "this and that" is "not(not this or not that)". Which leads to

/(?!(?!.*this)|(?=.*that))/s

There might be a better way of expressing it, but my time is limited...

Speaking of time, I tossed them all into a benchmarking script, and ran them on the machine I'm typing from which runs perl 5.12.1, and am a little surprised by the results.

#!/usr/bin/env perl use Benchmark 'cmpthese'; my %ways = ( '3calls' => sub {/a/ && !/b/ && /c/}, code => sub {m:^(?(?{/a/ && !/b/ && /c/})|(?!)):}, pos_look => sub {/(?=.*a)(?!.*b)(?=.*c)/s}, neg_look => sub {/(?!(?!.*a)|(?=.*b)|(?!.*c))/s}, neg_anch => sub {/^(?!(?!.*a)|(?=.*b)|(?!.*c))/s} ); # Minimal check that the regexps do what they're supposed to while (my ($way, $sub)=each %ways) { die "$way failed to match\n" unless ($_ = 'cat') && &$sub; die "$way had a false positive\n" if ($_= 'cab') && &$sub; } print "For matching:\n"; $_="find your cat"; cmpthese(-1, \%ways); print "For non-matching:\n"; $_="don't find a cab"; cmpthese(-1, \%ways); __END__ Rate code neg_look neg_anch pos_anch pos_look 3calls code 674093/s -- -52% -53% -63% -63% -87% neg_look 1417081/s 110% -- -1% -21% -22% -73% neg_anch 1429846/s 112% 1% -- -21% -21% -73% pos_anch 1798791/s 167% 27% 26% -- -1% -66% pos_look 1810887/s 169% 28% 27% 1% -- -66% 3calls 5338256/s 692% 277% 273% 197% 195% -- For non-matching: Rate code neg_anch neg_look pos_anch pos_look 3calls code 674093/s -- -55% -57% -63% -65% -88% neg_anch 1495898/s 122% -- -5% -17% -22% -73% neg_look 1570933/s 133% 5% -- -13% -18% -72% pos_anch 1806441/s 168% 21% 15% -- -5% -67% pos_look 1906963/s 183% 27% 21% 6% -- -65% 3calls 5522908/s 719% 269% 252% 206% 190% --

What surprises me is both how much faster the three separate matches 3calls is than any single-matching alternative, and how much slower the "code" regexp is. The quick /a/ && !/b/ is due to the optimizing that the simpler patterns can undergo, but I guess that's dwarfed by the overhead of setting up an eval-inside-a-regexp.

And you, dear readers, have you any other favorite ways of saying "match this and not that" in a single regexp?

Replies are listed 'Best First'.
Re: Matching and nonmatching multiple regexps at once
by BrowserUk (Patriarch) on Apr 20, 2012 at 15:19 UTC

    For single characters, add this to your benchmark, it is faster than the 3-regex method:

    x => sub {/^(?:[ac]|[^b])+$/s},

    I realise that yours are more general and should handle more than single characters, but if that is your target application, you shouldn't be using single chars in your benchmark as it is quite likely to give you false expectations.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

Re: Matching and nonmatching multiple regexps at once
by eyepopslikeamosquito (Archbishop) on Apr 21, 2012 at 00:53 UTC

    This topic is discussed in detail in the Perl Cookbook recipe 6.18 "Expressing AND, OR, and NOT in a Single Pattern".

Re: Matching and nonmatching multiple regexps at once
by JavaFan (Canon) on Apr 20, 2012 at 15:18 UTC
    /that (*COMMIT) (*FAIL) | this .* that (*COMMIT) (*FAIL) | this/xs;
    Note that this doesn't work correctly on 5.12 (adding some parens around 'this .* that' makes it work though), but it does work correctly in 5.14.2.
Re: Matching and nonmatching multiple regexps at once
by Yary (Pilgrim) on Apr 22, 2012 at 14:40 UTC

    All excellent comments from JavaFan, BrowserUk, and eyepopslikeamosquito.

    I hadn't considered the COMMIT/FAIL, partly because this came up when writing tests for some perl 5.8 code. It won't find overlapping matches, unlike the other tests. And I have to think about it more when used in the more general case. Still I'd like to use and try it out, so I'm adding it to the test suite.

    As for BrowserUk's observation, yes character classes are preferable when looking for single characters, but this was meant for less trivial "atoms"... I will re-write the tests to be more realistic.

    I'm glad that one of my solutions is exactly what The Perl Cookbook recommends, and the cookbook explains it better and talks about overlapping vs non-overlapping cases. Still, we have come up with some variations it does not cover.

    New benchmark, taking the above into account, making the things to match/not match a little easier to modify:

    #!/usr/bin/env perl use Benchmark 'cmpthese'; my ($t1,$f,$t2)=(qr/people/,qr/babi?es/,qr/health(?:y|ful)/); my %ways = ( '3calls' => sub {/$t1/o && !/$f/o && /$t2/o}, code => sub {m:^(?(?{/$t1/o && !/$f/o && /$t2/o})|(?!)):}, pos_look => sub {/(?=.*$t1)(?!.*$f)(?=.*$t2)/so}, pos_anch => sub {/^(?=.*$t1)(?!.*$f)(?=.*$t2)/so}, neg_look => sub {/(?!(?!.*$t1)|(?=.*$f)|(?!.*$t2))/so}, neg_anch => sub {/^(?!(?!.*$t1)|(?=.*$f)|(?!.*$t2))/so}, commit => sub {/$f(*COMMIT)(*FAIL)|^(?=.*$t1)(?=.*$t2).*$f(*COMMIT) +(*FAIL)|(?=.*$t1)(?=.*$t2)/so} ); while (my ($way, $sub)=each %ways) { die "$way failed to match\n" unless ($_ = 'healthy people') && &$sub +; die "$way had a false positive\n" if ($_= 'healthy people, including + babies') && &$sub; } print "For matching:\n"; $_="Our people eat heathful meals."; cmpthese(-1, \%ways); print "For non-matching:\n"; $_="Healthy people have more babes in arms"; cmpthese(-1, \%ways); __END__

    For matching:

    Rate commit code neg_look neg_anch pos_look pos_anch 3calls
    commit406208/s---40%-64%-66%-69%-70%-91%
    code675893/s66%---40%-44%-49%-50%-85%
    neg_look1128170/s178%67%---7%-14%-16%-75%
    neg_anch1212346/s198%79%7%---8%-10%-73%
    pos_look1312820/s223%94%16%8%---2%-71%
    pos_anch1344229/s231%99%19%11%2%---71%
    3calls4567476/s1024%576%305%277%248%240%--

    For non-matching:

    Ratecommitcodeneg_anchneg_lookpos_lookpos_anch3calls
    commit418623/s---36%-65%-68%-68%-70%-90%
    code654350/s56%---45%-49%-50%-54%-85%
    neg_anch1180978/s182%80%---9%-10%-17%-73%
    neg_look1294154/s209%98%10%---1%-9%-70%
    pos_look1308383/s213%100%11%1%---8%-70%
    pos_anch1418108/s239%117%20%10%8%---68%
    3calls4365151/s943%567%270%237%234%208%--

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://966203]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (2)
As of 2024-04-26 07:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found