We all know how to match "this or that"-
/this|that/
And we can match "this and not that" simply-
/this/ && !/that/
Once in a while I'm faced with a library that wants me to provide a regexp, and I need it to test a "this and not that" pair of patterns. (If it's my own code, it will accept a coderef like sub {/this/ && !/that/} as a condition, and all is well...) One easy-to-explain workaround is to use a code-conditional pattern:
m:^(?(?{/this/ && !/that/})|(?!)):
or, more verbosely,
m: # use : as delims so / in code won't hurt
^ # Anchor it at start, don't need to backtrack
(? # Start a conditional test
(?{/this/ && !/that/}) # What to test-
# By itself, a code assertion always succeeds,
# That's why we wrap the code inside a conditional
|(?!) # Simplest "T|F" - match succeed, nonmatch fails
) # End the condition
:x
And it works. Still, as I was settling down one evening a few nights ago, I got to thinking of alternatives. We can use look-ahead assertions:
/(?=.*this)(?!.*that)/s
which is fairly easy for me to read. It doesn't need an anchor, the sub-patterns will cause a match or fail at the first position- in fact the anchor slows it down since it is a useless extra operation in this regexp. The s modifier at the end is so the .* won't stop at a line break.
Then I had a thought going back to alternation: /this|that/, and a class I had on logic. Another way of saying "this and that" is "not(not this or not that)". Which leads to
/(?!(?!.*this)|(?=.*that))/s
There might be a better way of expressing it, but my time is limited...
Speaking of time, I tossed them all into a benchmarking script, and ran them on the machine I'm typing from which runs perl 5.12.1, and am a little surprised by the results.
#!/usr/bin/env perl
use Benchmark 'cmpthese';
my %ways =
(
'3calls' => sub {/a/ && !/b/ && /c/},
code => sub {m:^(?(?{/a/ && !/b/ && /c/})|(?!)):},
pos_look => sub {/(?=.*a)(?!.*b)(?=.*c)/s},
neg_look => sub {/(?!(?!.*a)|(?=.*b)|(?!.*c))/s},
neg_anch => sub {/^(?!(?!.*a)|(?=.*b)|(?!.*c))/s}
);
# Minimal check that the regexps do what they're supposed to
while (my ($way, $sub)=each %ways) {
die "$way failed to match\n" unless ($_ = 'cat') && &$sub;
die "$way had a false positive\n" if ($_= 'cab') && &$sub;
}
print "For matching:\n";
$_="find your cat";
cmpthese(-1, \%ways);
print "For non-matching:\n";
$_="don't find a cab";
cmpthese(-1, \%ways);
__END__
Rate code neg_look neg_anch pos_anch pos_look 3calls
code 674093/s -- -52% -53% -63% -63% -87%
neg_look 1417081/s 110% -- -1% -21% -22% -73%
neg_anch 1429846/s 112% 1% -- -21% -21% -73%
pos_anch 1798791/s 167% 27% 26% -- -1% -66%
pos_look 1810887/s 169% 28% 27% 1% -- -66%
3calls 5338256/s 692% 277% 273% 197% 195% --
For non-matching:
Rate code neg_anch neg_look pos_anch pos_look 3calls
code 674093/s -- -55% -57% -63% -65% -88%
neg_anch 1495898/s 122% -- -5% -17% -22% -73%
neg_look 1570933/s 133% 5% -- -13% -18% -72%
pos_anch 1806441/s 168% 21% 15% -- -5% -67%
pos_look 1906963/s 183% 27% 21% 6% -- -65%
3calls 5522908/s 719% 269% 252% 206% 190% --
What surprises me is both how much faster the three separate matches 3calls is than any single-matching alternative, and how much slower the "code" regexp is. The quick /a/ && !/b/ is due to the optimizing that the simpler patterns can undergo, but I guess that's dwarfed by the overhead of setting up an eval-inside-a-regexp.
And you, dear readers, have you any other favorite ways of saying "match this and not that" in a single regexp?
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.