perlmeditation
Yary
<p>We all know how to match "this or that"-<p>
<code>/this|that/</code>
<p>And we can match "this and not that" simply-<p>
<code>/this/ && !/that/</code>
<p>Once in a while I'm faced with a library that wants me to provide a regexp, and I need it to test a "this and not that" pair of patterns. (If it's my own code, it will accept a coderef like <code>sub {/this/ && !/that/}</code> as a condition, and all is well...) One easy-to-explain workaround is to use a code-conditional pattern:</p>
<code>m:^(?(?{/this/ && !/that/})|(?!)):</code>
<p>or, more verbosely,</p>
<code>m: # use : as delims so / in code won't hurt
^ # Anchor it at start, don't need to backtrack
(? # Start a conditional test
(?{/this/ && !/that/}) # What to test-
# By itself, a code assertion always succeeds,
# That's why we wrap the code inside a conditional
|(?!) # Simplest "T|F" - match succeed, nonmatch fails
) # End the condition
:x
</code>
<p>And it works. Still, as I was settling down one evening a few nights ago, I got to thinking of alternatives. We can use look-ahead assertions:</p>
<code>/(?=.*this)(?!.*that)/s</code>
<p>which is fairly easy for me to read. It doesn't need an anchor, the sub-patterns will cause a match or fail at the first position- in fact the anchor slows it down since it is a useless extra operation in this regexp. The <code>s</code> modifier at the end is so the <code>.*</code> won't stop at a line break.</p>
<p>Then I had a thought going back to alternation: <code>/this|that/</code>, and a class I had on logic. Another way of saying "this and that" is "not(not this or not that)". Which leads to</p>
<code>/(?!(?!.*this)|(?=.*that))/s</code>
<p>There might be a better way of expressing it, but my time is limited...</p>
<p>Speaking of time, I tossed them all into a benchmarking script, and ran them on the machine I'm typing from which runs perl 5.12.1, and am a little surprised by the results.</p>
<code>
#!/usr/bin/env perl
use Benchmark 'cmpthese';
my %ways =
(
'3calls' => sub {/a/ && !/b/ && /c/},
code => sub {m:^(?(?{/a/ && !/b/ && /c/})|(?!)):},
pos_look => sub {/(?=.*a)(?!.*b)(?=.*c)/s},
neg_look => sub {/(?!(?!.*a)|(?=.*b)|(?!.*c))/s},
neg_anch => sub {/^(?!(?!.*a)|(?=.*b)|(?!.*c))/s}
);
# Minimal check that the regexps do what they're supposed to
while (my ($way, $sub)=each %ways) {
die "$way failed to match\n" unless ($_ = 'cat') && &$sub;
die "$way had a false positive\n" if ($_= 'cab') && &$sub;
}
print "For matching:\n";
$_="find your cat";
cmpthese(-1, \%ways);
print "For non-matching:\n";
$_="don't find a cab";
cmpthese(-1, \%ways);
__END__
Rate code neg_look neg_anch pos_anch pos_look 3calls
code 674093/s -- -52% -53% -63% -63% -87%
neg_look 1417081/s 110% -- -1% -21% -22% -73%
neg_anch 1429846/s 112% 1% -- -21% -21% -73%
pos_anch 1798791/s 167% 27% 26% -- -1% -66%
pos_look 1810887/s 169% 28% 27% 1% -- -66%
3calls 5338256/s 692% 277% 273% 197% 195% --
For non-matching:
Rate code neg_anch neg_look pos_anch pos_look 3calls
code 674093/s -- -55% -57% -63% -65% -88%
neg_anch 1495898/s 122% -- -5% -17% -22% -73%
neg_look 1570933/s 133% 5% -- -13% -18% -72%
pos_anch 1806441/s 168% 21% 15% -- -5% -67%
pos_look 1906963/s 183% 27% 21% 6% -- -65%
3calls 5522908/s 719% 269% 252% 206% 190% --
</code>
<p>What surprises me is both how much faster the three separate matches <code>3calls</code> is than any single-matching alternative, and how much slower the "<code>code</code>" regexp is. The quick <code>/a/ && !/b/</code> is due to the optimizing that the simpler patterns can undergo, but I guess that's dwarfed by the overhead of setting up an eval-inside-a-regexp.</p>
<p>And you, dear readers, have you any other favorite ways of saying "match this and not that" in a single regexp?</p>