|Problems? Is your data what you think it is?|
The cost of unchecked best practicesby moritz (Cardinal)
|on Mar 19, 2008 at 11:22 UTC||Need Help??|
A recent node caught my attention: a fellow monk restructured a regex, and used a different "quoting" mechanism for special characters: [|] instead of \|.
I know that TheDamian's "Perl Best Practices" recommends that, at least for whitespaces in regexes with the /x modifier.
IMHO this is quite dangerous, because it changes the semantics. Only very slightly, but a character class with one element isn't handled identically to a literal char in perl.
Another advice in PBP is to use the modifiers /s and /m on all regexes, and if you really mean "everything but a newline", you should say that explicitly.
I crafted a regex and a test string that show the differences:
(tested with perl 5.8.8 on CentOS).
I don't have to comment the speed difference, it's obvious.
What might not be obvious is the reason: the regex engine is smart enough to factor out string literals in the regex, then uses a fast search for literal substrings (the same algorithm that index uses), and anchors the regex to occurences of this literal substring.
This optimization is not performed for char classes with single entries.
The difference between . and [^\n] seems to be an optimzation that is specific to the dot char.
So what you can learn from this is: Don't apply "Best Practices" without fully understanding the underlying mechanisms.
The regex and test data were chose to show the differnce, and it might not affect "real world" regexes to that exent. Most of the time. But if you're not careful (and out of luck), you might be hit with this.