http://www.perlmonks.org?node_id=674973

A recent node caught my attention: a fellow monk restructured a regex, and used a different "quoting" mechanism for special characters: [|] instead of \|.

I know that TheDamian's "Perl Best Practices" recommends that, at least for whitespaces in regexes with the /x modifier.

IMHO this is quite dangerous, because it changes the semantics. Only very slightly, but a character class with one element isn't handled identically to a literal char in perl.

Another advice in PBP is to use the modifiers /s and /m on all regexes, and if you really mean "everything but a newline", you should say that explicitly.

I crafted a regex and a test string that show the differences:

#!/usr/bin/perl use strict; use warnings; my $line = ('a' x 500) . ' ' . ('a' x 20); use Benchmark qw( cmpthese ); cmpthese -2, { literal => sub {$line =~ /a .{1,10} \ /x }, class => sub {$line =~ /a .{1,10} [ ]/x}, class_nodot => sub {$line =~ /a [^\n]{1,10} [ ]/smx }, }; __END__ Rate class_nodot class literal class_nodot 2530/s -- -26% -100% class 3413/s 35% -- -100% literal 718209/s 28289% 20942% --

(tested with perl 5.8.8 on CentOS).

I don't have to comment the speed difference, it's obvious.

What might not be obvious is the reason: the regex engine is smart enough to factor out string literals in the regex, then uses a fast search for literal substrings (the same algorithm that index uses), and anchors the regex to occurences of this literal substring.

This optimization is not performed for char classes with single entries.

The difference between . and [^\n] seems to be an optimzation that is specific to the dot char.

So what you can learn from this is: Don't apply "Best Practices" without fully understanding the underlying mechanisms.

The regex and test data were chose to show the differnce, and it might not affect "real world" regexes to that exent. Most of the time. But if you're not careful (and out of luck), you might be hit with this.