http://www.perlmonks.org?node_id=674973

A recent node caught my attention: a fellow monk restructured a regex, and used a different "quoting" mechanism for special characters: [|] instead of \|.

I know that TheDamian's "Perl Best Practices" recommends that, at least for whitespaces in regexes with the /x modifier.

IMHO this is quite dangerous, because it changes the semantics. Only very slightly, but a character class with one element isn't handled identically to a literal char in perl.

Another advice in PBP is to use the modifiers /s and /m on all regexes, and if you really mean "everything but a newline", you should say that explicitly.

I crafted a regex and a test string that show the differences:

#!/usr/bin/perl use strict; use warnings; my $line = ('a' x 500) . ' ' . ('a' x 20); use Benchmark qw( cmpthese ); cmpthese -2, { literal => sub {$line =~ /a .{1,10} \ /x }, class => sub {$line =~ /a .{1,10} [ ]/x}, class_nodot => sub {$line =~ /a [^\n]{1,10} [ ]/smx }, }; __END__ Rate class_nodot class literal class_nodot 2530/s -- -26% -100% class 3413/s 35% -- -100% literal 718209/s 28289% 20942% --

(tested with perl 5.8.8 on CentOS).

I don't have to comment the speed difference, it's obvious.

What might not be obvious is the reason: the regex engine is smart enough to factor out string literals in the regex, then uses a fast search for literal substrings (the same algorithm that index uses), and anchors the regex to occurences of this literal substring.

This optimization is not performed for char classes with single entries.

The difference between . and [^\n] seems to be an optimzation that is specific to the dot char.

So what you can learn from this is: Don't apply "Best Practices" without fully understanding the underlying mechanisms.

The regex and test data were chose to show the differnce, and it might not affect "real world" regexes to that exent. Most of the time. But if you're not careful (and out of luck), you might be hit with this.

Replies are listed 'Best First'.
Re: The cost of unchecked best practices
by ikegami (Patriarch) on Mar 19, 2008 at 11:40 UTC

    Off topic: Perhaps single char positive character classes should be optimized to be equivalent to a literal character? Perhaps something can be done about single char negative character classes too.

    Update: Corion sent me a message saying he thought this was already done in 5.10, so I put it to the test. Indeed, it is.

    Test code used: (Same as OP accept avoided extra sub calls)

    #!/usr/bin/perl use strict; use warnings; use Benchmark qw( cmpthese ); our $line = ('a' x 500) . ' ' . ('a' x 20); sub code { "use strict; use warnings; $_[0]; 1;" } cmpthese -2, { literal => code(' our $line =~ / a .{1,10} \ /x; '), class => code(' our $line =~ / a .{1,10} [ ] /x; '), class_nodot => code(' our $line =~ / a [^\n]{1,10} [ ] /smx; '), };
    >c:\progs\perl588\bin\perl 674979.pl Rate class_nodot class literal class_nodot 2206/s -- -44% -100% class 3908/s 77% -- -100% literal 805341/s 36409% 20507% -- >c:\progs\perl5100\bin\perl 674979.pl Rate class_nodot literal class class_nodot 618394/s -- -12% -13% literal 704734/s 14% -- -1% class 708360/s 15% 1% --
      Does that mean that, apart from these optimizations, the regex engine got actually slower in other cases? I see that the benchmark for "literal" dropped from 8E5/sec to 7E5/sec. That's a speed drop of about 12%. I assume these benchmarks are run on the same computer...
        Yes, same computer. I just reran the tests (3 times each, one after the other), and got the same results as before:
        >c:\progs\perl588\bin\perl 674979.pl Rate class_nodot class literal class_nodot 3037/s -- -24% -100% class 4008/s 32% -- -100% literal 859954/s 28218% 21354% -- >c:\progs\perl588\bin\perl 674979.pl Rate class_nodot class literal class_nodot 3041/s -- -26% -100% class 4087/s 34% -- -100% literal 868745/s 28464% 21158% -- >c:\progs\perl588\bin\perl 674979.pl Rate class_nodot class literal class_nodot 3041/s -- -26% -100% class 4117/s 35% -- -100% literal 834248/s 27329% 20162% -- >c:\progs\perl5100\bin\perl 674979.pl Rate class_nodot literal class class_nodot 667282/s -- -10% -11% literal 741037/s 11% -- -1% class 747754/s 12% 1% -- >c:\progs\perl5100\bin\perl 674979.pl Rate class_nodot literal class class_nodot 667704/s -- -10% -12% literal 742924/s 11% -- -2% class 758776/s 14% 2% -- >c:\progs\perl5100\bin\perl 674979.pl Rate class_nodot literal class class_nodot 657342/s -- -11% -14% literal 740217/s 13% -- -3% class 766213/s 17% 4% --

        So yes, something is slower. Not necessarily the regexp, but it's likely.

Re: The cost of unchecked best practices
by Corion (Patriarch) on Mar 19, 2008 at 11:47 UTC

    This is not (as much) the case on 5.10:

    Q:\>perl -w tmp.pl Rate class_nodot literal class class_nodot 459753/s -- -11% -12% literal 518377/s 13% -- -1% class 523687/s 14% 1% -- Q:\>perl -v This is perl, v5.10.0 built for MSWin32-x86-multi-thread Copyright 1987-2007, Larry Wall Perl may be copied only under the terms of either the Artistic License + or the GNU General Public License, which may be found in the Perl 5 source ki +t. Complete documentation for Perl, including FAQ lists, should be found +on this system using "man perl" or "perldoc perl". If you have access to + the Internet, point your browser at http://www.perl.org/, the Perl Home Pa +ge.

    And at least when considering the readability, I prefer the method of writing the class instead of backslashing stuff, myself.

      Thank you, good to know that the regex engine has been improved so much for 5.10.

      It's intresting that both yours and ikegami's benchmark show the 'class' example to be slightly faster than the 'literal' one.

      Any ideas why? Could somebody repeat the benchmark with more iterations please, to check if it's really faster (and not just randomness)? I have no 5.10 available here, sorry.

        For perl5.10

        perl -Mre=debug -e "qr/a .{1,10} \ /x;qr/a .{1,10} [ ]/x;"

        gives

        Compiling REx "a .{1,10} \ " Final program: 1: EXACT <a> (3) 3: CURLY {1,10} (6) 5: REG_ANY (0) 6: EXACT < > (8) 8: END (0) anchored "a" at 0 floating " " at 2..11 (checking floating) minlen 3 Compiling REx "a .{1,10} [ ]" Final program: 1: EXACT <a> (3) 3: CURLY {1,10} (6) 5: REG_ANY (0) 6: EXACT < > (8) 8: END (0)

        So the two compile to the identical program under 5.10. And indeed, when I take out class_nodot, I get:

        Rate literal class literal 531978/s -- -1% class 539600/s 1% --
        1% difference is meaningless. Many factors influence a few % in benchmarks. Seeing fluctuations of 5% is not unusual when running the same benchmark repeatedly.
Re: The cost of unchecked best practices
by Tux (Canon) on Mar 19, 2008 at 12:04 UTC

    That might also depend on your architecture, as my 5.8.8 doesn't show thos huge differences on Linux, but YMMV:

    v5.8.8 built for x86_64-linux Rate class_nodot class literal class_nodot 865566/s -- -17% -19% class 1047607/s 21% -- -2% literal 1068831/s 23% 2% -- v5.11.0 DEVEL33542 built for x86_64-linux-thread-multi Rate class_nodot literal class class_nodot 782508/s -- -9% -10% literal 861502/s 10% -- -0% class 865566/s 11% 0% -- v5.10.0 built for i686-linux-64int Rate class class_nodot literal class 808928/s -- -2% -17% class_nodot 822085/s 2% -- -15% literal 969189/s 20% 18% -- v5.8.8 built for i686-linux-64int Rate class_nodot class literal class_nodot 3937/s -- -23% -100% class 5111/s 30% -- -100% literal 1065584/s 26967% 20750% -- v5.8.7 built for i686-linux-64int Rate class_nodot class literal class_nodot 2576/s -- -32% -99% class 3811/s 48% -- -99% literal 340787/s 13131% 8841% -- v5.8.7 built for IA64.ARCHREV_0-LP64 Rate class_nodot class literal class_nodot 2169/s -- -18% -100% class 2646/s 22% -- -99% literal 463693/s 21282% 17425% -- v5.11.0 DEVEL33495 built for IA64.ARCHREV_0 Rate class_nodot literal class class_nodot 435916/s -- -9% -10% literal 480089/s 10% -- -1% class 483507/s 11% 1% -- v5.10.0 built for IA64.ARCHREV_0 Rate class_nodot class literal class_nodot 233984/s -- -8% -8% class 254449/s 9% -- -0% literal 254481/s 9% 0% -- v5.10.0 built for IA64.ARCHREV_0-LP64 Rate class_nodot literal class class_nodot 269216/s -- -9% -10% literal 296479/s 10% -- -1% class 299316/s 11% 1% -- v5.8.8 built for PA-RISC2.0 Rate class_nodot class literal class_nodot 852/s -- -26% -100% class 1146/s 34% -- -100% literal 328648/s 38456% 28589% -- v5.8.8 built for PA-RISC2.0-LP64 Rate class_nodot class literal class_nodot 901/s -- -35% -100% class 1393/s 55% -- -100% literal 326127/s 36080% 23315% --

    ++ for this post BTW. PBP is to make you think, not to apply everything blindfolded. This is one of them.


    Enjoy, Have FUN! H.Merijn
Re: The cost of unchecked best practices
by dragonchild (Archbishop) on Mar 19, 2008 at 14:31 UTC
    The reason for the advice should be mentioned. Best practices exist because of real world issues that have come up again and again. In this case, the reason for the single-char class is because of changing requirements when using regexes as a parsing tool. The other regex best practices fall under the same category. If your regex changes so that now both : and ; are delimiters, then a single-char class is easier to maintain with a lower chance of error.

    That's all best practices are - things that reduce your chance of error when changing code. It's no different than the C best practices that go into changing

    if ( foo == 5 ) call_function( foo );
    into
    if ( 5 == foo ) { call_function( foo ); }
    Not much of a difference at first glance, but those "minor changes" took thousands of man-hours to coalesce and we ignore them at our peril. Don't blindly accept, but look deeper before suggesting a rejection.

    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
      I'll bite, why make that change? Is it so if you accidentally slip and use "=" you will get an error for trying to assign into a literal?

        There are two changes there. One, yes, is to catch s/==/=/ by making it an error (for Perl, that error is "Can't modify constant item in scalar assignment"). The other is to add braces. This is so that later it's easy to add another line that's within the branch. Otherwise, you might have this:

        if ( 5 == foo ) call_function( foo ); call_other_function( foo );

        ...which actually means this:

        if ( 5 == foo ) { call_function( foo ); } call_other_function( foo );
        That's exactly the reason. Many people do the same thing in Perl, too, considering we use the same assignment and equals operators.

        Personally, I try to remember to do so during an initial version and I might make the swap if I'm already rewriting or refactoring code. I don't go through existing, working code looking for that and making the change. Changing something that's not in error is error prone in itself, and why take the time?

        Update: I completely missed the braces added at first. That's a good change, and it is another one I try to stick to in C and PHP (which have the same single-statement rule for conditionals) for initial versions. Still, unless I have some reason to change working code, I won't. When I change the code around that section, I'd probably go ahead and add the braces whether or not I was adding another statement. I wouldn't just go into the source and add the braces and not change anything else.

        As an aside, the "blue" language (as opposed to the "Blue" language, apparently) has a single-expression rule for conditionals. All of its conditionals are done using the 'condition ? true : false' operators common to C and Perl, with the false case being purely optional. I'd say braces (or parentheses) are the way to go there, too, unless you're working with a very short single-line statement.

Re: The cost of unchecked best practices
by kyle (Abbot) on Mar 19, 2008 at 15:18 UTC

    Let me say first that I find this kind of discussion of internals and performance impact interesting. A character class with one item is not the same as a literal one item. Fascinating!

    That said, I find this kind of statement irritating:

    Don't apply "Best Practices" without fully understanding the underlying mechanisms.

    The "underlying mechanisms" go pretty deep. You seem to be making reference to the implementation of Perl, in C, but there are mechanisms underlying that too. Should one also understand the underlying mechanisms of the C compiler? There's the OS. There's the hardware. There's physical science that explains how the hardware works.

    These are all underlying mechanisms. There are performance differences in simple arithmetic from architecture to architecture. There are differences from version to version of the compiler. I don't think that it's worth my time to study all these things to figure out the very best way to implement something. I don't feel obligated to plumb the depths of computing to justify my decision to use one practice or another.

    What I think is important is readability and maintainability (after, of course, functionality). The best way to write code is the way that makes it easiest for the programmer and the programmers who come after. At some point, performance becomes a functionality issue. At that point, it's time to do some profiling and optimize where it counts. Before that point, time spent optimizing is time wasted.

    Part of the point of best practices is precisely so that one does not have to learn every detail of what underlies the advice given. Call me selfish and lazy if you like, but I want the benefit of others' experiences without paying the price of those experiences myself. I admit those experiences have value that I might perhaps miss sometimes, but they also have a cost in time and energy. Here again, I don't feel obligated to go exploring a place if someone has already made me a map.

      The "underlying mechanisms" go pretty deep. You seem to be making reference to the implementation of Perl, in C, but there are mechanisms underlying that too. Should one also understand the underlying mechanisms of the C compiler? There's the OS. There's the hardware. There's physical science that explains how the hardware works.

      That's one of the things that makes programming a science and art: you have to know which abstraction levels you care about.

      To decide that, you have to know them, at least most of them a little bit.

      Let's stay at the regex example: if you do more than just very simple regexes, you have to know that perl's regex engine is a NFA (which means the order of alternations is important), and that it matches greedy by default.

      It also helps to know that all kind of anchoring results in a performance benefit.

      For some programmers that might be enough to know, for others it isn't.

      Suppose you decide for a larger team which of the regex best practices you apply. In that case I think that you should know more about implementation and optimzations.

      If you just decide for yourself, then you don't need to know more, but you shouldn't complain about slow regexes when you don't.

      Every attempt to say "I'll inform myself until I reach abstraction level $x, and then no more" is bound to fail at some point, and your curiosity should prevent you from stopping at one level ;-)

      I agree that most micro-optimizations are not worth having, but runtime differences of two orders of magnitudes are worth noting nonetheless.

      Readability shouldn't suffer, so perhaps I should write my regexes as my $space = " "; m/... $space/; or use the \N notation.

      IMHO sticking to best practices without understanding them makes coding something mechanical, and poses the risk of cargo cult programming.

      My view might be influenced by the fact that programming is mostly a hobby for me, and I want to be fun above all else, and apart from this forum most code that I maintain is my own.

Re: The cost of unchecked best practices (fully understand)
by tye (Sage) on Mar 20, 2008 at 03:11 UTC
    Don't apply "Best Practices" without fully understanding the underlying mechanisms.

    But feel free to disparage a "best practice" w/o fully understanding the underlying mechanisms so long as you have one case in a trivial (and contrived) benchmark and don't seem to have any clue about the reason for the best practice?

    I don't see any mention of these changes causing your application to suddenly run unacceptably slowly. Sounds like premature micro-optimization. Only being able to run that contrived regex several thousand times per second seems like something likely to have no preceivable impact on performance in most situations. Sure, there can be cases where such a dramatic-looking benchmark result for a micro operation can have real impact on practical performance, but those cases are much rarer than many people seem to believe.

    Yes, I knew that [x] has been (until recently) not as optimized by Perl as a literal character. I also knew that [$] doesn't DWIM (sadly), and quite a bit about why (one of the more interesting parts of Perl parsing). But you certainly won't find me claiming to "fully understand the underlying mechanisms" of those constructs.

    If you yelled at me for making the described changes and had this as justification... Actually, my reaction (eventually) would be to strongly encourage you to spend some effort to more fully understand why premature micro-optimization so very rarely pays off and better understand why it is so widely recognized as the opposite of a "best practice".

    I guess your bolded statement particularly bothers me because most changes I make to software don't involve me "fully understanding the underlying mechanisms". And the point of a "best practice" is to prevent me from having to waste time trying to fully understand yet another thing. Don't make changes to software unless you have some confidence that your change isn't making things worse. And test your changes, including for acceptable performance.

    So you've tried to set the bar way too high for making changes to software. And trying to set a higher bar for "best practices" is the opposite of what should usually be done.

    - tye        

Re: The cost of unchecked best practices
by ww (Archbishop) on Mar 19, 2008 at 12:40 UTC

    And, with ikegami's test code, run on Ubuntu with perl, v5.8.7 built for i486-linux-gnu-thread-multi (with 1 registered patch...), there's significant variation with variance of x:

    our $line = ('a' x 5500) . ' ' . ('a' x 120);

    ww@GIG:~/pl_test$ perl 674979.pl</p> Rate class_nodot class literal class_nodot 242/s -- -30% -100% class 344/s 42% -- -100% literal 106283/s 43872% 30824% --

    whereas, with ikegami's original, our $line = ('a' x 500) . ' ' . ('a' x 20);

    ww@GIG:~/pl_test$ perl 674979.pl Rate class_nodot class literal class_nodot 2647/s -- -30% -100% class 3793/s 43% -- -99% literal 672122/s 25292% 17620% --
Re: The cost of unchecked best practices
by Pancho (Pilgrim) on Mar 19, 2008 at 14:13 UTC

    Since I am currently reading the Conway's PBP book, this discussion has been very profitable, as it underscores a point from the book. Namely, the Best practices are guidelines not hard rules.

    However, Moritz, your point though also a valuable reminder that adhering to a set of guidelines should be done actively considering what in fact is going on, and not passively (i.e. blindly) relying on an experts advice. A trap which as a noob in a crunch you can easily fall into.

    Thanks,

    Pancho

      I'm not sure blindly following an expert's advice is always a bad idea.

      I am not saying it is good to cargo cult into your programs everything you see an expert do, without understanding the mechanisms behind it or its context. But consider that some things during the learning process are going to be over the head of the learner.

      When I first started learning Perl, I used strict and warnings because PerlMonks told me to. I was not at a level, however, to appreciate many of the reasons why it was a good idea. Yet, overall it helped me.

      Even today, I have mastery of only a very small subset of the Perl language. I try out all sorts of stuff I see here that would have been beyond my capabilities to come up with on my own, with the intent of increasing my understanding. Recently I've been playing around with test-first design. Never tried it, might like it, might not, but were I not playing around in the edges of my knowledge I wouldn't learn anything.

      I think that it is more important to be in the process of learning about your practices than it is to not "apply "Best Practices" without fully understanding the underlying mechanisms." (as the OP stated). If I limited myself only to what I understood fully, I'm not even sure how much would be left. I guess I could, uh, add and subtract numeric literals. No, wait, just integer literals, floats still do wonky things that throw me curveballs :).

Re: The cost of unchecked best practices
by sundialsvc4 (Abbot) on Mar 19, 2008 at 13:02 UTC

    The general admonition, “don't blindly apply ‘advice’ that you don't understand or haven't verified for yourself” is, nonetheless, a good one...

    Sometimes a seemingly insignificant difference can have very big implications to the computer...

Re: The cost of unchecked best practices
by sundialsvc4 (Abbot) on Mar 20, 2008 at 16:06 UTC

    In the end, this is a profession, not a religion. We're paid to deliver usable tools and solutions, using Perl to do it, and we have to find our own balance-point per situation. There are no silver bullets. There are no open books.

      Nice point again sundial. I gather, I think, everything you wrote except for the last line.

      I thought it was rather that there are as many open books as you need or want, or reasonably have time for?

      Care to elaborate on the final sentence?