http://www.perlmonks.org?node_id=642055

eyepopslikeamosquito has asked for the wisdom of the Perl Monks concerning the following question:

From some code I inherited recently:

if ( $line =~ /$DELIMITER/ ) { ...
Now, this an accident waiting to happen -- what if $DELIMITER contains a regex metachar? I suppose the obvious fix is:
if ( $line =~ /\Q$DELIMITER/ ) { ...
though perhaps this is better/faster:
if ( index($line, $DELIMITER) >= 0 ) { ...
How would you do it?

Replies are listed 'Best First'.
Re: Style question: regex versus string builtin function
by shmem (Chancellor) on Oct 02, 2007 at 07:53 UTC
    If $DELIMITER was dynamic and could contain a regex, I'd use a m//, otherwise index.

    --shmem

    _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                  /\_¯/(q    /
    ----------------------------  \__(m.====·.(_("always off the crowd"))."·
    ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
Re: Style question: regex versus string builtin function
by johngg (Canon) on Oct 02, 2007 at 08:57 UTC
    If $DELIMITER was static and was being tested for more than once in the code I might consider making a compiled regex.

    my $rxDELIMITER = qr{\Q$DELIMITER\E}; ... if ( $line =~ $rxDELIMITER ) { ...

    I probably reach for regexen too quickly without even considering the use of index. I suspect I'm not the only one with a bit of a blind spot there.

    Cheers,

    JohnGG

Re: Style question: regex versus string builtin function
by throop (Chaplain) on Oct 02, 2007 at 11:47 UTC
    Use index. Even after using \Q, there are other odd cases lurking. From perlreref
    If 'pattern' is an empty string, the last I matched regex is used.
    Also
    You cannot include a literal $ or @ within a \Q sequence. An unescaped $ or @ interpolates the corresponding variable, while escaping will cause the literal string \$ to be matched. You'll need to write something like m/\Quser\E\@\Qhost/.
    The real 'style' question here, though, is Which form is most maintainable, most understandable when somebody looks at it two years from now? And this use of the $DELIMITER is going to be rather opaque in either case. Therefore, the most important element of style here is a generous set of comments, explaining why $DELIMITER was broken out a separate variable (or constant.)

    throop

    Update: lidden's point is well taken; even a zero-width assertion like \Q keeps the pattern from being empty. But see the discussion that follows

      But 'pattern' is not an empty string after using \Q.
        Silly enough this still counts as empty. In general I think the way empty regexes work is just bad design. It should only trigger if the regex is empty at the literal code level, not after all kinds of expansion has been done on the stuff between the delimiters.

        A \Q does not "fill" an empty regex.

        use Test::More 'tests' => 5; ok( 'foo' =~ //, 'empty regex matches' ); ok( 'foo' =~ /foo/, '/foo/ matches' ); ok( !('bar' =~ //), 'repeated match of foo' ); ok( !('bar' =~ /\Q/), 'repeated match with \\Q' ); my $empty = ''; ok( !('bar' =~ /\Q$empty/), 'interpolated empty string same as \\Q' );

      a zero-width assertion like \Q

      \Q isn't an assertion. It's like \U et al. and works in all interpolating quote operators. It's just that one almost always sees it with the regexp operators. An example:

      print "\U\Qfoo.bar"; __END__ FOO\.BAR

      lodin

Re: Style question: regex versus string builtin function
by thospel (Hermit) on Oct 02, 2007 at 11:33 UTC
    I'd definitely go for the regex. If I later would read that code, I'd have to think for half a second about the index code to see that's it's not a "where is this needle", but that it's a "does the needle exist anywhere", while a regex immediately gives that kind of association. If index is faster, that is an implementation detail. If we care, we should just fix the perl optimization code to make them equivalent. But by default clarity not speed is the goal of writing code.
      that it's a "does the needle exist anywhere", while a regex immediately gives that kind of association

      g, that doesn't work for me.

      Regexes are inherently more complex to use than the index function. There are the various regular expression dialects, there are the modifiers, and there are the global variables upon which they may trample.

      But, like others, I tend to reach for the match operator.

      Be well,
      rir

Re: Style question: regex versus string builtin function
by lima1 (Curate) on Oct 02, 2007 at 11:57 UTC
    I use index when I need the match position, otherwise a regex. And it seems that index is NOT faster. Even code like
    my $pos; if ( $line =~ $regex ) { $pos = length $`; }
    which gets the match position with a regex is slightly faster (but much uglier of course):

    Update: For better ways of getting the match position, see How do I retrieve the position of the first occurrence of a match?.

    Benchmark code:

    Benchmark results:
    Rate index regex_pos regex regex_compiled_pos rege +x_compiled index 450/s -- -38% -39% -40% + -41% regex_pos 728/s 62% -- -2% -3% + -5% regex 741/s 65% 2% -- -1% + -3% regex_compiled_pos 749/s 66% 3% 1% -- + -2% regex_compiled 763/s 70% 5% 3% 2% + --
      there are some issues about using $`, check perlre.
      what do you want is m// then pos, this will be faster.

      Oha

      update: check the tye's note below

        Make that m//g (note the 'g') in a scalar context and then pos.

        - tye        

        Well, you must be careful when you use match variables, especially when you work with big strings. But they aren't slow per se:

        Update: Thank you all for your comments and suggestions (here and in the CB)! See How do I get what is to the left of my match? for an updated benchmark and better explanations.

Re: Style question: regex versus string builtin function
by apl (Monsignor) on Oct 02, 2007 at 09:45 UTC
    I'd definitely use index. It's the simplest tool for this problem.
Re: Style question: regex versus string builtin function
by graff (Chancellor) on Oct 02, 2007 at 13:04 UTC
    ... what if $DELIMITER contains a regex metachar?

    What if the intention is that metacharacters in the variable should be used as such? What to use depends on what the intention is.

    For cases where "TMTOWTDI" really applies, the choice of approach is not likely to matter all that much (except to those who are compelled to optimize). For cases where literal-vs.-metachar handling means a difference between success vs. error (or ability vs. inability to do a task), one tool will be better than the other, and whichever one is right, you still have to provide some safeguards and checks to try to handle all contingencies as best you can.

Re: Style question: regex versus string builtin function
by talexb (Chancellor) on Oct 02, 2007 at 17:33 UTC

    While I don't doubt that index is faster, I like your first solution better, simply because it's more Perl-ish. You're seeing if a particular delimiter appears on a line.

    The alternative would (for me) require I look up how index works -- it's a logical function to have in a language, I just don't think I've ever used it, so I'm not sure what the parameters are or what it returns.

    That's just my preference.

    Alex / talexb / Toronto

    "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds