http://www.perlmonks.org?node_id=450727


in reply to Re: Tokenizing and qr// <=> /g interplay
in thread Tokenizing and qr// <=> /g interplay

OK, consider it asked. I presume that the most useful thing about qr// is that it allows you to pass regular expressions around as arguments to functions and such, but honestly my lack of experience with it leaves me wondering.
  • Comment on Re^2: Tokenizing and qr// <=> /g interplay

Replies are listed 'Best First'.
Re^3: Tokenizing and qr// <=> /g interplay
by japhy (Canon) on Apr 23, 2005 at 17:09 UTC
    When you have variables in a regex, Perl examines the contents of those variables to see if the overall representation of the regex has changed:
    for ("ab", "cd") { if ($str =~ /$_/) { ... } }
    In the above code, the regex 'ab' is compiled and executed, and then the regex 'cd' is compiled and executed. Compare that with:
    ($x, $y) = ("ab", "c"); for (1, 2) { if ($str =~ /$x$y/) { ... } ($x, $y) = ("a", "bc"); }
    Here, even though $x and $y change, the ACTUAL regex ('abc') does not change, so the regex is compiled only once. The process that Perl does internally is this:
    1. take regex at this opcode
    2. interpolate any variables
    3. compare with previous value of the regex at this opcode
    4. compile if different
    5. execute this regex
    When you use the /o modifier, it tells Perl that after it has compiled the regex, it should SKIP steps 2-4 of this process, meaning that the regex at this opcode will NEVER change.

    So what is qr// good for? Consider this:

    my @strings = make_10_strings(); for (@strings) { for my $p ('x+', 'yz?y', 'xz+y') { if ($_ =~ $p) { handle($_) } } }
    This code compiles a grand total of 30 regexes. Why? Because for each string in @strings we've got three patterns to execute, and because each time the $_ =~ $p is encountered the contents of $p has changed, the regex is compared and recompiled each time. Now sure, you could reverse the order of the loops, but that will result in the calls to handle() happening in a different order.

    So enter the qr//.

    When Perl sees a regex comprised solely of a single variable, Perl checks to see if that variable is a Regexp object (what qr// returns). If it is, Perl knows that the regex has already been compiled, so it simply uses the compiled regex in the place of the regex. That means doing:

    my @strings = make_10_strings(); for (@strings) { for my $p (qr/x+/, qr/yz?y/, qr/xz+y/) { if ($_ =~ $p) { handle($_) } } }
    is considerably faster. There is no additional compilation happening. It's probably even better to move the qr// values into an array, but that might be moot since they're made of constant strings in this example. The point is, the use of qr// in a looping construct is the primary benefit it offers. Yes, it helps break a regex up into pieces too, but that's just a matter of convenience.

    Be warned that the benefit of qr// objects is lost if there is additional text in the pattern match. I mean that $foo =~ /^$rx_obj$/ suffers from the same problem as $foo =~ /^$text$/.

    _____________________________________________________
    Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
    How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart
      When Perl sees a regex comprised solely of a single variable, Perl checks to see if that variable is a Regexp object (what qr// returns). If it is, Perl knows that the regex has already been compiled, so it simply uses the compiled regex in the place of the regex.

      Really? That is very interesting. So if I have:

      my $true = qr/y(?:es|up)?|1|enabled?/i; my $false = qr/n(?:o(?:pe)?)?|0|disabled?/i; die "Need boolean input" unless /^(?:$true|$false)$/; if (/$true/) { do_stuff(); }
      you're saying that in the die's unless clause, it will need to completely recompile the regex? That is not my interpretation, but I could be completely wrong here.

      My assumption is that both $true and $false are compiled once, and only once, and the unless modifier above would not need to recompile either one.

      Even if that is the case, I use code like the above because I like to be able to reuse a common criteria for truth and falseness across many expressions - sometimes, as in the die statement above, for validation that the value is something (i.e., not a typo - if someone had "y]", we'd not accidentally treat that as a false value, we'd simply reject it so the user could fix the typo), or, at other times, such as the if statement above, just to see which one it was. Which goes to the OP's question on why it's useful, somewhat in agreement with other posts here. I'm just showing a concrete example of real, live, production code where I use this construct.

        It won't need to recompile the entire regex each time, because $true and $false haven't changed, but Perl will need to compare the string representation of the regex with the representation of it the last time it was at this opcode. And the first time through, yes, it will compile the regex.
        _____________________________________________________
        Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
        How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart
      Hi, I was just refactoring some code and saw a possible opportunity to use this advice. But. Instead of having multiple strings in @strings to process, I leave all the lines from my file joined together as one giant string with embedded \n chars. From this angle, since I only have to use each regex once across all the strings via them being 'joined' into one string, I won't benefit from qr.

      my ( $crummy, $good ); foreach my $crummy_good_ar ( @corrections_to_make ) { ( $crummy, $good ) = @$crummy_good_ar; $file_in_string_form =~ s/\b(\Q$crummy\E)\b/$good/ig; }
      However, as you can see (?) from the example above, I have lots of crummy/good switchouts to do, and is my plodding approach above the best that can be expected?

      P.S. Can you clarify/update what you meant by:
      the benefit of qr// objects is lost if there is additional text in the pattern match

      I think you are saying that a precompiled/qr regex used in a follow-on regex will have to be recompiled if you snap additional text on to the qr'd variable, because the overall text of the new regex will be different. Although at least one would still have the benefit of 'concentrated regex logic' within the qr'd variable?

        I can answers your first question by answering your second one. If you made all the first elements of your array references Regexp objects:
        $_->[0] = qr/\b(\Q$_->[0]\E)b/ for @corrections_to_make;
        then you could do your loop as
        for my $crummy_good_ar (@corrections_to_make) { my ($crummy, $good) = @$crummy_good_ar; $file_in_string_form =~ s/$crummy/$good/ig; }
        This way, even if you end up looping over THAT code, you'd still be dealing with already-compiled regexes. As soon as you put additional text into a regex with qr// in it:
        my $rx = qr/abc/; if ($str =~ /^($rx)$/) { ... }
        Perl has to do the "compare physical regex forms" test. Only if the qr// object is all alone will it have the entire benefits it was made for.

        Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
        How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart
Re^3: Tokenizing and qr// <=> /g interplay
by MarkusLaker (Beadle) on Apr 23, 2005 at 16:34 UTC
    Another use for qr// is to break up unmanageably complex regular expressions into simpler, named, self-contained pieces. (There's a direct parallel here with subs, which do the same for 'ordinary' Perl code. In fact, you can consider a named regex to be just a function written with a funny-looking syntax: its input is a string and its output is either a Boolean value or one or more strings, depending on whether it captures anything.)

    Here's an example from a code-filtering assertions module (yes, another one) that's not yet tested thoroughly enough to submit to CPAN: