http://www.perlmonks.org?node_id=62398

MrNobo1024 has asked for the wisdom of the Perl Monks concerning the following question:

'foo' =~ m/.*/; print eval <STDIN>;
I ran this program, and entered the text '$&'. It printed out 'foo'. If you don't use $`, $&, or $' in your program at all, they aren't set on a regex match. There was no way for Perl to know I was going to enter $& into STDIN, so why did it set it? Does this mean that Perl is psychic?

Replies are listed 'Best First'.
Re (tilly) 1: Perl is psychic?!
by tilly (Archbishop) on Mar 06, 2001 at 06:38 UTC
    Excellent question!

    Since everybody else seems to have missed your (subtle) point by quoting irrelevant documentation that you clearly understood in great detail, allow me to repeat your point. Perl is supposed to have an important optimization. If you never use $&, $`, and $' in your script, Perl is not supposed to calculate them ever. This is important because it makes matches against long strings an order of magnitude faster. If you use them ever, they are calculated from then on. Caveat programmer. (I don't use them, ever. I wish I could make attempting to use them optionally fatal just to smoke out people who use them, but I can't.)

    With this optimization there should be no way that the above code will work since when you do the match, Perl is dealing with a script that has no $&, $`, or $' in it. And so when it goes to display the answer, the necessary data should not exist yet. But you run it and it does.

    For the record I ran it under 5.004, and got the output that you describe. I ran it under 5.005 and got no output at all as you would expect. I ran it under a slightly modified 5.6 and got a segmentation fault. (Not good, but in this case understandable.) A slight modification of your code to test $' and $` had similar results. With 5.005 when I look at perldelta I see that there were a number of changes to the RE engine including the following:

    Changes in Perl code using RE engine: More optimizations to s/longer/short/; study() was not working; /blah/ may be optimized to an analogue of index() i +f $& $` $' not seen; Unneeded copying of matched-against string removed; Only matched part of the string is copying if $` $' + were not seen;
    The last 2 items sound like the behaviour fix. I guess that the optimization wasn't really being done in 5.004, or it was done but not done as fully as it was done later.

    For the record I was seriously impressed with Ruby's optimization for this case. What they did is lazily calculated $&, $', and $` as needed. You only pay on the matches where you use those, or on cases where you try to modify a string in place that you matched against before you go to match again. Don't use it one place, pay no price even if you use it elsewhere. I tried, but couldn't find a way to break it. I suspect that this approach (which is much cleaner) would be harder to do in Perl. Still it was a nice surprise...

    UPDATE
    This seems to be very, very specific to the code. I actually assumed I knew what should happen and wanted to check $` and $' as well, so I changed the code to

    'string' =~ /ri/; print eval <STDIN>;
    for my tests. As confirmed on several platforms in chatter, the behaviour switches between versions of Perl. But the original code snippet always seems to work, and I have not a clue how or why.
      Woo. This one's got me interested.

      I've tested this on perl 5.004_04 for sun-solaris, perls 5.004_05 and 5.6 for i686-linux (redhat) and even ActiveState's 5.6.0 for Win32 and _all_ of them show the same behaviour.
      What causes the difference between two variations on this bit of code is whether or not the pattern is plain text (as it says above /blah/ may be optimized to an analogue of index()). If there's no regex compilation then $& causes Segmentation faults.

      Using
      use re 'debug';
      shows that the regex isn't re-evaluated when the $& is entered on STDIN, but it does state explicitly Omitting $` $& $' support. Must say I'm at a bit of a loss as to where the value does come from.

      If I were to go out on a limb a bit I would say that I'm thinking that maybe the penalty from using $&, etc in your code is because perl links it into plain text matches as well as compiled regexes. ie $&, etc are always there for full compiled regex's, but index() doesn't normally return the pre-match, match and post-match strings, so the "analogue of index()" requires a bit more work to produce them.

      Where's japhy? I get the feeling he'll know :o)

      There's a bunch of tests and re 'debug' output below if you're interested: <READMORE>
      use re 'debug'; 'foo' =~ m/.*/; print eval <STDIN>;
      This gives the following output:
      Compiling REx `.*'
      size 3 first at 2
         1: STAR(3)
         2:   REG_ANY(0)
         3: END(0)
      anchored(MBOL) implicit minlen 0
      Omitting $` $& $' support.
      
      EXECUTING...
      
      Matching REx `.*' against `foo'
        Setting an EVAL scope, savestack=3
         0 <> <foo>             |  1:  STAR
                                 REG_ANY can match 3 times out of 32767...
        Setting an EVAL scope, savestack=3
         3 <foo> <>             |  3:    END
      Match successful!
      
      Before waiting for the input. It actually specifies that it's omitting $&, etc support, yet when you do enter $& still gives the expected answer:
      Freeing REx: `.*'
      foo
      
      If you use a plain text match (like tilly suggested with /ri/ in 'string', you don't get this result at all, as perl doesn't handle the match in the same way, it "guesses" the result, presumably using a more index() like way of making the match:
      use re 'debug'; 'foo' =~ m/o/; print eval <STDIN>;
      gives the output:
      $ perl reg
      Compiling REx `o'
      size 3 first at 1
      rarest char o at 0
         1: EXACT <o>(3)
         3: END(0) 
      anchored `o' at 0 (checking anchored isall) minlen 1
      Omitting $` $& $' support.
      
      EXECUTING...
      
      Guessing start of match, REx `o' against `foo'...
      Found anchored substr `o' at offset 1...
      Guessed: match at offset 1
      $&
      Segmentation fault (core dumped)
      
      $` and $' don't have quite such drastic efects, they simply print blank.
      The extra level of compilation that look(ahead|behind)s give the regex also allow $& to produce the required result:
      use re 'debug'; 'foo' =~ m/(?<=f)o(?=o)/; print eval <STDIN>;
      Giving:
      $ perl reg
      Compiling REx `(?<=f)o(?=o)'
      size 15 first at 1
      rarest char o at 0
         1: IFMATCH[-1](7)
         3:   EXACT <f>(5)
         5:   SUCCEED(0)
         6:   TAIL(7)
         7: EXACT <o>(9)
         9: IFMATCH[-0](15)
        11:   EXACT <o>(13)
        13:   SUCCEED(0)
        14:   TAIL(15)
        15: END(0)
      anchored `o' at 0 (checking anchored) minlen 1
      Omitting $` $& $' support.
      
      EXECUTING...
      
      Guessing start of match, REx `(?<=f)o(?=o)' against `foo'...
      Found anchored substr `o' at offset 1...
      Guessed: match at offset 1
      Matching REx `(?<=f)o(?=o)' against `oo'
        Setting an EVAL scope, savestack=3
         1 <f> <oo>             |  1:  IFMATCH[-1]
         0 <> <foo>             |  3:    EXACT <f>
         1 <f> <oo>             |  5:    SUCCEED
                                    could match...
         1 <f> <oo>             |  7:  EXACT <o>
         2 <fo> <o>             |  9:  IFMATCH[-0]
         2 <fo> <o>             | 11:    EXACT <o>
         3 <foo> <>             | 13:    SUCCEED
                                    could match...
         2 <fo> <o>             | 15:  END
      Match successful!
      $&
      Freeing REx: `(?<=f)o(?=o)'
      o
      
      I'm curious to know if perl would attempt to re-execute the last regexp inside the eval block to get $&?
      Does that sound at all plausible? If so, would that mean that evaling on $&, $' or $` would remove their associated penalties?
Re: Perl is psychic?!
by petral (Curate) on Mar 07, 2001 at 02:31 UTC
    This is something like another bug someone posted recently (I guess in the cb since I can't find it anywhere). Combining them just for fun:
    > perl -lwe '() = ($_ = "abc") =~ /(c)/; $_ = "def"; print eval <>' $& f >
    Seems as if there's a pointer squirrled away somewhere deep in perl that was never removed just because it was never accessed. When there's not supposed to be anything there (and isn't in the normal place), somehow this shows through.

    update: Could note, of course, that these needn't be considered bugs. Just because a program does something apparently semi-sensible for 'undefined behavior' doesn't mean one _has_ to beat on the poor thing till it stops. If this were any language but Perl one would expect the compiler/interpreter to simply throw up (or at least throw up its hands). In perl, it just gets tossed into the "Doctor it hurts when I do this. -- Then don't do that" bin.

    p
(jptxs)Re: Perl is psychic?!
by jptxs (Curate) on Mar 06, 2001 at 05:06 UTC
    according to the Perl5 Pocket Ref, $& is the string matched by the last successful pattern match. Since your regex .* matches anything $& is set to that by the first line, which matches foo and that's what you print in your eval - it eval's $& and finds 'foo' there. I think... =)
    "A man's maturity -- consists in having found again the seriousness one had as a child, at play." --Nietzsche
      Yes, but Perl dosen't set $& if you don't use it, and it was impossible to know that it would be used...
        Whoever voted down the above node completely missed the point. MrNobo1024 is completely correct in saying that if Perl worked as documented as far back as, say, Camel 2 then it should not have had enough information to calculate $&.
Re: Perl is psychic?!
by KM (Priest) on Mar 06, 2001 at 05:13 UTC
    Yes, Perl is psychic... but not in this case. If you look at perlvar you will see:

     $&      The string matched by the last successful pattern
                   match (not counting any matches hidden within a
                   BLOCK or eval() enclosed by the current BLOCK).
                   (Mnemonic: like & in some editors.)  This variable
                   is read-only and dynamically scoped to the current
                   BLOCK.
    

    So, since .* matched 'foo', $& is set when you use it. If you make your pattern /\d.*/ you will find you will get no output in your same test case.

    (root@frodo):/tmp>
    # cat t.pl
    'foo' =~ m/\d.*/;
    print eval <STDIN>;
    (root@frodo):/tmp>
    # perl t.pl
    $&
    (root@frodo):/tmp>
    #
    

    Cheers,
    KM

Re: Perl is psychic?!
by mkmcconn (Chaplain) on Mar 07, 2001 at 01:31 UTC
    This introduces several new ideas to me, so I played with it for more than a quarter hour, at a console command-line. I tried in 5.005_003 and in 5.6, evaluating a second eval(), getting the same behavior as for the first eval(). I guess it clarifies the behavior, and hopefully it's contributory to an interesting thread.
    > perl -wle ' q(foo) =~ m/.*/; eval <>;' print $&; q(snarf) =~ m/.*/ ; eval <>; #prints 'foo', not 'snarf' and waits for input;

    And, I think this is amusing:

    > perl -le ' my $incr = 0; q( print $incr++, $& and " stew" =~ /.*/ and eval $& until $incr > 10) + =~ /.*/; eval <>;' eval $&; # prints '0 ( guesswhat) '..'10 stew' (versions differ on +-w)

    mkmcconn
    edited after first posting, to simplify examples