Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Perl Bug in Regex Code Block?

by Hofmator (Curate)
on Sep 03, 2001 at 15:45 UTC ( [id://109847]=perlquestion: print w/replies, xml ) Need Help??

Hofmator has asked for the wisdom of the Perl Monks concerning the following question:

Playing around with regexes (abusing them :) on the weekend I came across the following (on Perl 5.6.1, ActiveState Build 626). Executing the same regex on the same string in a loop multiple times yields different results for the first run and the remaining runs. I'm running on Win2K, but I don't think this plays a role here. The code:

#!/usr/bin/perl use strict; use warnings; use re qw/eval /; my $pattern = q/(.)(?{ print ++$counts[0]; })^/; my $line = 'ab'; for (0..2) { my @counts = (0); print "$_: "; # $pattern .= '(?=.)'; $line =~ /$pattern/; print "; \@counts = (", join(', ', @counts), ")\n"; } print "\@main::counts = (", join(', ', @::counts), ")\n";

This prints - apart from the warning about the last line:

0: 12; @counts = (2) 1: 34; @counts = (0) 2: 56; @counts = (0) @main::counts = ()
which means, it works the first time as expected but the next times my @counts doesn't get modified by the regex. However, inside the regex the variable seems to retain its value from execution to execution.

When using a package variable by changing my @counts to our @counts the program works as expected and prints:

0: 12; @counts = (2) 1: 12; @counts = (2) 2: 12; @counts = (2) @main::counts = (2)

When uncommenting the $pattern .= line (and going back to my) - effectively changing the pattern in every loop (remark: this does not effect the working of the regex!), the code also works as expected printing:

0: 12; @counts = (2) 1: 12; @counts = (2) 2: 12; @counts = (2) @main::counts = ()

My question - is this a known bug? Is it a bug at all or might I have overlooked a (well) documented feature ;-) and how does this behave in other versions of perl?

-- Hofmator

Replies are listed 'Best First'.
Re: Perl Bug in Regex Code Block?
by japhy (Canon) on Sep 03, 2001 at 17:22 UTC
    Your regex is only being compiled once, and in this compilation, it makes note of the variable you're using. Thus, it creates an "accidental" closure. Here is my proof:
    ### update: fixed ### thanks Hof -- I condensed working code poorly :( use re 'eval'; my @r; my $p = q/.(?{ ++$x[0] })^/; for (0..2) { my @x = (0); "ab" =~ $p; push @r, \@x; } print "$_->[0]" for @r;
    That code prints 600. If, however, you cause the regex to change, such that it requires recompilation, the binding to the previous @x is gone, and the new @x is bound.

    If you were to use qr// instead, you'd be changing the global array.

    You're doing some funny-looking scope-crufting. I'd stay away from it if I were you. This situation is the sort of thing I fear having to write about and explain in my book.

    _____________________________________________________
    Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
    s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

      Thanks for the explanation, japhy++! Playing around with your code I think I understand it now and how I accidently created a closure. I still have some questions, though.

      • Why is the regex not recompiled, I'm not using the /o modifier. I thought perl recompiles a regex /$p/ which contains a variable interpolation every time. And is there a way to force a recompile?
      • I was not trying to do anything funny with the different scopes. What I want is execute some code which manipulates a variable in a regex. And I'd like to use a lexical variable so that I don't pollute the global namespace. Is there a way to do that? Taking the my @x declaration out of the loop like this
        my @x; for (0..2) { @x = (0); "ab" =~ $p; push @r, \@x; }
        fixes it here but what if the whole thing is in a subroutine, then I can't call it more than once, can I?
      • I think I'm not wrong in saying that this is slightly underdocumented ... especially since 5.6.0 seems to behave differently as others have posted here.

      -- Hofmator

        Last thing first: it's not documented because code evaluation is experimental. It's a very iffy thing, and it changes quickly and silently.

        Second thing second: use a local array, and copy its contents to a lexical one. I know you don't want to use a global array, but I'm telling you that you should. This is an example from my book:

        "12180" =~ m{ (?{ local @n = () }) (?: (\d) (?{ local @n = (@n, $1) }) )+ \d (?{ @d = @n }) }x;
        We make a local array that things happen to, and then we copy it to our real array at the end of the regex. In your case, you might want to do:
        local @n; /(.)(?{ ++$n[0] })^/; @d = @n;
        First thing last: regex compilation is an interesting thing. Here is code that compiles the regex twice:
        $p = '\w+-\d+'; /$p/; /$p/;
        And here's code that only compiles it once:
        $p = '\w+-\d+'; for $i (1,2) { /$p/ }
        The secret is this (and pertains to regexes with variables in them, for they're not compiled until run-time): for each compilation op-code in the syntax tree, Perl keeps a string representation of the regex. The next time the compilation op-code is gotten to, the NEW string representation is compared with the previous one. If they are the same, the regex doesn't need recompilation. If they are different, it does need to be recompiled.

        Now, if you've heard "if you have a regex, and it has variables in it, and the variables change, the regex has to be recompiled" that's technically incorrect:

        ($x,$y) = ('a+', 'b'); for (1,2) { /$x$y/; ($x,$y) = ('a', '+b'); }
        The two variables comprising our regex have changed, but the regex ends up being the same. Sneaky, eh?

        I can't take credit for figuring this out on my own -- a couple months ago, Dominus gave me the hint about the string representation. Now I understand.

        So that answers your question, I think.

        _____________________________________________________
        Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
        s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Re: Perl Bug in Regex Code Block?
by stefan k (Curate) on Sep 03, 2001 at 16:53 UTC
    Hi,
    not that I fully understand what you're doing there, but it seems to me that you're fiddling with scopes in a very unintuitive way (at least to me). What is the scope of counts whenever you're referring to it? Within the regexp it should be the global one, shouldn't it? It is first used outside the for-loop; but OTOH it isn't declared using my, so how could it pass use strict??
    Well then, running the same code under 5.6.0/Linux results in:
    Name "main::counts" used only once: possible typo at ./re-code.pl line + 20. 0: 12; @counts = (0) 1: 34; @counts = (0) 2: 56; @counts = (0) @main::counts = (6)
    Then uncommenting the pattern line (your third example) yields exactly the same results. Changing my to our I get the same result as you get.
    You're simply throwing away the warning we get in line 20. Is this a clever thing to do?

    blblblblblblblblblblb

    You know what? I'm even more confused than before I started studying the code. At least I could present another results from another perl version as you wished.

    Regards... Stefan
    you begin bashing the string with a +42 regexp of confusion

      OK, to your first problem

      It [@counts] is first used outside the for-loop
      this is inside a single quoted string. It is the same as if I put it directly into the regex in the loop or declare it inside the loop. The reason I'm doing it this way is that I want to change this pattern inside the loop for my third testcase.

      Name "main::counts" used only once: possible typo at ./re-code.pl line + 20. 0: 12; @counts = (0) 1: 34; @counts = (0) 2: 56; @counts = (0) @main::counts = (6)
      This is very interesting. As I interpret it, the code block inside the regex uses the global variable @main::counts which is otherwise blocked from view inside the loop by the my @counts declaration. A simple equivalent example
      { my $num = 0; $main::num = 5; # this instead of the regex print $num; # prints 0 } print $num; # prints 5 # or under use strict print $main::num; # prints 5 as well
      Makes perfect sense. However with 5.6.1 you seem to be able to use lexical variables from the enclosing scope, but this is where the bug comes in. It works the first time but doesn't work the next times.

      btw, the warning can be ignored in this case

      -- Hofmator

Re: Perl Bug in Regex Code Block?
by demerphq (Chancellor) on Sep 03, 2001 at 17:38 UTC
    Ok, well I am running 5.6.0 AS 623 and I get different output for your code:
    0: 12; @counts = (0) 1: 34; @counts = (0) 2: 56; @counts = (0) @main::counts = (6)
    Which says to me that perl is using the dynamic variable inside the regex eval. (Incidentally the docs do say that this is an experimental feature and may not work appropriately. Also they mention localization so I suspect this is maybe intended.) Also if you change the my to a local it produces the desired results.
    Just ran it on AS 628 and it produces the results you said it did. Although worked as expected under our and local. My money says this is a bug.
    But this all gets weirder.
    When I change the code under (only did this under 623) (barely) to
    #!/usr/bin/perl use strict; use warnings; use re qw/eval /; my $line = 'ab'; my $pattern = q/(.)(?{print ++$counts[0]})^/; for (0..2) { my @counts = (0); print "$_: "; $line =~ /$pattern/; print "; my \@counts = (@counts)\n"; } { no strict; no warnings; print "our \@counts = (".join(",",@counts).")\n"; #print "our \@counts = (@counts)\n"; }
    I get the same result again. Now uncomment the last print line and run it again. I get a
    In string, @counts now must be written as \@counts at .\counts.pl line + 22, near "our \@counts = (@counts" Execution of .\counts.pl aborted due to compilation errors.
    Which to me doesnt make any sense at all. It should die in both cases cause the dynamic @counts is not declared, or in neither, but not like this.

    And I have another point of weirdness to note in the regex you are using you have placed a '^' caret at the END of the regex, which for some reason makes your print statement fire twice. If I remove the ^ it prints once. Either way I dont see what is going on here at all....

    Yves
    --
    You are not ready to use symrefs unless you already know why they are bad. -- tadmc (CLPM)

      Some answers to your questions

      • Concerning the error message in connection with the last print statement ... I cannot reproduce that, it works fine both ways (commented and uncommented). With strict and warnings I get
        Possible unintended interpolation of @counts in string at bug line 20. Global symbol "@counts" requires explicit package name at bug line 19. Global symbol "@counts" requires explicit package name at bug line 20.
        and it dies as expected. Adding the explicit @main:: solves the problem altogether.
      • And I have another point of weirdness to note in the regex you are using you have placed a '^' caret at the END of the regex
        This is weirdness, you are right and actually not necessary for the thing in question here. It is a left-over from the code where I originally encountered the problem. But I can explain the behaviour ... consider this simpler regex "ab" =~ /.^/; it matches any character and after that the beginning of the line, so it can never match! Nevertheless the regex tries to match. First the a, then it sees that that doesn't work out and so tries the b after which it fails. If we now sneak in a code block like this "ab" =~ /.(?{print 'hello!'})/; the regex passes this block twice! And you can do very nice things with that (see e.g. my twiddle code) ... the original code came from a nonogram solver which I will post here in a couple of weeks (I have to find time to clean up the code a bit :)

      Update: I forgot to mention use re 'debug'. It is always helpful when you don't understand a pattern match.

      -- Hofmator

Re: Perl Bug in Regex Code Block?
by MZSanford (Curate) on Sep 03, 2001 at 16:38 UTC
    I may be confussed, but i think the problem is with the :
    my @counts = (0);
    ... specifically, the my. When you use my, you are creating a variable which will disappear when it goes out of scope. since the for (0..2) {} loop is the current scope, when it completes, @counts is destroyed. This is fixed by not using a my inside of the for loop (as you have seen), and is a very perlish thing.
    can't sleep clowns will eat me
    -- MZSanford

      When you use my, you are creating a variable which will disappear when it goes out of scope.
      I'm well aware of that ... but the regex is taking place inside this scope and so the lexical variables should be accessible inside the regex. This works the first time as expected but it doesn't work on the second and third iteration of the loop.

      Maybe you have misunderstood my question, I'm not confused that the last line of my code doesn't print anything. It was only included for the (working) run with our instead of my. I want to know, why it's changing its behaviour inside the loop.

      I hope this clarifies my problem ...

      -- Hofmator

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://109847]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (4)
As of 2024-04-24 00:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found