Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight

Regex capture consumed by non-capturing match

by ribasushi (Monk)
on Jul 19, 2007 at 20:36 UTC ( #627614=perlquestion: print w/replies, xml ) Need Help??
ribasushi has asked for the wisdom of the Perl Monks concerning the following question:

Hello honorable Monks, Today I uncovered a very subtle bug in one of my programs, and although I fixed it I have no idea what is actually going on. Here is an example program that demonstrates what happens:
#!/usr/bin/perl my $test = 'abc , def'; for my $sub qw/trim_start trim_end trim_start trim_end/ { $test =~ /([\s\w]+),([\s\w]+)/; $sub->($1, $2); } sub trim_start { while (@_) { my $string = shift @_; $string =~ s/\A\s+//ms; print "$string\n"; } } sub trim_end { while (@_) { my $string = shift @_; $string =~ s/\s+\z//ms; print "$string\n"; } }

What is the difference between the two regexes? Why does only trim_end() destroy $2? I solved my problem by doing:
my @args = @_; while (@args)...
but I still would like to know what causes this.

Thank you in advance.

Replies are listed 'Best First'.
Re: Regex capture consumed by non-capturing match
by ikegami (Pope) on Jul 19, 2007 at 20:44 UTC

    The difference between trim_start and trim_end is that the first pass of the loop in trim_start never matches while it always matches in trim_end. Change your input to ' abc ,def' and you'll see the same problem in both trim_start and trim_end.

    On a successful match, $1 and $2 are cleared and so are $_[0] and $_[1] (since they are aliased to $1 and $2). On an unsuccessful match, $1 and $2 are left untouched.

    Passing a global as an argument is bad, especially when that global is changed by the function to which it is being passed. The best solution is to pass a copy of the global to the function. This can be done by simply changing the call to

    $sub->("$1", "$2");

    Update: Added explanation.

      I feel dumb. Thank you for pointing this out.
      As far as your suggestion goes I prefer to fix the function to make a copy of @_, instead of expecting the user of the function to remember this subtle behavior.

      Thanks again!

        You could fix it in both places. I would definitely consider passing a global a bug.

        May I also suggest an alternate implementation?

        #!/usr/bin/perl sub trim_start { for (my $s = @_ ? $_[0] : $_) { s/\A\s+//m; return $_; } } sub trim_end { for (my $s = @_ ? $_[0] : $_) { s/\s+\z//m; return $_; } } { my $test = 'abc , def'; $test =~ /([\s\w]+),([\s\w]+)/; my @words = map trim_start, map trim_end, "$1", "$2"; }

        The advantage of that implementation is that its flexible as to how its called.

        • It can be used to trim a single value:

          print(trim_start(trim_end($var)), "\n");
        • It can be used to trim a list of values:

          my @trimed = map trim_start, map trim_end, @untrimmed;
Re: Regex capture consumed by non-capturing match
by GrandFather (Sage) on Jul 19, 2007 at 21:40 UTC

    We tell the children and we tell the children: use strict; use warnings;. In this case turning on warnings generates a whole bunch of

    Use of uninitialized value in concatenation (.) or string at noname1.p +l line 23.

    warnings. A clue perhaps? Indeed, an unsubtle clue! A hit you about the ears and do something clue. First thing to notice is that all the warnings are in trim_end, so remove trim_start and try again - same result. Ok, remove some cruft and the code looks like:

    use strict; use warnings; my $test = 'abc , def'; $test =~ /([\s\w]+),([\s\w]+)/; for ($1, $2) { my $string = $_; $string =~ s/\s+\z//ms; print "$string\n"; }

    and generates (omitting the warnings):


    Now look at what the code does. It performs a regular expression match setting $1 and $2. It then (in the loop) performs two more matches where the first fails and the second sees an undefined value. At this point it is worth noting from perlre that:

    The numbered match variables ($1, $2, $3, etc.) and the related punctuation set ($+, $&, $`, $', and $^N) are all dynamically scoped until the end of the enclosing block or until the next successful match, whichever comes first.

    but the first match in the loop is successful so $2 goes out of scope - becomes undefined. Capeesh?

    Remember: use those strictures.

    DWIM is Perl's answer to Gödel
      I always prefer the form:

      my ($foo, $bar) = ($test =~ /(capt-pat-1)...(capt-pat-2)/);

      Doesn't get clobbered and is quite readable I think.


        Yes, except that it is quite hard to write an if that checks if the match succeeded at all. I omitted this in the example, but in the real code this was the case:
        if ($string =~ /re/) { my ($a, $b, $c) = trim_func ($1, $2, $3); }
      The children have been using strictures since day one. They were omitted from the example to keep it as small as possible. And besides they do not yield any information except what I already know - matches get undefined. So Teacher, leave those kids alone, thank you very much.
      As far as the fine perlre I misread it as in until the next successful capturing match, which does not seem to be the case as it was pointed out.

        We are pleased that the children are taking note of their lessons. ;)

        I assumed that you were not using strictures because you didn't mention the warning and "uninitialized value" warnings are not subtle. I see now that you meant to imply that the mechanism of the bug was subtle, rather than that the effects were subtle or the location of the faulty code was difficult to determine. My appologies for the misunderstanding.

        Note that we only know what you tell us. We didn't know from your node what the problem was - you didn't even present the output generated, nor the output you expected, let alone that you knew that the issue was an uninitialized value.

        DWIM is Perl's answer to Gödel
Re: Regex capture consumed by non-capturing match
by Fletch (Chancellor) on Jul 19, 2007 at 20:48 UTC

    Just a wild guess, but I'd say there's some weird interaction going on between the aliased copies of $1 and $2 that are being passed as arguments and the subsequent s/// operators clobbering the existing contents thereof.