Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Empty pattern in regex

by choroba (Cardinal)
on Oct 18, 2023 at 20:20 UTC ( [id://11155031]=perlquestion: print w/replies, xml ) Need Help??

choroba has asked for the wisdom of the Perl Monks concerning the following question:

Dear brothers and sisters in Perl.

The perlop says:

If the pattern evaluates to the empty string, the last successfully executed regular expression is used instead.

This seems to be true in a simple example:

$ perl -wle '"a" =~ /a/; // and print for qw( a b a b a b )' a a a
The first match is successful, and all the a's that match the same regex are printed.

It behaves correctly when combined with the flip flop operator, too, making it possible to treat the boundary lines specially:

$ perl -le 'print for a .. z' | perl -nle 'if (/d/ .. /h/) { print un +less // }' e f g

It doesn't work as expected when combined with next, though:

$ perl -le 'print for a .. z' | perl -nle 'if (/d/ .. /h/) { next unl +ess //; print }' d f g h

Why is "f" printed?

When combined with -Mre=debug, it shows

... Matching REx "d" against "d" Intuit: trying to determine minimum start position... doing 'check' fbm scan, [0..1] gave 0 Found anchored substr "d" at offset 0 (rx_origin now 0)... (multiline anchor test skipped) Intuit: Successfully guessed: match at offset 0 Matching REx "h" against "d" Intuit: trying to determine minimum start position... doing 'check' fbm scan, [0..1] gave -1 Did not find anchored substr "h"... Match rejected by optimizer Matching REx "d" against "d" Intuit: trying to determine minimum start position... doing 'check' fbm scan, [0..1] gave 0 Found anchored substr "d" at offset 0 (rx_origin now 0)... (multiline anchor test skipped) Intuit: Successfully guessed: match at offset 0 Matching REx "h" against "e" Intuit: trying to determine minimum start position... doing 'check' fbm scan, [0..1] gave -1 Did not find anchored substr "h"... Match rejected by optimizer Matching REx "d" against "e" Intuit: trying to determine minimum start position... doing 'check' fbm scan, [0..1] gave -1 Did not find anchored substr "d"... Match rejected by optimizer Matching REx "h" against "f" Intuit: trying to determine minimum start position... doing 'check' fbm scan, [0..1] gave -1 Did not find anchored substr "h"... Match rejected by optimizer Matching REx "" against "f" (*) 0 <> <f> | 0| 1:NOTHING(2) 0 <> <f> | 0| 2:END(0) Match successful! ...

Why is there the empty regex (see (*))? Is the magic of // somehow broken by next? Is this the expected behaviour and is it documented anywhere?

Update: When combined with continue, the output changes.

$ perl -le 'print for a .. z' | perl -ne 'if (/d/ .. /h/) { next unle +ss //; print }} continue {' d e f g h
Note that the continue part is empty!

map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

Replies are listed 'Best First'.
Re: Empty pattern in regex
by hv (Prior) on Oct 19, 2023 at 00:40 UTC

    Why is "f" printed?

    I would have expected the question 'why are "f" and "g" printed'. Do you agree that printing "g" is also surprising, for the same reason? (If not, I may be misunderstanding random parts of your post.)

    Why is there the empty regex (see (*))?

    I don't know, seems very odd to me. I suggest reporting it as a possible bug.

    It seems possible that since the last successfully matched regexp was /d/, and the last attempted match against that regexp was a fail, it may have somehow marked it as no longer successfully matched; but that doesn't explain the change of behaviour when you add the empty continue block.

    I suspect rather that it is a scoping bug: I'm not sure if the docs make this clear, but it is intended to use the last successfully matched regexp visible to the current scope. Thus:

    % perl -wle '"a" =~ /a/; { "b" =~ /b/ } "ab" =~ // and print $&' a %

    FWIW p5p mostly regards the empty regexp behaviour as a misfeature reluctantly spared the axe only because of the constraints of backward compatibility - it is very rare to see anyone actually trying to make use of it. But since we have it, it certainly ought to work as advertised.

          > but it is intended to use the last successfully matched regexp

      Whatever its intent, it is one confusing puppy. //; always matches, always returns TRUE, but it never changes $&. $& is always whatever the previous regex set it to, whether it matched or not, effectively a NOP. Consider this:

      $_ = 'Hello Perl'; say '$_ = \'Hello Perl\';'; print "\$1: $1\n\$2: $2\n\$3: $3\n\$&: $&\n\n"; # unsuccessful match /Python/; print "No match, \/Python\/\n"; print "\$1: $1\n\$2: $2\n\$3: $3\n\$&: $&\n\n"; # successful match if (//) { print "No nothing, if (\/\/) {\n"; print "\$1: $1\n\$2: $2\n\$3: $3\n\$&: $&\n\n"; } else { print "\/\/ unsuccessfull match"; } # successful match, no captures /Perl/; print "Match \/Perl\/, No captures\n"; print "\$1: $1\n\$2: $2\n\$3: $3\n\$&: $&\n\n"; # successful match, empty pattern if (//) { print "No nothing, if (\/\/) {\n"; print "\$1: $1\n\$2: $2\n\$3: $3\n\$&: $&\n\n"; } else { print "\/\/ unsuccessfull match"; } # successful match, no captures /Perl/; print "Match \/Perl\/, No captures\n"; print "\$1: $1\n\$2: $2\n\$3: $3\n\$&: $&\n\n"; # successful match, no pattern, empty parens if (/()/) { #//; print "No nothing, if \(\/\(\)\/\) {\n"; print "\$1: $1\n\$2: $2\n\$3: $3\n\$&: $&\n\n"; } else { print "\/\/ unsuccessfull match"; }

      Which results in:

        $_ = 'Hello Perl';
        $1: 
        $2: 
        $3: 
        $&: 
      
        No match, /Python/
        $1:  
        $2: 
        $3: 
        $&: 
      
        No nothing, if (//) {
        $1: 
        $2: 
        $3: 
        $&: 
      
        Match /Perl/, No captures
        $1: 
        $2: 
        $3: 
        $&: Perl
      
        No nothing, if (//) {
        $1: 
        $2: 
        $3: 
        $&: Perl
      
        Match /Perl/, No captures
        $1: 
        $2: 
        $3: 
        $&: Perl
        
        No nothing, if (/()/) {
        $1: 
        $2: 
        $3: 
        $&: 
      

      What is or was the purpose of this construction? How would one use it?

        > What is or was the purpose of this construction? How would one use it?

        #!/usr/bin/perl use warnings; use strict; use feature qw{ say }; $_ = 'abacad'; say "/a(.)/"; if (/a(.)/g) { say "\$1: $1"; say "\$&: $&"; } else { say 'No match'; } for my $try (1 .. 3) { say "//"; if (//g) { say "\$1: $1"; say "\$&: $&"; } else { say 'No match'; } }
        Output:
        /a(.)/ $1: b $&: ab // $1: c $&: ac // $1: d $&: ad // No match

        Update: If I remember correctly, this was the original reason the feature was introduced:

        #!/usr/bin/perl use warnings; use strict; use feature qw{ say }; my $x = 'found 11'; my $y = 'found 12'; if ($x =~ /found (\d+)/ && $y =~ //) { # No need to repeat the long r +egex! Yay! say "Found $1."; }

        map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

        // always matches, always returns TRUE, but it never changes $&.

        That is not correct, in either aspect:

        % perl -wle '"a" =~ /a/; "b" =~ // or print "did not match, did not re +turn TRUE"' did not match, did not return TRUE % perl -wle '"a" =~ /.*/; q{$& changed} =~ // and print $&' $& changed %
Re: Empty pattern in regex
by jo37 (Deacon) on Oct 19, 2023 at 07:07 UTC

    Maybe it's a bug, maybe it's an obscure feature. Anyway, this is a fragile construct for border-checking of the flip-flop operator as it will be broken by a regex match within the if-block.

    The required information for such a check is provided by the flip-flop operator itself: it returns the current "loop number", with "E0" appended to the final loop call.

    Here is a more robust version:

    perl -le 'print for a .. h' | perl -nle 'if (my $ff = (/d/ .. /h/)) { +next unless $ff =~ /(?:^1|E0)$/; print }'

    Greetings,
    -jo

    $gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$
Re: Empty pattern in regex [updated]
by jo37 (Deacon) on Oct 25, 2023 at 20:11 UTC

    I think it's a bug. It has nothing to do with the flip-flop operator and it seems to be caused by jumping out of a block. Consider this example that emulates a flip flop and uses goto instead of next.

    #!/usr/bin/perl use v5.24; use warnings; my ($first, $last); while (<DATA>) { chomp; $first ||= /d/; undef($first) if $last ||= /h/; if ($first || $last) { undef $last; #goto ewhile unless //; goto eif unless //; say; eif: } ewhile: } __DATA__ c d e f g h i
    goto eif: d h
    goto ewhile: d f g h

    Jumping to the end of the current block produces the expected result, while jumping to the end of the while loop reproduces choroba's strange results. The jump out of the block seems to clear the "last successful match" causing // to be taken as an always matching empty pattern.

    However, I'd prefer to check the flip-flop's return value as this works in all circumstances, even for if(foo($_) .. bar($_)) {...}.

    Update: 26.10.2023

    Here is a much simpler example demonstrating the behaviour without any flip-flop behaviour. A jump out of a block transforms the empty pattern // from the last successful matching pattern to a true empty pattern.

    #!/usr/bin/perl use v5.24; use warnings; for my $label ('inner', 'outer') { say "goto $label"; for ('c' .. 'g') { say "loop: $_"; say "/d/ matched" if /d/; { goto $label unless //; say "// matched"; inner: } outer: } say ''; } __DATA__ goto inner loop: c // matched loop: d /d/ matched // matched loop: e loop: f loop: g goto outer loop: c // matched loop: d /d/ matched // matched loop: e loop: f // matched loop: g // matched

    Greetings,
    -jo

    $gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$

      Hello jo37,

      Do you still think it's a bug even though we seem to be able to get what we want with some Perl trickery? You duplicated the issue choroba raised but we did get the answer he expected. // has been around at least since 5.6 as it is discussed in PP 3rd, as well as 4th (5.14) without, unfortunately, examples. The one in perlop is clear but is not suggestive, to me, of a real use case. I'm still looking for a definitive use case, or at least realistic, if not definitive. I've come up with two. The first might be used by a grammarian or linguist researching comparative languages. The second extracts the string between html tags, although I show how to do this with a much simpler plain-old regex. Me thinks it's a stretch to use // when there are other ways to do a thing, but TMTOWTDI. One of my examples parses a string while the other uses an array. if (/this/../that/) {... almost demands an array. I would really like to hear a war story or two how // was used to solve some really gnarly problem. Here be my two examples:

      #!/usr/bin/env -S perl -w ##!/usr/bin/env -S perl -wd use v5.30.0; use strict; use List::AllUtils qw( reduce ); my ($slurpee, $length, $sum); { local $/; ($slurpee) = <DATA>; } $length = length $slurpee; my @regexes = ( [ qr/[A-Z]/, "uppercase characte +rs", 0 ], [ qr/[a-z]/, "lowercase characte +rs", 0 ], [ qr/\d/, "digits", + 0 ], [ qr/\s/, "whitespace charact +ers", 0 ], # # Note: $ must be \$, and - must be first to avoid range interpretat +ion. # [ qr/[-~`!@#\$%^&*()_+={}\[\]|\\:;"'<>,.?\/]/, "punctuation charac +ters", 0 ], ); #for my $c (split //, $slurpee) { print $c; } for my $case (@regexes) { say "seeding // with: $case->[0]"; "Aa5: " =~ $case->[0]; # seed the // iteration say "matched: '$&'" if $&; for (split //, $slurpee) { // and $case->[2]++; } } for my $case (@regexes) { printf("%4d %s\n", $case->[2], $case->[1]); +} $sum = reduce { $a + $b } (map $_->[2], @regexes); printf(" sum and length: %3d and %3d\n", $sum, $length); say "\nNow extract the string between HTML tags with //..."; my $str = "Before tag<i>between tags</i>after tag"; say "\n$str"; $str =~ s{ (?: (?<= \w) (?= <) | (?<= >) (?= \w) ) }{ }xg; # insert + whitespace say $str; my @tokens = split / /, $str; say "Tokens...\n"; for (@tokens) { say }; my $between; for (@tokens) { if (/<\w>/../<\/\w>/) { $between .= "$_ " unless // and $&; } } chop $between if $between; say "'$between'"; $str = "\n'Before tag<i>between tags</i>after tag'"; say $str; say "Parse it again with..."; my $regex = qr/ (<\w+>) (.*) (<\/\w+>) /x; say $regex; $str =~ $regex; say "\$1: '$1'"; say "\$2: '$2'"; say "\$3: '$3'"; exit(0); __END__ Last night I dreamt I went to Manderley again. This will come as a sur +prise to Daphne since she did not write these lines. Here is a line containing + stuff ,?- ! : that should/must be deleted/// ; : ! before using it as a o +ne-time-pad. A one-time-pad should contain only characters, no punctuation, no par +entheticals like (this is bogus) or [(this is bogus, too)], or {also +this}; no contractions, such as I'll or it's or digits such as 0, 123, -75 or 8 P.M., and no numbers, +such as $1,234.69. If you want to use numbers in your message, spell them out; one-hundred d +ollars and sixty-nine cents, or theeepm. These non-alpha characters +in the one-time-pad will be discarded, but they must be entered eactl +y as represented in the book used as the pad. Let the encoding progr +am decide what to use and what to skip. Some of the text is from "Rebecca", an out-of copyright but not out-of +-print fictional work that can be freely downloaded as an eBook from Project Gutenberg. + I use it as the raw source for one-time pads in a cryptologic research study; i.e., ex +tract potential pad bits from somewhere in the text, randomly chosen with seek from EO +F. Munge the characters, encrypt the message and delete the characters used for the + pad. Since both encoder and decoder use the same seek expression, both pads are guaran +teed to be identical, and since the characters used to create the pad are deleted +, never to be seen again, the pad is guaranteed to be used exactly once. Does not scale f +or large organizations but works flawlessly for a small group of conspirators.
      O U T P U T
        seeding // with: (?^u:A-Z)
        matched: 'A'
        seeding // with: (?^u:a-z)
        matched: 'a'
        seeding // with: (?^u:\d)
        matched: '5'
        seeding // with: (?^u:\s)
        matched: ' '
        seeding // with: (?^u:[-~`!@#\$%^&*()_+={}\\|\\:;"'<>,.?/])
        matched: ':'
          26 uppercase characters
        1168 lowercase characters
          13 digits
         283 whitespace characters
          80 punctuation characters
         sum and length: 1570 and 1570
      
        Now extract the string between HTML tags with //...
      
        Before tag<i>between tags</i>after tag
        Before tag <i> between tags </i> after tag
      
        Tokens...
      
        Before
        tag
        <i>
        between
        tags
        </i>
        after
        tag
        'between tags'
      
        'Before tag<i>between tags</i>after tag'
        Parse it again with...
        (?^ux: (<\w+>) (.*) (</\w+>) )
        $1: '<i>'
        $2: 'between tags'
        $3: '</i>'
      

        Hello perlboy_emeritus,

        to be more explicit in this issue, I do not only think it's a bug, I am absolutely convinced it is. Some remarks:

        • Having a workaround for a bug does in no way mean it is not a bug.
        • Using $& in this scenario is dangerous, as it is affected by the very same bug. See extended example below.
        • I cannot find anything in your code that would trigger the bug. This is fine and TIMTOWTDI
        • perlop is very precise in The empty pattern "//:
          If the *PATTERN* evaluates to the empty string, the last *successfully* matched regular expression is used instead. (...) If no match has previously succeeded, this will (silently) act instead as a genuine empty pattern (which will always match). (...)
          As you can see from my example, // does not behave as described if there was a successful match and there happens a jump out of an inner block where // was applied. This clears $& and resets // to the genuine empty pattern.

        #!/usr/bin/perl use v5.24; use warnings; for my $label ('inner', 'outer') { say "goto $label"; for ('c' .. 'g') { say "loop: $_"; say "/d/ matched" if /d/; say "\$&: '$&'" if defined $&; { goto $label unless //; say "// matched"; inner: } outer: } say ''; } __DATA__ goto inner loop: c // matched loop: d /d/ matched $&: 'd' // matched loop: e $&: 'd' loop: f $&: 'd' loop: g $&: 'd' goto outer loop: c // matched loop: d /d/ matched $&: 'd' // matched loop: e $&: 'd' loop: f // matched loop: g // matched

        Greetings,
        -jo

        $gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$
Re: Empty pattern in regex
by jo37 (Deacon) on Oct 30, 2023 at 20:08 UTC

    Filed a bugreport.

    Greetings,
    -jo

    $gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$
Re: Empty pattern in regex
by perlboy_emeritus (Scribe) on Oct 19, 2023 at 00:15 UTC

    Per that perlop discussion I wrapped // in quotes with 'm' and tried:

    perl -le 'print for a .. z' | perl -nle 'if (/d/ .. /h/) { next unles +s "m//"; print }'

    and got:

      % perl -le 'print for a .. z' | perl  -nle 'if (/d/ .. /h/) { next unless "m//"; print }'
      d
      e
      f
      g
      h
    

    And then I pedantically did:

    for my $c ( 'a'..'z') { next unless ($c =~ /[d-h]/); say $& if $&; }

    and got:

      d
      e
      f
      g
      h
    

    I guess I don't understand. Isn't 'd e f g h' what is expected? I've never really trusted one-liners. Brian Foy wrote an interesting piece on SO, to wit:

    https://stackoverflow.com/questions/22652393/regex-1-variable-reset

    except his example using //; did not work for me. He expected all vars to be cleared but when I ran his code:

    # The regex capture variables are only reset on the next successful ma +tch. # This way, Perl saves a lot of time by not affecting variables when m +atches # fail. As such, only use those variables with a guard, to wit: # if ( /abc/ ) { # this tests for /abc/ success and now it's OK t +o use $& # ... # } # Here's an extended demonstration, with a special surprise at the end +: say "First long example...\n"; $_ = 'Hello Perl'; print "\$1: $1\n\$2: $2\n\$3: $3\n\$&: $&\n\n"; # successful match /(P)(erl)/; print "First match\n"; print "\$1: $1\n\$2: $2\n\$3: $3\n\$&: $&\n\n"; # unsuccessful match /(P)(ython)/; print "Failed capture\n"; print "\$1: $1\n\$2: $2\n\$3: $3\n\$&: $&\n\n"; # successful match again /(Pe)(r)(l)/; print "Three captures\n"; print "\$1: $1\n\$2: $2\n\$3: $3\n\$&: $&\n\n"; # successful match, fewer captures /(Perl)/; print "One capture\n"; print "\$1: $1\n\$2: $2\n\$3: $3\n\$&: $&\n\n"; # successful match, no captures /Perl/; print "No captures\n"; print "\$1: $1\n\$2: $2\n\$3: $3\n\$&: $&\n\n"; # successful match, no pattern, special case //; print "No nothing\n"; print "\$1: $1\n\$2: $2\n\$3: $3\n\$&: $&\n\n";

    I got:

      $1: 
      $2: 
      $3: 
      $&: 
    
      First match
      $1: P
      $2: erl
      $3: 
      $&: Perl
    
      Failed capture
      $1: P
      $2: erl
      $3: 
      $&: Perl
    
      Three captures
      $1: Pe
      $2: r
      $3: l
      $&: Perl
    
      One capture
      $1: Perl
      $2: 
      $3: 
      $&: Perl
    
      No captures
      $1: 
      $2: 
      $3: 
      $&: Perl
    
      No nothing
      $1: 
      $2: 
      $3: 
      $&: Perl
    

    As you can see in 'No nothing' $& was not cleared for me as it was for him, as he reported in that piece. I don't trust using $n, $`, $& or $' unless I explicitly test for TRUE after the regex executes. Am I being overly paranoid?

      > Per that perlop discussion I wrapped // in quotes with m and tried:

      Which discussion? unless "m//" is the same as unless "1", it's just a string.

      map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

        I tried "m//" as reported on your 'next' expression and got 'd e f g h', as expected. From my perlop on 5.36.

            The empty pattern "//"
                    If the *PATTERN* evaluates to the empty string, the last
                    *successfully* matched regular expression is used instead. In
                    this case, only the "g" and "c" flags on the empty pattern are
                    honored; the other flags are taken from the original pattern. If
                    no match has previously succeeded, this will (silently) act
                    instead as a genuine empty pattern (which will always match).
        
                    Note that it's possible to confuse Perl into thinking "//" (the
                    empty regex) is really "//" (the defined-or operator). Perl is
                    usually pretty good about this, but some pathological cases
                    might trigger this, such as "$x///" (is that "($x) / (//)" or
                    "$x // /"?) and "print $fh //" ("print $fh(//" or
                    "print($fh //"?). In all of these examples, Perl will assume you
                    meant defined-or. If you meant the empty regex, just use
                    parentheses or spaces to disambiguate, or even prefix the empty
                    regex with an "m" (so "//" becomes "m//").
        
Re: Empty pattern in regex
by perlboy_emeritus (Scribe) on Oct 23, 2023 at 17:33 UTC

    Hello choroba,

    I don't like to give up without exhausting all avenues of research, and for me Perl is enjoyment and therapy (needed in this world we live in, and this age). This issue may now have dropped off the radars of the other participants, but not mine. jo37 came up with:

    perl -le 'print for a .. h' | perl -nle 'if (my $ff = (/d/ .. /h/)) { +next unless $ff =~ /(?:^1|E0)$/; print }' d h

    which troubles me because of that alternation and the absence of //, which I think is/was your point. Mine are, granted, after debugging with strategic print statements:

    perl -le 'print for a .. z' | perl -nle 'if (/d/ .. /h/) { next unles +s // and $_ eq $&; print; }' d h

    or

    perl -le 'print for a .. z' | perl -nle 'if (/d/ .. /h/) { next unles +s $_ eq $& and //; print; }' d h

    and is a short-circuit operator so it works either way. Does this do what you expected it to do?

    Regards, Will

      It was not my intention to cause any troubles. $ff =~ /(?:^1|E0)$/ can be rewritten as $ff == 1 || $ff =~ /E0$/. When the second operand of the flip-flop operator becomes true, the return value gets an E0 appended. This does not change the value in numeric context as it is just one of its floating point representations. In string context it is distinguishable from all the other values, though.

      HTH

      Greetings,
      -jo

      $gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$
      The $_ eq $& is an interesting idea. Note that it only works because of -l, otherwise \n would have been included in $_ but not $&.

      map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re: Empty pattern in regex
by Anonymous Monk on Oct 19, 2023 at 09:17 UTC

    It behaves correctly

    itd a warning not use it ;)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11155031]
Approved by johngg
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (8)
As of 2024-04-18 13:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found