Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re^5: regex doubt on excluding

by Athanasius (Archbishop)
on Apr 22, 2014 at 09:33 UTC ( [id://1083119]=note: print w/replies, xml ) Need Help??


in reply to Re^4: regex doubt on excluding
in thread regex doubt on excluding

Hello SuicideJunkie, and thanks for the answer. Unfortunately, I’m still confused. :-(

From your explanation, I would expect that making the whitespace match non-greedy would prevent the intermediate newline(s) from being eliminated. But it doesn’t (see below). Here is my current understanding (obviously flawed) of what should happen:

  • ^ and $ are zero-width assertions, so when they feature in a match the newline they follow/preceed is not substituted. For example:

    18:14 >perl -wE "my $s = qq[\n\n\n]; my $t = $s =~ s{$}{}gmr; say $s e +q $t;" 1 18:14 >
  • \s*? matches zero or more whitespace characters (including newline) non-greedily.

  • With the /g modifier in effect, whenever a match succeeds the regex engine begins looking for the next match one character past where the last successful match began.

Given these assumptions, I would expect that the regex /^\s*?$/ would match the string "a\n\n\nb" as follows: First, ^ matches after the first newline. Since \s*? is non-greedy, the regex engine looks for the shortest match satisfying \s*?$, and finds it in the zero-length string between the first two newlines. This it replaces with another zero-length string. It then starts looking for the next match with ^ matching after the second newline. Again, it finds and replaces a zero-length string. Finaly, ^ matches after the final newline, but no match is found. Result: the string is unchanged. However:

#! perl use strict; use warnings; my $s = "a\n\n\nb"; my $t = $s =~ s{^\s*?$}{}gmr; printf "%s\n", $s eq $t ? 'success' : 'fail'; print ">$s<\n"; print "[$t]\n";

Output:

18:29 >perl 902_SoPW.pl fail >a b< [a b] 18:29 >

One of the newlines is being deleted, so my understanding must be wrong somewhere.

I did try adding use re 'debug'; but I’m only just learning to interpret the output. I think the relevant part is:

... Guessed: match at offset 0 2 <a%n> <%n%nb> | 1:MBOL(2) 2 <a%n> <%n%nb> | 2:MINMOD(3) 2 <a%n> <%n%nb> | 3:STAR(5) 2 <a%n> <%n%nb> | 5: MEOL(6) 2 <a%n> <%n%nb> | 6: END(0) Match possible, but length=0 is smaller than requested=1, failing! POSIXD[\s] can match 1 times out of +1... 3 <a%n%n> <%nb> | 5: MEOL(6) 3 <a%n%n> <%nb> | 6: END(0) Match successful! ...

which seems to show that the match I would expect (empty line) is rejected, but I don’t know why it is.

What am I missing?

Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Replies are listed 'Best First'.
Re^6: regex doubt on excluding
by Anonymous Monk on Apr 22, 2014 at 10:58 UTC

    What am I missing?

    I think rxrx :) pos , @- and @+

    So it matched the zero length string, doesn't advance position, then matches one newline at same position thus advancing position, then it matches the zero length string again, and thats the end of matches

    "a\n\n\nb" s(2)e(2)pos(2)len(0) ("a\n", "", "\n\nb") s(2)e(3)pos(3)len(1) ("a\n", "\n", "\nb") s(3)e(3)pos(3)len(0) ("a\n\n", "", "\nb")

    I think that makes sense :)

      I’ve finally found some documentation which sheds light on this (and it’s only taken me 4 months!). From perlre#Repeated-Patterns-Matching-a-Zero-length-Substring:

      The higher-level loops preserve an additional state between iterations: whether the last match was zero-length. To break the loop, the following match after a zero-length match is prohibited to have a length of zero.

      I was wrong in thinking that the search position advances after a successful match. It does advance to the position immediately following the last match, but when that match was of zero length the “advance” is zero. But Perl’s regex engine prevents an infinite loop of zero-length matches by applying the rule quoted above.

      Thanks to Anonymous Monk for the useful analysis.

      Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1083119]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (3)
As of 2024-04-26 02:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found