Re^7: Understanding a portion of perlretut

Replies are listed 'Best First'.

Re^8: Understanding a portion of perlretut
by choroba (Cardinal) on Dec 10, 2015 at 10:21 UTC

how does $dna =~ / (\w\w\w)*? TGA /gx differ logically from $s =~ / (f)*? C /gx

After the first match (A is where the matching started, B denotes the position of the capture group)

XXXxxxTGAxxTGAxxxxTGAxx
^     ^
|     |
A     B
[download]

the matching starts at B + 1. Zero times \w\w\w doesn't match here, we have xxTGAx, so the engine tries longer and longer strings, until it finds the TGA:

XXXxxxTGAxxTGAxxxxTGAxx
         ^        ^
         |        |
         A        B
[download]

The next search will start at B + 1 again, and fail on xx.

But, with the capture group of length 1, you always match the nearest group, because the (f)*? tries longer and longer strings. Maybe what's confusing here is that expanding the group by one character is similar to the engine advancing the starting position after a match failure?

($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord
}map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
[download]

[reply]
[d/l]
[select]

Re^9: Understanding a portion of perlretut

by Athanasius (Archbishop) on Dec 10, 2015 at 12:46 UTC

Hello choroba,

Thanks for the explanation, and I’m sorry to be obtuse but — I still don’t understand. :-( Consider:

#! perl -l
use strict;
use warnings;

my $s = 'uvXYZdabcXYZfg';

while ($s =~ /(\w\w\w)*?(XYZ)/g)
{
    print 'Found match ', $1, $2, ' at pos: ', pos $s;
}

print '-----';

while ($s =~ /(abc)*?(XYZ)/g)
{
    print 'Found match ', $1, $2, ' at pos: ', pos $s;
}
[download]

Output:

22:35 >perl 1476_SoPW.pl
Found match abcXYZ at pos: 12
-----
Use of uninitialized value $1 in print at 1476_SoPW.pl line 28.
Found match XYZ at pos: 5
Found match abcXYZ at pos: 12

22:35 >
[download]

The first capture in each regex is 3 characters wide, but the first regex matches only the second occurrence of XYZ whereas the second regex also matches the first occurrence, with (abc)*? matching zero times. Why the difference in behaviour? In particular, why does (\w\w\w)*? not also match zero times?

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re^10: Understanding a portion of perlretut

by choroba (Cardinal) on Dec 10, 2015 at 13:02 UTC

uvXYZdabcXYZfg
^        ^   
|        |
A        B
[download]

The engine than starts to match at B + 1, and finds no such a match.

In the second case, the engine starts from the left as well, but finds no match:

uvXYZdabcXYZfg
^
|
A
[download]

So, it moves to A + 1 (still no match), and then A + 2, where it can match with (abc)* repeating zero times:

uvXYZdabcXYZfg
  ^
  |
 A=B
[download]

After matching, it continues (because of /g) to B + 3 (no match), and at B + 4 it finally succeeds with

uvXYZdabcXYZfg
      ^  ^
      |  |
      A  B
[download]

Better now?

($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord
}map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
[download]

[reply]
[d/l]
[select]

Re^11: Understanding a portion of perlretut

by Athanasius (Archbishop) on Dec 10, 2015 at 13:27 UTC

Re^12: Understanding a portion of perlretut

by Corion (Patriarch) on Dec 10, 2015 at 13:30 UTC

Re^8: Understanding a portion of perlretut
by AnomalousMonk (Archbishop) on Dec 10, 2015 at 22:22 UTC

Is your Supplemental question meant rhetorically?

It was meant rhetorically, but I'm glad you enjoyed it!

... my $s = 'abCdefC'; while ($s =~ / (f)*? C /gx) { ... }
...
I actually don�t understand how this can ever, logically, match with more than zero, since zero is possible and less greedy than 1??

I think choroba has already well addressed the issues you raised in the paragraph following the one from which this is quoted, but let me try to address this one specifically — insofar as I understand what's going on and assuming I understand your question!

In the code below, I think we're both happy that the (f)*? capture group acting before the first 'C' in the string is allowed not to match at all, and in that case the value of the capture variable ($1 in the code) is undef. I think we can agree that if the group expression were changed to (f*?) it would also match, capturing the empty string to $1.

The second 'C' in the string is preceded by an 'f'. Why do both (f)*? and (f*?) capture the 'f' when they can be satisfied with nothing and need not be satisfied with anything more than nothing (i.e., they both do lazy matching)?

Here's my story. If the RE matches nothing at offset 5 (the 'f'), it must then match a 'C' at offset 5, which is already occupied by an 'f', in order to satisfy the overall regex! The RE must first "consume" the 'f' at offset 5 before it can advance to match the 'C' at offset 6 for an overall match.

c:\@Work\Perl>perl -wMstrict -le
"use Data::Dump qw(pp);
 ;;
 my $s = 'abCdefC';
 print '01234567';
 print $s;
 print '------------';
 ;;
 while ($s =~ / (f)*? C /gx) {
   printf qq{matched (f) of %s then C at pos of %d \n},
     pp($1), pos $s;
   }
 ;;
 print '------------';
 ;;
 while ($s =~ / (f*?) C /gx) {
   printf qq{matched (f) of %s then C at pos of %d \n},
     pp($1), pos $s;
   }
"
01234567
abCdefC
------------
matched (f) of undef then C at pos of 3
matched (f) of "f" then C at pos of 7
------------
matched (f) of "" then C at pos of 3
matched (f) of "f" then C at pos of 7
[download]

But here's a non-rhetorical question. In the code below, notice that there is a peculiar double-step at pos 3. The 'f' at offset 2 is first not captured (either as undef or as the empty string), then captured. I don't get it: a non-zero-width match is never a necessity for an overall match. Why not just step over the 'f' at offset 2 in the same way all the other characters are stepped over?

c:\@Work\Perl>perl -wMstrict -le
"use Data::Dump qw(pp);
 ;;
 my $s = 'xxfxx';
 print '012345';
 print $s;
 print '------------';
 ;;
 while ($s =~ / (f)*? /gx) {
   printf qq{matched (f)*? of %s, starting at offset of %s, ending at 
+pos of %d \n},
     pp($1), pp($-[1]), pos $s;
   }
 ;;
 print '------------';
 ;;
 while ($s =~ / (f*?) /gx) {
   printf qq{matched (f*?) of %s, starting at offset of %s, ending at 
+pos of %d \n},
     pp($1), pp($-[1]), pos $s;
   }
"
012345
xxfxx
------------
matched (f)*? of undef, starting at offset of undef, ending at pos of 
+0
matched (f)*? of undef, starting at offset of undef, ending at pos of 
+1
matched (f)*? of undef, starting at offset of undef, ending at pos of 
+2
matched (f)*? of "f", starting at offset of 2, ending at pos of 3
matched (f)*? of undef, starting at offset of undef, ending at pos of 
+3
matched (f)*? of undef, starting at offset of undef, ending at pos of 
+4
matched (f)*? of undef, starting at offset of undef, ending at pos of 
+5
------------
matched (f*?) of "", starting at offset of 0, ending at pos of 0
matched (f*?) of "", starting at offset of 1, ending at pos of 1
matched (f*?) of "", starting at offset of 2, ending at pos of 2
matched (f*?) of "f", starting at offset of 2, ending at pos of 3
matched (f*?) of "", starting at offset of 3, ending at pos of 3
matched (f*?) of "", starting at offset of 4, ending at pos of 4
matched (f*?) of "", starting at offset of 5, ending at pos of 5
[download]

except

$1

always

undef

Update: Consider also the second code example with a string of 'xxfffxx' for similar perplexity.

Give a man a fish: <%-{-{-{-<

[reply]
[d/l]
[select]

Re^9: Understanding a portion of perlretut

by choroba (Cardinal) on Dec 10, 2015 at 22:51 UTC

(|f+)

Running a simplified code with use re 'debug' gives the answer:

use re 'debug';
$s = 'xfx';
while ($s =~ / (f*?) /gx) {
[download]


go ahead... be a heretic
	PerlMonks