Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

mysteries of regex substring matching

by smile4me (Beadle)
on Jan 15, 2021 at 21:42 UTC ( #11126976=perlquestion: print w/replies, xml ) Need Help??

smile4me has asked for the wisdom of the Perl Monks concerning the following question:

We all know that "In list context, a regex match returns a list of captured substrings." And, we also know "Numeric quantifiers express the number of times an atom may match. {n} means that a match must occur exactly n times." So can the numeric quantifier work with the captured substrings?

perl -E '$s = q[AAD34017837201D98AAED18778DEF993]; say length($s), " ", $s; @m = $s =~ /(....)(....)(....)(....)(.+)/; say "", join("-",@m);' # 32 AAD34017837201D98AAED18778DEF993 # AAD3-4017-8372-01D9-8AAED18778DEF993

In contrast, the following regex uses a numeric quantifier but does not work as above:

perl -E '$s = q[AAD34017837201D98AAED18778DEF993]; say length($s), " ", $s; @m = $s =~ /(....){4}(.+)/; say "", join("-",@m);' # 32 AAD34017837201D98AAED18778DEF993 # 01D9-8AAED18778DEF993

So, is there a way to use capture groups to match multiple times like separate groups does in the first example?

Replies are listed 'Best First'.
Re: mysteries of regex substring matching
by haukex (Bishop) on Jan 15, 2021 at 22:50 UTC
    So, is there a way to use capture groups to match multiple times like separate groups does inthe first example?

    Not really (Update: at least within a single regex, LanX already showed /g), unless you mess around with some of the more advanced regex features like maybe (?{})/(??{}). But before going down that route, this feels like an XY Problem to me; for example, for the task you show, unpack may be better, e.g. unpack("(A4)4A*","AAD34017837201D98AAED18778DEF993")

    Minor edits.

Re: mysteries of regex substring matching (updated)
by AnomalousMonk (Bishop) on Jan 15, 2021 at 23:14 UTC

    I think unpack is better for this as haukex has suggested, but here's a pure-regex solution:

    Win8 Strawberry 5.8.9.5 (32) Fri 01/15/2021 18:04:55 C:\@Work\Perl\monks >perl -Mstrict -Mwarnings -MData::Dump=dd my $s = q(AAAAbbbbCCCCddddXeeeeeeeeeeX); my @caps = $s =~ m{ (?<! \A .{16}) \G .{4} | .* }xmsg; dd \@caps; ^Z ["AAAA", "bbbb", "CCCC", "dddd", "XeeeeeeeeeeX", ""]
    The trick is to have an unambiguous look-around anchor.

    Update: Another variation:

    Win8 Strawberry 5.8.9.5 (32) Fri 01/15/2021 18:19:20 C:\@Work\Perl\monks >perl -Mstrict -Mwarnings -MData::Dump=dd my $s = q(AAAAbbbbCCCCddddXeeeeeeeeeeX); my $n = 4; my $m = 3; my @caps = $s =~ m{ (?<! \A (?: .{$n}){$m}) \G .{$n} | .* }xmsg; dd \@caps; ^Z ["AAAA", "bbbb", "CCCC", "ddddXeeeeeeeeeeX", ""]


    Give a man a fish:  <%-{-{-{-<

      Hi AnomalousMonk,

      I'm sure we discussed this technique before, but I can't find it in the archives.

      Do you remember a thread? :)

      update

      well I deciphered it in the meantime, it's not operating on match-groups but the /x /g modifier.

      That'll repeat a search where the last match ended, and return all results in list context till it fails

      DB<139> p 0123456789abcdefghijklmnopqrstuvwxyz DB<139> x m{ .... }xg 0 0123 1 4567 2 '89ab' 3 'cdef' 4 'ghij' 5 'klmn' 6 'opqr' 7 'stuv' 8 'wxyz' DB<140>

      with a negative look-behind (?<! ) we can filter out all matches after the initial 4

      DB<140> x m{ (?<! (?: .... ){4} ) .... }xg 0 0123 1 4567 2 '89ab' 3 'cdef' DB<141>

      an or condition | helps matching after the fail.

      DB<142> x m{ (?<! (?: .... ){4} ) .... | .+ }xg 0 0123 1 4567 2 '89ab' 3 'cdef' 4 'ghijklmnopqrstuvwxyz' DB<143>

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

      ) thanks AnomalousMonk++ for spotting :)

Re: mysteries of regex substring matching
by LanX (Cardinal) on Jan 15, 2021 at 22:34 UTC
    > So, is there a way to use capture groups to match multiple times like separate groups does in the first example?

    It depends what your goal is.

    If it's simply using a quantifier {4} the answer is no, because only the last match will be kept for the unique first group , that's why you get 01D9 at the end. Not a mystery.

    But there are numerous workarounds I can think of.

    Like

    • using the /g modifier in a loop /(....)/g
    • reading the last capture with embedded Perl code like /(....(?{print $1})){4}/

    updates
    • minor corrections
    • links to perlretut

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

      > using the /g modifier in a loop /(....)/g

      demo:

      DB<53> $x = q[AAD34017837201D98AAED18778DEF993]; DB<54> $x =~ m/(....)/g and say $1 for 1..4 AAD3 4017 8372 01D9 DB<55>

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

      > reading the last capture with embedded Perl code like /(....(?{print $1})){4}/

      demo: (using /x for clarification)

      DB<66> $x = q[AAD34017837201D98AAED18778DEF993]; DB<67> $x =~ m/(?: (....) (?{say $1}) ) {4} /x AAD3 4017 8372 01D9 DB<68>

      update

      I think this explains your "mystery", they all match but only the last one is kept in $1.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

Re: mysteries of regex substring matching
by LanX (Cardinal) on Jan 15, 2021 at 23:15 UTC
    For completeness:

    Another workaround was so trivial that I skipped it in my previous answer

    You can build regexes from smaller ones with string interpolation

    DB<73> p $capt4 = '(....)' x 4 (....)(....)(....)(....) DB<74> x $x =~ m/ $capt4 (.+) /x 0 'AAD3' 1 4017 2 8372 3 '01D9' 4 '8AAED18778DEF993' DB<75>

    I think that's the most readable solution.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

      Same approach, but as a one liner:

      DB<5> p 0123456789012345678901234567890123456789 DB<5> x / @{[ '(....)' x4 ]} (.+) /x 0 0123 1 4567 2 8901 3 2345 4 678901234567890123456789 DB<6>

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://11126976]
Approved by marto
Front-paged by davies
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (5)
As of 2021-06-12 15:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What does the "s" stand for in "perls"? (Whence perls)












    Results (53 votes). Check out past polls.

    Notices?