Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

mysteries of regex substring matching

by smile4me (Beadle)
on Jan 15, 2021 at 21:42 UTC ( #11126976=perlquestion: print w/replies, xml ) Need Help??

smile4me has asked for the wisdom of the Perl Monks concerning the following question:

We all know that "In list context, a regex match returns a list of captured substrings." And, we also know "Numeric quantifiers express the number of times an atom may match. {n} means that a match must occur exactly n times." So can the numeric quantifier work with the captured substrings?

perl -E '$s = q[AAD34017837201D98AAED18778DEF993]; say length($s), " ", $s; @m = $s =~ /(....)(....)(....)(....)(.+)/; say "", join("-",@m);' # 32 AAD34017837201D98AAED18778DEF993 # AAD3-4017-8372-01D9-8AAED18778DEF993

In contrast, the following regex uses a numeric quantifier but does not work as above:

perl -E '$s = q[AAD34017837201D98AAED18778DEF993]; say length($s), " ", $s; @m = $s =~ /(....){4}(.+)/; say "", join("-",@m);' # 32 AAD34017837201D98AAED18778DEF993 # 01D9-8AAED18778DEF993

So, is there a way to use capture groups to match multiple times like separate groups does in the first example?

Replies are listed 'Best First'.
Re: mysteries of regex substring matching
by haukex (Bishop) on Jan 15, 2021 at 22:50 UTC
    So, is there a way to use capture groups to match multiple times like separate groups does inthe first example?

    Not really (Update: at least within a single regex, LanX already showed /g), unless you mess around with some of the more advanced regex features like maybe (?{})/(??{}). But before going down that route, this feels like an XY Problem to me; for example, for the task you show, unpack may be better, e.g. unpack("(A4)4A*","AAD34017837201D98AAED18778DEF993")

    Minor edits.

Re: mysteries of regex substring matching
by LanX (Cardinal) on Jan 15, 2021 at 22:34 UTC
    > So, is there a way to use capture groups to match multiple times like separate groups does in the first example?

    It depends what your goal is.

    If it's simply using a quantifier {4} the answer is no, because only the last match will be kept for the unique first group , that's why you get 01D9 at the end. Not a mystery.

    But there are numerous workarounds I can think of.

    Like

    • using the /g modifier in a loop /(....)/g
    • reading the last capture with embedded Perl code like /(....(?{print $1})){4}/

    updates
    • minor corrections
    • links to perlretut

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

      > using the /g modifier in a loop /(....)/g

      demo:

      DB<53> $x = q[AAD34017837201D98AAED18778DEF993]; DB<54> $x =~ m/(....)/g and say $1 for 1..4 AAD3 4017 8372 01D9 DB<55>

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

      > reading the last capture with embedded Perl code like /(....(?{print $1})){4}/

      demo: (using /x for clarification)

      DB<66> $x = q[AAD34017837201D98AAED18778DEF993]; DB<67> $x =~ m/(?: (....) (?{say $1}) ) {4} /x AAD3 4017 8372 01D9 DB<68>

      update

      I think this explains your "mystery", they all match but only the last one is kept in $1.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

Re: mysteries of regex substring matching (updated)
by AnomalousMonk (Bishop) on Jan 15, 2021 at 23:14 UTC

    I think unpack is better for this as haukex has suggested, but here's a pure-regex solution:

    Win8 Strawberry 5.8.9.5 (32) Fri 01/15/2021 18:04:55 C:\@Work\Perl\monks >perl -Mstrict -Mwarnings -MData::Dump=dd my $s = q(AAAAbbbbCCCCddddXeeeeeeeeeeX); my @caps = $s =~ m{ (?<! \A .{16}) \G .{4} | .* }xmsg; dd \@caps; ^Z ["AAAA", "bbbb", "CCCC", "dddd", "XeeeeeeeeeeX", ""]
    The trick is to have an unambiguous look-around anchor.

    Update: Another variation:

    Win8 Strawberry 5.8.9.5 (32) Fri 01/15/2021 18:19:20 C:\@Work\Perl\monks >perl -Mstrict -Mwarnings -MData::Dump=dd my $s = q(AAAAbbbbCCCCddddXeeeeeeeeeeX); my $n = 4; my $m = 3; my @caps = $s =~ m{ (?<! \A (?: .{$n}){$m}) \G .{$n} | .* }xmsg; dd \@caps; ^Z ["AAAA", "bbbb", "CCCC", "ddddXeeeeeeeeeeX", ""]


    Give a man a fish:  <%-{-{-{-<

      Hi AnomalousMonk,

      I'm sure we discussed this technique before, but I can't find it in the archives.

      Do you remember a thread? :)

      update

      well I deciphered it in the meantime, it's not operating on match-groups but the /x /g modifier.

      That'll repeat a search where the last match ended, and return all results in list context till it fails

      DB<139> p 0123456789abcdefghijklmnopqrstuvwxyz DB<139> x m{ .... }xg 0 0123 1 4567 2 '89ab' 3 'cdef' 4 'ghij' 5 'klmn' 6 'opqr' 7 'stuv' 8 'wxyz' DB<140>

      with a negative look-behind (?<! ) we can filter out all matches after the initial 4

      DB<140> x m{ (?<! (?: .... ){4} ) .... }xg 0 0123 1 4567 2 '89ab' 3 'cdef' DB<141>

      an or condition | helps matching after the fail.

      DB<142> x m{ (?<! (?: .... ){4} ) .... | .+ }xg 0 0123 1 4567 2 '89ab' 3 'cdef' 4 'ghijklmnopqrstuvwxyz' DB<143>

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

      ) thanks AnomalousMonk++ for spotting :)

Re: mysteries of regex substring matching
by LanX (Cardinal) on Jan 15, 2021 at 23:15 UTC
    For completeness:

    Another workaround was so trivial that I skipped it in my previous answer

    You can build regexes from smaller ones with string interpolation

    DB<73> p $capt4 = '(....)' x 4 (....)(....)(....)(....) DB<74> x $x =~ m/ $capt4 (.+) /x 0 'AAD3' 1 4017 2 8372 3 '01D9' 4 '8AAED18778DEF993' DB<75>

    I think that's the most readable solution.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

      Same approach, but as a one liner:

      DB<5> p 0123456789012345678901234567890123456789 DB<5> x / @{[ '(....)' x4 ]} (.+) /x 0 0123 1 4567 2 8901 3 2345 4 678901234567890123456789 DB<6>

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://11126976]
Approved by marto
Front-paged by davies
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (6)
As of 2021-02-25 08:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?