Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Arbitrary number of captures in a regular expression

by grinder (Bishop)
on Sep 23, 2007 at 20:33 UTC ( #640610=perlquestion: print w/ replies, xml ) Need Help??
grinder has asked for the wisdom of the Perl Monks concerning the following question:

Monks,

I may be overlooking something, but as far as I can tell, given the following strings (simplified from a real-world example, but sufficient for the purpose):

foo m 1 m 2 m 3 m 4 bar foo m 2 m 4 m 7 bar foo m 1 bar

There may be any number of m digit sequences following 'foo'. You cannot group and capture an arbitrary number of times in a single pattern and get back all the captures? That is, the following does not work:

my (@match) = ($str =~ /^foo (?:m (\d+) )+bar/);

That only captures the final digit group. I think the only way out is to use a pile of code and do something like

my @match; if ($str =~ /^foo /) { while ($str =~ /m (\d+) /g) { push @match, $1; } }

There must be a simpler way, but I can't see it right now, worse, that solution doesn't strike me as being particularly well-anchored, that is, following 'foo'.

The chain of m digit+ could be anywhere in string, for instance, after 'bar', and it would be incorrect to match and capture them. Thanks for any suggestions and insights you may have.

• another intruder with the mooring in the heart of the Perl

Comment on Arbitrary number of captures in a regular expression
Select or Download Code
Re: Arbitrary number of captures in a regular expression
by Anno (Deacon) on Sep 23, 2007 at 20:50 UTC
    Take a two-step approach: First isolate the core part of the string between "foo" and "bar". Use a global match to extract all numeric sequences from the core in one more step.
    while ( <DATA> ) { my ( $core) = /^foo (.*) bar$/; my @nums = $core =~ /m (\d+)/g; print "$core -> (@nums)\n"; } __DATA__ foo m 1 m 2 m 3 m 4 bar foo m 2 m 4 m 7 bar foo m 1 bar
    That assumes there is only one delimiting pair of "foo" ... "bar" per string.

    Anno

      I agree on the two-step approach (at least for now 1), but I'd reverse the specificity of the regexes: I'd first do the validation using a specific regex and in a second phase split out the data you want from the capture in the first, which might be using a simpler regex because you validated the data already:
      while (<DATA>) { my ($params) = /^foo ((?:m \d+ *)*) bar$/; my @nums = $params=~ /\d+/g; print "$params -> (@nums)\n"; } __DATA__ foo m 1 m 2 m 3 m 4 bar foo m 2 m 4 m 7 bar foo m 1 bar

      1 Looking at demerphq's slides on his work on extended regexes in Perl 5.10, it might become possible to have a simpler, single step solution in the near future. See %-.

      But maybe I'm just dreaming.

Re: Arbitrary number of captures in a regular expression
by Sidhekin (Priest) on Sep 23, 2007 at 20:59 UTC

    ... and if you still want a one-step approach ...

    This must be a job for the /g modifier and its side-kick, \G:

    my (@match) = $str =~ /(?:^foo |(?<!^)\G)m (\d+) (?=(?:m \d+ )*bar)/g;

    (The negative lookbehind in order to prevent matching strings starting with the m \d+ pattern — the positive lookahead to prevent matching strings that don't properly close with bar.)

    With a little test case, it looks like this:

    my @test = ( 'foo m 1 m 2 m 3 m 4 bar', 'foo m 2 m 4 m 7 bar', 'foo m 1 bar', 'm 2 foo m 1 bar', 'foo m 1 c 2 bar', 'foo m 1 bar m 2', 'foo m 1 m 5 m 7', ); for my $str (@test) { my (@match) = $str =~ /(?:^foo |(?<!^)\G)m (\d+) (?=(?:m \d+ )*bar)/ +g; local $" = ', '; print "'$str' => (@match)\n"; }

    ... and outputs like this:

    'foo m 1 m 2 m 3 m 4 bar' => (1, 2, 3, 4) 'foo m 2 m 4 m 7 bar' => (2, 4, 7) 'foo m 1 bar' => (1) 'm 2 foo m 1 bar' => () 'foo m 1 c 2 bar' => () 'foo m 1 bar m 2' => (1) 'foo m 1 m 5 m 7' => ()

    Update: Almost missed the "bar" requirement. Fixed now, right?

    print "Just another Perl ${\(trickster and hacker)},"
    The Sidhekin proves Sidhe did it!

      For those for whom '\G' is deep into 'executable line noise' country:

      The \G anchor forces the next match to start where the last match left off. Use \G analogously to ^ at the beginning of a string. ^ matches only the beginning of a string – \G matches only the beginning of the string when greedy matching has chewed off the front of the string.

      perlfaq5 has more detail. (The internal hyperlink at perldoc.perl.org is broken – apparently the backslash discombobulated the escapeHTML routines. But this link will get you there.) The other piece of the puzzle is the '(?='. This handy expression—the 'zero width positive lookahead' (along with its evil twin '(?!') are explained in more detail at perlretut.

      You may also want to review Non-capturing-groupings.

      Let's take Sidhekin's piece of work apart, and not be quite so terse. As perlretut says

      Long regexps like this may impress your friends, but can be hard to decipher. In complex situations like this, the //x modifier for a match is invaluable. It allows one to put nearly arbitrary whitespace and comments into a regexp without affecting their meaning. Using it, we can rewrite our 'extended' regexp in the more pleasing form
      So using the x modifier, the heart of Sidhekin's code becomes
      # We're hunting the (properly bracketed) $str =~ # 'm \d+' occurrences. They must be / (?:^foo\s # - proceeded by and initial foo | # OR (?<!^)\G) # - the end of a previous successful match # - but not the beginning of the string m \s (\d+) # Here's the guy we really want. # But he must be followed by the right stuff (?= # Lookahead says he must be followed by: (?:m \s \d+ \s)* # Any number of m \d+ groups. bar)/xg; # Finally terminated with bar (though not # necessarily the end of string.)
      Notice—since whitespace is not significant when using the //x modifier. So where Sidhekin used a single blankspace, I had to use a '\s'.

      This is straightforward way for a programmer to do a greedy capture in the middle of the string. Realize tho that it it not the most straightforward way for the computer. For each 'm \d+' expression in the string, the computer

      - starts at the current 'beginning' of the string - matches the 'm \d+' at the current position - matches (fore) the foo and all the 'm \d+' before the current posi +tion - matches (aft) all the remaining 'm \d+' and the final bar - and THROWS AWAY the fore and aft matches (they're non-capturi +ng)
      This is a trivial amount of extra work on a single line. But if you are attempting to do something similar by, say, matching across line breaks and pattern searching a set of 120 page MS-Word documents, you may notice some performance problems.

      Update: added detail

      Id do something like this myself. Except id probably not use look ahead and instead would approach it a different way. (I might even follow up with some code later if i get some time.)

      ---
      $world=~s/war/peace/g

        Id do something like this myself. Except id probably not use look ahead and instead would approach it a different way.

        I was annoyed with the lookahead myself, but it's unlikely to be a big deal, and I could not at the time see any way to avoid it. After some thinking, however, I believe I see a way to avoid looking ahead more than once -- just include it in the first alternation, which is matched precisely once on a successful match (anchored to the beginning of the string, and the only alternation that can match there):

        my (@match) = $str =~ /(?:^foo (?=(?:m \d+ )+bar)|(?<!^)\G)m (\d+) /g;

        ... or, in the less-terse form:

        my (@match) = $str =~ / (?: ^foo\ (?= (?:m\ \d+\ )+bar) # overall match from ^foo | (?<!^) \G # or continue from not-^ ) m\ (\d+)\ # grab each digit sequence /xg;

        I think that's the best I got. Match that? :)

        print "Just another Perl ${\(trickster and hacker)},"
        The Sidhekin proves Sidhe did it!

Re: Arbitrary number of captures in a regular expression
by BrowserUk (Pope) on Sep 24, 2007 at 03:43 UTC

    You can avoid the nested loop by capturing the repeated elements as a single match and then splitting it:

    my @match; while( <DATA> ) { m[^foo ((?:m \d+ )+)] and @match = split '(?<=\d) (?=m)', $1; print "@match"; }

    Whether that is in any way 'better' is your call.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Arbitrary number of captures in a regular expression
by sfink (Deacon) on Sep 24, 2007 at 03:59 UTC
    First match the surrounding strings, ending the match just before the "body". Then start just after that header and pull out all the digits.
    if (/^foo (?=.*bar$)/g) { @matches = /\Gm (\d+) /g; }
Re: Arbitrary number of captures in a regular expression
by ikegami (Pope) on Sep 24, 2007 at 05:03 UTC

    It is possible to do with one regexp.

    local our @matches; $str =~ / (?{ [] }) ^foo (?: \s+ m \s+ (\d+) (?{ [ @{$^R}, $^N ] }) )+ \s+ bar (?{ @matches = @{$^R} }) /x;

    Didn't say it was nice.

Re: Arbitrary number of captures in a regular expression
by bsb (Priest) on Sep 25, 2007 at 10:19 UTC
    It's not fancy or general, but it's simple and may be adaptable to your purpose:
    $ms = '(?: m (\d+))?' x 5; # 1) print join(",", /^foo$ms bar$/ ),"\n" for ( 'foo m 1 m 2 m 3 m 4 bar', 'foo m 2 m 4 m 7 bar', 'foo m 1 bar', ); __END__ 1,2,3,4, 2,4,7,, 1,,,,
    1) Where "any number" is equal to 5...
Re: Arbitrary number of captures in a regular expression
by TimToady (Parson) on Sep 25, 2007 at 16:38 UTC
    This is an unfortunate consequence of the fact that semantics of Perl 5 regex were designed by a young idiot. They got an older (and possibly wiser) idiot to design Perl 6 regex, so you should be able to say something like:
    $str ~~ mm/foo [ m (\d+) ]* bar/; my @matches = @$0;
    Indeed, the notion that repeating groups shouldn't throw away all but the final capture is fundamental to making Perl 6 regexes powerful enough to parse Perl 6. In a similar vein, the ordinary scalar comma operator does not throw away its left argument anymore in Perl 6 either. Perl 6 has pretty much cleaned out all the dirty little spots where Perl 5 has "return the last one" semantics.

    Update: I also forgot the 'bar'...

      TimToady:  This is an unfortunate consequence of the fact that semantics of Perl 5 regex were designed by a young idiot.

      I came to PM late and didn't possbly get the whole problem
      connected w/above topic. Wouldn't this be very simple to solve
      by code assertion?

      Did I miss sth.?
      my @foobar = ( 'foo m 1 m 2 m 3 m 4 bar', 'foo m 2 m 4 m 7 bar', 'foo m 1 bar' ); my ($cnt, @match) = (0, ()) ; /^foo (?:m (\d+)(?{push @{$match[$cnt]}, $^N}) )+bar(?{++$cnt})/ for + @foobar; print map "@$_\n", @match;

      Or is it a non-go here to put code into regexes?

      Regards
      mwa
        I came to PM late and didn't possbly get the whole problem connected w/above topic.
        That's what the "in thread" link up at the top is for, I believe... :-)

        And, in fact, ikegami already posted a code assertion solution earlier in the thread.

        Wouldn't this be very simple to solve by code assertion?
        I suspect your definition of "very simple" must be different from mine. It's nice to have an escape hatch like code assertions for when the basic mechanism is insufficient (and indeed, Perl 6 provides more such escape mechanisms and also makes them easier to use), but it would be even better if the basic capture mechanism did what you wanted it to do. That's my idea of simple.

        Using lexical (my) variables from outside the regex in (?{ ... }) is dangerous. Your code will break if it's moved to a function. Use package (our/use vars) variables instead.

        Also, it's unsafe to modify @match at the point where you did modify it. If any backtracking through that the (?{ ... }) that changes @match occurs, you won't get the correct result. Now, the only time your code backtracks is when the match is unsuccessful. Even if you realized that and found it acceptable, you're playing with fire for the smallest change to the regexp can change that.

        See earlier post Re: Arbitrary number of captures in a regular expression for the safe approach.

      At what cost tho? Maintaining that array and rolling it back during backtracking must impose a runtime cost for what IMO is not all that common a use case.

      ---
      $world=~s/war/peace/g

        At what cost tho? Maintaining that array and rolling it back during backtracking must impose a runtime cost for what IMO is not all that common a use case.
        Er, you're falling into Perl-5-Think here. The very fact that I used parens means that I do want to capture the array. If I didn't, I'd have used square brackets for the groupings I didn't want to capture. In Perl 6 we made it just as easy to not capture as it is to capture, so there's no need to guess about use cases in advance. You just write it how you want it.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://640610]
Approved by kyle
Front-paged by Joost
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (9)
As of 2014-09-19 06:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (132 votes), past polls