http://www.perlmonks.org?node_id=1085808

vitoco has asked for the wisdom of the Perl Monks concerning the following question:

Im quite embarrassed, but I cannot figure out what's going on here:

#!perl use strict; use warnings; my $data = <DATA>; chomp $data; my @f = ($data =~ m!((\w+),+)+!g); print join("\t", @f) . "\n"; __DATA__ qwerty,asd,zxcvbnm,fgh,jkl,uiop,

Output:

uiop, uiop

I was expecting to receive many elements in the array: every word twice (one with and one without the comma), not just the last one.

What am I missing?

Replies are listed 'Best First'.
Re: Iterations in regex
by toolic (Bishop) on May 12, 2014 at 15:44 UTC
    Don't use the last +
    my @f = ($data =~ m!((\w+),+)!g);

    YAPE::Regex::Explain

    The regular expression: (?-imsx:((\w+),+)+) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- ( group and capture to \1 (1 or more times (matching the most amount possible)): ---------------------------------------------------------------------- ( group and capture to \2: ---------------------------------------------------------------------- \w+ word characters (a-z, A-Z, 0-9, _) (1 or more times (matching the most amount possible)) ---------------------------------------------------------------------- ) end of \2 ---------------------------------------------------------------------- ,+ ',' (1 or more times (matching the most amount possible)) ---------------------------------------------------------------------- )+ end of \1 (NOTE: because you are using a quantifier on this capture, only the LAST repetition of the captured pattern will be stored in \1) ---------------------------------------------------------------------- ) end of grouping ----------------------------------------------------------------------
Re: Iterations in regex
by LanX (Saint) on May 12, 2014 at 16:08 UTC
    toolic is right, with the last plus you ARE matching all words already in the first "iteration" (so /g is useless), but only the last matches can be returned (the ones before are overwritten).

    Without the + the /g will produce multiple attempts and return matches for each one.

    BTW: did you really mean ,+ ??? Looks weird...

    Cheers Rolf

    ( addicted to the Perl Programming Language)

      Thanks to both of you. I've tried adding and removing the g modifier, but never tried without the last operator.

      The sample code was a simplification of my real problem, where I'm trying to capture one specific record from one kind of table from a set of html documents, where each field has it's own line in the source.

      Doing that way, I had to split the original regex in two:

      1. one to identify the required record by the value of the first field
      2. another to the capture of the data fields

      BTW, the original regex was something like this:

      my ($k, @f) = ($h =~ m!<td.*?>(required_\d+_\d+.txt)</td>\s+(<td.*?> +(.+?)</td>\s+)+</tr>!m);

      Then, the ",+" actually meant whitespace "\s+", but I wanted to make them visible in the output. ;-)