http://www.perlmonks.org?node_id=1004225

wanna_code_perl has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks,

An aspect of this regex has me stumped:

my @r; # Results, array refs for each row record $text = fetch_that_blob(); $text =~ s/ (.+?) (?:\s*\n)+ (?: (\d+ (?:\sPSI)?) (?:\s*\n)+ ){4} /push @r, [ $1, $2, $3, $4, $5 ]/esgx;

Of course, this will only return the description and last number in ($1, $2) since Perl only keeps the last successful pattern match in a quantifier.

I can cut/paste line 2 of the regexp a bunch of times to account for some arbitrary maximum number of numbers and it works, but it's sort of ugly.

Data Format

Station 1 50 PSI 59 PSI 69 PSI 74 PSI B Block 49 63 70 80 96

Note that the data may contain arbitrary newlines between lines, and whitespace at the end of lines. The numbers will sometimes include units, which will always be a literal ' PSI'. The basic pattern is (Description, number, ...).

The pattern comes into Perl as a blob.

The Question That's in Here Somewhere

I could also just split /(\s*\n)+/ or such and iterate on lines building up records as I go, but is there no way I can build @r with one regex?

I know, $me->has('cake') && $me->eat('cake'); :-)

Replies are listed 'Best First'.
Re: Capturing all instances of a repeating sub-pattern in regex
by kennethk (Abbot) on Nov 16, 2012 at 18:41 UTC

    It seems odd that you are doing all this in a substitution rather than iterating over the results of the match, e.g. while ($text =~ /$re/g) {...}, particularly given that you are already resorting to some complex activity including an e modifier. However, you could do this in a fell swoop by capturing the entire block you want to reparse, and using a sub-regex in list context w/ the g modifier to return the full list:

    $text =~ s/ (.+?) (?:\s*\n)+ ((?: \d+ (?:\sPSI)? (?:\s*\n)+ ){4}) /push @r, [ $1, $2 =~ m|\d+ (?:\sPSI)?|xg ]/esgx;

    If I were the maintenance guy that followed you, I would not think friendly thoughts when I saw that.

    You could get something a little more maintainable by explicitly expanding your {4} terms:

    $text =~ s/ (.+?) (?:\s*\n)+ (?: (\d+ (?:\sPSI)?) (?:\s*\n)+ ) (?: (\d+ (?:\sPSI)?) (?:\s*\n)+ ) (?: (\d+ (?:\sPSI)?) (?:\s*\n)+ ) (?: (\d+ (?:\sPSI)?) (?:\s*\n)+ ) /push @r, [ $1, $2, $3, $4, $5 ]/esgx;
    which could be refactored into
    my $sub_re = qr/ (\d+ (?:\sPSI)?) (?:\s*\n)+ /x; $text =~ s/ (.+?) (?:\s*\n)+ $sub_re $sub_re $sub_re $sub_re /push @r, [ $1, $2, $3, $4, $5 ]/esgx;

    None of this changes the fact that you have a fundamental obfuscation by using a substitution instead of a loop.


    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

      Thanks for this reply. I appreciate the comments and that, even though you had reservations about doing this in production code (which I would too, and this is not!), you had a go at answering the question I asked.

      I'll share a bit more, just in case you're wondering what I'm thinking by going with an approach like this. It's really just a personal challenge to see if (and how) I could tackle something a different way than I normally would. Similar to obfu/JAPH/golf in that it's amusing and enlightening. I've never tried anything quite so arcane as to capture every instance of a repeating multi-line pattern within a pattern. I only got so far with it before I got stumped, which is why I naturally came here for sage advice next.

      That's the main reason, anyway. The data itself was from a personal engineering project that quite literally exploded (and set off a few car alarms) after I got what I needed out of it. It's hence sort of unlikely I'll ever need to maintain this code, and even then, the much more difficult task would be rebuilding the thing, not to mention the municipal permit. :-)

Re: Capturing all instances of a repeating sub-pattern in regex
by space_monk (Chaplain) on Nov 17, 2012 at 08:36 UTC

    This question strikes me as an example of just because TMTOWDI doesn't mean you should choose the most complex.

    Choosing a less ambitious way of doing it will result in more maintainable Perl and I doubt it will harm your program in terms of performance too much.

    A Monk aims to give answers to those who have none, and to learn from those who know more.