Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

Update: the question was overhauled to emphasize the main topic which is the node title and because my first posting caused much more confusion than it should. As the question had already many replies and votes, I posted it here as a reply following the advices of other monks.

I am looking for a solution for the following problem: given an arbitrary regex (like qr/Title: (.*?), Author: (\w+) (\w+)$/) with an arbitrary number of groups (not known beforehand), how do I get ($1, $2, ...) in a generic way?

I envisaged a solution using @- and @+ and wrote the following piece of code. (See perlvar.)

# return ($1, $2, ...) matched against $s sub _groups { my $s = shift; my @groups; foreach my $i (1..$#-) { push @groups, substr($s, $-[$i], $+[$i] - $-[$i]); } return @groups }

Then I can write:

if (/$re/mgc) { @groups = _groups($_); # ($1, $2, ...) }

The question is: There is a better way to do this?


Why, for Heaven's sake, I think I need to get these ($1, $2, ...)?

Read more if you care.

I am writing a code to extract pieces from a larger text in a flexible way. This is to be accomplished by a data-driven approach, based on a set of regexes.

For example, it must be capable of extract the title, author and publisher out of this snippet and in the right order.

Title: The Moor's Last Sigh Author: Salman Rushdie Publisher: Foo Title: The God of Small Things Author: Arundhati Roy Publisher: Bar

(Note. The input text is not supposed to be so nice like this example all the time — there may be gobs of stuff to be ignored/skipped in between the information that matters, like tags, whitespace, etc.)

As a simplified application of this, I wrote a code that looks like:

my $text = THE EXAMPLE TEXT ABOVE ... my $re_title = qr/Title: (.*?)$/; my $re_author = qr/Author: (\w+) (\w+)$/; my $re_publisher = qr/Publisher: (.*?)$/; my @answers; { my %book; if ($text =~ /$re_title/mgc) { $book{title} = $1; } if ($text =~ /$re_author/mgc) { $book{author} = [ $1, $2 ]; } if ($text =~ /$re_publisher/mgc) { $book{publisher} = $1; } push @answers, \%book; } { my %book; if ($text =~ /$re_title/mgc) { $book{title} = $1; } ...

(Note. The code is not meant to be a maintenance nightmare like the piece above. This piece is weird with detached regexes because it will be abstracted with those regexes and some control flow coming from data structures. What will remain is how the text is processed.)

The main issue here is that the modifier /gc is used to get the scanner behavior mentioned in Regexp Quote Like Operators. With it, after a match, it is possible to resume the scan from the point where the last regex left. It also avoids to build a complex regex which is going to be even more complex when I depart from this simplified approach of matching regexes in sequences to start implementing things like loops, conditionals and alternations.

The problem is that to get all captured groups, I cannot call $text =~ /$re/mgc in list context, or /g will create a loop and consume more ouput than I would like it did. For example, with the example above and

if (@groups = $text =~ /Title: (.*?)$/mgc) { $book{title} = $1; }

The array @groups will hold ( 'The Moor's Last Sigh', 'The God of Small Things' ) and leave pos($text) right before Author: Arundhati Roy (and then Salman Rushdie would be lost :). So I will have to call $text =~ /$re/mgc in a scalar context to get the scanner-like behavior and I found wanting a way to get all the groups for an arbitrary regex. So that's the reason of this question.

Note 1. Before the rephrasing of this question, educated_foo answered with a nice alternative (at Re: How to get ($1, $2, ...)?) for _groups and almut proposed a two-step process (at Re: How to get ($1, $2, ...)?) also in line with the node problem. I thank all other mongers that replied and eric256 that inspired me to rewrite this question.

Note 2. Yeah, there are modules like Text::Scraper, Text::Template to things like that, but they are not quite the same. Sometimes one needs to try to reinvent some wheels, even if it is just to have confidence on the wheels someone else made.

Note 3. demerphq pointed there is no way to do that in current production perls. Only in blead or with a little XS for earlier versions. The best thing he think of without using XS is: my @array=eval '($'.join(',$',1..$#-).')'; Thanks.

In reply to Rewrite for "How to get ($1, $2, ...)? by ferreira
in thread How to get ($1, $2, ...)? by ferreira

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others exploiting the Monastery: (7)
    As of 2020-01-23 02:42 GMT
    Find Nodes?
      Voting Booth?