http://www.perlmonks.org?node_id=600423

ferreira has asked for the wisdom of the Perl Monks concerning the following question:

I wrote a code that looks like this:

my $text = <<TEXT; Title: The Moor's Last Sigh, Author: Salman Rushdie Title: The God of Small Things, Author: Arundhati Roy TEXT my @answers; my $re = qr/Title: (.*?), Author: (\w+) (\w+)$/; # 3 groups here while ($text =~ /$re/mgc) { my %ans; push @answers, [ $1, $2, $3 ]; # that's the main point! } use Data::Dump qw(dump); print dump(\@answers);

And then I got the answer I want in @answers, namely:

( ["The Moor's Last Sigh", "Salman", "Rushdie"], ["The God of Small Things", "Arundhati", "Roy"], )

Now imagine I want to run the same code against other inputs and with other regexes driving the extraction. These regexes can have an arbitrary number of capturing groups. I can't get all groups by using $text = /$re/mgc in list context, because it will make a loop over the text gathering $1, $2, ... of each match into a list (that is, $1, $2, ..., $1, $2, ..., ...). The code above is actually a simplification and it should work on a context more like this:

if ($text =~ /$re1/mgc) { ... if ($text =~ /$re2/mgc) { ... if ($text =~ /$re1/mgc) { ... if ($text =~ /$re2/mgc) { ...

So the /g behavior in list context is inappropriate. To get ($1, $2, ...) in a generic form, the only way I envisaged was to use @- and @+ and to employ a piece of code like this:

# return ($1, $2, ...) matched against $s sub _groups { my $s = shift; my @groups; foreach my $i (1..$#-) { push @groups, substr($s, $-[$i], $+[$i] - $-[$i]); } return @groups }

and then:

push @answers, [ _groups($text) ]; # [ ($1, $2, ...) ]

My question is: Is there actually a better way to do this?

Replies are listed 'Best First'.
Re: How to get ($1, $2, ...)?
by wfsp (Abbot) on Feb 16, 2007 at 15:07 UTC
    Perhaps put your reg exes in a loop an array?
    #!/usr/bin/perl use strict; use warnings; use Data::Dumper; my @res = ( qr/Title: (.*?), Author: (\w+) (\w+)$/, qr/Title: (.*?), Author: (\w+) (\w+) Publisher: (\w+)$/, qr/Title: (.*?), Author: (\w+) (\w+) Publisher: (\w+) Year: (\w+)$/, ); my @answers; while (my $line = <DATA>){ for my $re (@res){ my @results; if (@results = $line =~ /$re/){ push @answers, [@results]; } } } print Dumper \@answers; __DATA__ Title: The Moor's Last Sigh, Author: Salman Rushdie Title: The God of Small Things, Author: Arundhati Roy Title: one, Author: two three Publisher: four Title: five, Author: six seven Publisher: eight Year: nine
    output:
    $VAR1 = [ [ 'The Moor\'s Last Sigh', 'Salman', 'Rushdie' ], [ 'The God of Small Things', 'Arundhati', 'Roy' ], [ 'one', 'two', 'three', 'four' ], [ 'five', 'six', 'seven', 'eight', 'nine' ] ];
    updated:
    tinkered with the format of the output
    update 2:
    forgot to update the code. :-( Thanks to Tanktalus for spotting it.
      Reproducing a bit of your code:
      my @answers; while (my $line = <DATA>){ for my $re (@res){ my @results; if (@results = $line =~ /$re/){ push @answers, ["@results"];
      Why the quotes around @results? They weren't in the version that produced the output you're showing.
      } } }
      You're also making an unnecessary copy of the array @results. Its scope is the loop body, so you have a new one each time through. Just take the reference:
      # ... for my $re (@res){ my @results; push @answers, \ @results if @results = $line =~ $re; } # ...

      Anno

        Oh, or even
        # ... push @answers, grep @$_, map [ $line =~ $_], @res; # ...
        instead of the for loop over @res.

        I realize I'm expanding on a non-solution to the original question. It's art for art's sake, if that's allowed.

        Anno

      That won't do. I am interested in the order of the regexes and in resuming from where other left. If one uses /$re/, the search will be reset each time. In turn, with /$re/gc I may write code to look for things such as /Title: (.*?)$/, Author: (.*?), and Publisher: (.*?), but will not accept if they come out of order (like "Publisher... Title... Author...").

      I have been thinking that I should have phrased this question differently, asking directly for a way to get ($1, $2, ...) in a generic manner and then showing the code for sub _groups. The background that inspired me to formulate the problem could be added as a complement, without obscuring what I was looking for.

        If you really need to hang on to the //gmc regex construct then you could opt to include the regex's as alternatives. Afterwards split the grouped result per regex based on field position in the group.
        Update:
        - note that the order of the alternatives influences which one will match first each time (and that's what you wanted right?)
        - since the total regex is just one expression your program will examine the text only once -> performance gain

        See below for an example to get the idea.

        #!/usr/bin/perl use strict; use warnings; my $text = <<TEXT; Title: The Moor's Last Sigh, Author: Salman Rushdie Title: The God of Small Things, Author: Arundhati Roy Title: A very special title, Author: varianf varians TEXT my @answers; my $re = qr/Title: (.*?), Author: (\w+) (\w+)$/; # 3 groups here my $re2= qr/Title: (.*?special.*?), Author: (\w+) (\w+)$/; my (@MatchAll) = ($text =~ /$re2|$re/mgc); my (@Match1,@Match2); for (my $i=0;$i<@MatchAll;$i=$i+6) { defined $MatchAll[$i] && push @Match2, $MatchAll[$i..$i+2]; defined $MatchAll[$i+3] && push @Match1, $MatchAll[$i+3..$i+5]; } Output: $ perl reg.pl .$VAR1 = [ 'A very special title', 'varianf', 'varians' ]; $VAR1 = [ 'The Moor\'s Last Sigh', 'Salman', 'Rushdie', 'The God of Small Things', 'Arundhati', 'Roy' ];
        P.S.: I hardcoded the boundaries for the captured fields to shortcut the coding here. Naturally this part could/should be coded more flexible if you deal with a lot of regex's.

        Since he is comparing line by line instead of the whole doc all at once, it doesn't matter that the next regex starts at the begging even if the last one matched. I know that often my problem isn't getting perl to do what i want, it is thinking i want perl to do one thing when realy there is a better solution. That's why it is good you provide your actual problem because someone might see a solution you are missing, or at very least the insight into the problem will allow people to agree you are doing it the best way, either way you get good information!


        ___________
        Eric Hodges
        I'd try a combination of m//g in scalar context and using the \G marker. If necessary, you can control where it matches by setting pos().

        Sorry for not presenting a coded solution, I don't understand your problem well enough to give one.

        Anno

Re: How to get ($1, $2, ...)?
by rinceWind (Monsignor) on Feb 16, 2007 at 16:20 UTC

    I very rarely use $1, $2, etc. these days, as I find they're a source of bugs for the unwary. This is mainly owing to the very special type of binding, a consequence of which is that, in the event of a failure to match, they retain their previous values.

    Instead, I capture to lexical variables. In your case, a single lexical array will do:

    my @answers; my $re = qr/Title: (.*?), Author: (\w+) (\w+)$/; # 3 groups here while (my @results = $text =~ /$re/mgc) { my %ans; push @answers, \@results; }

    --

    Oh Lord, won’t you burn me a Knoppix CD ?
    My friends all rate Windows, I must disagree.
    Your powers of persuasion will set them all free,
    So oh Lord, won’t you burn me a Knoppix CD ?
    (Missquoting Janis Joplin)

Re: How to get ($1, $2, ...)?
by fenLisesi (Priest) on Feb 16, 2007 at 13:47 UTC
    Wouldn't you solve your problem by handling a line at a time?
      Wouldn't you solve your problem by handling a line at a time?

      That's almost what I do, but I need the /gc modifier to restart from where the last regex left. Then, I can't use the match in list context as explained and so I don't see an easy way to get ($1, $2, ...) for free.

Re: How to get ($1, $2, ...)?
by educated_foo (Vicar) on Feb 16, 2007 at 14:30 UTC
    Perhaps this (without strict)?
    @groups = grep defined, map $$_, 1..$last_capture
      Please don't brainlessly use defined on the list of values of $1, $2 etc, because you're now throwing away information about which pairs of parens were involved in the actual match. Sometimes that's what you want, but usually, it's not.

      And eric256 has the right idea for using @- and/or @+ to find out how many paren pairs were involved.

      You could use these arrays, one item per array, together with substr to extract the matches without symbolic references, but then you'd have to know what variable the match was against, and it may not have been changed in the meantime, for example by using the regexp in s///g, which would foul up the result. So I think the symbolic references for the captures for a reasonably elegant approach.

      my @groups; {no strict 'refs'; @groups = map $$_, 1..$#-;}

      Nice Educated_foo!

      Update: Removed use of grep, didn't notice it in there the first time ;)


      ___________
      Eric Hodges
Re: How to get ($1, $2, ...)?
by almut (Canon) on Feb 16, 2007 at 17:28 UTC

    Presuming I'm understanding your problem correctly :) I'd do a two step matching process. For this, you'd need additional outer parentheses around each regex for the m//mgc step (in scalar context). Then you could do something like

    my $text = <<TEXT; A B 1 2 a b c TEXT my @rxs = ( qr/(\s*(\w)\s+(\w)\s*$)/, qr/(\s*(\d)\s+(\d)\s*$)/, qr/(\s*([a-z])\s+([a-z])\s+([a-z])\s*$)/, ); my @result; for my $rx (@rxs) { if ($text =~ /$rx/mgc) { my $match = $1; # match again to extract submatches my @grps = $match =~ $rx; shift @grps; # remove $1 (outer parens) push @result, [ @grps ]; } } use Data::Dumper; print Dumper \@result;

    Prints:

    $VAR1 = [ [ 'A', 'B' ], [ '1', '2' ], [ 'a', 'b', 'c' ] ];

    Maybe not the best approach performance-wise, but at least reasonably easy to code and maintain...

    Alternatively, you could also simply use $& (update: look for the first WARNING: in perldoc perlre for why you might want to avoid $&, $`, $'), i.e.

    ... my @rxs = ( qr/\s*(\w)\s+(\w)\s*$/, qr/\s*(\d)\s+(\d)\s*$/, qr/\s*([a-z])\s+([a-z])\s+([a-z])\s*$/, ); my @result; for my $rx (@rxs) { if ($text =~ /$rx/mgc) { my $match = $&; # match again to extract submatches my @grps = $match =~ $rx; push @result, [ @grps ]; } }
Rewrite for "How to get ($1, $2, ...)?
by ferreira (Chaplain) on Feb 16, 2007 at 18:31 UTC

    Update: the question was overhauled to emphasize the main topic which is the node title and because my first posting caused much more confusion than it should. As the question had already many replies and votes, I posted it here as a reply following the advices of other monks.

    I am looking for a solution for the following problem: given an arbitrary regex (like qr/Title: (.*?), Author: (\w+) (\w+)$/) with an arbitrary number of groups (not known beforehand), how do I get ($1, $2, ...) in a generic way?

    I envisaged a solution using @- and @+ and wrote the following piece of code. (See perlvar.)

    # return ($1, $2, ...) matched against $s sub _groups { my $s = shift; my @groups; foreach my $i (1..$#-) { push @groups, substr($s, $-[$i], $+[$i] - $-[$i]); } return @groups }

    Then I can write:

    if (/$re/mgc) { @groups = _groups($_); # ($1, $2, ...) }

    The question is: There is a better way to do this?

    Background

    Why, for Heaven's sake, I think I need to get these ($1, $2, ...)?

    Read more if you care.

    Note 1. Before the rephrasing of this question, educated_foo answered with a nice alternative (at Re: How to get ($1, $2, ...)?) for _groups and almut proposed a two-step process (at Re: How to get ($1, $2, ...)?) also in line with the node problem. I thank all other mongers that replied and eric256 that inspired me to rewrite this question.

    Note 2. Yeah, there are modules like Text::Scraper, Text::Template to things like that, but they are not quite the same. Sometimes one needs to try to reinvent some wheels, even if it is just to have confidence on the wheels someone else made.

    Note 3. demerphq pointed there is no way to do that in current production perls. Only in blead or with a little XS for earlier versions. The best thing he think of without using XS is: my @array=eval '($'.join(',$',1..$#-).')'; Thanks.

      I don't know if this will help or not, but if you split it line by line and then trigger a new book every time you see the title, you get the same scanner like behavior without using /mgc. If you might have multiple fields in one line then it might be hard to use, but maybe some combination of the two methods would let you find a book and then use the @fields = $str =~ /$re/mgs code on just one book section at a time. For all I know you might be able to split on a boundary before "Title" and then have each book as a chunk to then run your multiple regexs on without fear of them leaking over to the next book. Good Luck! ;)

      use strict; use warnings; use Data::Dumper; my $test =<<HERE; Title: The Moor's Last Sigh Author: Salman Rushdie Publisher: Foo asdf asdf asdf a d f d a sf as Title: The God of Small Things Author: Arundhati Roy Publisher: Bar HERE my @lines = split /\n/, $test; my $re_title = qr/Title: (.*?)$/; my $re_author = qr/Author: (\w+) (\w+)$/; my $re_publisher = qr/Publisher: (.*?)$/; my @answers; my $book; for my $line (@lines) { if ($line =~ /$re_title/) { #if this is a title line then the previous book is done being +scanned # so push the previous book onto answers and and clear out %bo +ok push @answers, $book if $book; $book = {}; $book->{title} = $1; } elsif ($line =~ /$re_author/) { $book->{author} = [ $1, $2 ]; } elsif ($line =~ /$re_publisher/) { $book->{publisher} = $1; } } #push the final book push @answers, $book; print Dumper(\@answers);

      ___________
      Eric Hodges

        This is kinda a fun little challenge and so I played with it more. Below is a mix of several different solutions here, i know they probably wont help you but since i've worked it out and tested it some i figured i'd share:

        use strict; use warnings; use Data::Dumper; my $test =<<HERE; Title: The Moor's Last Sigh Author: Salman Rushdie Publisher: Foo Title: The God of Small Things Author: Arundhati Roy Publisher: Bar Title: The Moor's Last Sigh, Author: Salman Rushdie HERE my @books = split /(?=Title:)/, $test; my @res = ( [ qr/^Title: (.*?), Author: (\w+) (\w+), Publisher: (\w+), Year: (\w ++)$/, sub { my $b = shift; $b->{title} = $1; $b->{author} = [$2,$3]; $b->{publisher} = $4; $b->{year} = $5; } ], [ qr/^Title: (.*?), Author: (\w+) (\w+), Publisher: (\w+)$/, sub { my $b = shift; $b->{title} = $1; $b->{author} = [$2,$3]; $b->{publisher} = $4 } ], [ qr/^Title: (.*?), Author: (\w+) (\w+)$/, + sub { my $b = shift; $b->{title} = $1; $b->{author} = [$2,$3] } ], [ qr/^Title: (.*?)$/, sub { my $b = shift; $b->{title} = $ +1; } ], [ qr/^Author: (\w+) (\w+)$/, sub { my $b = shift; $b->{author} = [ +$1,$2];}], [ qr/^Publisher: (.*?)$/, sub { my $b = shift; $b->{publisher}= $ +1; } ], ); my @answers; for my $book_src (@books) { my $book = {}; for my $re (@res) { my $reg = $re->[0]; if ($book_src =~ /$reg/mgc){ &{$re->[1]}($book); } } push @answers, $book; } print Dumper(\@answers); __END__ $VAR1 = [ { 'author' => [ 'Salman', 'Rushdie' ], 'title' => 'The Moor\'s Last Sigh', 'publisher' => 'Foo' }, { 'author' => [ 'Arundhati', 'Roy' ], 'title' => 'The God of Small Things', 'publisher' => 'Bar' }, { 'author' => [ 'Salman', 'Rushdie' ], 'title' => 'The Moor\'s Last Sigh' } ];

        ___________
        Eric Hodges
      Even after reading and re-reading your Update, I still don't quite get your problem.
      1. Do your patterns span multiple lines, or must each pattern match within a single line?
      2. Do your patterns always start matching at the beginning of a line, or do you sometimes pickup a pattern halfway through a line?
      If the patterns can span multiple lines, when you apply the first pattern, it's may skip over text that will match the second (third, fourth...) pattern. Won't it?

      Also, unless the patterns are contstrained to start matching at the beginning of a line, it sounds like you're heading for a wildly inefficient and slow routine. Or am I misunderstanding you?

      throop

Benchmark - How to get ($1, $2, ...)?
by ferreira (Chaplain) on Feb 17, 2007 at 18:55 UTC

    Ok. To conclude my explorations on the issue of this node, something which can be eventually useful for others, I summarized the three solutions to "get ( $1, $2, ... )" favoured in this thread and made a little benchmark on them.

    There is the solution which uses @- and @+: no messing with magic $<n> variables, but needs to know about the variable it was matched against.

    # @groups = groups1($s) sub groups1 { return map { substr $_[0], $-[$_], $+[$_] - $-[$_] } 1..$#- }

    There is the beautiful solution by educated_foo with symbolic references:

    # educated_foo sub groups2 { no strict 'refs'; return map { $$_ } 1..$#- }

    And the quick solution given by demerphq:

    # demerphq sub groups3 { return eval '($'.join(',$',1..$#-).')' }

    which happens to use eval.

    The general result for 5.8.8 looks like

    s/iter demerphq @- and @- educated_foo demerphq 6.65 -- -44% -50% @- and @- 3.72 79% -- -10% educated_foo 3.34 99% 11% --

    which suggests eval imposes a high performance penalty, making it half as fast as the educated_foo's version, which is followed closely by the solution using @- and @+ (but this never surpasses the former). This was tested on four 5.8.8 architectures: i386-freebsd-64int, cygwin, MSWin32-x86-multi-thread (vanilla-perl), MSWin32-x86-multi-thread.

    Results has shown greater variances at other versions and architectures, like v5.8.2 built for PA-RISC1.1-thread-multi, v5.8.7 built for i686-linux-thread-multi, v5.8.4 built for MSWin32-x86-multi-thread, but the order kept the same. The winner was faster than the 'eval' version with percentages ranging from 65% to 145%. But as the interpreter development has seen a lot of changes up to the code of 5.8.8, I preferred to concentrate on 5.8.8 perls. Maybe 5.8.8 tested on different processor architectures like PA-RISC and PPC may reveal more trade-offs than this partial benchmark.

    The benchmark code used was: