ferreira has asked for the wisdom of the Perl Monks concerning the following question:
I wrote a code that looks like this:
my $text = <<TEXT;
Title: The Moor's Last Sigh, Author: Salman Rushdie
Title: The God of Small Things, Author: Arundhati Roy
TEXT
my @answers;
my $re = qr/Title: (.*?), Author: (\w+) (\w+)$/; # 3 groups here
while ($text =~ /$re/mgc) {
my %ans;
push @answers, [ $1, $2, $3 ]; # that's the main point!
}
use Data::Dump qw(dump);
print dump(\@answers);
And then I got the answer I want in @answers, namely:
(
["The Moor's Last Sigh", "Salman", "Rushdie"],
["The God of Small Things", "Arundhati", "Roy"],
)
Now imagine I want to run the same code against other inputs and with other regexes driving the extraction. These regexes can have an arbitrary number of capturing groups. I can't get all groups by using $text = /$re/mgc in list context, because it will make a loop over the text gathering $1, $2, ... of each match into a list (that is, $1, $2, ..., $1, $2, ..., ...). The code above is actually a simplification and it should work on a context more like this:
if ($text =~ /$re1/mgc) { ...
if ($text =~ /$re2/mgc) { ...
if ($text =~ /$re1/mgc) { ...
if ($text =~ /$re2/mgc) { ...
So the /g behavior in list context is inappropriate. To get ($1, $2, ...) in a generic form, the only way I envisaged was to use @- and @+ and to employ a piece of code like this:
# return ($1, $2, ...) matched against $s
sub _groups {
my $s = shift;
my @groups;
foreach my $i (1..$#-) {
push @groups, substr($s, $-[$i], $+[$i] - $-[$i]);
}
return @groups
}
and then:
push @answers, [ _groups($text) ]; # [ ($1, $2, ...) ]
My question is: Is there actually a better way to do this?
Re: How to get ($1, $2, ...)?
by wfsp (Abbot) on Feb 16, 2007 at 15:07 UTC
|
Perhaps put your reg exes in a loop an array?
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my @res = (
qr/Title: (.*?), Author: (\w+) (\w+)$/,
qr/Title: (.*?), Author: (\w+) (\w+) Publisher: (\w+)$/,
qr/Title: (.*?), Author: (\w+) (\w+) Publisher: (\w+) Year: (\w+)$/,
);
my @answers;
while (my $line = <DATA>){
for my $re (@res){
my @results;
if (@results = $line =~ /$re/){
push @answers, [@results];
}
}
}
print Dumper \@answers;
__DATA__
Title: The Moor's Last Sigh, Author: Salman Rushdie
Title: The God of Small Things, Author: Arundhati Roy
Title: one, Author: two three Publisher: four
Title: five, Author: six seven Publisher: eight Year: nine
output:
$VAR1 = [
[
'The Moor\'s Last Sigh',
'Salman',
'Rushdie'
],
[
'The God of Small Things',
'Arundhati',
'Roy'
],
[
'one',
'two',
'three',
'four'
],
[
'five',
'six',
'seven',
'eight',
'nine'
]
];
updated:
tinkered with the format of the output
update 2:
forgot to update the code. :-( Thanks to Tanktalus for spotting it. | [reply] [d/l] [select] |
|
Reproducing a bit of your code:
my @answers;
while (my $line = <DATA>){
for my $re (@res){
my @results;
if (@results = $line =~ /$re/){
push @answers, ["@results"];
Why the quotes around @results? They weren't in the version that produced
the output you're showing.
}
}
}
You're also making an unnecessary copy of the array @results. Its scope is the loop body, so you have a new one each time through. Just take the reference:
# ...
for my $re (@res){
my @results;
push @answers, \ @results if @results = $line =~ $re;
}
# ...
Anno | [reply] [d/l] [select] |
|
# ...
push @answers, grep @$_, map [ $line =~ $_], @res;
# ...
instead of the for loop over @res.
I realize I'm expanding on a non-solution to the original question. It's art for art's sake, if that's allowed.
Anno | [reply] [d/l] [select] |
|
That won't do. I am interested in the order of the regexes and in resuming from where other left. If one uses /$re/, the search will be reset each time. In turn, with /$re/gc I may write code to look for things such as /Title: (.*?)$/, Author: (.*?), and Publisher: (.*?), but will not accept if they come out of order (like "Publisher... Title... Author...").
I have been thinking that I should have phrased this question differently, asking directly for a way to get ($1, $2, ...) in a generic manner and then showing the code for sub _groups. The background that inspired me to formulate the problem could be added as a complement, without obscuring what I was looking for.
| [reply] [d/l] [select] |
|
#!/usr/bin/perl
use strict;
use warnings;
my $text = <<TEXT;
Title: The Moor's Last Sigh, Author: Salman Rushdie
Title: The God of Small Things, Author: Arundhati Roy
Title: A very special title, Author: varianf varians
TEXT
my @answers;
my $re = qr/Title: (.*?), Author: (\w+) (\w+)$/; # 3 groups here
my $re2= qr/Title: (.*?special.*?), Author: (\w+) (\w+)$/;
my (@MatchAll) = ($text =~ /$re2|$re/mgc);
my (@Match1,@Match2);
for (my $i=0;$i<@MatchAll;$i=$i+6) {
defined $MatchAll[$i] && push @Match2, $MatchAll[$i..$i+2];
defined $MatchAll[$i+3] && push @Match1, $MatchAll[$i+3..$i+5];
}
Output:
$ perl reg.pl
.$VAR1 = [
'A very special title',
'varianf',
'varians'
];
$VAR1 = [
'The Moor\'s Last Sigh',
'Salman',
'Rushdie',
'The God of Small Things',
'Arundhati',
'Roy'
];
P.S.: I hardcoded the boundaries for the captured fields to shortcut the coding here. Naturally this part could/should be coded more flexible if you deal with a lot of regex's. | [reply] [d/l] |
|
Since he is comparing line by line instead of the whole doc all at once, it doesn't matter that the next regex starts at the begging even if the last one matched. I know that often my problem isn't getting perl to do what i want, it is thinking i want perl to do one thing when realy there is a better solution. That's why it is good you provide your actual problem because someone might see a solution you are missing, or at very least the insight into the problem will allow people to agree you are doing it the best way, either way you get good information!
| [reply] |
|
| [reply] [d/l] [select] |
Re: How to get ($1, $2, ...)?
by rinceWind (Monsignor) on Feb 16, 2007 at 16:20 UTC
|
I very rarely use $1, $2, etc. these days, as I find they're a source of bugs for the unwary. This is mainly owing to the very special type of binding, a consequence of which is that, in the event of a failure to match, they retain their previous values.
Instead, I capture to lexical variables. In your case, a single lexical array will do:
my @answers;
my $re = qr/Title: (.*?), Author: (\w+) (\w+)$/; # 3 groups here
while (my @results = $text =~ /$re/mgc) {
my %ans;
push @answers, \@results;
}
--
Oh Lord, won’t you burn me a Knoppix CD ?
My friends all rate Windows, I must disagree.
Your powers of persuasion will set them all free,
So oh Lord, won’t you burn me a Knoppix CD ? (Missquoting Janis Joplin)
| [reply] [d/l] |
Re: How to get ($1, $2, ...)?
by fenLisesi (Priest) on Feb 16, 2007 at 13:47 UTC
|
Wouldn't you solve your problem by handling a line at a time? | [reply] |
|
Wouldn't you solve your problem by handling a line at a time?
That's almost what I do, but I need the /gc modifier to restart from where the last regex left. Then, I can't use the match in list context as explained and so I don't see an easy way to get ($1, $2, ...) for free.
| [reply] [d/l] |
Re: How to get ($1, $2, ...)?
by educated_foo (Vicar) on Feb 16, 2007 at 14:30 UTC
|
Perhaps this (without strict)?
@groups = grep defined, map $$_, 1..$last_capture
| [reply] [d/l] |
|
Please don't brainlessly use defined on the list of values of $1, $2 etc, because you're now throwing away information about which pairs of parens were involved in the actual match. Sometimes that's what you want, but usually, it's not.
And eric256 has the right idea for using @- and/or @+ to find out how many paren pairs were involved.
You could use these arrays, one item per array, together with substr to extract the matches without symbolic references, but then you'd have to know what variable the match was against, and it may not have been changed in the meantime, for example by using the regexp in s///g, which would foul up the result. So I think the symbolic references for the captures for a reasonably elegant approach.
| [reply] [d/l] |
|
my @groups;
{no strict 'refs'; @groups = map $$_, 1..$#-;}
Nice Educated_foo!
Update: Removed use of grep, didn't notice it in there the first time ;)
| [reply] [d/l] |
Re: How to get ($1, $2, ...)?
by almut (Canon) on Feb 16, 2007 at 17:28 UTC
|
Presuming I'm understanding your problem correctly :) I'd do a
two step matching process. For this, you'd need additional outer
parentheses around each regex for the m//mgc step (in scalar context).
Then you could do something like
my $text = <<TEXT;
A B
1 2
a b c
TEXT
my @rxs = (
qr/(\s*(\w)\s+(\w)\s*$)/,
qr/(\s*(\d)\s+(\d)\s*$)/,
qr/(\s*([a-z])\s+([a-z])\s+([a-z])\s*$)/,
);
my @result;
for my $rx (@rxs) {
if ($text =~ /$rx/mgc) {
my $match = $1;
# match again to extract submatches
my @grps = $match =~ $rx;
shift @grps; # remove $1 (outer parens)
push @result, [ @grps ];
}
}
use Data::Dumper;
print Dumper \@result;
Prints:
$VAR1 = [
[
'A',
'B'
],
[
'1',
'2'
],
[
'a',
'b',
'c'
]
];
Maybe not the best approach performance-wise, but at least reasonably easy to code and maintain...
Alternatively, you could also simply use $& (update: look for the first WARNING: in perldoc perlre for why you might want to avoid $&, $`, $'), i.e.
...
my @rxs = (
qr/\s*(\w)\s+(\w)\s*$/,
qr/\s*(\d)\s+(\d)\s*$/,
qr/\s*([a-z])\s+([a-z])\s+([a-z])\s*$/,
);
my @result;
for my $rx (@rxs) {
if ($text =~ /$rx/mgc) {
my $match = $&;
# match again to extract submatches
my @grps = $match =~ $rx;
push @result, [ @grps ];
}
}
| [reply] [d/l] [select] |
Rewrite for "How to get ($1, $2, ...)?
by ferreira (Chaplain) on Feb 16, 2007 at 18:31 UTC
|
Update: the question was overhauled to emphasize the main topic
which is the node title and because my first posting caused
much more confusion than it should. As the question had already many replies and votes, I posted it here as a reply following the advices of other monks.
I am looking for a solution for the following problem: given
an arbitrary regex (like qr/Title: (.*?), Author: (\w+) (\w+)$/)
with an arbitrary number of groups (not known beforehand),
how do I get ($1, $2, ...) in a generic way?
I envisaged a solution using @- and @+ and wrote the
following piece of code. (See perlvar.)
# return ($1, $2, ...) matched against $s
sub _groups {
my $s = shift;
my @groups;
foreach my $i (1..$#-) {
push @groups, substr($s, $-[$i], $+[$i] - $-[$i]);
}
return @groups
}
Then I can write:
if (/$re/mgc) {
@groups = _groups($_); # ($1, $2, ...)
}
The question is: There is a better way to do this?
Background
Why, for Heaven's sake, I think I need to get these ($1, $2, ...)?
Read more if you care.
Note 1. Before the rephrasing of this question,
educated_foo answered with a nice alternative (at Re: How to get ($1, $2, ...)?) for
_groups
and almut proposed a two-step process (at Re: How to get ($1, $2, ...)?) also in line with
the node problem. I thank all other mongers that replied and
eric256 that inspired me to rewrite this question.
Note 2. Yeah, there are modules like
Text::Scraper, Text::Template to things
like that, but they are not quite the same.
Sometimes one needs to try to reinvent some wheels, even if it
is just to have confidence on the wheels someone else made.
Note 3. demerphq pointed there is no way to do that in current production perls.
Only in blead or with a little XS for earlier versions. The best thing he think of without using XS is: my @array=eval '($'.join(',$',1..$#-).')'; Thanks.
| [reply] [d/l] [select] |
|
I don't know if this will help or not, but if you split it line by line and then trigger a new book every time you see the title, you get the same scanner like behavior without using /mgc. If you might have multiple fields in one line then it might be hard to use, but maybe some combination of the two methods would let you find a book and then use the @fields = $str =~ /$re/mgs code on just one book section at a time. For all I know you might be able to split on a boundary before "Title" and then have each book as a chunk to then run your multiple regexs on without fear of them leaking over to the next book. Good Luck! ;)
use strict;
use warnings;
use Data::Dumper;
my $test =<<HERE;
Title: The Moor's Last Sigh
Author: Salman Rushdie
Publisher: Foo
asdf
asdf
asdf a d
f
d a
sf
as
Title: The God of Small Things
Author: Arundhati Roy
Publisher: Bar
HERE
my @lines = split /\n/, $test;
my $re_title = qr/Title: (.*?)$/;
my $re_author = qr/Author: (\w+) (\w+)$/;
my $re_publisher = qr/Publisher: (.*?)$/;
my @answers;
my $book;
for my $line (@lines) {
if ($line =~ /$re_title/) {
#if this is a title line then the previous book is done being
+scanned
# so push the previous book onto answers and and clear out %bo
+ok
push @answers, $book if $book;
$book = {};
$book->{title} = $1;
} elsif ($line =~ /$re_author/) {
$book->{author} = [ $1, $2 ];
} elsif ($line =~ /$re_publisher/) {
$book->{publisher} = $1;
}
}
#push the final book
push @answers, $book;
print Dumper(\@answers);
| [reply] [d/l] [select] |
|
use strict;
use warnings;
use Data::Dumper;
my $test =<<HERE;
Title: The Moor's Last Sigh
Author: Salman Rushdie
Publisher: Foo
Title: The God of Small Things
Author: Arundhati Roy
Publisher: Bar
Title: The Moor's Last Sigh, Author: Salman Rushdie
HERE
my @books = split /(?=Title:)/, $test;
my @res = (
[ qr/^Title: (.*?), Author: (\w+) (\w+), Publisher: (\w+), Year: (\w
++)$/,
sub {
my $b = shift;
$b->{title} = $1;
$b->{author} = [$2,$3];
$b->{publisher} = $4;
$b->{year} = $5;
}
],
[ qr/^Title: (.*?), Author: (\w+) (\w+), Publisher: (\w+)$/,
sub {
my $b = shift;
$b->{title} = $1;
$b->{author} = [$2,$3];
$b->{publisher} = $4
}
],
[ qr/^Title: (.*?), Author: (\w+) (\w+)$/,
+
sub {
my $b = shift;
$b->{title} = $1;
$b->{author} = [$2,$3]
}
],
[ qr/^Title: (.*?)$/, sub { my $b = shift; $b->{title} = $
+1; } ],
[ qr/^Author: (\w+) (\w+)$/, sub { my $b = shift; $b->{author} = [
+$1,$2];}],
[ qr/^Publisher: (.*?)$/, sub { my $b = shift; $b->{publisher}= $
+1; } ],
);
my @answers;
for my $book_src (@books) {
my $book = {};
for my $re (@res) {
my $reg = $re->[0];
if ($book_src =~ /$reg/mgc){
&{$re->[1]}($book);
}
}
push @answers, $book;
}
print Dumper(\@answers);
__END__
$VAR1 = [
{
'author' => [
'Salman',
'Rushdie'
],
'title' => 'The Moor\'s Last Sigh',
'publisher' => 'Foo'
},
{
'author' => [
'Arundhati',
'Roy'
],
'title' => 'The God of Small Things',
'publisher' => 'Bar'
},
{
'author' => [
'Salman',
'Rushdie'
],
'title' => 'The Moor\'s Last Sigh'
}
];
| [reply] [d/l] |
|
| [reply] |
Benchmark - How to get ($1, $2, ...)?
by ferreira (Chaplain) on Feb 17, 2007 at 18:55 UTC
|
Ok. To conclude my explorations on the issue of this node, something which can be eventually useful for others, I summarized the three solutions to "get ( $1, $2, ... )" favoured in this thread and made a little benchmark on them.
There is the solution which uses @- and @+: no messing with magic $<n> variables, but needs to know about the variable it was matched against.
# @groups = groups1($s)
sub groups1 {
return map { substr $_[0], $-[$_], $+[$_] - $-[$_] } 1..$#-
}
There is the beautiful solution by educated_foo with symbolic references:
# educated_foo
sub groups2 {
no strict 'refs'; return map { $$_ } 1..$#-
}
And the quick solution given by demerphq:
# demerphq
sub groups3 {
return eval '($'.join(',$',1..$#-).')'
}
which happens to use eval.
The general result for 5.8.8 looks like
s/iter demerphq @- and @- educated_foo
demerphq 6.65 -- -44% -50%
@- and @- 3.72 79% -- -10%
educated_foo 3.34 99% 11% --
which suggests eval imposes a high performance penalty, making it half as fast as the educated_foo's version, which is followed closely by the solution using
@- and @+ (but this never surpasses the former). This was tested on four 5.8.8 architectures: i386-freebsd-64int, cygwin, MSWin32-x86-multi-thread (vanilla-perl), MSWin32-x86-multi-thread.
Results has shown greater variances at other versions and architectures, like v5.8.2 built for PA-RISC1.1-thread-multi, v5.8.7 built for i686-linux-thread-multi, v5.8.4 built for MSWin32-x86-multi-thread, but the order kept the same. The winner was faster than the 'eval' version with percentages ranging from 65% to 145%. But as the interpreter development has seen a lot of changes up to the code of 5.8.8, I preferred to concentrate on 5.8.8 perls. Maybe 5.8.8 tested on different processor architectures like PA-RISC and PPC may reveal more trade-offs than this partial benchmark.
The benchmark code used was:
| [reply] [d/l] [select] |
|
|