Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:
Salutations, dearest monks.
I am trying to write a parsery sort of thing, but ran into a Perl syntax problem. This is the layout of the code I have:
for my $syn (@syntax) {
my ($re, $cb) = @$syn;
if (my (@matches) = ($line =~ $re)) {
$cb->(@matches);
last;
}
}
As you can see, I have an array of possible syntax elements (@syntax) that each houses a regexp ($re) and then a callback function ($cb). The callback function expects the regexp's capture groups as arguments.
It then occurred to me that the code won't run if the regexp has no capture groups!
I can, of course, say if ($line =~ $re) { ... } but then I lose the captures. I need the captures.
I need a) to know that the regexp matched, and b) the capture groups returned by the regexp.
What to do? Is there a syntax that allows both? Do I run my regexp twice? Or do I just add a dummy capture group into every regexp?
Re: How to know that a regexp matched, and get its capture groups?
by tybalt89 (Monsignor) on Jan 09, 2023 at 19:44 UTC
|
if ( $line =~ $re ) {
$cb->(@{^CAPTURE});
last;
| [reply] [d/l] |
|
Note that you need 5.26+ for @{^CAPTURE}.
map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
| [reply] [d/l] [select] |
|
if ( my (@caps) = ($line =~ $re) ) {
no warnings 'uninitialized';
@caps = () if $caps[0] ne $1; # reset pseudo capture
+s
$cb->(@caps);
last;
}
/update
This should be backward compatible
my (@matches) = ($line =~ $re)
if (defined $&) {
$cb->(@matches);
last;
}
# tests...
use v5.12;
use warnings;
for my $str ("AB","") {
say "****** str=<$str>";
for my $re ( qr/../, qr/(.)(.)/, q/XY/, q/(X)Y/, q// ) {
say "--- re=<$re>";
my @captures = $str =~ $re;
if ( defined $& ) {
say "matched"
} else {
say "no match"
}
if (defined $1) {
say "with captures <@captures>";
} else {
say "no captures";
}
}
}
****** str=<AB>
--- re=<(?^u:..)>
matched
no captures
--- re=<(?^u:(.)(.))>
matched
with captures <A B>
--- re=<XY>
no match
no captures
--- re=<(X)Y>
no match
no captures
--- re=<>
matched
no captures
****** str=<>
--- re=<(?^u:..)>
no match
no captures
--- re=<(?^u:(.)(.))>
no match
no captures
--- re=<XY>
no match
no captures
--- re=<(X)Y>
no match
no captures
--- re=<>
matched
no captures
| [reply] [d/l] [select] |
|
|
|
|
|
|
Re: How to know that a regexp matched, and get its capture groups?
by Corion (Patriarch) on Jan 09, 2023 at 19:36 UTC
|
| [reply] [d/l] [select] |
Re: How to know that a regexp matched, and get its capture groups?
by NERDVANA (Hermit) on Jan 09, 2023 at 22:56 UTC
|
As tybalt89 wrote, @{^CAPTURE} is what you're looking for, but don't forget named captures and %+. From the perlvar documentation:
For example, $+{foo} is equivalent to $1 after the following match:
'foo' =~ /(?<foo>foo)/;
The next cool feature of perl for parsing that you should probably be aware of is "pos" and "\G" and the /c regex switch. As it happens, you're in luck, because David Raab just wrote a blog post fully explaining it! (just saw that in Perl Weekly email earlier today)
And if that wasn't enough, along your parsing journey you might discover it's a bit slow to iterate through a bunch of @syntax items at each point along the parse. (as in, dozens or more. less than 10 is probably fine the way you are doing it) When you come to this problem, the solution is to dynamically build a string of code that looks like this:
sub {
/\G (?:
... (?{ code1(...); }) # pattern 1, handler for pattern 1
| ... (?{ code2(...); }) # pattern 2, handler for pattern 2
| ... (?{ code3(...); }) # and so on
)/gcx;
}
You then need to eval that to ensure perl compiles it. (qr// notation is not guaranteed to compile it, and usually doesn't)
sub parse {
my $input= shift;
my $code= ... # assemble regex sub text like above
my $lexer= eval $code
or die "BUG: syntax error in generated code: $@";
local $_= $input;
&$lexer || die "Syntax error at '" . substr($_, pos, 10) . "'"
while pos < length;
}
and then you've reached about the highest performance Perl can give you for parsing! The final speedup is to let perl do the looping for you by putting (...)++ on the regex you built (++ ensures that perl doesn't try to backtrack) but then you lose the ability to stop the loop and it runs until all input is exhausted. | [reply] [d/l] [select] |
Re: How to know that a regexp matched, and get its capture groups? (updated)
by haukex (Archbishop) on Jan 10, 2023 at 10:08 UTC
|
The others have already given you some ideas for better parsing. However, you're mistaken on the premise of the question:
It then occurred to me that the code won't run if the regexp has no capture groups! ... I need the captures.
The code will still run - a regex in list context without /g and without capture groups will return the list (1) if it matched, so the assignment will evaluate to true. See also.
use warnings;
use strict;
use Data::Dump;
my @syntax = ( [qr/cd/, sub { dd "callback", \@_ }] );
my $line = "abcdef";
for my $syn (@syntax) {
my ($re, $cb) = @$syn;
if (my (@matches) = ($line =~ $re)) {
$cb->(@matches);
last;
}
}
__END__
("callback", [1])
Update: And $#+ will give you the number of capture groups present in the last successful match (see also). | [reply] [d/l] [select] |
Re: How to know that a regexp matched, and get its capture groups?
by GrandFather (Saint) on Jan 09, 2023 at 21:09 UTC
|
Parsers can be tricky. You may be interested in looking at a parsing tool such as Marpa::R2 to do most of the heavy lifting for you so that you can concentrate on syntax and output from the parser.
Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
| [reply] |
Re: How to know that a regexp matched, and get its capture groups?
by LanX (Sage) on Jan 09, 2023 at 20:34 UTC
|
I'm confused...
> It then occurred to me that the code won't run if the regexp has no capture groups!
Why do you use the if if you don't want to test if there was no @match? (empty list-assignments are false)
and if you need the if b/c you wanna handle a no-match separately, why don't you use an else branch?
edit
OK, I think your problem is that $re can match -hence be true - without any internal syntax defining (capture) groups.
Hence check if (@match) separately, whether with surrounding if for the $re or not depends on the desired logic...
update
after some testing do I like tybalt's solution the most Re: How to know that a regexp matched, and get its capture groups?
| [reply] [d/l] [select] |
|
> I'm confused...
I need to find the first regexp that is stored in @syntax that matches, and discard the rest. When one matches, I need the capture groups, if any.
But I'll take the other answer, that a regexp returns true even without capture groups. I could not quickly find any mention of return values in either perlre or perlsyn, both rather hefty pages, before asking. Was probably looking in the wrong place anyway... Ah, yes, I found it in perlop now:
Matching in list context
If the "/g" option is not used, "m//" in list context returns a
list consisting of the subexpressions matched by the
parentheses in the pattern, that is, ($1, $2, $3...) (Note
that here $1 etc. are also set). When there are no parentheses
in the pattern, the return value is the list "(1)" for success.
With or without parentheses, an empty list is returned upon
failure.
I didn't expect this many answers, to be honest... | [reply] [d/l] |
|
> I didn't expect this many answers, to be honest...
Because the problem is not easy to grasp normally one knows beforehand if captures are expected.
And the documentation is accurate.
My last solution here should fix the fake capture issue in a straight forward way, without any performance or version penalty.
| [reply] |
|
Re: How to know that a regexp matched, and get its capture groups?
by ikegami (Patriarch) on Jan 10, 2023 at 13:56 UTC
|
It then occurred to me that the code won't run if the regexp has no capture groups!
It will run. The match returns 1 in such circumstances.
$ perl -Mv5.14 -e'if ( my @m = "b" =~ /(a)/ ) { say "@m"; }'
$ perl -Mv5.14 -e'if ( my @m = "a" =~ /(a)/ ) { say "@m"; }'
a
$ perl -Mv5.14 -e'if ( my @m = "b" =~ /a/ ) { say "@m"; }'
$ perl -Mv5.14 -e'if ( my @m = "a" =~ /a/ ) { say "@m"; }'
1
| [reply] [d/l] [select] |
|
|