Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

How to know that a regexp matched, and get its capture groups?

by Anonymous Monk
on Jan 09, 2023 at 19:31 UTC ( #11149467=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Salutations, dearest monks.

I am trying to write a parsery sort of thing, but ran into a Perl syntax problem. This is the layout of the code I have:

for my $syn (@syntax) { my ($re, $cb) = @$syn; if (my (@matches) = ($line =~ $re)) { $cb->(@matches); last; } }

As you can see, I have an array of possible syntax elements (@syntax) that each houses a regexp ($re) and then a callback function ($cb). The callback function expects the regexp's capture groups as arguments.

It then occurred to me that the code won't run if the regexp has no capture groups!

I can, of course, say if ($line =~ $re) { ... } but then I lose the captures. I need the captures.

I need a) to know that the regexp matched, and b) the capture groups returned by the regexp.

What to do? Is there a syntax that allows both? Do I run my regexp twice? Or do I just add a dummy capture group into every regexp?

Replies are listed 'Best First'.
Re: How to know that a regexp matched, and get its capture groups?
by tybalt89 (Monsignor) on Jan 09, 2023 at 19:44 UTC
    if ( $line =~ $re ) { $cb->(@{^CAPTURE}); last;
      Note that you need 5.26+ for @{^CAPTURE}.

      map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
        update

        I don't recommend the following code anymore

        rather

        if ( my (@caps) = ($line =~ $re) ) { no warnings 'uninitialized'; @caps = () if $caps[0] ne $1; # reset pseudo capture +s $cb->(@caps); last; }
        /update

        This should be backward compatible

        my (@matches) = ($line =~ $re) if (defined $&) { $cb->(@matches); last; }

        # tests...

        use v5.12; use warnings; for my $str ("AB","") { say "****** str=<$str>"; for my $re ( qr/../, qr/(.)(.)/, q/XY/, q/(X)Y/, q// ) { say "--- re=<$re>"; my @captures = $str =~ $re; if ( defined $& ) { say "matched" } else { say "no match" } if (defined $1) { say "with captures <@captures>"; } else { say "no captures"; } } }

        ****** str=<AB> --- re=<(?^u:..)> matched no captures --- re=<(?^u:(.)(.))> matched with captures <A B> --- re=<XY> no match no captures --- re=<(X)Y> no match no captures --- re=<> matched no captures ****** str=<> --- re=<(?^u:..)> no match no captures --- re=<(?^u:(.)(.))> no match no captures --- re=<XY> no match no captures --- re=<(X)Y> no match no captures --- re=<> matched no captures

        Cheers Rolf
        (addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
        Wikisyntax for the Monastery

Re: How to know that a regexp matched, and get its capture groups?
by Corion (Patriarch) on Jan 09, 2023 at 19:36 UTC

    The captures also live in the @+ and @- arrays. See perlvar on them.

Re: How to know that a regexp matched, and get its capture groups?
by NERDVANA (Hermit) on Jan 09, 2023 at 22:56 UTC
    As tybalt89 wrote, @{^CAPTURE} is what you're looking for, but don't forget named captures and  %+. From the perlvar documentation:
    For example, $+{foo} is equivalent to $1 after the following match: 'foo' =~ /(?<foo>foo)/;

    The next cool feature of perl for parsing that you should probably be aware of is "pos" and "\G" and the /c regex switch. As it happens, you're in luck, because David Raab just wrote a blog post fully explaining it! (just saw that in Perl Weekly email earlier today)

    And if that wasn't enough, along your parsing journey you might discover it's a bit slow to iterate through a bunch of @syntax items at each point along the parse. (as in, dozens or more. less than 10 is probably fine the way you are doing it) When you come to this problem, the solution is to dynamically build a string of code that looks like this:

    sub { /\G (?: ... (?{ code1(...); }) # pattern 1, handler for pattern 1 | ... (?{ code2(...); }) # pattern 2, handler for pattern 2 | ... (?{ code3(...); }) # and so on )/gcx; }
    You then need to eval that to ensure perl compiles it. (qr// notation is not guaranteed to compile it, and usually doesn't)
    sub parse { my $input= shift; my $code= ... # assemble regex sub text like above my $lexer= eval $code or die "BUG: syntax error in generated code: $@"; local $_= $input; &$lexer || die "Syntax error at '" . substr($_, pos, 10) . "'" while pos < length; }
    and then you've reached about the highest performance Perl can give you for parsing! The final speedup is to let perl do the looping for you by putting  (...)++ on the regex you built (++ ensures that perl doesn't try to backtrack) but then you lose the ability to stop the loop and it runs until all input is exhausted.
Re: How to know that a regexp matched, and get its capture groups? (updated)
by haukex (Archbishop) on Jan 10, 2023 at 10:08 UTC

    The others have already given you some ideas for better parsing. However, you're mistaken on the premise of the question:

    It then occurred to me that the code won't run if the regexp has no capture groups! ... I need the captures.

    The code will still run - a regex in list context without /g and without capture groups will return the list (1) if it matched, so the assignment will evaluate to true. See also.

    use warnings; use strict; use Data::Dump; my @syntax = ( [qr/cd/, sub { dd "callback", \@_ }] ); my $line = "abcdef"; for my $syn (@syntax) { my ($re, $cb) = @$syn; if (my (@matches) = ($line =~ $re)) { $cb->(@matches); last; } } __END__ ("callback", [1])

    Update: And $#+ will give you the number of capture groups present in the last successful match (see also).

Re: How to know that a regexp matched, and get its capture groups?
by GrandFather (Saint) on Jan 09, 2023 at 21:09 UTC

    Parsers can be tricky. You may be interested in looking at a parsing tool such as Marpa::R2 to do most of the heavy lifting for you so that you can concentrate on syntax and output from the parser.

    Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
Re: How to know that a regexp matched, and get its capture groups?
by LanX (Sage) on Jan 09, 2023 at 20:34 UTC
    I'm confused...

    > It then occurred to me that the code won't run if the regexp has no capture groups!

    Why do you use the if if you don't want to test if there was no @match? (empty list-assignments are false)

    and if you need the if b/c you wanna handle a no-match separately, why don't you use an else branch?

    edit

    OK, I think your problem is that $re can match -hence be true - without any internal syntax defining (capture) groups.

    Hence check if (@match) separately, whether with surrounding if for the $re or not depends on the desired logic...

    update

    after some testing do I like tybalt's solution the most Re: How to know that a regexp matched, and get its capture groups?

    Cheers Rolf
    (addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
    Wikisyntax for the Monastery

      > I'm confused...

      I need to find the first regexp that is stored in @syntax that matches, and discard the rest. When one matches, I need the capture groups, if any.

      But I'll take the other answer, that a regexp returns true even without capture groups. I could not quickly find any mention of return values in either perlre or perlsyn, both rather hefty pages, before asking. Was probably looking in the wrong place anyway... Ah, yes, I found it in perlop now:

      Matching in list context If the "/g" option is not used, "m//" in list context returns a list consisting of the subexpressions matched by the parentheses in the pattern, that is, ($1, $2, $3...) (Note that here $1 etc. are also set). When there are no parentheses in the pattern, the return value is the list "(1)" for success. With or without parentheses, an empty list is returned upon failure.

      I didn't expect this many answers, to be honest...

        > I didn't expect this many answers, to be honest...

        Because the problem is not easy to grasp normally one knows beforehand if captures are expected.

        And the documentation is accurate.

        My last solution here should fix the fake capture issue in a straight forward way, without any performance or version penalty.

        Cheers Rolf
        (addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
        Wikisyntax for the Monastery

Re: How to know that a regexp matched, and get its capture groups?
by ikegami (Patriarch) on Jan 10, 2023 at 13:56 UTC

    It then occurred to me that the code won't run if the regexp has no capture groups!

    It will run. The match returns 1 in such circumstances.

    $ perl -Mv5.14 -e'if ( my @m = "b" =~ /(a)/ ) { say "@m"; }' $ perl -Mv5.14 -e'if ( my @m = "a" =~ /(a)/ ) { say "@m"; }' a $ perl -Mv5.14 -e'if ( my @m = "b" =~ /a/ ) { say "@m"; }' $ perl -Mv5.14 -e'if ( my @m = "a" =~ /a/ ) { say "@m"; }' 1

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11149467]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (6)
As of 2023-06-06 09:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    How often do you go to conferences?






    Results (26 votes). Check out past polls.

    Notices?