Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Regex to pull out string within parenthesis that could contain parenthesis

by dpelican (Initiate)
on Jul 09, 2018 at 13:45 UTC ( #1218162=perlquestion: print w/replies, xml ) Need Help??
dpelican has asked for the wisdom of the Perl Monks concerning the following question:

I'm working on a way to automate comment generation for some code that I'm working on and I'm trying to extract parameters from a function declaration. I came up with the following expression:

 ^(private|public)?\s?(function|report)\s([^()]+)\(([^()]+)?\)(\s(returns)\s\(?([^()]+)\)?)?

The expression worked on almost all functions until the parameters contained parentheses themselves, such as:

function convert_wa_date_strings(iv_beg string, iv_end string, iv_read_date date, iv_step char(6)) returns (date, date, char(1))

Since the parentheses are important for the variable type they can't be ignored. The same issue occurs with the returns, but it'll be the same fix. What is it that I'm missing to capture those pesky parameters with parentheses?

Thanks!

Replies are listed 'Best First'.
Re: Regex to pull out string within parentheses that could contain parentheses
by hippo (Canon) on Jul 09, 2018 at 14:23 UTC

    Is this the sort of thing you are after? It's a PoC as it stands so feel free to tweak until it delivers what you actually want.

    use strict; use warnings; use Test::More tests => 6; my $text = 'function convert_wa_date_strings(iv_beg string, iv_end str +ing, iv_read_date date, iv_step char(6)) returns (date, date, char(1) +)'; my $re = qr#^(private|public)?\s?(function|report)\s(\w+)\((.+?)\)((?: +\s+)(returns)\s\((.+)*?\))?$#; ok ($text =~ $re, 'Matched'); is ($1, undef, '$1 is correct'); is ($2, 'function', '$2 is correct'); is ($3, 'convert_wa_date_strings', '$3 is correct'); is ($4, 'iv_beg string, iv_end string, iv_read_date date, iv_step char +(6)', '$4 is correct'); is ($5, ' returns (date, date, char(1))', '$5 is correct');
Re: Regex to pull out string within parenthesis that could contain parenthesis
by roboticus (Chancellor) on Jul 09, 2018 at 14:28 UTC

    dpelican

    Read perldoc perlre and search for "Recursive subpattern" and you'll find how to handle nesteded parenthesis.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re: Regex to pull out string within parenthesis that could contain parenthesis
by golux (Hermit) on Jul 09, 2018 at 15:45 UTC
    Hi dpelican,

    I think this will do what you need. It uses the recursive subpatterns described in the perlre documentation. By the time I finished getting my example working I saw that roboticus had already mentioned them. (I had never used the recursive regex method, so it was a good learning experience for me).

    Edit:   Fixed some comments (specifically capture group numbering), and captured a little bit more.

    Edit 2: Added output.

    Edit 3: Allow keyword 'report' (somehow missed it the first time).

    #!/usr/bin/perl # # References: # http://perldoc.perl.org/perlre.html (See section on 'PARNO') ## use strict; use warnings; use feature qw( say ); use Method::Signatures; ################## ## Main Program ## ################## my $str = 'private function convert_wa_date_strings(iv_beg string, iv_ +end string, iv_read_date date, iv_step char(6)) returns (date, date, +char(1))'; recursive_function_parsing_regex($str); ################# ## Subroutines ## ################# func recursive_function_parsing_regex($str) { my $re = qr{ ( # Paren group 1 -- full function (?: (private|public) # Paren group 2 -- optional 'private +' or 'public' \s+)? (function) # Paren group 3 -- required 'functio +n' keyword \s* # Optional space after 'function' (\w+) # Paren group 4 -- function name ( # Paren group 5 -- args in parens \( ( # Paren group 6 -- contents of paren +s (?: (?> [^()]+ ) # Non-parens without backtracking | (?5) # Recurse to start of paren group 5 )* ) \) ) (?: # Optional return value \s+ returns\s* ( # Paren group 7 -- return args in pa +rens \( ( # Paren group 8 -- return args (?: (?> [^()]+ ) # Non-parens without backtracking | (?7) # Recurse to start of paren group 7 )* ) \) ) )? ) }x; if ($str !~ /$re/) { say "No match for '$str'"; return; } my ($full, $pp, $func, $name, $par, $args, $ret, $rargs) = ($1, $2 + || "", $3, $4, $5, $6, $7 || "", $8 || ""); say "Match!"; say " \$full => '$full'"; # Full expression say " \$pp => '$pp'"; # Optional 'private' or 'public' k +eyword say " \$func => '$func'"; # 'function' keyword say " \$name => '$name'"; # Function name say " \$par => '$par'"; # Func args (in parens) say " \$args => '$args'"; # Func args (no parens) say " \$ret => '$ret'"; # Optional return args (in parens) say " \$rargs => '$rargs'"; # Optional return args (no parens) }

    Result:

    Match! $full => 'private function convert_wa_date_strings(iv_beg string, i +v_end string, iv_read_date date, iv_step char(6)) returns (date, date +, char(1))' $pp => 'private' $func => 'function' $name => 'convert_wa_date_strings' $par => '(iv_beg string, iv_end string, iv_read_date date, iv_step + char(6))' $args => 'iv_beg string, iv_end string, iv_read_date date, iv_step +char(6)' $ret => '(date, date, char(1))' $rargs => 'date, date, char(1)'
    say  substr+lc crypt(qw $i3 SI$),4,5
Re: Regex to pull out string within parenthesis that could contain parenthesis (updated)
by AnomalousMonk (Chancellor) on Jul 09, 2018 at 17:58 UTC

    Here's another, more factored example of the use of recursive subpatterns (introduced with Perl version 5.10):

    c:\@Work\Perl\monks>perl -wMstrict -MData::Dump -le "use 5.010; ;; my $s = 'function convert(beg string, end string, read_date date, step char +(6)) returns (date, date, char(1))'; ;; my $rx_paren = qr{ ( [(] (?: [^()]*+ | (?-1))* [)] ) }xms; my $rx_identifier = qr{ \w+ }xms; ;; my $parsed_ok = my @ra = $s =~ m{ \A \s* (private|public)? \s* (function|report) \s* ($rx_identifier) \s* $rx_paren \s* ((returns) \s* $rx_paren)? \s* \z }xms; ;; if ($parsed_ok) { dd @ra; } else { print 'parse failed'; } " ( undef, "function", "convert", "(beg string, end string, read_date date, step char(6))", "returns (date, date, char(1))", "returns", "(date, date, char(1))", )

    Update: The  (private|public)? \s* sub-expression in the above  m// should probably be something like (untested)
        ((?: private | public) \s)? \s*
    because, e.g.,  public looks too much like  function or  report that would always follow it and requires some delimitation.


    Give a man a fish:  <%-{-{-{-<

      Here's a variation on the above solution, using named recursive subpatterns and named captures.
      Nowadays I write all my non-trivial regexes this way.
      use 5.010; my $source = 'function convert(beg string, end string, read_date date, + step char(6)) returns (date, date, char(1))'; my $matched = $source =~ m{ \A \s*+ (?<access> private | public )?+ \s*+ (?<keyword> function | report ) \s*+ (?<name> (?&identifier) ) \s*+ (?<params> (?&list) ) \s*+ (returns \s*+ (?<returns> (?&list) ) )?+ \s*+ \z (?(DEFINE) (?<identifier> [^\W\d]\w*+ ) (?<list> [(] [^()]*+ (?: (?&list) [^()]*+ )*+ [)] ) ) }xms; if ($matched) { my %components = %+; use Data::Dumper 'Dumper'; say Dumper \%components; } else { say 'parse failed'; }
      which outputs:
      $VAR1 = { keyword => 'function', name => 'convert', params => '(beg string, end string, read_date date, step char(6))', returns => '(date, date, char(1))', };

        I had entirely forgotten about named captures and  (?(DEFINE)...) — a much better (regex) approach.


        Give a man a fish:  <%-{-{-{-<

Re: Regex to pull out string within parenthesis that could contain parenthesis
by tobyink (Abbot) on Jul 09, 2018 at 20:06 UTC
Re: Regex to pull out string within parenthesis that could contain parenthesis
by fishy (Pilgrim) on Jul 09, 2018 at 16:03 UTC
Re: Regex to pull out string within parenthesis that could contain parenthesis
by sundialsvc4 (Abbot) on Jul 09, 2018 at 19:55 UTC

    Another possibility to consider is Parse::RecDescent or something along those lines – a true parser.   You can write your grammar so that it looks for function declarations and ignores everything else, and now the solution is truly generalized.   When, inevitably, a programmer did something that your current crop of regular-expressions didn’t recognize (if you caught it), you would find yourself maybe going through more-and-more calisthenics.   Whereas a parser avoids all that.

    I have personal experience with this.   In a past life, I concocted a system that parsed a mess of SAS® source-files, Korn(!) shell scripts, and Tivoli job-files, to construct a data-flow picture of what this old application was actually doing.   Parse::RecDescent frankly astounded me with its ability to do so much of the job “with speed, grace, and style.”   I had prior experience with other parsers based on Bison and Yacc (which are also supported through Perl ...), but this was considerably more flexible, allowing me to write grammars to zero-in on what I needed out of the files without having to create a grammar for all of its irrelevant-to-me twists and turns.

    As the system I was building became more visible, its requirements expanded several times.   But, thanks to the parser and my experience with it at that time, I was able to accommodate those new requests without major re-design.   That positively sold me on it, hence my suggestion.

      Another possibility to consider is Parse::RecDescent or something along those lines – a true parser. ... When ... your current crop of regular-expressions didn’t recognize [something], you would find yourself maybe going through more-and-more calisthenics. Whereas a parser avoids all that.

      I tend to agree that a real parser is a better, certainly a more general and scalable, solution than a regex approach.

      I have personal experience with this. ... I concocted a system that parsed a mess of [stuff] ... Parse::RecDescent frankly astounded me ... I had prior experience with other parsers ..., but this was considerably more flexible, allowing me to write grammars ...

      The OPed problem definition is pretty limited and seems fairly well defined. Would you care to supply an SASE | SSCCE (update: I don't know why I thought a Self-Addressed Stamped Envelope would be helpful here :) for a parser approach that would address this problem? I think such an example would certainly add to the discussion.


      Give a man a fish:  <%-{-{-{-<

        A reply falls below the community's threshold of quality. You may see it by logging in.
        A reply falls below the community's threshold of quality. You may see it by logging in.
    A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1218162]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (4)
As of 2018-11-21 20:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My code is most likely broken because:
















    Results (250 votes). Check out past polls.

    Notices?