Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Regex to pull out string within parenthesis that could contain parenthesis

by dpelican (Initiate)
on Jul 09, 2018 at 13:45 UTC ( #1218162=perlquestion: print w/replies, xml ) Need Help??
dpelican has asked for the wisdom of the Perl Monks concerning the following question:

I'm working on a way to automate comment generation for some code that I'm working on and I'm trying to extract parameters from a function declaration. I came up with the following expression:

 ^(private|public)?\s?(function|report)\s([^()]+)\(([^()]+)?\)(\s(returns)\s\(?([^()]+)\)?)?

The expression worked on almost all functions until the parameters contained parentheses themselves, such as:

function convert_wa_date_strings(iv_beg string, iv_end string, iv_read_date date, iv_step char(6)) returns (date, date, char(1))

Since the parentheses are important for the variable type they can't be ignored. The same issue occurs with the returns, but it'll be the same fix. What is it that I'm missing to capture those pesky parameters with parentheses?

Thanks!

Replies are listed 'Best First'.
Re: Regex to pull out string within parentheses that could contain parentheses
by hippo (Canon) on Jul 09, 2018 at 14:23 UTC

    Is this the sort of thing you are after? It's a PoC as it stands so feel free to tweak until it delivers what you actually want.

    use strict; use warnings; use Test::More tests => 6; my $text = 'function convert_wa_date_strings(iv_beg string, iv_end str +ing, iv_read_date date, iv_step char(6)) returns (date, date, char(1) +)'; my $re = qr#^(private|public)?\s?(function|report)\s(\w+)\((.+?)\)((?: +\s+)(returns)\s\((.+)*?\))?$#; ok ($text =~ $re, 'Matched'); is ($1, undef, '$1 is correct'); is ($2, 'function', '$2 is correct'); is ($3, 'convert_wa_date_strings', '$3 is correct'); is ($4, 'iv_beg string, iv_end string, iv_read_date date, iv_step char +(6)', '$4 is correct'); is ($5, ' returns (date, date, char(1))', '$5 is correct');
Re: Regex to pull out string within parenthesis that could contain parenthesis
by roboticus (Chancellor) on Jul 09, 2018 at 14:28 UTC

    dpelican

    Read perldoc perlre and search for "Recursive subpattern" and you'll find how to handle nesteded parenthesis.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re: Regex to pull out string within parenthesis that could contain parenthesis
by golux (Hermit) on Jul 09, 2018 at 15:45 UTC
    Hi dpelican,

    I think this will do what you need. It uses the recursive subpatterns described in the perlre documentation. By the time I finished getting my example working I saw that roboticus had already mentioned them. (I had never used the recursive regex method, so it was a good learning experience for me).

    Edit:   Fixed some comments (specifically capture group numbering), and captured a little bit more.

    Edit 2: Added output.

    Edit 3: Allow keyword 'report' (somehow missed it the first time).

    #!/usr/bin/perl # # References: # http://perldoc.perl.org/perlre.html (See section on 'PARNO') ## use strict; use warnings; use feature qw( say ); use Method::Signatures; ################## ## Main Program ## ################## my $str = 'private function convert_wa_date_strings(iv_beg string, iv_ +end string, iv_read_date date, iv_step char(6)) returns (date, date, +char(1))'; recursive_function_parsing_regex($str); ################# ## Subroutines ## ################# func recursive_function_parsing_regex($str) { my $re = qr{ ( # Paren group 1 -- full function (?: (private|public) # Paren group 2 -- optional 'private +' or 'public' \s+)? (function) # Paren group 3 -- required 'functio +n' keyword \s* # Optional space after 'function' (\w+) # Paren group 4 -- function name ( # Paren group 5 -- args in parens \( ( # Paren group 6 -- contents of paren +s (?: (?> [^()]+ ) # Non-parens without backtracking | (?5) # Recurse to start of paren group 5 )* ) \) ) (?: # Optional return value \s+ returns\s* ( # Paren group 7 -- return args in pa +rens \( ( # Paren group 8 -- return args (?: (?> [^()]+ ) # Non-parens without backtracking | (?7) # Recurse to start of paren group 7 )* ) \) ) )? ) }x; if ($str !~ /$re/) { say "No match for '$str'"; return; } my ($full, $pp, $func, $name, $par, $args, $ret, $rargs) = ($1, $2 + || "", $3, $4, $5, $6, $7 || "", $8 || ""); say "Match!"; say " \$full => '$full'"; # Full expression say " \$pp => '$pp'"; # Optional 'private' or 'public' k +eyword say " \$func => '$func'"; # 'function' keyword say " \$name => '$name'"; # Function name say " \$par => '$par'"; # Func args (in parens) say " \$args => '$args'"; # Func args (no parens) say " \$ret => '$ret'"; # Optional return args (in parens) say " \$rargs => '$rargs'"; # Optional return args (no parens) }

    Result:

    Match! $full => 'private function convert_wa_date_strings(iv_beg string, i +v_end string, iv_read_date date, iv_step char(6)) returns (date, date +, char(1))' $pp => 'private' $func => 'function' $name => 'convert_wa_date_strings' $par => '(iv_beg string, iv_end string, iv_read_date date, iv_step + char(6))' $args => 'iv_beg string, iv_end string, iv_read_date date, iv_step +char(6)' $ret => '(date, date, char(1))' $rargs => 'date, date, char(1)'
    say  substr+lc crypt(qw $i3 SI$),4,5
Re: Regex to pull out string within parenthesis that could contain parenthesis (updated)
by AnomalousMonk (Chancellor) on Jul 09, 2018 at 17:58 UTC

    Here's another, more factored example of the use of recursive subpatterns (introduced with Perl version 5.10):

    c:\@Work\Perl\monks>perl -wMstrict -MData::Dump -le "use 5.010; ;; my $s = 'function convert(beg string, end string, read_date date, step char +(6)) returns (date, date, char(1))'; ;; my $rx_paren = qr{ ( [(] (?: [^()]*+ | (?-1))* [)] ) }xms; my $rx_identifier = qr{ \w+ }xms; ;; my $parsed_ok = my @ra = $s =~ m{ \A \s* (private|public)? \s* (function|report) \s* ($rx_identifier) \s* $rx_paren \s* ((returns) \s* $rx_paren)? \s* \z }xms; ;; if ($parsed_ok) { dd @ra; } else { print 'parse failed'; } " ( undef, "function", "convert", "(beg string, end string, read_date date, step char(6))", "returns (date, date, char(1))", "returns", "(date, date, char(1))", )

    Update: The  (private|public)? \s* sub-expression in the above  m// should probably be something like (untested)
        ((?: private | public) \s)? \s*
    because, e.g.,  public looks too much like  function or  report that would always follow it and requires some delimitation.


    Give a man a fish:  <%-{-{-{-<

      Here's a variation on the above solution, using named recursive subpatterns and named captures.
      Nowadays I write all my non-trivial regexes this way.
      use 5.010; my $source = 'function convert(beg string, end string, read_date date, + step char(6)) returns (date, date, char(1))'; my $matched = $source =~ m{ \A \s*+ (?<access> private | public )?+ \s*+ (?<keyword> function | report ) \s*+ (?<name> (?&identifier) ) \s*+ (?<params> (?&list) ) \s*+ (returns \s*+ (?<returns> (?&list) ) )?+ \s*+ \z (?(DEFINE) (?<identifier> [^\W\d]\w*+ ) (?<list> [(] [^()]*+ (?: (?&list) [^()]*+ )*+ [)] ) ) }xms; if ($matched) { my %components = %+; use Data::Dumper 'Dumper'; say Dumper \%components; } else { say 'parse failed'; }
      which outputs:
      $VAR1 = { keyword => 'function', name => 'convert', params => '(beg string, end string, read_date date, step char(6))', returns => '(date, date, char(1))', };

        I had entirely forgotten about named captures and  (?(DEFINE)...) — a much better (regex) approach.


        Give a man a fish:  <%-{-{-{-<

Re: Regex to pull out string within parenthesis that could contain parenthesis
by tobyink (Abbot) on Jul 09, 2018 at 20:06 UTC
Re: Regex to pull out string within parenthesis that could contain parenthesis
by fishy (Pilgrim) on Jul 09, 2018 at 16:03 UTC
Re: Regex to pull out string within parenthesis that could contain parenthesis
by sundialsvc4 (Abbot) on Jul 09, 2018 at 19:55 UTC

    Another possibility to consider is Parse::RecDescent or something along those lines – a true parser.   You can write your grammar so that it looks for function declarations and ignores everything else, and now the solution is truly generalized.   When, inevitably, a programmer did something that your current crop of regular-expressions didn’t recognize (if you caught it), you would find yourself maybe going through more-and-more calisthenics.   Whereas a parser avoids all that.

    I have personal experience with this.   In a past life, I concocted a system that parsed a mess of SAS® source-files, Korn(!) shell scripts, and Tivoli job-files, to construct a data-flow picture of what this old application was actually doing.   Parse::RecDescent frankly astounded me with its ability to do so much of the job “with speed, grace, and style.”   I had prior experience with other parsers based on Bison and Yacc (which are also supported through Perl ...), but this was considerably more flexible, allowing me to write grammars to zero-in on what I needed out of the files without having to create a grammar for all of its irrelevant-to-me twists and turns.

    As the system I was building became more visible, its requirements expanded several times.   But, thanks to the parser and my experience with it at that time, I was able to accommodate those new requests without major re-design.   That positively sold me on it, hence my suggestion.

      Another possibility to consider is Parse::RecDescent or something along those lines – a true parser. ... When ... your current crop of regular-expressions didn’t recognize [something], you would find yourself maybe going through more-and-more calisthenics. Whereas a parser avoids all that.

      I tend to agree that a real parser is a better, certainly a more general and scalable, solution than a regex approach.

      I have personal experience with this. ... I concocted a system that parsed a mess of [stuff] ... Parse::RecDescent frankly astounded me ... I had prior experience with other parsers ..., but this was considerably more flexible, allowing me to write grammars ...

      The OPed problem definition is pretty limited and seems fairly well defined. Would you care to supply an SASE | SSCCE (update: I don't know why I thought a Self-Addressed Stamped Envelope would be helpful here :) for a parser approach that would address this problem? I think such an example would certainly add to the discussion.


      Give a man a fish:  <%-{-{-{-<

        “However, I do want to point the OP fairly-specifically in this fruitful direction.   Very-carefully explore the possibility of using Parse::RecDescent, letting it drive the show, having been given a grammar that teaches it to “skip over” (without further definition) sections of source-code that are not of interest.

        In several very-key ways, P::RD is not implemented in the same way as are classical parsers.   (Hint:   it is centered on regular expressions at all levels ...)   Plus, it translates your grammar into a Perl source program.   This gives you significant opportunities not found elsewhere.

        “Unfortunately, they fairly-frisked me at the door.”   (It was a [rightfully ...] very security-conscious insurance company.)   So in this case I really don’t have any source code ... ;-) ... that I could specifically offer as an example, and it wouldn’t exactly qualify as a “forum post” even if I did.

        The thing that is both very-unique and very-nice about this particular module is that it actually does all of its magick within the Perl context, generating on-the-fly Perl source-code corresponding to your grammar and then magically executing it.   (Holy Moose, Batman!)   Therefore, you actually can construct grammars that “surf over” blocks of irrelevant source-code in search of the things that matter, without having to write grammars that describe everything in-between.

        I wish that I could feel free to be more specific ... but there are NDAs ...

Re: Regex to pull out string within parenthesis that could contain parenthesis
by sundialsvc4 (Abbot) on Jul 10, 2018 at 03:07 UTC

    When I referred to “NDAs,” I am referring to source-code, written for a previous job, that I either no longer possess or cannot disclose.   Whereas in most cases I have been able to finish the job with an archive of code that I created for that job, this is not one of those cases.   Therefore, I cannot provide examples ... and, even if I could, they would not exactly be of the sort that I could meaningfully enclose within <code> tags for the edification of the present audience.

    While I acknowledge the possible applicability of Text::Balanced, and do fully understand this module, I personally do not believe that “it is the approach that I would select.”   In my opinion, the most-appropriate description of a problem of this kind is found (only) in a grammar, which expresses the input-text in question as an expression of a programming language and which facilitates the use of a tool that parses the content in these terms.   “Balanced text” is a less-generalized approach which “might work just fine, 95% of the time.”   But, under which, “you might never know of the other 5%.”

    Now, here is where I am going to retreat:   “this-here worked for me, and I think it worked amazingly well, therefore I stuck my head out from above the PerlMonks trenches and dared to suggest(ed) it.   But now I do admit that I am beginning to regret that decision.   Do you folks really just want to hear yourselves talk?   This site used to be a forum.

      Aaaaaaaaaand I'm back to ignoring sundialsvc4.


      Give a man a fish:  <%-{-{-{-<

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1218162]
Front-paged by Corion
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (5)
As of 2018-07-20 09:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    It has been suggested to rename Perl 6 in order to boost its marketing potential. Which name would you prefer?















    Results (427 votes). Check out past polls.

    Notices?