Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Help modifying recursive regex

by beeflobill (Initiate)
on May 07, 2013 at 18:41 UTC ( [id://1032522]=perlquestion: print w/replies, xml ) Need Help??

beeflobill has asked for the wisdom of the Perl Monks concerning the following question:

A long time ago I worked really hard and wrote a piece of recursive regex code which I'm trying to modify now. It takes text with nested parentheses and returns a list with the "top" groups of parentheses.

Here is an example which shows what the regex does:
Input: "A (B C D)"
Output: "B C D"

Input: "A (B C D) (E F G)"
Output: "B C D", "E F G"

Input: "A B (C (D E) (F G)) (H I)"
Output: "C (D E) (F G)", "H I"

Even when I wrote the code I knew I would really, really want to keep the data that wasn't in parentheses too, but I couldn't figure out how to do it. Now I'm back on the problem.

Here is an example of what I am trying to get:
Input: "A (B C D)"
Output: "A", "B C D"

Input: "A (B C D) (E F G)"
Output: "A", "B C D", "E F G"

Input: "A B (C (D E) (F G)) (H I)"
Output: "A", "B", "C (D E) (F G)", "H I"

I still don't know how to do that. I've provided a test script demonstrating the subroutine that does what I'm describing. May I supplicate, how might I modify this to do what I need? Or, is there simply a better way to approach this problem? Thanks.

#!/usr/bin/perl -w use strict; sub strip_paren { my $input = $_[0]; my $re; $re = qr/ (?: \( ( (?: [^()]+ | (??{$re}) )+ ) \) ) | \(\) /x; my @output = $input =~ /$re/g; return @output; } sub print_crud { print shift, "\n"; foreach my $line (@_) { print " $line\n"; } print "\n"; } my $string = "A (B C D)"; print_crud($string, strip_paren($string)); $string = "A (B C D) (E F G)"; print_crud($string, strip_paren($string)); $string = "A B (C (D E) (F G)) (H I)"; print_crud($string, strip_paren($string));

Replies are listed 'Best First'.
Re: Help modifying recursive regex
by hdb (Monsignor) on May 07, 2013 at 19:11 UTC

    Check the following Re^2: splitting data. Changing the code slightly will give you trees based on levels of parantheses.

    use strict; use warnings; while(my $input = <DATA>){ print "input: $input\n"; my $level = 0; my $tab = "| "; my %action = ( '(' => sub { print "\n", $tab x ++$level, shift }, ')' => sub { print "\n", $tab x $level--, shift }, 'default' => sub { print shift }, ); ( $action{$_} // $action{'default'} )->($_) for $input =~ /./g; print "\n\n\n"; } __DATA__ A (B C D) A (B C D) (E F G) A B (C (D E) (F G)) (H I)

    with the following output

    input: A (B C D) A | (B C D | ) input: A (B C D) (E F G) A | (B C D | ) | (E F G | ) input: A B (C (D E) (F G)) (H I) A B | (C | | (D E | | ) | | (F G | | ) | ) | (H I | )

    not based on regex though... but if you now replace parantheses on level 1 with quotes and put quotes around words on level 0 you should have what you want. Just change the subs in %action as required.

Re: Help modifying recursive regex
by choroba (Cardinal) on May 07, 2013 at 20:06 UTC
    Not using a regex, but a simple automaton:
    #!/usr/bin/perl use warnings; use strict; use Test::More; sub groups { my $string = shift; my $depth = 0; my @return = (q()); for my $char (split //, $string) { $depth++ if '(' eq $char; $depth-- if ')' eq $char; if ((2 > $depth and qw/) (/[$depth] eq $char) or (' ' eq $char and ! $depth)) { push @return, q(); } else { $return[-1] .= $char; } } s/^ *| *$//g for @return; # trim return [ grep length, @return]; } is_deeply(groups("A (B C D)"), ["A", "B C D"]); is_deeply(groups("A (B C D) (E F G)"), ["A", "B C D", "E F G"]); is_deeply(groups("A B (C (D E) (F G)) (H I)"), ["A", "B", "C (D E) (F G)", "H I"]); is_deeply(groups("A B (C (D E) (F G)) X Y (H I)"), ["A", "B", "C (D E) (F G)", "X", "Y", "H I"]); done_testing();
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: Help modifying recursive regex
by gurpreetsingh13 (Scribe) on May 08, 2013 at 05:25 UTC
    Yes, splitting is the simple solution.
    sub get_groups { my @groups = (); my $level = 0; my $group = ""; #Keep on increasing the level once you find the opening brace. #After coming out of loop, level should be again 0 foreach my $val ( split //, shift ) { next if $val =~/\s/; if ( $level == 0 && !($val ~~ [ "(", ")" ]) ) { push @groups, +$val; next; } if ( $level != 0 && !($val ~~ [ "(", ")" ]) ) { $group.=$val; +next; } if ( $val eq "(" && $level==0) { $level++;$group = ""; next; } if ( $val eq "(" && $level!=0) { $level++;$group .= "("; next; + } if ( $val eq ")" && $level==1) {$level--;push @groups, $group; + $group = ""; next; } if ( $val eq ")" && $level!=1) {$level--;$group .= ")"; next; +} } print Dumper(@groups); } ## ---------- end sub get_groups

    Input:

    my $string = "A (B C D)"; get_groups($string); $string = "A (B C D) (E F G)"; get_groups($string); $string = "A B (C (D E) (F G)) (H I)"; $string = "(A B) M N ((C (D E)) (F G)) (Z(H I)) T R"; get_groups($string);

    Output: $VAR1 = 'A'; $VAR2 = 'BCD'; $VAR1 = 'A'; $VAR2 = 'BCD'; $VAR3 = 'EFG'; $VAR1 = 'AB'; $VAR2 = 'M'; $VAR3 = 'N'; $VAR4 = '(C(DE))(FG)'; $VAR5 = 'Z(HI)'; $VAR6 = 'T'; $VAR7 = 'R';

Re: Help modifying recursive regex
by Laurent_R (Canon) on May 08, 2013 at 13:02 UTC

    Thank you, HDB and Choroba, for your elegant and enlightening solutions.

    I had to do something similar a few weeks ago, and my solution was far more complicated, at least 3 times as much code. I wish I had thought about something similar to your proposals (and hope I will next time). I really learned something from your posts.

Re: Help modifying recursive regex
by Krambambuli (Curate) on May 08, 2013 at 16:49 UTC
    Here's a regexp based solution, with minor tweeks on your code:

    #!/usr/bin/perl use strict; use warnings; use Data::Dumper; sub strip_paren { my $input = $_[0]; our $re; $re = qr/ ( [^()]+ ) (?: \( ( (?: (??{$re}) )+ )+ \) )? /x; my @output = (); @output = $input =~ /$re/g; return @output; } sub print_crud { print "\nLine: ", shift, "\n"; foreach my $line (@_) { next if not defined $line; print " '$line'\n"; } } while (<DATA>) { chomp; print_crud( $_, strip_paren($_)); } exit; __DATA__ A A B A B C (A B) C A (B) C A (B C D) A(B C D) A (B C D) E A (B C D)E A (B C D) (E F G) A (B C D) (E F G) I A B (C (D E) (F G)) (H I) A B (C (D E) (F G)) (H I) J
    I think it comes close to what you wanted.
Re: Help modifying recursive regex
by vagabonding electron (Curate) on May 09, 2013 at 17:20 UTC

    I have tried this with Text::Balanced and it seems to work:

    #!/usr/bin/perl use strict; use warnings; use Text::Balanced ':ALL'; while ( my $text = <DATA> ) { my @data = extract_multiple( $text, [ \&extract_bracketed, qr/\s+/ ] ); @data = grep { !/^$/ } map {s/^\s+//;s/\s+$//;s/^\(//;s/\)$//;$_} @data; print join ( '---', @data ), "\n"; } __END__ A A B A B C (A B) C A (B) C A (B C D) A(B C D) A (B C D) E A (B C D)E A (B C D) (E F G) A (B C D) (E F G) I A B (C (D E) (F G)) (H I) A B (C (D E) (F G)) (H I) J A B (C (D E) (F G)) (H I) J K

    because it prints:

    A A---B A---B---C A B---C A---B---C A---B C D A---B C D A---B C D---E A---B C D---E A---B C D---E F G A---B C D---E F G---I A---B---C (D E) (F G)---H I A---B---C (D E) (F G)---H I---J A---B---C (D E) (F G)---H I---J---K

    There is an issue though (perhaps due to my lack of deep knowledge of the module). If I put a line print  $@->{error}, "\n" if $@; after the first line in the loop it prints Did not find opening bracket after prefix: "\s*" every time. Is this something to worry about or just to take note of?

    Just my 2 cents.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1032522]
Approved by davies
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (5)
As of 2024-03-28 09:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found