JeffR100 has asked for the wisdom of the Perl Monks concerning the following question:
I am trying to find a way to use a regular expression to sort through nested parentheses and return everything within them surroundign a perticular element. Basically, I want to extract everything within the minimum matching parentheses surrounding a particular element. For example, I want to be able to input:
((A,B),C,(D,E)) and the letter C and have the program the program return (A,B),C,(D,E)), but if I enter D, I only want (D,E) returned. Is there a simple way to do this? I have looked at the CPAN modules for matching nested parentheses, but none of them are able to look for a specific string within the parentheses.
Thank You for your help.
Jeff
Re: Parsing nested parentheses
by Abigail-II (Bishop) on Nov 18, 2003 at 19:55 UTC
|
Once you know how to write a regex to match a string with
balanced parens, it's not to hard to write a regex to do what you
want.
How does a string with balanced parens look like? It starts
with an opening paren, then "junk" separated by strings with
balanced parens, where "junk" is a string consisting of
something that isn't a paren. Quite like a delimited string
with escaped delimiters actually.
Now, how do we go from string with balanced parens to the
requested string? Simple. We start with an opening paren,
then junk string separated by strings with balanced parens,
then "C", then junk strings separated by strings with
balanced parens, and finally a closing paren.
So much for text. Here's the code:
#!/usr/bin/perl
use strict;
use warnings;
no warnings qw /syntax/;
my $bal;
$bal = qr /[(]
[^()]*
(?:(??{ $bal }) [^()]* )*
[)]/x;
my $re = qr /[(]
[^()]*
(?:(??{ $bal }) [^()]* )*
C
[^()]*
(?:(??{ $bal }) [^()]* )*
[)]/x;
while (<DATA>) {
chomp;
print "$_: ";
print $& if /$re/;
print "\n";
}
__DATA__
((A, B), C, (D, E))
(((A, B), C, (D, E)))
((A, B), D, (C, E))
((A, C), D, (B, E))
(((A, C)), D, (B, E))
((A, (C)), D, (B, E))
A, B, C, D, E
((A, B), F, (D, E))
((A, B), C, (D, E)): ((A, B), C, (D, E))
(((A, B), C, (D, E))): ((A, B), C, (D, E))
((A, B), D, (C, E)): (C, E)
((A, C), D, (B, E)): (A, C)
(((A, C)), D, (B, E)): (A, C)
((A, (C)), D, (B, E)): (C)
A, B, C, D, E:
((A, B), F, (D, E)):
(I cooked up this regex while I was walking home from the
train station. It turned out to work at the first try).
Abigail | [reply] [d/l] |
|
(I cooked up this regex while I was walking home from the train station. It turned out to work at the first try).
Amazing. The most constructive thing I think about on the way home from the train station is if I will be able to get my "welcome home" beer open on the first try.
| [reply] |
Re: Parsing nested parentheses
by hardburn (Abbot) on Nov 18, 2003 at 17:20 UTC
|
AFAIK, CPAN modules for matching balanced text are only going to return what is in the parens, not filter out information inside them. You would use the module to grab the text, then parse it yourself. A few modules I can think of are Text::Balanced and Regexp::Common::balanced.
---- I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident.
-- Schemer
: () { :|:& };:
Note: All code is untested, unless otherwise stated
| [reply] [d/l] |
Re: Parsing nested parentheses
by diotalevi (Canon) on Nov 18, 2003 at 17:17 UTC
|
Simple? Perhaps. For the moment have a gander at Maximal match in a recursive regex which is the bare minimum required if your going to merely capture maximum. My first thought is to follow the $re check with a code block that can check the just-captured result for the contents of your search.
for ( A .. E ) {
print "$_: ";
print '((A, B), C, (D, E))' =~ match( $_ ); # Returns '(D, E)'
print "\n";
}
sub match {
my $search = shift;
my $re;
eval qq[
\$re = qr/
\\( # Opening parenthese
((?: # Capture the contents
[^()]+ # Body of a link
|
(??{\$re}) # Or recurse
)+) # and allow repeats internally
\\) # Closing parenthese
# Test that the just-closed capture block contains $search
+.
(?(?{ -1 != index \$^N, \$search })(?=)|(?!))
/x;
] or die $@;
return $re;
}
| [reply] [d/l] |
|
| [reply] |
|
Oh too bad. It "should". There's also some funny action with stuff being cached. For kicks, try a for ( A .. E ) { ... =~ match( $_ ) } and you'll notice that the expression never shifts off of its initial search.
| [reply] [d/l] |
Re: Parsing nested parentheses
by TheDamian (Vicar) on Nov 18, 2003 at 20:15 UTC
|
This is reasonably straightforward with Regexp::Common, though you do have to "roll-your-own" a little:
use Regexp::Common;
my $balanced = qr/[^()]+|$RE{balanced}{-parens=>'()'}/;
sub extract {
my ($want, $from) = @_;
my $nested = qr/$balanced* \Q$want\E $balanced*/x;
$from =~ m/( \( $nested \) | \Q$want\E )/x;
return $1;
}
my $expr = '(((B,G),A,(A,B)),C,(D,(E))),F';
for ('A'..'H') {
print "$_: ", extract($_,$expr),"\n";
}
Note that I've assumed that, in the edge case where the target is unnested (e.g. 'F' in the above example), the extraction should just return target itself.
Also note the standard regexish "left-most-longest-match" behaviour when the target appears more than once (as do targets 'A' and 'B' in the example). | [reply] [d/l] |
|
I tried this out, but it give a segmentation fault on the line:
$from =~ m/( \( $nested \) | \Q$want\E )/x;
Any ideas?
| [reply] |
|
Any ideas?
Not without a little more information. What version of perl are you using? What version of Regexp::Common?
| [reply] |
Re: Parsing nested parentheses
by fletcher_the_dog (Friar) on Nov 18, 2003 at 19:46 UTC
|
Here is a simple way to do it by keeping track of how deep you are into parenthesis' and what depth your match occured at:
use strict;
my $str = "((A,B),C,(D,E))";
foreach my $m (qw(A B C D E)) {
print $m." => ".FindMinParens($m,$str)."\n";
}
sub FindMinParens{
my $match = shift;
my $string = shift;
my @start_pos;
my $depth = -1;
# keep matching parens or $match
while ($string=~/([()]|$match)/g) {
if ($1 eq "(") {
# record opening paren positions
push @start_pos,pos($string);
}
elsif ($1 eq ")") {
# if we reached the closing parens for the minimum pair
# get the sub string and exit
if ($#start_pos == $depth) {
my $start = $start_pos[-1];
my $len = pos($string) - $start -1;
return substr($string,$start,$len);
}else{
pop @start_pos;
}
}
else {
# store depth of $match
$depth = $#start_pos;
}
}
return "";
}
__OUTPUT__
A => A,B
B => A,B
C => (A,B),C,(D,E)
D => D,E
E => D,E
| [reply] [d/l] |
Re: Parsing nested parentheses
by shotgunefx (Parson) on Nov 18, 2003 at 18:29 UTC
|
This is probably a klunky solution, but it reminded me of a parser I had written.. (Evaluate Expressions.)
Quickly hacking at it and only testing a few simple cases (you've been warned)
You could use the returned structure to search for whatever criteria you desire. I'm sure someone here has a much more elegant solution, but my regex abilities are rather limited...
#!/usr/bin/perl
use warnings;
use strict;
use Data::Dumper;
my $input = '(a,b,(c,d,(e,f,g)))';
print Dumper parse_expression($input);
sub parse_expression {
my $exp = shift;
my @tokens = ();
$exp=~s/\s*([()])\s*/ $1 /go;
# Get tokens
push @tokens, $1 while $exp=~/\G\s*(".*?")/gc or $exp=~/\G\s*('.*?
+')/gc or $exp=~/\G\s*(\S+)/gc;
# Find any parens.
my (@lp,@rp) = ();
for (my $p =0; $p < @tokens; $p++){
if ($tokens[$p] eq '('){
push @lp,$p;
}elsif($tokens[$p] eq ')'){
push @rp,$p;
}
}
if ( @lp != @rp){
warn "Mismatched parens in expression.\n";
return;
}
my @temp = @tokens;
for (my $i=0; $i < @rp; $i++){
my @wanted;
for (my $j = $#lp; $j >= 0 ; $j--){
if ( defined $lp[$j] && $lp[$j] < $rp[$i] ){
(undef,@wanted) = @tokens[ $lp[$j] .. ($rp[$i] - 1 )
+] ;
@tokens[ $lp[$j] .. ($rp[$i]) ] = [ grep {defined $_
+} @wanted];
push @temp, map {split /\s*,\s*/} @wanted;
$lp[$j] = $rp[$i] = undef;
last;
}
}
}
return $tokens[0];
}
__DATA__
# OUTPUT
$VAR1 =
[
'a,b,',
[
'c,d,',
[
'e,f,g'
]
]
];
-Lee
"To be civilized is to deny one's nature."
| [reply] [d/l] |
|
Hi there,
ever thought of using the perl internal functions to import a nested data structure to a LoL? If you do not have to avoid using eval, you could do it real simple like this:
#!/usr/bin/perl -w
use strict;
use Data::Dumper;
my $string = "(a,b,(c,d,(e,f,g)))";
$string =~ s/([^,\(\)]+)/"$1"/gsm; # quote content
$string =~ tr/\(\)/\[\]/; # replace braces
my @LoL; # define list
eval("\@LoL = $test;"); # fill LoL
print Dumper @LoL;
__DATA__
Output:
$VAR1 = [
'a',
'b',
[
'c',
'd',
[
'e',
'f',
'g'
]
]
];
It is not extensively tested, but should work for most items.
- frank | [reply] [d/l] |
Re: Parsing nested parentheses
by Anonymous Monk on Nov 18, 2003 at 17:55 UTC
|
my $expr = '((A,B),C,(D,E))';
print extract('C',$expr),"\n";
print extract('B',$expr),"\n";
print extract('D',$expr),"\n";
sub extract {
my $char = shift;
my ($str,$dup) = (shift) x 2;
1 while $dup =~ s/\([^()$char]*\)/'.' x length $&/e;
$dup =~ m/(\([^()$char]*$char[^()]*\))/;
return substr($str,$-[1],length $1);
}
__END__
# output:
((A,B),C,(D,E))
(A,B)
(D,E)
But you probably want to build a small expression parser.
| [reply] [d/l] |
|
Being an ancient C programmer, I would probably recurse through the string, inspecting a character at a time. A "(" would mean a call to myself with the remainder of the string.
Other characters (except for ")") I would load up into a string, which would be all the stuff within a (). No, better, load them into separate strings, starting a new string after each comma.
A ")" would mean returning, after inspecting the above string(s) for a match. If you find a match, I guess you print out that list, comma delimited just like it was in the original string? Or you could just print the substring (of the original string) just inspected. And yes, you could use a regex for the comparison, but you could just use "eq" to compare the strings, since the commas and parens would already be gone.
To a true Perl programmer, this is probably brute force and ignorance, but I've used this approach before and it can work.
| [reply] |
Re: Parsing nested parentheses
by welchavw (Pilgrim) on Nov 19, 2003 at 17:53 UTC
|
This will not work if all tokens are not the same length, because of the reliance on index, which seems ok given the sample dataset provided. Also, its probably not the best approach, but I believe it works and I had fun producing it.
use strict;
use warnings;
my $val = 'X';
my $min_len_subgroup;
sub chk {
my $subgroup = shift;
if(-1 != index($subgroup, $val)) {
if (!defined $min_len_subgroup || length $subgroup < length $min_l
+en_subgroup) {
$min_len_subgroup = $subgroup;
}
}
}
our $re = qr{
(
\(
(?:
(?> [^()]+ )
|
(??{ $re })
)*
\)
)
(?{ chk($1) })
}x;
my @pats = (
"(1(2(X)))",
"(1(2(X,a))) (3(X,e,f))",
"(1(2(X,b,c))) (3(X,f))",
"(1(2(X,a))) (X,b)",
"(1(2(X,a))) (X,b) X",
"(1(2(X,a))) (X,b) (X)",
);
for my $pat (@pats) {
undef $min_len_subgroup;
if (() = ($pat =~ /$re/g)) {
print "$min_len_subgroup\n";
}
}
,welchavw | [reply] [d/l] [select] |
Re: Parsing nested parentheses
by Anonymous Monk on Nov 21, 2003 at 07:21 UTC
|
my $expr = '((A,B),C,(D,E))';
print extract('C',$expr),"\n";
print extract('B',$expr),"\n";
print extract('D',$expr),"\n";
sub extract {
my $char = shift;
local $_ = shift;
(my $re=$_)=~s/((\()|(\))|.)/$2\Q$1\E$3/gs;
return (sort{length $a <=> length $b}(grep/$char/,eval{/$re/},$@ &
+& die $@))[0];
}
| [reply] [d/l] |
|
|