The question
Ido posed,
A (non) reg-ex question, and particularly the responses from
hv and
tilly where they applied recursive regular expressions to the problem piqued my interest. I decided to have a tinker with the example given in the Camel book, 3rd edition pp. 214 which is used to find balanced bracket pairs in a string. I thought I would try to enhance the regular expression with memory groups to see if I could pull out the brackets. So, you see, the pain I have suffered is self-inflicted:-)
I first created some test data and replicated the example in the book in a simple script to test success of failure of the match. Here is the script
#!/usr/bin/perl
#
use strict;
use warnings;
our @toTest = (
"Cont(ains balanced( nested Br(ack)ets )in t)he text",
"Con(tains i(mbalan(ced Br(ack)ets, )one c)lose missing",
"Contains i(mbalan(ced Br(ack)ets, )one op)en m)missing",
"No brackets in this string",
"Won)ky br(ackets in) this s(tring",
"More wonky br(ackets in) th)is s(tring",
"Just the one( leading bracket",
"And just th)e one trailing bracket",
"So(me m(ultip)le n(est(s in) thi)s o)ne",
"Ther(e is( mo(re) de(e)p )nes(ti(n(g i)n (mul)ti)p(l)es) he)re",
"Some d((oub)le b)rackets",
"ab(())cde",
"ab(c(d)e",
"ab(c)d)e");
our $rxNest;
$rxNest = qr
{(?x)
\(
(?:
(?>[^()]+)
|
(??{$rxNest})
)*
\)
};
testString($_) for @toTest;
sub testString
{
my $string = shift;
print "\n$string\n";
print " Match ", $string =~ /$rxNest/ ?
"succeeded\n" :
"failed\n";
}
and here is the output
Cont(ains balanced( nested Br(ack)ets )in t)he text
Match succeeded
Con(tains i(mbalan(ced Br(ack)ets, )one c)lose missing
Match succeeded
Contains i(mbalan(ced Br(ack)ets, )one op)en m)missing
Match succeeded
No brackets in this string
Match failed
Won)ky br(ackets in) this s(tring
Match succeeded
More wonky br(ackets in) th)is s(tring
Match succeeded
Just the one( leading bracket
Match failed
And just th)e one trailing bracket
Match failed
So(me m(ultip)le n(est(s in) thi)s o)ne
Match succeeded
Ther(e is( mo(re) de(e)p )nes(ti(n(g i)n (mul)ti)p(l)es) he)re
Match succeeded
Some d((oub)le b)rackets
Match succeeded
ab(())cde
Match succeeded
ab(c(d)e
Match succeeded
ab(c)d)e
Match succeeded
I had assumed without really analysing the regular expression that the match would fail if there were non-balanced brackets but that was obviously not the case. So that's another thing to try and add, I thought, match must fail if brackets don't balance.
Going first to the extraction of the brackets, I added a memory group
...
our $rxNest;
$rxNest = qr
{(?x)
(
\(
(?:
(?>[^()]+)
|
(??{$rxNest})
)*
\)
)
};
...
in the expectation that I could access all of the memory groups created as the match recursed like this
...
if($string =~ /$rxNest/)
{
no strict 'refs';
print " Match succeeded\n";
my $memNo = 1;
while(defined ${$memNo})
{
print " \$$memNo - ${$memNo}\n";
$memNo ++;
}
}
else
{
print " Match failed\n";
}
...
That did not work. It seems to be a scoping issue with each recursion having it's own $1 with only the outer one visible to the code so I only ever saw everything between the outermost brackets.
As it seemed to do with scoping, I though I would be able to do my own memoizing by executing a bit of code in the regular expression when matching was successful. Like this
...
our @memoList;
our $rxNest;
$rxNest = qr
{(?x)
(
\(
(?:
(?>[^()]+)
|
(??{$rxNest})
)*
\)
)
(?{push @memoList, $+})
...
if($string =~ /$rxNest/)
{
print " Match succeeded\n";
print " $_\n" for @memoList;
}
else
{
print " Match failed\n";
}
...
This worked well, eg
Cont(ains balanced( nested Br(ack)ets )in t)he text
Match succeeded
(ack)
( nested Br(ack)ets )
(ains balanced( nested Br(ack)ets )in t)
but showed up some weirdness when trying to match a string with one too many opening brackets
Con(tains i(mbalan(ced Br(ack)ets, )one c)lose missing
Match succeeded
(ack)
(ced Br(ack)ets, )
(mbalan(ced Br(ack)ets, )one c)
(ack)
(ced Br(ack)ets, )
(mbalan(ced Br(ack)ets, )one c)
The regular expression seemed to go through the string twice so that I saw each nested pair twice. I imagine this is to do with backtracking too far in some way but I can't work out what is going on. I have tried to use re q(debug); on a shorter string with the same imbalance to spot what is going on but I don't really understand the output produced. Can any more savvy Monks shed any light on what is happening here?
Going on to the second problem, which was to try and make the match fail if there were imbalanced brackets, I thought the best way would be to add stuff to anchor the match to the beginning and end of the string. The trouble was, whatever I tried did not seem to work, either failing everything or being too lax again. It was only when I broke the expression into three parts and then amalgamated them that I had success. This works
#!/usr/bin/perl
#
use strict;
use warnings;
our @toTest = (
"Cont(ains balanced( nested Br(ack)ets )in t)he text",
"Con(tains i(mbalan(ced Br(ack)ets, )one c)lose missing",
"Contains i(mbalan(ced Br(ack)ets, )one op)en m)missing",
"No brackets in this string",
"Won)ky br(ackets in) this s(tring",
"More wonky br(ackets in) th)is s(tring",
"Just the one( leading bracket",
"And just th)e one trailing bracket",
"So(me m(ultip)le n(est(s in) thi)s o)ne",
"Ther(e is( mo(re) de(e)p )nes(ti(n(g i)n (mul)ti)p(l)es) he)re",
"Some d((oub)le b)rackets",
"ab(())cde",
"ab(c(d)e",
"ab(c)d)e");
our @memoList;
our ($rxBefore, $rxNest, $rxAfter, $rxWhole);
$rxBefore = $rxAfter = qr{([^()]*)};
$rxNest = qr
{(?x)
(
\(
(?:
(?>[^()]+)
|
(??{$rxNest})
)*
\)
)
(?{push @memoList, $+})
};
$rxWhole = qr{^$rxBefore$rxNest$rxAfter$};
testString($_) for @toTest;
sub testString
{
my $string = shift;
@memoList = ();
print "\nString: $string\n";
if($string =~ /$rxWhole/)
{
print " Match succeeded\n";
print " ---------------\n";
print " Before brackets:-\n";
print " $1\n";
print " Bracket pairs:-\n";
print " $_\n" for @memoList;
print " After brackets:-\n";
print " $3\n";
}
else
{
print " Match failed\n";
}
}
and produces
String: Cont(ains balanced( nested Br(ack)ets )in t)he text
Match succeeded
---------------
Before brackets:-
Cont
Bracket pairs:-
(ack)
( nested Br(ack)ets )
(ains balanced( nested Br(ack)ets )in t)
After brackets:-
he text
String: Con(tains i(mbalan(ced Br(ack)ets, )one c)lose missing
Match failed
String: Contains i(mbalan(ced Br(ack)ets, )one op)en m)missing
Match failed
String: No brackets in this string
Match failed
String: Won)ky br(ackets in) this s(tring
Match failed
String: More wonky br(ackets in) th)is s(tring
Match failed
String: Just the one( leading bracket
Match failed
String: And just th)e one trailing bracket
Match failed
String: So(me m(ultip)le n(est(s in) thi)s o)ne
Match succeeded
---------------
Before brackets:-
So
Bracket pairs:-
(ultip)
(s in)
(est(s in) thi)
(me m(ultip)le n(est(s in) thi)s o)
After brackets:-
ne
String: Ther(e is( mo(re) de(e)p )nes(ti(n(g i)n (mul)ti)p(l)es) he)re
Match succeeded
---------------
Before brackets:-
Ther
Bracket pairs:-
(re)
(e)
( mo(re) de(e)p )
(g i)
(mul)
(n(g i)n (mul)ti)
(l)
(ti(n(g i)n (mul)ti)p(l)es)
(e is( mo(re) de(e)p )nes(ti(n(g i)n (mul)ti)p(l)es) he)
After brackets:-
re
String: Some d((oub)le b)rackets
Match succeeded
---------------
Before brackets:-
Some d
Bracket pairs:-
(oub)
((oub)le b)
After brackets:-
rackets
String: ab(())cde
Match succeeded
---------------
Before brackets:-
ab
Bracket pairs:-
()
(())
After brackets:-
cde
String: ab(c(d)e
Match failed
String: ab(c)d)e
Match failed
However, if I put exactly the same regular expression syntax into one expression without breaking it down like this
#!/usr/bin/perl
#
use strict;
use warnings;
our @toTest = (
"Cont(ains balanced( nested Br(ack)ets )in t)he text",
"Con(tains i(mbalan(ced Br(ack)ets, )one c)lose missing",
"Contains i(mbalan(ced Br(ack)ets, )one op)en m)missing",
"No brackets in this string",
"Won)ky br(ackets in) this s(tring",
"More wonky br(ackets in) th)is s(tring",
"Just the one( leading bracket",
"And just th)e one trailing bracket",
"So(me m(ultip)le n(est(s in) thi)s o)ne",
"Ther(e is( mo(re) de(e)p )nes(ti(n(g i)n (mul)ti)p(l)es) he)re",
"Some d((oub)le b)rackets",
"ab(())cde",
"ab(c(d)e",
"ab(c)d)e");
our @memoList;
our $rxNest;
$rxNest = qr
{(?x)
^
([^()]*)
(
\(
(?:
(?>[^()]+)
|
(??{$rxNest})
)*
\)
)
(?{push @memoList, $+})
([^()]*)
$
};
testString($_) for @toTest;
sub testString
{
my $string = shift;
@memoList = ();
print "\nString: $string\n";
if($string =~ /$rxNest/)
{
print " Match succeeded\n";
print " ---------------\n";
print " Before brackets:-\n";
print " $1\n";
print " Bracket pairs:-\n";
print " $_\n" for @memoList;
print " After brackets:-\n";
print " $3\n";
}
else
{
print " Match failed\n";
}
}
Nothing matches
String: Cont(ains balanced( nested Br(ack)ets )in t)he text
Match failed
String: Con(tains i(mbalan(ced Br(ack)ets, )one c)lose missing
Match failed
String: Contains i(mbalan(ced Br(ack)ets, )one op)en m)missing
Match failed
String: No brackets in this string
Match failed
String: Won)ky br(ackets in) this s(tring
Match failed
String: More wonky br(ackets in) th)is s(tring
Match failed
String: Just the one( leading bracket
Match failed
String: And just th)e one trailing bracket
Match failed
String: So(me m(ultip)le n(est(s in) thi)s o)ne
Match failed
String: Ther(e is( mo(re) de(e)p )nes(ti(n(g i)n (mul)ti)p(l)es) he)re
Match failed
String: Some d((oub)le b)rackets
Match failed
String: ab(())cde
Match failed
String: ab(c(d)e
Match failed
String: ab(c)d)e
Match failed
I can't see any logical difference between the regular expression that is made up of three components and works and the one that is all in one piece and fails. What is going on?
I'm sorry that this post is so long but I am so puzzled. I must be missing something obvious but I can't see what. Please show me what I have overlooked, I'm sure I'll kick myself.
Cheers,
JohnGG
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.