comment on

The question Ido posed, A (non) reg-ex question, and particularly the responses from hv and tilly where they applied recursive regular expressions to the problem piqued my interest. I decided to have a tinker with the example given in the Camel book, 3rd edition pp. 214 which is used to find balanced bracket pairs in a string. I thought I would try to enhance the regular expression with memory groups to see if I could pull out the brackets. So, you see, the pain I have suffered is self-inflicted:-)

I first created some test data and replicated the example in the book in a simple script to test success of failure of the match. Here is the script

#!/usr/bin/perl
#
use strict;
use warnings;

our @toTest = (
   "Cont(ains balanced( nested Br(ack)ets )in t)he text",
   "Con(tains i(mbalan(ced Br(ack)ets, )one c)lose missing",
   "Contains i(mbalan(ced Br(ack)ets, )one op)en m)missing",
   "No brackets in this string",
   "Won)ky br(ackets in) this s(tring",
   "More wonky br(ackets in) th)is s(tring",
   "Just the one( leading bracket",
   "And just th)e one trailing bracket",
   "So(me m(ultip)le n(est(s in) thi)s o)ne",
   "Ther(e is( mo(re) de(e)p )nes(ti(n(g i)n (mul)ti)p(l)es) he)re",
   "Some d((oub)le b)rackets",
   "ab(())cde",
   "ab(c(d)e",
   "ab(c)d)e");

our $rxNest;
$rxNest = qr
   {(?x)
       \(
       (?:
           (?>[^()]+)
       |
           (??{$rxNest})
       )*
       \)
   };

testString($_) for @toTest;

sub testString
{
    my $string = shift;
    print "\n$string\n";
    print "     Match ", $string =~ /$rxNest/ ?
       "succeeded\n" :
       "failed\n";
}
[download]

and here is the output

Cont(ains balanced( nested Br(ack)ets )in t)he text
     Match succeeded

Con(tains i(mbalan(ced Br(ack)ets, )one c)lose missing
     Match succeeded

Contains i(mbalan(ced Br(ack)ets, )one op)en m)missing
     Match succeeded

No brackets in this string
     Match failed

Won)ky br(ackets in) this s(tring
     Match succeeded

More wonky br(ackets in) th)is s(tring
     Match succeeded

Just the one( leading bracket
     Match failed

And just th)e one trailing bracket
     Match failed

So(me m(ultip)le n(est(s in) thi)s o)ne
     Match succeeded

Ther(e is( mo(re) de(e)p )nes(ti(n(g i)n (mul)ti)p(l)es) he)re
     Match succeeded

Some d((oub)le b)rackets
     Match succeeded

ab(())cde
     Match succeeded

ab(c(d)e
     Match succeeded

ab(c)d)e
     Match succeeded
[download]

I had assumed without really analysing the regular expression that the match would fail if there were non-balanced brackets but that was obviously not the case. So that's another thing to try and add, I thought, match must fail if brackets don't balance.

Going first to the extraction of the brackets, I added a memory group

...
our $rxNest;
$rxNest = qr
   {(?x)
       (
       \(
       (?:
           (?>[^()]+)
       |
           (??{$rxNest})
       )*
       \)
       )
   };
...
[download]

in the expectation that I could access all of the memory groups created as the match recursed like this

...
    if($string =~ /$rxNest/)
    {
        no strict 'refs';
        print "     Match succeeded\n";
        my $memNo = 1;
        while(defined ${$memNo})
        {
            print "     \$$memNo - ${$memNo}\n";
            $memNo ++;
        }
    }
    else
    {
        print "     Match failed\n";
    }
...
[download]

That did not work. It seems to be a scoping issue with each recursion having it's own $1 with only the outer one visible to the code so I only ever saw everything between the outermost brackets.

As it seemed to do with scoping, I though I would be able to do my own memoizing by executing a bit of code in the regular expression when matching was successful. Like this

...
our @memoList;
our $rxNest;
$rxNest = qr
   {(?x)
       (
       \(
       (?:
           (?>[^()]+)
       |
           (??{$rxNest})
       )*
       \)
       )
       (?{push @memoList, $+})
...
    if($string =~ /$rxNest/)
    {
        print "     Match succeeded\n";
        print "     $_\n" for @memoList;
    }
    else
    {
        print "     Match failed\n";
    }
...
[download]

This worked well, eg

Cont(ains balanced( nested Br(ack)ets )in t)he text
     Match succeeded
     (ack)
     ( nested Br(ack)ets )
     (ains balanced( nested Br(ack)ets )in t)
[download]

but showed up some weirdness when trying to match a string with one too many opening brackets

Con(tains i(mbalan(ced Br(ack)ets, )one c)lose missing
     Match succeeded
     (ack)
     (ced Br(ack)ets, )
     (mbalan(ced Br(ack)ets, )one c)
     (ack)
     (ced Br(ack)ets, )
     (mbalan(ced Br(ack)ets, )one c)
[download]

The regular expression seemed to go through the string twice so that I saw each nested pair twice. I imagine this is to do with backtracking too far in some way but I can't work out what is going on. I have tried to use re q(debug); on a shorter string with the same imbalance to spot what is going on but I don't really understand the output produced. Can any more savvy Monks shed any light on what is happening here?

Going on to the second problem, which was to try and make the match fail if there were imbalanced brackets, I thought the best way would be to add stuff to anchor the match to the beginning and end of the string. The trouble was, whatever I tried did not seem to work, either failing everything or being too lax again. It was only when I broke the expression into three parts and then amalgamated them that I had success. This works

#!/usr/bin/perl
#
use strict;
use warnings;

our @toTest = (
   "Cont(ains balanced( nested Br(ack)ets )in t)he text",
   "Con(tains i(mbalan(ced Br(ack)ets, )one c)lose missing",
   "Contains i(mbalan(ced Br(ack)ets, )one op)en m)missing",
   "No brackets in this string",
   "Won)ky br(ackets in) this s(tring",
   "More wonky br(ackets in) th)is s(tring",
   "Just the one( leading bracket",
   "And just th)e one trailing bracket",
   "So(me m(ultip)le n(est(s in) thi)s o)ne",
   "Ther(e is( mo(re) de(e)p )nes(ti(n(g i)n (mul)ti)p(l)es) he)re",
   "Some d((oub)le b)rackets",
   "ab(())cde",
   "ab(c(d)e",
   "ab(c)d)e");

our @memoList;

our ($rxBefore, $rxNest, $rxAfter, $rxWhole);
$rxBefore = $rxAfter = qr{([^()]*)};
$rxNest = qr
   {(?x)
       (
       \(
       (?:
       (?>[^()]+)
       |
       (??{$rxNest})
       )*
       \)
       )
       (?{push @memoList, $+})
   };
$rxWhole = qr{^$rxBefore$rxNest$rxAfter$};

testString($_) for @toTest;

sub testString
{
    my $string = shift;
    @memoList = ();
    print "\nString: $string\n";
    if($string =~ /$rxWhole/)
    {
        print "     Match succeeded\n";
        print "     ---------------\n";
        print "     Before brackets:-\n";
        print "          $1\n";
        print "     Bracket pairs:-\n";
        print "          $_\n" for @memoList;
        print "     After brackets:-\n";
        print "          $3\n";
    }
    else
    {
        print "     Match failed\n";
    }
}
[download]

and produces

String: Cont(ains balanced( nested Br(ack)ets )in t)he text
     Match succeeded
     ---------------
     Before brackets:-
          Cont
     Bracket pairs:-
          (ack)
          ( nested Br(ack)ets )
          (ains balanced( nested Br(ack)ets )in t)
     After brackets:-
          he text

String: Con(tains i(mbalan(ced Br(ack)ets, )one c)lose missing
     Match failed

String: Contains i(mbalan(ced Br(ack)ets, )one op)en m)missing
     Match failed

String: No brackets in this string
     Match failed

String: Won)ky br(ackets in) this s(tring
     Match failed

String: More wonky br(ackets in) th)is s(tring
     Match failed

String: Just the one( leading bracket
     Match failed

String: And just th)e one trailing bracket
     Match failed

String: So(me m(ultip)le n(est(s in) thi)s o)ne
     Match succeeded
     ---------------
     Before brackets:-
          So
     Bracket pairs:-
          (ultip)
          (s in)
          (est(s in) thi)
          (me m(ultip)le n(est(s in) thi)s o)
     After brackets:-
          ne

String: Ther(e is( mo(re) de(e)p )nes(ti(n(g i)n (mul)ti)p(l)es) he)re
     Match succeeded
     ---------------
     Before brackets:-
          Ther
     Bracket pairs:-
          (re)
          (e)
          ( mo(re) de(e)p )
          (g i)
          (mul)

          (n(g i)n (mul)ti)
          (l)
          (ti(n(g i)n (mul)ti)p(l)es)
          (e is( mo(re) de(e)p )nes(ti(n(g i)n (mul)ti)p(l)es) he)
     After brackets:-
          re

String: Some d((oub)le b)rackets
     Match succeeded
     ---------------
     Before brackets:-
          Some d
     Bracket pairs:-
          (oub)
          ((oub)le b)
     After brackets:-
          rackets

String: ab(())cde
     Match succeeded
     ---------------
     Before brackets:-
          ab
     Bracket pairs:-
          ()
          (())
     After brackets:-
          cde

String: ab(c(d)e
     Match failed

String: ab(c)d)e
     Match failed
[download]

However, if I put exactly the same regular expression syntax into one expression without breaking it down like this

#!/usr/bin/perl
#
use strict;
use warnings;

our @toTest = (
   "Cont(ains balanced( nested Br(ack)ets )in t)he text",
   "Con(tains i(mbalan(ced Br(ack)ets, )one c)lose missing",
   "Contains i(mbalan(ced Br(ack)ets, )one op)en m)missing",
   "No brackets in this string",
   "Won)ky br(ackets in) this s(tring",
   "More wonky br(ackets in) th)is s(tring",
   "Just the one( leading bracket",
   "And just th)e one trailing bracket",
   "So(me m(ultip)le n(est(s in) thi)s o)ne",
   "Ther(e is( mo(re) de(e)p )nes(ti(n(g i)n (mul)ti)p(l)es) he)re",
   "Some d((oub)le b)rackets",
   "ab(())cde",
   "ab(c(d)e",
   "ab(c)d)e");

our @memoList;

our $rxNest;
$rxNest = qr
   {(?x)
       ^
       ([^()]*)
       (
       \(
       (?:
           (?>[^()]+)
       |
           (??{$rxNest})
       )*
       \)
       )
       (?{push @memoList, $+})
       ([^()]*)
       $
   };

testString($_) for @toTest;

sub testString
{
    my $string = shift;
    @memoList = ();
    print "\nString: $string\n";
    if($string =~ /$rxNest/)
    {
        print "     Match succeeded\n";
        print "     ---------------\n";
        print "     Before brackets:-\n";
        print "          $1\n";
        print "     Bracket pairs:-\n";
        print "          $_\n" for @memoList;
        print "     After brackets:-\n";
        print "          $3\n";
    }
    else
    {
        print "     Match failed\n";
    }
}
[download]

Nothing matches

String: Cont(ains balanced( nested Br(ack)ets )in t)he text
     Match failed

String: Con(tains i(mbalan(ced Br(ack)ets, )one c)lose missing
     Match failed

String: Contains i(mbalan(ced Br(ack)ets, )one op)en m)missing
     Match failed

String: No brackets in this string
     Match failed

String: Won)ky br(ackets in) this s(tring
     Match failed

String: More wonky br(ackets in) th)is s(tring
     Match failed

String: Just the one( leading bracket
     Match failed

String: And just th)e one trailing bracket
     Match failed

String: So(me m(ultip)le n(est(s in) thi)s o)ne
     Match failed

String: Ther(e is( mo(re) de(e)p )nes(ti(n(g i)n (mul)ti)p(l)es) he)re
     Match failed

String: Some d((oub)le b)rackets
     Match failed

String: ab(())cde
     Match failed

String: ab(c(d)e
     Match failed

String: ab(c)d)e
     Match failed
[download]

I can't see any logical difference between the regular expression that is made up of three components and works and the one that is all in one piece and fails. What is going on?

I'm sorry that this post is so long but I am so puzzled. I must be missing something obvious but I can't see what. Please show me what I have overlooked, I'm sure I'll kick myself.

Cheers,

JohnGG

In reply to Recursive regular expression weirdness by johngg

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


"be consistent"
	PerlMonks