Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
The question Ido posed, A (non) reg-ex question, and particularly the responses from hv and tilly where they applied recursive regular expressions to the problem piqued my interest. I decided to have a tinker with the example given in the Camel book, 3rd edition pp. 214 which is used to find balanced bracket pairs in a string. I thought I would try to enhance the regular expression with memory groups to see if I could pull out the brackets. So, you see, the pain I have suffered is self-inflicted:-)

I first created some test data and replicated the example in the book in a simple script to test success of failure of the match. Here is the script

#!/usr/bin/perl # use strict; use warnings; our @toTest = ( "Cont(ains balanced( nested Br(ack)ets )in t)he text", "Con(tains i(mbalan(ced Br(ack)ets, )one c)lose missing", "Contains i(mbalan(ced Br(ack)ets, )one op)en m)missing", "No brackets in this string", "Won)ky br(ackets in) this s(tring", "More wonky br(ackets in) th)is s(tring", "Just the one( leading bracket", "And just th)e one trailing bracket", "So(me m(ultip)le n(est(s in) thi)s o)ne", "Ther(e is( mo(re) de(e)p )nes(ti(n(g i)n (mul)ti)p(l)es) he)re", "Some d((oub)le b)rackets", "ab(())cde", "ab(c(d)e", "ab(c)d)e"); our $rxNest; $rxNest = qr {(?x) \( (?: (?>[^()]+) | (??{$rxNest}) )* \) }; testString($_) for @toTest; sub testString { my $string = shift; print "\n$string\n"; print " Match ", $string =~ /$rxNest/ ? "succeeded\n" : "failed\n"; }

and here is the output

Cont(ains balanced( nested Br(ack)ets )in t)he text Match succeeded Con(tains i(mbalan(ced Br(ack)ets, )one c)lose missing Match succeeded Contains i(mbalan(ced Br(ack)ets, )one op)en m)missing Match succeeded No brackets in this string Match failed Won)ky br(ackets in) this s(tring Match succeeded More wonky br(ackets in) th)is s(tring Match succeeded Just the one( leading bracket Match failed And just th)e one trailing bracket Match failed So(me m(ultip)le n(est(s in) thi)s o)ne Match succeeded Ther(e is( mo(re) de(e)p )nes(ti(n(g i)n (mul)ti)p(l)es) he)re Match succeeded Some d((oub)le b)rackets Match succeeded ab(())cde Match succeeded ab(c(d)e Match succeeded ab(c)d)e Match succeeded

I had assumed without really analysing the regular expression that the match would fail if there were non-balanced brackets but that was obviously not the case. So that's another thing to try and add, I thought, match must fail if brackets don't balance.

Going first to the extraction of the brackets, I added a memory group

... our $rxNest; $rxNest = qr {(?x) ( \( (?: (?>[^()]+) | (??{$rxNest}) )* \) ) }; ...

in the expectation that I could access all of the memory groups created as the match recursed like this

... if($string =~ /$rxNest/) { no strict 'refs'; print " Match succeeded\n"; my $memNo = 1; while(defined ${$memNo}) { print " \$$memNo - ${$memNo}\n"; $memNo ++; } } else { print " Match failed\n"; } ...

That did not work. It seems to be a scoping issue with each recursion having it's own $1 with only the outer one visible to the code so I only ever saw everything between the outermost brackets.

As it seemed to do with scoping, I though I would be able to do my own memoizing by executing a bit of code in the regular expression when matching was successful. Like this

... our @memoList; our $rxNest; $rxNest = qr {(?x) ( \( (?: (?>[^()]+) | (??{$rxNest}) )* \) ) (?{push @memoList, $+}) ... if($string =~ /$rxNest/) { print " Match succeeded\n"; print " $_\n" for @memoList; } else { print " Match failed\n"; } ...

This worked well, eg

Cont(ains balanced( nested Br(ack)ets )in t)he text Match succeeded (ack) ( nested Br(ack)ets ) (ains balanced( nested Br(ack)ets )in t)

but showed up some weirdness when trying to match a string with one too many opening brackets

Con(tains i(mbalan(ced Br(ack)ets, )one c)lose missing Match succeeded (ack) (ced Br(ack)ets, ) (mbalan(ced Br(ack)ets, )one c) (ack) (ced Br(ack)ets, ) (mbalan(ced Br(ack)ets, )one c)

The regular expression seemed to go through the string twice so that I saw each nested pair twice. I imagine this is to do with backtracking too far in some way but I can't work out what is going on. I have tried to use re q(debug); on a shorter string with the same imbalance to spot what is going on but I don't really understand the output produced. Can any more savvy Monks shed any light on what is happening here?

Going on to the second problem, which was to try and make the match fail if there were imbalanced brackets, I thought the best way would be to add stuff to anchor the match to the beginning and end of the string. The trouble was, whatever I tried did not seem to work, either failing everything or being too lax again. It was only when I broke the expression into three parts and then amalgamated them that I had success. This works

#!/usr/bin/perl # use strict; use warnings; our @toTest = ( "Cont(ains balanced( nested Br(ack)ets )in t)he text", "Con(tains i(mbalan(ced Br(ack)ets, )one c)lose missing", "Contains i(mbalan(ced Br(ack)ets, )one op)en m)missing", "No brackets in this string", "Won)ky br(ackets in) this s(tring", "More wonky br(ackets in) th)is s(tring", "Just the one( leading bracket", "And just th)e one trailing bracket", "So(me m(ultip)le n(est(s in) thi)s o)ne", "Ther(e is( mo(re) de(e)p )nes(ti(n(g i)n (mul)ti)p(l)es) he)re", "Some d((oub)le b)rackets", "ab(())cde", "ab(c(d)e", "ab(c)d)e"); our @memoList; our ($rxBefore, $rxNest, $rxAfter, $rxWhole); $rxBefore = $rxAfter = qr{([^()]*)}; $rxNest = qr {(?x) ( \( (?: (?>[^()]+) | (??{$rxNest}) )* \) ) (?{push @memoList, $+}) }; $rxWhole = qr{^$rxBefore$rxNest$rxAfter$}; testString($_) for @toTest; sub testString { my $string = shift; @memoList = (); print "\nString: $string\n"; if($string =~ /$rxWhole/) { print " Match succeeded\n"; print " ---------------\n"; print " Before brackets:-\n"; print " $1\n"; print " Bracket pairs:-\n"; print " $_\n" for @memoList; print " After brackets:-\n"; print " $3\n"; } else { print " Match failed\n"; } }

and produces

String: Cont(ains balanced( nested Br(ack)ets )in t)he text Match succeeded --------------- Before brackets:- Cont Bracket pairs:- (ack) ( nested Br(ack)ets ) (ains balanced( nested Br(ack)ets )in t) After brackets:- he text String: Con(tains i(mbalan(ced Br(ack)ets, )one c)lose missing Match failed String: Contains i(mbalan(ced Br(ack)ets, )one op)en m)missing Match failed String: No brackets in this string Match failed String: Won)ky br(ackets in) this s(tring Match failed String: More wonky br(ackets in) th)is s(tring Match failed String: Just the one( leading bracket Match failed String: And just th)e one trailing bracket Match failed String: So(me m(ultip)le n(est(s in) thi)s o)ne Match succeeded --------------- Before brackets:- So Bracket pairs:- (ultip) (s in) (est(s in) thi) (me m(ultip)le n(est(s in) thi)s o) After brackets:- ne String: Ther(e is( mo(re) de(e)p )nes(ti(n(g i)n (mul)ti)p(l)es) he)re Match succeeded --------------- Before brackets:- Ther Bracket pairs:- (re) (e) ( mo(re) de(e)p ) (g i) (mul) (n(g i)n (mul)ti) (l) (ti(n(g i)n (mul)ti)p(l)es) (e is( mo(re) de(e)p )nes(ti(n(g i)n (mul)ti)p(l)es) he) After brackets:- re String: Some d((oub)le b)rackets Match succeeded --------------- Before brackets:- Some d Bracket pairs:- (oub) ((oub)le b) After brackets:- rackets String: ab(())cde Match succeeded --------------- Before brackets:- ab Bracket pairs:- () (()) After brackets:- cde String: ab(c(d)e Match failed String: ab(c)d)e Match failed

However, if I put exactly the same regular expression syntax into one expression without breaking it down like this

#!/usr/bin/perl # use strict; use warnings; our @toTest = ( "Cont(ains balanced( nested Br(ack)ets )in t)he text", "Con(tains i(mbalan(ced Br(ack)ets, )one c)lose missing", "Contains i(mbalan(ced Br(ack)ets, )one op)en m)missing", "No brackets in this string", "Won)ky br(ackets in) this s(tring", "More wonky br(ackets in) th)is s(tring", "Just the one( leading bracket", "And just th)e one trailing bracket", "So(me m(ultip)le n(est(s in) thi)s o)ne", "Ther(e is( mo(re) de(e)p )nes(ti(n(g i)n (mul)ti)p(l)es) he)re", "Some d((oub)le b)rackets", "ab(())cde", "ab(c(d)e", "ab(c)d)e"); our @memoList; our $rxNest; $rxNest = qr {(?x) ^ ([^()]*) ( \( (?: (?>[^()]+) | (??{$rxNest}) )* \) ) (?{push @memoList, $+}) ([^()]*) $ }; testString($_) for @toTest; sub testString { my $string = shift; @memoList = (); print "\nString: $string\n"; if($string =~ /$rxNest/) { print " Match succeeded\n"; print " ---------------\n"; print " Before brackets:-\n"; print " $1\n"; print " Bracket pairs:-\n"; print " $_\n" for @memoList; print " After brackets:-\n"; print " $3\n"; } else { print " Match failed\n"; } }

Nothing matches

String: Cont(ains balanced( nested Br(ack)ets )in t)he text Match failed String: Con(tains i(mbalan(ced Br(ack)ets, )one c)lose missing Match failed String: Contains i(mbalan(ced Br(ack)ets, )one op)en m)missing Match failed String: No brackets in this string Match failed String: Won)ky br(ackets in) this s(tring Match failed String: More wonky br(ackets in) th)is s(tring Match failed String: Just the one( leading bracket Match failed String: And just th)e one trailing bracket Match failed String: So(me m(ultip)le n(est(s in) thi)s o)ne Match failed String: Ther(e is( mo(re) de(e)p )nes(ti(n(g i)n (mul)ti)p(l)es) he)re Match failed String: Some d((oub)le b)rackets Match failed String: ab(())cde Match failed String: ab(c(d)e Match failed String: ab(c)d)e Match failed

I can't see any logical difference between the regular expression that is made up of three components and works and the one that is all in one piece and fails. What is going on?

I'm sorry that this post is so long but I am so puzzled. I must be missing something obvious but I can't see what. Please show me what I have overlooked, I'm sure I'll kick myself.

Cheers,

JohnGG


In reply to Recursive regular expression weirdness by johngg

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (6)
As of 2024-04-18 01:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found