Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??

I've rewritten "Can I use Perl regular expressions to match balanced text?" in perlfaq6 for Perl 5.10, which has new, nifty regex features that make the answer "yes, and it's easy". You may not look at the perlfaq that much anymore, so here's the new answer for you to enjoy. :)

I expect that there are many more perlfaq answers which can use some Perl 5.10 features. If you find one, let us know at perlfaq-workers@perl.org. You can supply a patch or just point out the answer.


Your first try should probably be the Text::Balanced module, which is in the Perl standard library since Perl 5.8. It has a variety of functions to deal with tricky text. The Regexp::Common module can also help by providing canned patterns you can use.

As of Perl 5.10, you can match balanced text with regular expressions using recursive patterns. Before Perl 5.10, you had to resort to various tricks such as using Perl code in (??{}) sequences.

Here's an example using a recursive regular expression. The goal is to capture all of the text within angle brackets, including the text in nested angle brackets. This sample text has two "major" groups: a group with one level of nesting and a group with two levels of nesting. There are five total groups in angle brackets:

I have some <brackets in <nested brackets> > and <another group <nested once <nested twice> > > and that's it.

The regular expression to match the balanced text uses two new (to Perl 5.10) regular expression features. These are covered in perlre and this example is a modified version of one in that documentation.

First, adding the new possesive + to any quantifier finds the longest match and does not backtrack. That's important since you want to handle any angle brackets through the recursion, not backtracking. The group [^<>]++ finds one or more non-angle brackets without backtracking.

Second, the new (?PARNO) refers to the sub-pattern in the particular capture buffer given by PARNO. In the following regex, the first capture buffer finds (and remembers) the balanced text, and you need that same pattern within the first buffer to get past the nested text. That's the recursive part. The (?1) uses the pattern in the outer capture buffer as an independent part of the regex.

Putting it all together, you have:

#!/usr/local/bin/perl5.10.0 my $string =<<"HERE"; I have some <brackets in <nested brackets> > and <another group <nested once <nested twice> > > and that's it. HERE my @groups = $string =~ m/ ( # start of capture buffer 1 < # match an opening angle bracket (?: [^<>]++ # one or more non angle brackets, non back +tracking | (?1) # found < or >, so recurse to capture buff +er 1 )* > # match a closing angle bracket ) # end of capture buffer 1 /xg; $" = "\n\t"; print "Found:\n\t@groups\n";

The output shows that Perl found the two major groups:

Found: <brackets in <nested brackets> > <another group <nested once <nested twice> > >

With a little extra work, you can get the all of the groups in angle brackets even if they are in other angle brackets too. Each time you get a balanced match, remove its outer delimiter (that's the one you just matched so don't match it again) and add it to a queue of strings to process. Keep doing that until you get no matches:

#!/usr/local/bin/perl5.10.0 my @queue =<<"HERE"; I have some <brackets in <nested brackets> > and <another group <nested once <nested twice> > > and that's it. HERE my $regex = qr/ ( # start of bracket 1 < # match an opening angle bracket (?: [^<>]++ # one or more non angle brackets, non back +tracking | (?1) # recurse to bracket 1 )* > # match a closing angle bracket ) # end of bracket 1 /x; $" = "\n\t"; while( @queue ) { my $string = shift @queue; my @groups = $string =~ m/$regex/g; print "Found:\n\t@groups\n\n" if @groups; unshift @queue, map { s/^<//; s/>$//; $_ } @groups; }

The output shows all of the groups. The outermost matches show up first and the nested matches so up later:

Found: <brackets in <nested brackets> > <another group <nested once <nested twice> > > Found: <nested brackets> Found: <nested once <nested twice> > Found: <nested twice>
--
brian d foy <brian@stonehenge.com>
Subscribe to The Perl Review

In reply to perlfaq6: Can I use Perl regular expressions to match balanced text? by brian_d_foy

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others about the Monastery: (6)
    As of 2014-12-27 11:51 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      Is guessing a good strategy for surviving in the IT business?





      Results (177 votes), past polls