moxliukas has asked for the wisdom of the Perl Monks concerning the following question:


I know this question has been raised before, and I did try to Super Search for it, but none of the answers really right on the spot for me.

The problem is as follows:
I am writting a script that parses XML (and no, using XML::Parser or XML::simple is not the options as you need to compile them in order to use them -- otherwise it would be no problem). This XML has some tags that can go inside each other namely in this manner:

I need to get the stuff between the balanced <value> pairs. Obviously, the (.*?) approach does not work as it selects <code><value>...<value>...</value>. I am suspecting that one might need to use extended regular expression for this (backtracking?), but could you lead me on the rigth track how? (Or is it a much more difficult problem that cannot be solved by only using regexps?)

Thank you in advance

Replies are listed 'Best First'.
•Re: Balanced delimiter parsing
by merlyn (Sage) on Oct 04, 2002 at 14:58 UTC
Re: Balanced delimiter parsing
by Zaxo (Archbishop) on Oct 04, 2002 at 15:05 UTC

    &Text::Balanced::extract_tagged will do what you want. It is a core module.

    After Compline,

Re: Balanced delimiter parsing
by broquaint (Abbot) on Oct 04, 2002 at 14:58 UTC
    If you want to use XML::Parser and it's ilk then you should just download a binary version of it and copy the the right files to the appropriate places. If however, you can't find a binary version of XML::Parser and you really are stuck up the proverbial creek then you might want to check out the extract_tagged method in Text::Balanced. From the docs:
    # Extract the initial substring of $text that is bounded by # a C<BEGIN>...C<END> pair. Don't allow nested C<BEGIN> tags ($extracted, $remainder) = extract_tagged($text,"BEGIN","END",undef,{bad=>["BEGIN"]});



Re: Balanced delimiter parsing
by helgi (Hermit) on Oct 04, 2002 at 16:20 UTC
    Yours is a Frequently Asked Question. The answer to it can be found in the standard Perl documentation by using the command perldoc -q balanced, also known as:

    Can I use Perl regular expressions to match balanced text? Although Perl regular expressions are more powerful than "mathematical" regular expressions because they feature conveniences like backreferences ("\1" and its ilk), they still aren't powerful enough--with the possible exception of bizarre and experimental features in the development-track releases of Perl. You still need to use non-regex techniques to parse balanced text, such as the text enclosed between matching parentheses or braces, for example. An elaborate subroutine (for 7-bit ASCII only) to pull out balanced and possibly nested single chars, like "`" and "'", "{"and "}", or "(" and ")" can be found in +es.gz The C::Scan module from CPAN contains such subs for internaluse, but they are undocumented.

    Further information can be found in perldoc -q How do I remove HTML which despite the name, is directly applicable to XML.

    Helgi Briem
    helgi AT decode DOT is

Re: Balanced delimiter parsing
by Abigail-II (Bishop) on Oct 07, 2002 at 11:23 UTC
    use Regexp::Common; qr !$RE{balanced}{-begin => "<value>"}{-end => "</value>"}{-keep}!;
Re: Balanced delimiter parsing
by I0 (Priest) on Oct 06, 2002 at 06:43 UTC
    Using only regexps
    $_="...<value>...<value>...</value>...</value>..."; ($re=$_)=~s#((<value>)|(</value>)|.)#${[')','']}[!$3]\Q$1\E${['(','']} +[!$2]#gs; print join "\n",eval{/$re/}; die $@ if $@=~/unmatched/i;