onegative has asked for the wisdom of the Perl Monks concerning the following question:

Good Day Honorable Monks,
I am attempting to use RegEx to match specific XML tags and can't seem to figure out why I can't get the results that I expect.
Here is an example xml doc:

<ROOT hostname="bumblebee" tstamp="2011/09/21 22:24:05"> <APPLICATION> <PORT>7777</PORT> <APP_HOME>/extra/localcw/opt/APP/sun4</APP_HOME> <VERSION>V36.11.01</VERSION> <PERF_HOME>/usr/localcw/opt/APP/Solaris-2-9-sparc-64</PERF_HOM +E> <PERF_VERSION>glanceSunOS 5.9 (Solaris 9) (sparc, 64 Bit) 7.3. +00.6059 Jul 19 2006</PERF_VERSION> <STAR_VERSION>3.0</STAR_VERSION> <DEFAULT_ACCT>root</DEFAULT_ACCT> <HISTORY_RETENTION>90</HISTORY_RETENTION> <LAST_FILE_DOWN>StAR-201105090928.tar</LAST_FILE_DOWN> <LAST_STATUS>No download file found</LAST_STATUS> <ACL> <ACCOUNT id="f9a64ef61c"> <MD5>f9a64ef61c</MD5> <USERNAME>*</USERNAME> <HOST>flower</HOST> <PERMISSION>P</PERMISSION> </ACCOUNT> </ACL> </APPLICATION> </ROOT>

So I have basically 5 different distinct XML tag formats to match against.

<openingTagName attribute="whatever">

But when trying to match against <openingTagName> only using something like the following it doesn't match what I would think it would namely <openingTagName> only like from the xml tags <APPLICATION> and <ACL>. But it doesn't. Can someone give me a hint or two on how to grab only <APPLICATION> and <ACL> which once realized should help me get through the others myself.

foreach my $line (@xml) { chomp($line); if($line =~ /^\s*<(\w+)>[^\w*|\d*|<]/) { print "$1\n"; } }
I would have thought the negate classes would have eliminated any and all values after the initial > but it doesn't. I have spent quite a while on different patterns but none work. Any and all suggestions will be greatly appreciated.


Replies are listed 'Best First'.
Re: RegEx Against Arbitrary XML Tags
by GrandFather (Saint) on Oct 19, 2011 at 21:50 UTC

    Almost certainly you don't want to parse XML using hand rolled code. Instead use one of the many XML parsing modules (XML::Twig is highly recommended). Robustly parsing XML is hard and you will spend much more time trying to get it right than you will learning to use a module to do the heavy lifting for you. Consider:

    use warnings; use strict; use XML::Twig; my $xml = <<XML; <ROOT hostname="bumblebee" tstamp="2011/09/21 22:24:05"> <APPLICATION> <PORT>7777</PORT> <APP_HOME>/extra/localcw/opt/APP/sun4</APP_HOME> <VERSION>V36.11.01</VERSION> <PERF_HOME>/usr/localcw/opt/APP/Solaris-2-9-sparc-64</ +PERF_HOME> <PERF_VERSION>glanceSunOS 5.9 (Solaris 9) (sparc, 64 B +it) Jul 19 2006</PERF_VERSION> <STAR_VERSION>3.0</STAR_VERSION> <DEFAULT_ACCT>root</DEFAULT_ACCT> <HISTORY_RETENTION>90</HISTORY_RETENTION> <LAST_FILE_DOWN>StAR-201105090928.tar</LAST_FILE_DOWN> <LAST_STATUS>No download file found</LAST_STATUS> <ACL> <ACCOUNT id="f9a64ef61c"> <MD5>f9a64ef61c</MD5> <USERNAME>*</USERNAME> <HOST>flower</HOST> <PERMISSION>P</PERMISSION> </ACCOUNT> </ACL> </APPLICATION> </ROOT> XML my $twig = XML::Twig->new( twig_roots => {'APPLICATION' => \&doStuff, 'ACL' => \&doStuff} ); $twig->parse($xml); sub doStuff { my ($t, $elt) = @_; print "Found ", $elt->tag(), "\n"; $t->purge; # frees the memory }


    True laziness is hard work
Re: RegEx Against Arbitrary XML Tags
by anneli (Pilgrim) on Oct 19, 2011 at 23:09 UTC

    GrandFather's answer is certainly the right way to go, but while we're here, let's look at your regular expression:

      if($line =~ /^\s*<(\w+)>[^\w*|\d*|<]/)

    Breaking it down:

    Anchor the start, zero or more spaces
    One or more word-characters in <> (so far so good)
    A single character ([..] is a character class) which is not (^) a word character, (\w), an asterisk (*), a pipe (|) or a less-than sign (<). Note that \d is included under \w.

    I'm not quite sure what you wanted at the end there, but parens ((...|...)) are probably closer than a character class ([]).

      Yes my intention has never been to roll my own...and I am using Twig in parts of my code. The main reason I am trying to roll my own is due to all the xml differencing engines I have encountered share a same logic premise. And I have looked at a bunch of them from C to Java based. Unfortunately they all appear to have something in common that doesn't provide what I am truely looking for and that is not to cascade changes through sibling elements when an element is deleted. What I have found is that if I have multiple siblings in a element tree and you delete one element somewhere in the middle of the tree and add a new element to the sibling tree at the same time the differencing engine doesn't merely remove the deleted element and add the new changes the element below the deleted element to reflect the deleted element as being changed and that then cascades down the sibling tree showing the newly added element as a change of the previously last element in the sibling tree. This is very difficult to deal with when trying to maintain representations of this data in a RDBMS. So instead of a simple delete record and add record you end up with multiple changes to existing records cascaded down the sibling tree...with never indicating that an element was deleted and an element was added. Somehow all the diffrencing engines appear to maintain sibling element order as a key aspect of watching for changes...thus my intention to try and roll my own.

      The reason I am trying to understand the RegEx is to be able to detect tag patterns without having to know the contents of the tags...thus I don't want to write the matching pattern for ever possible tag...that is possible but I want it to function regardless of the tag name.

      As far as the RegEx...what I am wanting is to have a RegEx pattern that matches ^<ANYTHING>$ only with no attributes, but when you try to RegEx against xml that may look like <ANYTHING port="7777"> or <ANYTHING>someValue</ANYTHING> or <ANYTHING></ANYTHING> matching a pattern like /^<(.*)>$/ doesn't just get the first also grabs the second, third and fourth. The RegEx I am trying to understand is to only grab the first <ANYTHING>...and its become harder than I have imagined.

      So I have tried variations such as:
      if($line =~ /^\s*<(\w+)>[^.+]/) if($line =~ /^\s*<(\w+)>[^(\w*|\d*|<*)]/) if($line =~ /^\s*<(\w+)>([^\w*]|[^\d*]|[^<*])/)
      Just not sure how to overcome with a RegEx pattern...more complex patterns are easier because you have more items to anchor against...but the simplest tag <ANYTHING> is my harder than I thought.

        The catch is that to know one element was deleted and a new one was added you need to know which attribute or subelement is the key. Which is a piece of information the diff doesn't have. For the same reason the diff engine cannot handle reordered elements the way you seem to want.

        I do not think a generic solution is possible and it's hard to give you a specific solution as we do not know your XML.

        Enoch was right!
        Enjoy the last years of Rome.

        Use XML::Twig , default handler, you don't have to know the names of tags ahead of time

        For arbitrary xml, comparing them by position , like regular diff utility, seems like a reasonable generic approach

        If your target xml nodes have unique ID's, then its easy to find changed nodes, since their ID's don't change

        See also diff output of XML::SemanticCompare

Re: RegEx Against Arbitrary XML Tags
by JavaFan (Canon) on Oct 20, 2011 at 11:22 UTC
    Can someone give me a hint or two on how to grab only <APPLICATION> and <ACL>
    I'd write that as