http://www.perlmonks.org?node_id=932524


in reply to RegEx Against Arbitrary XML Tags

GrandFather's answer is certainly the right way to go, but while we're here, let's look at your regular expression:

  if($line =~ /^\s*<(\w+)>[^\w*|\d*|<]/)

Breaking it down:

^\s*
Anchor the start, zero or more spaces
<(\w+)>
One or more word-characters in <> (so far so good)
[^\w*|\d*|<]
A single character ([..] is a character class) which is not (^) a word character, (\w), an asterisk (*), a pipe (|) or a less-than sign (<). Note that \d is included under \w.

I'm not quite sure what you wanted at the end there, but parens ((...|...)) are probably closer than a character class ([]).

Replies are listed 'Best First'.
Re^2: RegEx Against Arbitrary XML Tags
by onegative (Scribe) on Oct 20, 2011 at 13:39 UTC
    Yes my intention has never been to roll my own...and I am using Twig in parts of my code. The main reason I am trying to roll my own is due to all the xml differencing engines I have encountered share a same logic premise. And I have looked at a bunch of them from C to Java based. Unfortunately they all appear to have something in common that doesn't provide what I am truely looking for and that is not to cascade changes through sibling elements when an element is deleted. What I have found is that if I have multiple siblings in a element tree and you delete one element somewhere in the middle of the tree and add a new element to the sibling tree at the same time the differencing engine doesn't merely remove the deleted element and add the new element...it changes the element below the deleted element to reflect the deleted element as being changed and that then cascades down the sibling tree showing the newly added element as a change of the previously last element in the sibling tree. This is very difficult to deal with when trying to maintain representations of this data in a RDBMS. So instead of a simple delete record and add record you end up with multiple changes to existing records cascaded down the sibling tree...with never indicating that an element was deleted and an element was added. Somehow all the diffrencing engines appear to maintain sibling element order as a key aspect of watching for changes...thus my intention to try and roll my own.

    The reason I am trying to understand the RegEx is to be able to detect tag patterns without having to know the contents of the tags...thus I don't want to write the matching pattern for ever possible tag...that is possible but I want it to function regardless of the tag name.

    As far as the RegEx...what I am wanting is to have a RegEx pattern that matches ^<ANYTHING>$ only with no attributes, but when you try to RegEx against xml that may look like <ANYTHING port="7777"> or <ANYTHING>someValue</ANYTHING> or <ANYTHING></ANYTHING> matching a pattern like /^<(.*)>$/ doesn't just get the first example...it also grabs the second, third and fourth. The RegEx I am trying to understand is to only grab the first <ANYTHING>...and its become harder than I have imagined.

    So I have tried variations such as:
    if($line =~ /^\s*<(\w+)>[^.+]/) if($line =~ /^\s*<(\w+)>[^(\w*|\d*|<*)]/) if($line =~ /^\s*<(\w+)>([^\w*]|[^\d*]|[^<*])/)
    Just not sure how to overcome with a RegEx pattern...more complex patterns are easier because you have more items to anchor against...but the simplest tag <ANYTHING> is my harder than I thought.

      The catch is that to know one element was deleted and a new one was added you need to know which attribute or subelement is the key. Which is a piece of information the diff doesn't have. For the same reason the diff engine cannot handle reordered elements the way you seem to want.

      I do not think a generic solution is possible and it's hard to give you a specific solution as we do not know your XML.

      Jenda
      Enoch was right!
      Enjoy the last years of Rome.

      Use XML::Twig , default handler, you don't have to know the names of tags ahead of time

      For arbitrary xml, comparing them by position , like regular diff utility, seems like a reasonable generic approach

      If your target xml nodes have unique ID's, then its easy to find changed nodes, since their ID's don't change

      See also diff output of XML::SemanticCompare