Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris

On XML parsing

by mirod (Canon)
on Dec 14, 2000 at 15:15 UTC ( #46601=perlmeditation: print w/replies, xml ) Need Help??

After reading a recent question, but also some older onesI thought it would be worth mentionning the basic rule of XML processing: Use a parser!

As I know you won't take my word for it I will give you just a couple of examples of things that might (that will) go wrong if you use plain regexps:

  • XML comments:

    <tag>value 1</tag> <!-- <tag>value 2</tag> --> <tag>value 3</tag>
    will probably hurt you first, then get you to write quite tricky regexps,

  • entities:

    <tag>value 1</tag> &v2; <tag>value 3</tag>
    what will your regexp do with the &v2; entity? Will it look in the appropriate place (right in the DTD, or in a separate file, maybe remote) to get the entity declaration: <!ENTITY v2 "<tag>value 2</tag>">

  • CDATA:

    <tag>value 1</tag> <tag><![CDATA[ <tag2>value 2</tag2> ]]></tag> <tag>value 3</tag>
    the data inside the CDATA should be treated literally, there is no tag2 element in the document,

  • namespaces:

    <mynamespace:tag>value 1</mynamespace:tag> <theirnamespace:tag>value 3</theirnamespace:tag>
    the 2 tag elements may or may not refer to the same element, depending on the namespace declarations in the document.

Not to mention the usual kind of problem with evolving XML, when the content of the tag element starts including additional mark-up, when the tag element gets a bunch of attributes, or when tag2 elements start popping up in between tag elements.

You might think that you don't care about all of those, your XML is simple and you don't need no stinkin' namespaces. WRONG! You are limiting yourself to a subset of XML, but you are NOT calling it a subset. And either you or (pity them!) the people who will maintain your code won't remember that it is only a subset, and what subset. Plus you might have total control over this pseudo-XML today but tomorrow? Maybe you will receive it from some external source, or you will use an off-the-shelf tool to create it.

Plus those extra features that your lovingly crafted regexps don't grok might come in handy in the future, will you add them to your software? Will you end up writing your own regexp-based parser? It has been done by the way, it's just that XML::Parser is faster for non-trivial XML, and I happen to trust James Clark more than myself when it comes to writing a parser.

So please, anytime you want to process XML, especially if the software is going to be used for a while, please,

Use the Parser Luke!

Replies are listed 'Best First'.
Re: On XML parsing
by neophyte (Curate) on Dec 14, 2000 at 15:47 UTC
    Right you are, mirod.
    May I add that the same applies to HTML (remember: HTML 4.0 is followed by xhtml 1.0 which is an XML DTD). So parsing HTML is best left to a HTML::Parser or similar parsing or templating modules.


      Sounds like sage advice from good monks mirod, neophyte and tilly.

      For the great unwashed masses (like myself - gasp) who still parse with regexen, new PM Tutorials on HTML::Parser and XML::Parser or even Parse::RecDescent would make t'sall good.   And prolly probably garner a few ++'s.

      Any takers?
          striving for Perl Adept
          (it's pronounced "why-bick")

        You could start by having a look at the module review for XML::Parser and Parse::RecDescent comes with a huge pod (what else would you expect from Damian anyway, it's even in English, no Latin nor Klingon) and I think a tutorial somewhere in the .tar file.

Re (tilly) 1: On XML parsing
by tilly (Archbishop) on Dec 14, 2000 at 18:03 UTC
    And this brings up a point I make from time to time as well. Regular expressions are very well suited to finding patterns in text. They are not suited to general purpose parsing. Trying to use them for that is hard to write, hard to read, hard to be confident you got everything, and you run the risk of exponential failure conditions.

    They are beautiful for breaking text into tokens, or finding tokens of interest.

    A good analogy is that REs are a great text-processing hammer. But often you need a screwdriver, and sometimes you are dealing with something fragile.

    See also related discussion at Why I like functional programming and the CPAN module Parse::RecDescent.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://46601]
Approved by root
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (6)
As of 2017-02-21 17:40 GMT
Find Nodes?
    Voting Booth?
    Before electricity was invented, what was the Electric Eel called?

    Results (314 votes). Check out past polls.