Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW

XML parsing vs Regular expressions

by karpatov (Beadle)
on Feb 16, 2008 at 21:16 UTC ( #668353=perlquestion: print w/replies, xml ) Need Help??

karpatov has asked for the wisdom of the Perl Monks concerning the following question:

With help of relevant tutorial I cobbled up a script using XML::twig for extraction of 2 values (disparate branches of the tree of each record) from quite big(100 000 records) and quite structured database output. XML is new for me but I somehow believed that it is not "proper" to use regular expressions for this propose when there is "structure" a and set of tools for manipulation with it. But then I started to think that there is no reason to discard using regular expressions, especially when I need just a few values from each record. Are there some general guidelines when to use RE or some XML parsing tool? And known pitfalls of either approach? tx karpatov

Replies are listed 'Best First'.
Re: XML parsing vs Regular expressions
by ajt (Prior) on Feb 16, 2008 at 21:59 UTC

    Many an insane person started out sane, before they tried to use regular expressions on XML. While it starts easy, it very quickly descends into chaos. As a general rule if you are working with XML, use a module that uses a real XML parser of some kind, XML::LibXML can be complicated to learn but it is very fast and complete. XML::Twig is another fast tool, and it even includes a regular expression on XML tool...

      ajt's right. You really do want to use XML tools for processing XML. The only time you may possibly do better with regexes is when you're writing a one-off script that only parses a very regular short file that you've inspected before running the script, and it generally takes a couple of tries even to get that right.

      In other circumstances just the fact that a real XML parser will throw a huge tantrum on invalid input will already safe you a lot of work. And that's without mentioning some of the really nice interfaces that modules like XML::Twig can provide.

Re: XML parsing vs Regular expressions
by Cody Pendant (Prior) on Feb 17, 2008 at 05:23 UTC
    The reasons are the same as for HTML parsing really. Your regular expression will do what you want it to do, probably, and then you'll come to trust it and it will come back to bite you when you meet an unexpected case.

    Does it:

    • ignore code which is commented out?
    • allow for attribute order changing?
    • cope with the characters < and > appearing inside attributes, or CDATA sections?
    There are probably a hundred more things you'd have to think of to make your regular expression solution bullet-proof, by which time you might as well have written your own XML parser.

    Nobody says perl looks like line-noise any more
    kids today don't know what line-noise IS ...
Re: XML parsing vs Regular expressions
by grinder (Bishop) on Feb 17, 2008 at 12:14 UTC

    It depends whether you're giving or taking.

    If you are taking at the output from one single program emitting its own brew of XML, you will usually find that it is always emitted in exactly the same way, often pretty-printed with indented nested elements, or hard wrapped against column zero all the way down.

    It is extremely rare (in my experience) to encounter XML emitted by a program that is neatly word-wrapped at or before column 72. After all, that takes a lot more work, and most sane programmers have better things to do with their time. Once you figure out empirically how a given program emits its XML, you can count on it being invariant.

    So, as much as it may shock the purists, you can quite easily get away with picking out what you want from a big XML file with a regexp or two, especially if you don't have to worry about context. By that I mean, for example, extracting the contents of element <HG>, if the parent is <BAR> except when the grand parent is <ZONK>

    You just need a good test-suite to cover your a.. code, to ensure that things don't break when the source program is upgraded.

    You cannot adopt this approach when it is you who has written the XML specification and you're dealing with how people give you their information according to your spec. Everyone will do it differently and you will indeed have to parse it. Update: or you're taking the information from a web service and thus don't have any control or forewarning when the originating program may be upgraded.

    That's been my rule so far in dealing with SGML and XML for over 15 years and it has served me well so far.

    • another intruder with the mooring in the heart of the Perl

Re: XML parsing vs Regular expressions
by planetscape (Chancellor) on Feb 19, 2008 at 16:37 UTC

    I note that you say:

    from quite big(100 000 records)

    I managed to segfault when using regexes to parse very large HTML files; I am certain you could manage to do the same using regexes to parse very large XML files. ;-)

    In other words, don't. Use a module, such as XML::Twig.


Re: XML parsing vs Regular expressions
by Jenda (Abbot) on Feb 18, 2008 at 14:33 UTC

    I'd definitely recomend going with a proper parser. There are several styles of parsers, good for different types of uses. For what you seem to need in this case you might like XML::Rules. It's designed to let you select the things you are interested in and tweak the structure of the data as it's extracted from the XML file. You might like the style ... or not. In either case it's good to try different styles.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://668353]
Approved by planetscape
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (5)
As of 2020-08-07 12:43 GMT
Find Nodes?
    Voting Booth?
    Which rocket would you take to Mars?

    Results (45 votes). Check out past polls.