http://www.perlmonks.org?node_id=1033324

derekstucki has asked for the wisdom of the Perl Monks concerning the following question:

I'm writing some code that retrieves a shipping rate quote in XML format. All I need from this XML tree ( 10-20KB in size ) is a single value from a node in the middle of it. The two approaches I've considered is parsing the tree with XML::LibXML, or running a very simple regex, along the lines of

/<specificParentNode>.*<nodeINeed>(\d+)</nodeINeed>/

Is there any reason why I should do the XML parser over the regex approach? My main reason for wanting to use the regex approach is ease of writing it.

Replies are listed 'Best First'.
Re: XML parsing vs regex
by space_monk (Chaplain) on May 13, 2013 at 18:21 UTC
    Umm yes, what happens if someone puts a few attributes in your parent node or the node you need and screws up your regex search as a result. Or a space? Regexs for XML will bite you on the bum when you least expect it. Parsing is slower, and can have its own issues, but is generally more predictable.
    # your regex would fail if <parentNode id="1234"> # would fail because node now has attributes <parentNode > # just one space is all it takes # or this... <nodeINeed><!-- regex this comment, sucka! -->12345</nodeINeed>
    If you spot any bugs in my solutions, it's because I've deliberately left them in as an exercise for the reader! :-)

      What makes you say parsing is slower? I would expect XML::LibXML to be faster than manual file handling + regular expressions. While I have no benchmarks, neither have I made any assertions. :P

        Its an assumption, I grant you, but I think I'm on safe ground when I think that building a DOM tree out of a document, followed by an XPath search is very likely to be more time consuming than a single regex pass. ;-)

        I would be curious to see how close various approaches get though, so if anyone is willing to benchmark say LibXML, XML::Twig and regex, I would like to see the results

        If you spot any bugs in my solutions, it's because I've deliberately left them in as an exercise for the reader! :-)

      Well, if the format of the data input changes, then there is a chance that you have to change your program.

      Maybe the XML::LibXML program will have to be changed, or maybe the rergex program will have to be changed (or neither or both). Nobody can know for sure which one, it depends on the nature of the change.

      I know some people will probably shout at me for that, but in such a simple case, I would probably go for a regex. You don't need (and don't want) a cruise missile with an H-bomb in it to kill a mosquito on your arm.

Re: XML parsing vs regex
by mirod (Canon) on May 13, 2013 at 18:52 UTC

    With XML::Twig you don't have to load the entire tree in memory. You could do something like this (untested):

    use XML::Twig; my $val; XML::Twig->new( twig_roots => { 'specificParentNode/nodeINeed' => s +ub { $val= $_->text; $_[0]->finish_now; } }) ->parsefile( "my.xml");

    Only the NodeINeed would be in memory, and parsing would stop right after finding the value.

Re: XML parsing vs regex
by LanX (Saint) on May 13, 2013 at 19:30 UTC
    It depends on the complexity and volatility of your XML.

    I'm using myself a regexes for parsing pdftohtml -xml output and never regretted it. (very simple and stable format)

    We don't have the necessary background informations to judge.

    Anyway be careful with your regex, at least a non-greedy quantifier .*? and some tests on plausibility of the result would certainly help making your code more robust!

    Cheers Rolf

    ( addicted to the Perl Programming Language)

Re: XML parsing vs regex
by Jim (Curate) on May 14, 2013 at 05:53 UTC

    Here are a few rudimentary points that summarize my personal take on the classic XML parser versus regular expressions debate.

    • Perl is a general-purpose scripting language that is especially well-suited for text processing using arbitrarily complex regular expression patterns.
    • XML is plain text. Its inventors chose this simple format intentionally. (At least one of its inventors was a Perl hacker.)
    • All the XML I've ever had to work with has been data-oriented rather than document-oriented. It has been generated by stable software in such a way that its format was uniform, constant and predictable. For the duration of time I've had to work with any particular XML data structure, the format of the XML has never changed.
    • I've mostly ever had to do just two things with XML data using Perl:  make small changes to XML files, or extract small amounts of specific data from them.
    • I know Perl regular expressions well because I use them all the time, for all kinds of applications. I don't know any of the multiple different XML parsing technologies very well (XML::Parser, XML::LibXML, XML::Twig, etc.) because I rarely have to use them.
    • If the XML changes over time, it seems to me most likely to change in ways that would require a Perl script that parses it to be updated regardless of how it's parsing the XML:  either using a proper XML parser such as XML::LibXML or using regular expression patterns.
    • If you need to parse a whole XML data structure into a whole Perl data structure, don't try to write your own XML parser in Perl, silly! That would be senseless and foolhardy.

    Jim

Re: XML parsing vs regex
by vsespb (Chaplain) on May 13, 2013 at 19:17 UTC

    If you will be able to fix code fast, in case XML change (spacing, additional attributes, etc), and this bug won't cause a disaster - use Regexp.

    In my experience html/xml parsing with regexp can work for years without problem, and can be fixed fast in case of bug (if I can easy detect where bug is).

    If you wan't something reliable - use XML parsing.

Re: XML parsing vs regex
by Zzenmonk (Sexton) on May 14, 2013 at 13:01 UTC

    I just had the same issue for an application I developed for a client. For lookups of single values I use regexps, since it saves code. For more complex matters I use XML:TreePP which should be also sufficient for your purpose. Just watch your regexp and the solution should be OK for your case!

    K

    The best medicine against depression is a cold beer!
Re: XML parsing vs regex
by sundialsvc4 (Abbot) on May 13, 2013 at 19:19 UTC

      That Indiana excerpt has the implication that the simple choice is the one to make, whereas personally I think the only way you avoid being poisoned in this case is to choose the complex glitzy method :-P

      If you spot any bugs in my solutions, it's because I've deliberately left them in as an exercise for the reader! :-)

        Humor was never intended to be taken deeply or to be saturated with implications.   Clearly, when you are dealing with an XML document, you [almost] always want to use a well-proved library to retrieve the elements of interest and serve them to you ... even when a “simpler” solution seems to be suggesting itself.   XML is a devilish beast that is always offering up more committee-bred complications, and ’tis best to let someone else wander into the fray on your behalf.   “Better libxml than me...”   Shortcuts here, well-intentioned though they may be, become a pain in the glutes.

Re: XML parsing vs regex
by derekstucki (Sexton) on May 14, 2013 at 22:51 UTC

    After a careful consideration of the posted answers, I think the regex is the right choice for THIS particular case because:
    The XML in question is retrieved via LWP, so by the time it needs processing, it's in a string, already in memory, so the possible efficiency of the library reading a file is negated.
    The site/API I'm retrieving the XML from is versioned, and I request a response from a particular version as part of my call to it, so the format *shouldn't* ever change.
    So, all the perfectly valid reasons to use the library in general don't apply enough in THIS case to justify the added complexity. Thanks for all the discussion, it helped.