Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

xml parsers: do I need one?

by regan (Sexton)
on Aug 28, 2003 at 14:05 UTC ( #287366=perlquestion: print w/ replies, xml ) Need Help??
regan has asked for the wisdom of the Perl Monks concerning the following question:

<suckup> oh Holders of great Perl Wisdom </suckup>

I wrote a few weeks ago about a problem with hashes, and mentioned that I was parsing xml with regexs. I got more comments that I need to use a parser, than help with the problem, but the problem was solved. I promised to investigate using a parser for my work. I've done a bit of looking, and am not sure what to do now: Here's my dilemma:
-I am parsing an xml file with about 9000 xml highest level elements
-The xml file is about 3.3MB long.
-The xml file is generated by another application that I wrote, and I know EXACTLY how the xml will look. I don't need to worry about element ordering, or if an element exists.
-The xml file is used by major applications where I work for, and is not going to change for my new app.
-I tried using the bare-bones code in the XML::Parser tutorial to parse through the file. All it does, is look at the element type, discovers that the element is not <message>, and returns. The parser has been running for more than 10 minutes, and shows no signs of stopping.
-When I parsed this with regexs, it took a minute or two to parse. I didn't parse all the element types, but I did go through the whole file.
-The application needs to suck up all the xml, process it, and spit out a huge HTML page. So far, due to speed issues, the regex approach is winning hands down.

I'm not including any code or data, because the solution to this problem probably does not rest solely on code issues, rather tradeoffs between readable code (xml Parser) and speed (regexs).

Comment on xml parsers: do I need one?
Re: xml parsers: do I need one?
by Abigail-II (Bishop) on Aug 28, 2003 at 14:12 UTC
    If this solves the problem for you, all the power to you. There is nothing wrong to parse XML (or HTML) with regexes. However, if you get problems and come here (or elsewhere) for help, and explain you are parsing XML with regexes, expect to get flak. First, we can't sniff that your files are special and that you exactly know how the XML looks like. Furthermore, if you make the choice for regexes instead of parser, and are hence reinventing, possibly badly, the wheel, you shouldn't burden forums like this with your problems.

    Abigail

      you should burden forums like this with your problems.

      Is that really what you meant to say? ;)

      --
      3dan
Re: xml parsers: do I need one?
by dragonchild (Archbishop) on Aug 28, 2003 at 14:13 UTC
    If you can mathematically guarantee that a regex will suffice, then use a regex! No-one is pointing a gun at your head and forcing you to use a parser. Parsers guarantee that they can read any file in XYZ format. If you have a subset of XYZ, then go ahead and use a regex. TMTOWDI!

    ------
    We are the carpenters and bricklayers of the Information Age.

    The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6

    Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

Re: xml parsers: do I need one?
by Elian (Parson) on Aug 28, 2003 at 14:18 UTC
    If you've full control of both ends, then no, you don't need a real XML parser.

    But, then again, if you have full control of both ends, do you even need XML in the first place?

      We have full control of both ends. Changing from XML to something different isn't going to happen. We use java's DOM parser to read the xml file in our main application. It takes about 2-3 minutes to parse everything using java. It took 35 minutes to parse the xml file with the Perl::XML parser. Is it that much slower, or was I just doing something really wrong?
        It takes about 2-3 minutes to parse everything using java.It took 35 minutes to parse the xml file with the Perl::XML parser

        35 minutes for a 3.3 MB file?? Yes, it sounds as if something is not right there. Or your computer has 24k of memory :)
        But that is difficult to say without the code or the DTD.
        --
        bm

        Sounds fishy. Last year I wrote some code that used XML::Parser to parse and munge XML files of about 20 MB apiece, and on my not-particularly-new notebook it didn't take anywhere near that long. I can't remember if runtime was closer to 30 seconds or 3 minutes, but it certainly wasn't long enough that you'd walk away from your desk and not be finished when you got back.

                $perlmonks{seattlejohn} = 'John Clyman';

Re: xml parsers: do I need one?
by tcf22 (Priest) on Aug 28, 2003 at 14:28 UTC
    I personally prefer XML::Simple for most stuff. However if the time difference is that great, and you know the exact format of the XML, then you could use a regex, I won't tell the Perl Police. The only real advantages I see for using the parser in your situation, is the parser is already done(since you are done the regex already, this doesn't matter), and if the format changes in the future, you'll have to re-engineer the regex.
Re: xml parsers: do I need one?
by zby (Vicar) on Aug 28, 2003 at 14:29 UTC
    I would say that what you are doing is more extracting data from the file then parsing the structure of the XML in it. Parsing would be when the parse tree could be built in many different shapes and you would be interested how is it built.
Re: xml parsers: do I need one?
by mirod (Canon) on Aug 28, 2003 at 14:36 UTC

    Well, if what you want is the collective blessing of the Monastery inhabitants on your devious practices, then I am afraid you can't have that without a proper offering. ;--)

    Seriously, first I am a bit surprised by the time difference. The only benchmark I have seen shows XML::Parser being only 4 times slower than regexps. Maybe XML::LibXML would be faster.

    Then of course, you can always use regexp to parse data. Just do not call it XML. It might indeed be well-formed XML (although as long as you haven't parsed it there is really no telling, the encoding might be all wrong for example), but the problem is that your system does not process XML. It processes a limited subset of it, which follows a format that should be described formally somewhere (even if it is just a list of XML features that are not used). It might actually be a good idea to call that format something like R-XML (Regan's XML) and to write everywhere that that's what your code processes. This way you or someone else who will need to maintain the system won't forget the limitations of the system. You can have a look at On XML parsing BTW to see examples of XML features that you probably don't support.

    That said, to finish on a note that will make you feel good, here is what Tim Bray, one of the creator of XML, has to say:

    That leaves input data munging, which I do a lot of, and a lot of input data these days is XML. Now here's the dirty secret; most of it is machine-generated XML, and in most cases, I use the perl regexp engine to read and process it. I've even gone to the length of writing a prefilter to glue together tags that got split across multiple lines, just so I could do the regexp trick.

    The rest of the rant gives a little context and interesting comments.

    Oh yeah, and I admit to having used regexps too sometimes, oddly enough not for speed purposes, but to use the power of the Perl regexp engine to wrap elements (now you can do this properly in XML::Twig of course ;--).

Re: xml parsers: do I need one?
by zakzebrowski (Curate) on Aug 28, 2003 at 14:38 UTC
Re: xml parsers: do I need one?
by bunnyman (Hermit) on Aug 28, 2003 at 14:50 UTC

    The XML file is used by major applications where I work for, and is not going to change for my new app.

    Does that mean that they are not using XML parsers either? This is a Bad Thing:

    • These apps may not be following the XML standard exactly right. This would mean that you have files which look like XML but are actually not, so they cannot be parsed by a real XML parser.
    • Or the XML files may follow the standard exactly right, but the input code makes invalid assumptions about the form of the XML. At some point in the future, a programmer who believes she is dealing with an XML file may make a change which produces valid XML data, but it cannot be read by your parsing code.
    • Because of both of the above items, you are not reaping the full benefits of using XML - a standard format for data interchange. Even if all your apps work together to support your own brand of not-exactly-XML-but-close, other apps should be able to create XML that you can read, or read XML that you created.

    If your new app does not use an XML parser either, that only contributes to the existing problem. It will need to be updated if the input file changes in format.

      sorry, the main app is written in java, uses the DOM parser, and parses and process what it needs in a couple of minutes. When I tried what I thought would be a really quick and dirty first pass using xml::Parse, it took 35 minutes!
      I want to do the right thing, and use a parser, but yowza! Maybe I should go back and look at my code, and see what I'm doing wrong.
Re: xml parsers: do I need one?
by derby (Abbot) on Aug 28, 2003 at 14:51 UTC
    I really cannot comment on what you need (and neither can any other monk) but I too was an XML naysayer for quite some time. A coworker turned me on to XML::LibXML and then to the power of XPath. I cannot even begin to tell you the power of XPath. This entry from the perl advent calendar gives a real nice intro with links to more on XPath.

    -derby

      And since it's based on a library written in C, I wouldn't be at all surprised if it beats the Java apps too.

      Makeshifts last the longest.

        Java is only slow on speed startup. For execution, it's not THAT much difference since hotspot (optimizing) or just the regular execution JVM's turn the byte code into a compiled program.

        ---
        Play that funky music white boy..

Re: xml parsers: do I need one?
by samtregar (Abbot) on Aug 29, 2003 at 02:29 UTC
    My first reaction was that you must be using XML::Parser incorrectly. To test my assumption I created a 3.3MB XML file with 9000 elements containing random <message> elements. Then I created a regex script (regex.pl) and an XML::Parser script (parser.pl) which both pull out all the <message> contents. I ran them head to head and made sure they both produced the same output:

    [sam@localhost xml_test]$ time ./parser.pl > parser.out real 0m1.827s user 0m1.730s sys 0m0.040s [sam@localhost xml_test]$ time ./regex.pl > regex.out real 0m0.200s user 0m0.160s sys 0m0.040s [sam@localhost xml_test]$ diff parser.out regex.out

    So I'm seeing a simple regex beating a simple XML::Parser implementation by around 9x. Given that your regex takes three minutes, an XML::Parser script taking 35 minutes is within a similar multiple. When you consider that XML::Parser is making multiple Perl sub calls on each element it encounters, I guess it makes sense. But, still, ouch!.

    Of course, if it were my job on the line I'd still use an XML parser. I've been bitten by changing specifications and funky data too many times to take the easy way out in the parser. In fact, these days I parse my XML twice - first with Xerces/C++ for schema validation and second with XML::Simple for actual usage. Better safe than sorry!

    -sam


    For the record, here's my test setup. First, the data generator:

    #!/usr/bin/perl -w print '<?xml version="1" encoding="UTF-8" ?>', "\n"; print "<test>\n"; for (0 .. 9000) { my $word = get_word(); print "<$word>\n"; if (rand(10) > 3) { for (0 .. rand(5)) { my $msg = get_words(30); print "\t<message>$msg</message>\n"; } } print "</$word>\n"; } print "</test>"; BEGIN { my @words; open(WORDS, "/usr/dict/words") or open(WORDS, "/usr/share/dict/words") or die "Can't open /usr/dict/words or /usr/share/dict/words: $ +!"; while (<WORDS>) { chomp; push @words, $_ if /^\w+$/; } srand (time ^ $$); # get a random word sub get_word { return lc $words[int(rand(scalar(@words)))]; } # get $num random words, joined by $sep, defaulting to " " sub get_words { my ($num, $sep) = @_; $sep = " " unless defined $sep and length $sep; return join($sep, map { get_word() } (0 .. ((int(rand($num)))+ +1))); } }

    The regex parser:

    #!/usr/bin/perl -w open(FILE, 'test.xml') or die $!; my $xml = join('', <FILE>); while($xml =~ m!<message>([\w\s]+)</message>!g) { print $1, "\n"; }

    And the XML::Parser script:

    #!/usr/bin/perl -w use strict; use XML::Parser; my $p = new XML::Parser(Style => 'Stream', Pkg => 'main'); $p->parsefile('test.xml'); my $in_msg = 1; sub StartTag { $in_msg++ if $_ eq '<message>'; } sub Text { print $_ if $in_msg; } sub EndTag { if ($_ eq '</message>') { $in_msg--; print "\n"; } }
      I did some thinking, and decided:
      -speed was NOT essential, as this could really be done via a cron job that runs once in a while
      -The xml should not change (we've all heard that one!), but we both know that it will
      -xml parsing is far more flexible than regex
      -regex is far faster in this case
      -xml parsing is both cooler, and the Right Thing to do

      I will use a parser, I don't know which one, or how to pick it, but since I have experience using java's DOM parser, I'll probably stick with xml::dom
        since I have experience using java's DOM parser, I'll probably stick with xml::dom

        Please DON'T!

        XML::DOM is an old and barely maintained module. XML::LibXML implements the DOM, plus XPath, which makes it a LOT easier (and safer) to use. It drives me crazy to see the number of people still choosing to use it now, when better alternatives are clearly available.

        <rant mode="on">The fact that a module has the name of a standard does not mean a thing. More exactly, it just means that the author was the first one to write a module implementing some of the standard. It does not mean that it is a good module, that you should use it, or that there isn't a better module available. Use your brain! </rant>

      Nice benchmark. I wouldn't use XML::Parser's Stream style, but it's probably because I am not very familiar with it.

      I expanded slightly this benchmark, creating a somewhat more complicated document, still around 3M and 10K elements, and run a bunch of modules on it.The results are quite surprising actually:

      10160 elements generated - (63 top level - 1721 to extract)
      bench_regexp             : 0:00.16 real 0.14  0.03 s
      bench_libxml             : 0:00.44 real 0.39  0.05 s
      bench_parser             : 0:00.88 real 0.83  0.01 s
      bench_parser_stream      : 0:01.15 real 1.10  0.06 s
      bench_twig               : 0:01.84 real 1.81  0.03 s
      bench_sax_base_libxml    : 0:03.29 real 3.25  0.05 s
      bench_sax_libxml         : 0:03.32 real 3.31  0.03 s
      bench_sax_expat          : 0:03.21 real 3.11  0.03 s
      bench_dom                : 0:04.51 real 4.41  0.03 s
      libxslt                  : 0:01.48 real 1.46  0.02 s
      xml_grep                 : 0:02.07 real 2.02  0.03 s
      

      I am very surprised by how slow the XML::SAX examples are (hence I wrote one using SAX::Base and 1 not using it). I did not expect this, and I will try to figure out what the problem is. If you look at the code, I really don't think I am using the PurePerl parser, I took great care of creating the parser myself. That's odd.

      Code and everything to run it is at http://xmltwig.com/article/simple_benchmark/.

      Get simple_benchmark.tar.gz

      tar zxvf simple_benchmark.tar.gz cd simple_benchmark perl run_all

      Note that the xml_grep version only works with the latest, greatest release of the tool, available somewhere else on the same site (with the development version of XML::Twig).

        Wow, and I thought I was going overboard! Very interesting. I'd like to see how Xerces/C++ fairs too, but I still can't build XML::Xerces. I've been using the DOMCount example program to do XML Schema validation...

        -sam

XML::LibXML vs XML::Parser and friends
by merlyn (Sage) on Aug 29, 2003 at 13:57 UTC
    Adding my own experience to what's already been said here...

    XML::Parser-based solutions (including XML::Simple) basically use C code to recognize tokens, but then have to build Perl data structures for the entire data tree, even for the parts that aren't being used.

    XML::LibXML (if you can get it built and working, because it can be a bit finicky) builds the DOM as a C-side data structure. It's fast. Very fast. From the Perl side, you then ask for precisely the parts of the tree you want (with either DOM or XPath syntax), and only then are the heavyweight Perl objects created for those particular elements.

    I've also seen benchmarks that libxml2 (on which XML::LibXML is based) is far faster at just recognizing the tokens than expat (on which XML::Parser is based). This also helps with speed.

    I've pretty much abandoned any use of expat-based solutions now. XML::LibXML is it.

    -- Randal L. Schwartz, Perl hacker
    Be sure to read my standard disclaimer if this is a reply.

Re: xml parsers: do I need one?
by Anonymous Monk on Aug 29, 2003 at 15:27 UTC
    Heretically, you may wish to use XSLT for this task. That's what it was written for.

      XSLT is an ideal tool for pre-digesting large pieces of XML in order to work on them further using any of the perl mods mentioned above. This technique is useful if the original XML contains a large amount of information that is surplus to requirements.

      In the original example, the 9000 tags may contain a large number of sub tags, attributes, text etc. which are not required. The XML can be processed into either a simpler form of XML that contains only the required data or even a flat file format that can be parsed line by line using standard techniques.

      Pre-processing the original XML using an XSLT engine such as Xalan (either directly or via XML::Xalan) is only going to be worth while if the source XML is large and contains a high proportion of non-essential information.

      Inman

        {My apologies for not having been logged in earlier when I suggested XSLT}

        As you said: "source XML is large" and contains "non-essential information". His example case was "large" (3.3MB) and searching solely for tags of type <message>. That is a very simple XSLT to output as HTML (fragmentary example, please don't carp about the syntax):

        <ul> <xsl:for-each select="message"> <li><xsl:value-of select="current()" /></li> </xsl:for-each> </ul>
        XSLT can convert XML directly into XML, HTML, or even perl:
        @messages = ( <xsl:for-each select="message"> "<xsl:value-of select="current()" />", </xsl:for-each> );

        I wouldn't want to comment further without a better understanding of the actual "processing" to be done, but the simple example presented in the question is practically a textbook case for XSLT.

        Whatever. Use the tools you like.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://287366]
Front-paged by broquaint
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (13)
As of 2014-08-29 16:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (281 votes), past polls