http://www.perlmonks.org?node_id=893895


in reply to parsing XML fragments (xml log files) with XML::Parser

My first reaction would be to try parse_balanced_chunk from XML::LibXML::Parser to create XML::LibXML::DocumentFragments.

Replies are listed 'Best First'.
Re^2: parsing XML fragments (xml log files) with... a regex
by tye (Sage) on Mar 18, 2011 at 05:05 UTC

    Well, since you bring up your "first reaction"... My first reaction would be to roll my own XML parser in about 30 minutes using some simple regexes (combined into a single, easily understood regex the last time I did this). That takes less time than finding a decent XML module that can parse partial XML, much less also getting it installed, much much less figuring out how to use it.

    Adjusting the small block of code to suite your needs and situation becomes trivial compared to getting something as complex (and rigid) as an XML parsing module to bend so. For this occasion I had no use for empty tags so the code ignores them. Fill in what you want to do with them if anything.

    Naturally, I had no use for the nearly completely useless feature of CDATA so I didn't even worry about the regex to parse that junk. If you need it (the OP doesn't appear to), adding that feature is 5, maybe 10 minutes' work.

    Actually, I started out trying to use some XML module that had gotten decent reviews somewhere. I had it all working on the sample data and then when I finished the "download the data" part, the XML part suddenly just stopped working. It told me that there was no 'foo' tag despite '<foo ...' being clearly there and that being recognized as a 'foo' tag previously. Eventually I figured out that XML namespaces were to blame and after too much time trying to even find any documentation on such things in relation to the module, I decided to write a regex so I could have something working that day.

    Took less time to write the regex and get it working than it had taken me to get the module working on the test data. And the resulting code is just tons easier to make adjustments to.

    sub ParseXmlString { my( $str )= @_; my $name= '(?:\w+:)?\w+'; my $value= q< (?: '[^']+' | "[^"]+" ) >; my $s= '\s'; my $attrib= "$name $s* = $s* $value"; my $decl= "< $s* [?] $s* $name (?: $s+ $attrib )* $s* [?] $s* >" +; my $tag= "< $s* (/?) $s* ($name) (?: $s+ $attrib )* $s* (/?) $s +* >"; my $data= '(?: [^<>&]+ | &\#?\w+; )+'; my $hv= {}; my @stack; while( $str =~ m{ \G(?: ( $decl ) # $1 <?xml ...?> | ( $data ) # $2 encoded text | ( $tag ) # $3 <...>, $4 '/' or '', $5 tag name, $6 '/' +or '' | ( . ) # $7 we failed ) }xgc ) { if( $1 ) { $hv->{'.header'}= $1; } elsif( defined $2 ) { my $text= $2; if( $text =~ /\S/ ) { s-&lt;-<-g, s-&quot;-"-g, s-&gt;->-g, s-&apos;-'-g, s-&amp;-&-g for $text; push @{ $hv->{'.data'} }, $text; } } elsif( $4 ) { $hv= pop @stack; } elsif( $6 ) { # We currently just ignore empty tags } elsif( $3 ) { my $new= {}; push @{ $hv->{$5} }, $new; push @stack, $hv; $hv= $new; } elsif( defined $7 ) { my $beg= pos($str); my $len= 20; $beg -= $len/2; if( $beg < 0 ) { $len += $beg; $beg= 0; } die "XML failed to parse byte ", pos($str), " ($7), near ' +", substr( $str, $beg, $len ), "'.\n"; } else { die "Impossible!"; } } if( @stack ) { die "Unclosed XML tags"; } return $hv; }

    - tye        

      It told me that there was no 'foo' tag despite being clearly there and that being recognized as a 'foo' tag previously. Eventually I figured out that XML amespaces were to blame

      I find it very unfortunate that XPath requires you to specify the namespace. I wish libxml had an option to configure what it meant to have no prefix in an XPath node test:

      • Missing prefix = Match the null namespace. (Standard)
      • Missing prefix = Match some previously defined default namespace.
      • Missing prefix = Match any namespace.

      For those interested, it can't handle

      • Numerical entities (decimal and hex).*
      • External entities (e.g. HTML's &eacute).*
      • Character decoding.**
      • UTF-16, UTF-32, UCS-2, UCS-4.**
      • CDATA.
      • Namespace prefixes. (They're included as part of the name.)***
      • Comments.
      • Identification of an element's namespace.***
      • XML validation (i.e. it allows some malformed XML).
      • (more? this wasn't a thorough analysis)

      Up to you to decide if it fits your needs or not.

      * — A post-processor could fix this if no entities were processed at all.

      ** — A pre-processor such as the following would fix this:

      sub _predecode { my $enc; if ( $_[0] =~ /^\xEF\xBB\xBF/ ) { $enc = 'UTF-8'; } elsif ( $_[0] =~ /^\xFF\xFE/ ) { $enc = 'UTF-16le'; } elsif ( $_[0] =~ /^\xFE\xFF/ ) { $enc = 'UTF-16be'; } elsif (substr($_[0], 0, 100) =~ /^[^>]* encoding="([^"]+)"/) { $en +c = $1; } else { $enc = 'UTF-8'; } return decode($enc, $_[0], Encode::FB_CROAK | Encode::LEAVE_SRC); }

      *** — A post-processor could fix this, but one wasn't supplied.

      Update: Added pre-processor I had previously coded.

        It parses numerical entities. Decoding them wasn't required and would only require one regex be added. It handles namespace prefixes exactly as I wanted it to (they're included in the tag name). It is trivial to make it handle them differently (which is the point). I won't go into "validation" here, it being a subject worthy of a lengthy write-up.

        A pre/post processor could fix this...

        Wow. You are really stuck in thinking in terms of an XML-parsing module. There is no need to do anything in a pre-/post-processor -- which is part of the whole point of the exercise.

        For example, supporting comments is 2 minutes' work and easily fits into the existing structure. The few items that rise to the level of being interesting to implement are the things that I've never actually seen used in any XML. So it shouldn't be surprising that I didn't bother to implement them in the code that implemented just what I needed for one project.

        - tye        

      my $data= '(?: [^<>&]+ | &\#?\w+; )+';
      should be
      my $data= '(?: [^<&]+ | &\#?\w+; )+';

      XML allows for unescaped ">"

        Yeah, XML got that wrong. I've never seen real XML that takes advantage of that and I always write my parsers to reject it, so I'll know if it ever happens. So far, my tiny universe of implementers of XML generators are smarter than the standard's authors on this point. :)

        - tye        

      Tye, thanks for sharing, but that approach is wrong on so many levels all I can say is: Good luck with that!
Re^2: parsing XML fragments (xml log files) with XML::Parser
by kgoess (Beadle) on Mar 18, 2011 at 16:32 UTC
    Thanks, parse_balanced_chunk was what I was looking for. But since I already wrote the stream parser handlers, I'm going to go with the suggestion a top-level parser to wrap it in a root element.