Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Re^2: parsing XML fragments (xml log files) with... a regex

by tye (Cardinal)
on Mar 18, 2011 at 05:05 UTC ( #893908=note: print w/ replies, xml ) Need Help??


in reply to Re: parsing XML fragments (xml log files) with XML::Parser
in thread parsing XML fragments (xml log files) with XML::Parser

Well, since you bring up your "first reaction"... My first reaction would be to roll my own XML parser in about 30 minutes using some simple regexes (combined into a single, easily understood regex the last time I did this). That takes less time than finding a decent XML module that can parse partial XML, much less also getting it installed, much much less figuring out how to use it.

Adjusting the small block of code to suite your needs and situation becomes trivial compared to getting something as complex (and rigid) as an XML parsing module to bend so. For this occasion I had no use for empty tags so the code ignores them. Fill in what you want to do with them if anything.

Naturally, I had no use for the nearly completely useless feature of CDATA so I didn't even worry about the regex to parse that junk. If you need it (the OP doesn't appear to), adding that feature is 5, maybe 10 minutes' work.

Actually, I started out trying to use some XML module that had gotten decent reviews somewhere. I had it all working on the sample data and then when I finished the "download the data" part, the XML part suddenly just stopped working. It told me that there was no 'foo' tag despite '<foo ...' being clearly there and that being recognized as a 'foo' tag previously. Eventually I figured out that XML namespaces were to blame and after too much time trying to even find any documentation on such things in relation to the module, I decided to write a regex so I could have something working that day.

Took less time to write the regex and get it working than it had taken me to get the module working on the test data. And the resulting code is just tons easier to make adjustments to.

sub ParseXmlString { my( $str )= @_; my $name= '(?:\w+:)?\w+'; my $value= q< (?: '[^']+' | "[^"]+" ) >; my $s= '\s'; my $attrib= "$name $s* = $s* $value"; my $decl= "< $s* [?] $s* $name (?: $s+ $attrib )* $s* [?] $s* >" +; my $tag= "< $s* (/?) $s* ($name) (?: $s+ $attrib )* $s* (/?) $s +* >"; my $data= '(?: [^<>&]+ | &\#?\w+; )+'; my $hv= {}; my @stack; while( $str =~ m{ \G(?: ( $decl ) # $1 <?xml ...?> | ( $data ) # $2 encoded text | ( $tag ) # $3 <...>, $4 '/' or '', $5 tag name, $6 '/' +or '' | ( . ) # $7 we failed ) }xgc ) { if( $1 ) { $hv->{'.header'}= $1; } elsif( defined $2 ) { my $text= $2; if( $text =~ /\S/ ) { s-&lt;-<-g, s-&quot;-"-g, s-&gt;->-g, s-&apos;-'-g, s-&amp;-&-g for $text; push @{ $hv->{'.data'} }, $text; } } elsif( $4 ) { $hv= pop @stack; } elsif( $6 ) { # We currently just ignore empty tags } elsif( $3 ) { my $new= {}; push @{ $hv->{$5} }, $new; push @stack, $hv; $hv= $new; } elsif( defined $7 ) { my $beg= pos($str); my $len= 20; $beg -= $len/2; if( $beg < 0 ) { $len += $beg; $beg= 0; } die "XML failed to parse byte ", pos($str), " ($7), near ' +", substr( $str, $beg, $len ), "'.\n"; } else { die "Impossible!"; } } if( @stack ) { die "Unclosed XML tags"; } return $hv; }

- tye        


Comment on Re^2: parsing XML fragments (xml log files) with... a regex
Download Code
Re^3: parsing XML fragments (xml log files) with... a regex
by kgoess (Beadle) on Mar 18, 2011 at 16:34 UTC
    Tye, thanks for sharing, but that approach is wrong on so many levels all I can say is: Good luck with that!
Re^3: parsing XML fragments (xml log files) with... a regex
by ikegami (Pope) on Mar 18, 2011 at 17:28 UTC

    It told me that there was no 'foo' tag despite being clearly there and that being recognized as a 'foo' tag previously. Eventually I figured out that XML amespaces were to blame

    I find it very unfortunate that XPath requires you to specify the namespace. I wish libxml had an option to configure what it meant to have no prefix in an XPath node test:

    • Missing prefix = Match the null namespace. (Standard)
    • Missing prefix = Match some previously defined default namespace.
    • Missing prefix = Match any namespace.
Re^3: parsing XML fragments (xml log files) with... a regex
by ikegami (Pope) on Mar 18, 2011 at 17:43 UTC

    For those interested, it can't handle

    • Numerical entities (decimal and hex).*
    • External entities (e.g. HTML's &eacute).*
    • Character decoding.**
    • UTF-16, UTF-32, UCS-2, UCS-4.**
    • CDATA.
    • Namespace prefixes. (They're included as part of the name.)***
    • Comments.
    • Identification of an element's namespace.***
    • XML validation (i.e. it allows some malformed XML).
    • (more? this wasn't a thorough analysis)

    Up to you to decide if it fits your needs or not.

    * — A post-processor could fix this if no entities were processed at all.

    ** — A pre-processor such as the following would fix this:

    sub _predecode { my $enc; if ( $_[0] =~ /^\xEF\xBB\xBF/ ) { $enc = 'UTF-8'; } elsif ( $_[0] =~ /^\xFF\xFE/ ) { $enc = 'UTF-16le'; } elsif ( $_[0] =~ /^\xFE\xFF/ ) { $enc = 'UTF-16be'; } elsif (substr($_[0], 0, 100) =~ /^[^>]* encoding="([^"]+)"/) { $en +c = $1; } else { $enc = 'UTF-8'; } return decode($enc, $_[0], Encode::FB_CROAK | Encode::LEAVE_SRC); }

    *** — A post-processor could fix this, but one wasn't supplied.

    Update: Added pre-processor I had previously coded.

      It parses numerical entities. Decoding them wasn't required and would only require one regex be added. It handles namespace prefixes exactly as I wanted it to (they're included in the tag name). It is trivial to make it handle them differently (which is the point). I won't go into "validation" here, it being a subject worthy of a lengthy write-up.

      A pre/post processor could fix this...

      Wow. You are really stuck in thinking in terms of an XML-parsing module. There is no need to do anything in a pre-/post-processor -- which is part of the whole point of the exercise.

      For example, supporting comments is 2 minutes' work and easily fits into the existing structure. The few items that rise to the level of being interesting to implement are the things that I've never actually seen used in any XML. So it shouldn't be surprising that I didn't bother to implement them in the code that implemented just what I needed for one project.

      - tye        

        It handles namespace prefixes exactly as I wanted it to

        So? All I did was identify the features others might have to add to suit their needs.

        It would be totally useless to me, for example, since the prefix isn't uniform across the documents I deal with. In fact, I've never encountered a situation where it was better to keep the prefixes.

        For example, supporting comments is 2 minutes' work

        I'm well aware that the changes are easy.

        There is no need to do anything in a pre-/post-processor

        Duh. There are many ways of changing it. I just suggested one.

        So it shouldn't be surprising that I didn't bother to implement them in the code that implemented just what I needed for one project.

        It's not. Why are you so defensive when I tell other people what they might need to adjust to suit their needs?

Re^3: parsing XML fragments (xml log files) with... a regex
by ikegami (Pope) on Mar 23, 2011 at 18:01 UTC
    my $data= '(?: [^<>&]+ | &\#?\w+; )+';
    should be
    my $data= '(?: [^<&]+ | &\#?\w+; )+';

    XML allows for unescaped ">"

      Yeah, XML got that wrong. I've never seen real XML that takes advantage of that and I always write my parsers to reject it, so I'll know if it ever happens. So far, my tiny universe of implementers of XML generators are smarter than the standard's authors on this point. :)

      - tye        

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://893908]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (6)
As of 2014-07-26 06:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (175 votes), past polls