in reply to Re^2: parsing XML fragments (xml log files) with... a regex
in thread parsing XML fragments (xml log files) with XML::Parser
For those interested, it can't handle
- Numerical entities (decimal and hex).*
- External entities (e.g. HTML's é).*
- Character decoding.**
- UTF-16, UTF-32, UCS-2, UCS-4.**
- CDATA.
- Namespace prefixes. (They're included as part of the name.)***
- Comments.
- Identification of an element's namespace.***
- XML validation (i.e. it allows some malformed XML).
- (more? this wasn't a thorough analysis)
Up to you to decide if it fits your needs or not.
* — A post-processor could fix this if no entities were processed at all.
** — A pre-processor such as the following would fix this:
sub _predecode { my $enc; if ( $_[0] =~ /^\xEF\xBB\xBF/ ) { $enc = 'UTF-8'; } elsif ( $_[0] =~ /^\xFF\xFE/ ) { $enc = 'UTF-16le'; } elsif ( $_[0] =~ /^\xFE\xFF/ ) { $enc = 'UTF-16be'; } elsif (substr($_[0], 0, 100) =~ /^[^>]* encoding="([^"]+)"/) { $en +c = $1; } else { $enc = 'UTF-8'; } return decode($enc, $_[0], Encode::FB_CROAK | Encode::LEAVE_SRC); }
*** — A post-processor could fix this, but one wasn't supplied.
Update: Added pre-processor I had previously coded.
In Section
Seekers of Perl Wisdom