XML::Twig doctype and entity handlingby AZed (Monk)
|on Sep 07, 2008 at 20:51 UTC||Need Help??|
AZed has asked for the
wisdom of the Perl Monks concerning the following question:
I'm writing a program that needs to extract a clump of XML metadata stored inside of a noncompliant HTML file and then perform a number of operations on that metadata. (Specifically, for those curious, this is part of a Mobipocket .prc to IPDF .epub ebook converter.)
The HTML file in question has no doctype declaration, and XHTML entities may be found in the metadata portion. In particular, © is the first entity that XML::Parser will choke on in my current test data.
An example of the type of non-HTML I'm dealing with:
Could someone please provide me with an example of how to get XML::Twig to recognize XHTML entities? (Or even just © to get me started?) I'm currently using a workaround involving slurping the input file and using a regular expression to split the metadata out into a temporary file with a proper XML and doctype declaration prepended, but it's something of an evil hack, given that I have to just read the results of that back into XML::Twig anyway, and these files can be several megabytes in size, making slurping a very costly technique.
The XML::Twig documentation implies that subelement extraction of this nature should be fairly low-cost, so I'm hoping that someone can work this out. If it isn't, I may simply have to experiment with using sysread to chop it up into manageable chunks -- the metadata is always at the beginning of the file, and in theory should never exceed 10k in size.
My last attempt at getting XML::Twig to read this looks like this:
It dies at the parsefile command with:
undefined entity at line 1, column 306, byte 306 at /usr/lib/perl5/XML/Parser.pm line 187
Byte 306 is the first ©. This is despite 'copy' being present in the entity list and showing up when printing $mobihtmltwig->entity_names.
Thanks for any help.