|laziness, impatience, and hubris|
IO::Zlib with XML::Twig/XML::Parser::Expatby athomason (Curate)
|on Aug 22, 2002 at 20:19 UTC||Need Help??|
athomason has asked for the wisdom of the Perl Monks concerning the following question:
I'm currently handling some XML documents that are too large to process in memory or to store on disk permanently. However, they fit well enough when gzip'ed (at about 15:1 compression). But when it comes time to process them, I would like to avoid first decompressing them fully. I thought I could solve the problem using IO::Zlib, which provides an interface much like IO::Handle; this would allow me to keep only portions of the decompressed text in memory at a time. And of course, XML::Twig is great for managing big XML documents (thanks mirod!), but it doesn't natively handle gzip'ed XML. But since XML::Parser::Expat and by extension, XML::Twig can take an IO::Handle as a document source, I thought I could string the two together. However, IO::Zlib doesn't actually inherit from IO::Handle, and XML::Parser::Expat demands that UNIVERSAL::isa($arg, 'IO::Handle') be true before it will treat the argument as a handle. I figured a simple workaround like this would work:
which would allow me to replace my IO::Zlib objects with IO::Handle::Zlib's transparently. However, when I try this out, I come across the following error, courtesy of expat:
Now that's odd, since the decompressed file ends at line 7212, and is only 780487 bytes long. One might think the file is being decompressed past the original size of the document, but inserting print DUMP <$gz>; gives a file that is identical to the original (i.e., the angle-bracket read gives a file that is also 7212 lines and 780487 bytes long). So clearly, whatever the XS part of XML::Parser::Expat is doing with the IO::Handle is not what the angle-brackets are doing. And expat itself is working, since replacing
eliminates the error.
Has anyone used IO::Zlib like this before? Is my IO::Handle::Zlib wrapper bogus? Anybody know how XS modules do IO::Handle reads, and why this doesn't work?