http://www.perlmonks.org?node_id=192164

athomason has asked for the wisdom of the Perl Monks concerning the following question:

'ello folks,

I'm currently handling some XML documents that are too large to process in memory or to store on disk permanently. However, they fit well enough when gzip'ed (at about 15:1 compression). But when it comes time to process them, I would like to avoid first decompressing them fully. I thought I could solve the problem using IO::Zlib, which provides an interface much like IO::Handle; this would allow me to keep only portions of the decompressed text in memory at a time. And of course, XML::Twig is great for managing big XML documents (thanks mirod!), but it doesn't natively handle gzip'ed XML. But since XML::Parser::Expat and by extension, XML::Twig can take an IO::Handle as a document source, I thought I could string the two together. However, IO::Zlib doesn't actually inherit from IO::Handle, and XML::Parser::Expat demands that UNIVERSAL::isa($arg, 'IO::Handle') be true before it will treat the argument as a handle. I figured a simple workaround like this would work:

package IO::Handle::Zlib; use vars qw/ @ISA /; @ISA = qw/ IO::Zlib IO::Handle /;
which would allow me to replace my IO::Zlib objects with IO::Handle::Zlib's transparently. However, when I try this out, I come across the following error, courtesy of expat:
not well-formed (invalid token) at line 7213, column 3, byte 780490 at + /path/to/perl/lib/5.6.1/IP27-irix/XML/Parser.pm line 185
Now that's odd, since the decompressed file ends at line 7212, and is only 780487 bytes long. One might think the file is being decompressed past the original size of the document, but inserting print DUMP <$gz>; gives a file that is identical to the original (i.e., the angle-bracket read gives a file that is also 7212 lines and 780487 bytes long). So clearly, whatever the XS part of XML::Parser::Expat is doing with the IO::Handle is not what the angle-brackets are doing. And expat itself is working, since replacing
my $reader = new IO::Handle::Zlib; $reader->open( $compressed_filename, "rb" ) or croak "could not open $compressed_filename: $!";
with
my $reader = new IO::File; $reader->open( $uncompressed_filename, "r" );
eliminates the error.

Has anyone used IO::Zlib like this before? Is my IO::Handle::Zlib wrapper bogus? Anybody know how XS modules do IO::Handle reads, and why this doesn't work?

Thanks,

--athomason