in reply to Re^3: Out of memory with XML::Parser
in thread Out of memory with XML::Parser

OK, I've done that for the element that's being problematic. It seems to just keep firing the character data sub over and over, and the expected and original strings always match (and always appear to be Base64-encoded data). The data is PCDATA, not CDATA. Should I try using CDATA?

Replies are listed 'Best First'.
Re^5: Out of memory with XML::Parser
by mirod (Canon) on Sep 14, 2005 at 21:45 UTC

    You don't need to use CDATA section as Base64 encoding does not use '<' or '&'

    Darn! I don't know what to say, especially as I am not able to reproduce the bug. What versions of perl, XML::Parser and expat are you using? On which OS?

    The code below works just fine for me (I actually get a single call for each long element).

    #!/usr/bin/perl use strict; use warnings; use XML::Parser; my $size= 500000; my @base64_chars=('a'..'z','A'..'Z','0'..'9','+','/','='); my $string= join( '', map { $base64_chars[rand(@base64_chars)] }(1..$s +ize)); my $doc= qq{<doc> <elt>foo</elt> <long1>$string</long1> <long2>$string</long2> <elt>bar</elt> </doc>}; my $p= XML::Parser->new( Handlers => { Char => \&char, }); $p->parse( $doc); exit; sub char { my( $expat, $char)= @_; print "in ", $expat->current_element, " - ", "length char: ", length( $char), " - ", "length recognized: ", length( $expat->recognized_string), " + - ", "length original: ", length( $expat->original_string), "\n", ; }

    I also tried getting the data from a file, and including "\n" to get several calls to the Char handler, and it all worked nicely:

    #!/usr/bin/perl use strict; use warnings; use XML::Parser; use Fatal qw(open); my $line_size= 100000; my $nb_lines= 5; my @base64_chars=('a'..'z','A'..'Z','0'..'9','+','/','='); my $line= join( '', map { $base64_chars[rand(@base64_chars)] }(1..$lin +e_size)); $line.= "\n"; my $string= $line x $nb_lines; my( $long1_length, $long2_length); my $doc= qq{<doc><elt>foo</elt><long1>$string</long1><long2>$string</l +ong2><elt>bar</elt></doc>}; open( my $xml, '>', "$0.xml"); print {$xml} $doc; close $xml; my $p= XML::Parser->new( Handlers => { Char => \&char, }); $p->parsefile( "$0.xml"); print "long1: $long1_length\n"; print "long2: $long2_length\n"; exit; sub char { my( $expat, $char)= @_; print "in ", $expat->current_element, " - ", "length char: ", length( $char), " - ", "length recognized: ", length( $expat->recognized_string), " + - ", "length original: ", length( $expat->original_string), "\n", ; if( $expat->in_element( 'long1')) { $long1_length+= length( $char); + } if( $expat->in_element( 'long2')) { $long2_length+= length( $char); + } }
      I figured it out! After appending the characters from XML::Parser to my string I now undef the expat character variable. Suddenly the whole script moves way faster and uses less memory. This is the new character data handling routine:
      Char => sub { my $expat = shift; my $chars = shift; $cbuffer = $cbuffer . $chars; undef $chars; }

        Do you get the same effect by replacing the undef $chars; by just... 1? It could be something like what's described in the XML::Twig FAQ: by having the last statement of the handler, which is returned to the calling routine, not evaluate to a huge string, you avoid copying it and thus get the effect you got.

        It looks like there was no bug in XML::Parser then, you just actually ran out of memory.

      I'm using Debian Testing with XML::Parser version 2.34 (Debian revision 4). I'll play with this more tonight...
      Sorry, I made a mistake - the element it's choking on contains 900 kilobytes of data (I had the tracing statements in the wrong section). Since I know though the size of the undecoded data in advance (it's on of the element's attributes) is there any way I can get Perl to preallocate a scalar to that size + padding for the Base64 encoding? Also, when it's processing the huge file strace shows massive amounts of mremap calls:
      mremap(0x40617000, 827392, 827392, MREMAP_MAYMOVE) = 0x40617000 mremap(0x404bd000, 827392, 827392, MREMAP_MAYMOVE) = 0x404bd000 mremap(0x40617000, 827392, 827392, MREMAP_MAYMOVE) = 0x40617000 mremap(0x404bd000, 827392, 827392, MREMAP_MAYMOVE) = 0x404bd000