Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

XML::Twig too many children?

by Anonymous Monk
on Feb 21, 2012 at 19:31 UTC ( #955372=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I've been having trouble with a script that terminates unexpectedly and silently. In the script I'm parsing an XML tree using XML::Twig, extracting information and purging as it parses, with the use of handlers (so it uses only a couple of MB of memory):

my $twig = XML::Twig->new(keep_encoding => 1, twig_handlers => {'lemma' => \&ProcessLemma}); sub ProcessLemma { my ($XmlTwig, $XmlLemma) = @_; # extract information here... $XmlLemma->purge; return 1; }

So far so good, but on a particular XML file, it would suddenly quit, without any error message. Now I've stripped the file that causes the problem down to the minimum that makes the script terminate. It contains a node with about 4800 children.

Is there a maximum number of children that Twig can handle? Is this why my script dies? (And is there a way to fix this?)

Thank you in advance for you help!

Comment on XML::Twig too many children?
Download Code
Re: XML::Twig too many children?
by GrandFather (Cardinal) on Feb 21, 2012 at 20:03 UTC

    The problem is not quite so trivial as you describe as the following code demonstrates:

    use warnings; use strict; use XML::Twig; my $children = 10000; my $found; my $xmlStr = <<XML; <XML> @{["<lemma>1</lemma>\n" x $children]} </XML> XML my $twig = XML::Twig->new( keep_encoding => 1, twig_handlers => {'lemma' => \&ProcessLemma} ); $twig->parse($xmlStr); print "Expected $children, found $found\n"; sub ProcessLemma { my ($XmlTwig, $XmlLemma) = @_; ++$found; $XmlLemma->purge; return 1; }

    Prints:

    Expected 10000, found 10000

    Perhaps you can use a generated XML test case shown above to reproduce your issue?

    True laziness is hard work

      Your example works fine for me too (except that it won't swallow the <<XML; ... XML construction, so I rewrote it).

      However, it breaks if I have just one <lemma>-tag with many children. I find that the breaking point is at 4696/4697.

      use warnings; use strict; use XML::Twig; my $children = 4697; my $found; my $xmlStr = '<XML><lemma>'.join("\n",@{['<line>1</line>' x $children] +}).'</lemma></XML>'; my $twig = XML::Twig->new( keep_encoding => 1, twig_handlers => {'lemma' => \&ProcessLemma} ); $twig->parse($xmlStr); print "Expected one, found $found\n"; sub ProcessLemma { my ($XmlTwig, $XmlLemma) = @_; ++$found; $XmlLemma->purge; return 1; }

      Btw, I don't know if it matters, but I'm using Win32, ActivePerl 5.14.2.

        For me, it works up to 20_140 (linux, i686, Perl 5.14.2). For 20_142, it usually dies of SIGSEGV, but sometimes still works.

        5.14.2 (i686-linux-thread-multi), XML::Twig 3.39, XML::Parser 2.41.

        Died of a segfault with a sufficiently large number.

        Stack trace:

        Program received signal SIGSEGV, Segmentation fault. 0x0807632d in Perl_call_sv () (gdb) bt #0 0x0807632d in Perl_call_sv () #1 0x080e959d in Perl_sv_clear () #2 0x080e9c8a in Perl_sv_free2 () #3 0x080d7644 in Perl_hv_free_ent () #4 0x080d8bf3 in S_hfreeentries () #5 0x080db12e in Perl_hv_undef_flags () #6 0x080e97cb in Perl_sv_clear () #7 0x080e9c8a in Perl_sv_free2 () #8 0x080d7644 in Perl_hv_free_ent () #9 0x080d8bf3 in S_hfreeentries () #10 0x080db12e in Perl_hv_undef_flags () #11 0x080e97cb in Perl_sv_clear () #12 0x080e9c8a in Perl_sv_free2 () ... #87303 0x080d7644 in Perl_hv_free_ent () #87304 0x080d8bf3 in S_hfreeentries () #87305 0x080db12e in Perl_hv_undef_flags () #87306 0x080e97cb in Perl_sv_clear () #87307 0x080e9c8a in Perl_sv_free2 () #87308 0x080d7644 in Perl_hv_free_ent () #87309 0x080d8bf3 in S_hfreeentries () #87310 0x080db12e in Perl_hv_undef_flags () #87311 0x080e97cb in Perl_sv_clear () #87312 0x080e9c8a in Perl_sv_free2 () #87313 0x080d7644 in Perl_hv_free_ent () #87314 0x080d8bf3 in S_hfreeentries () #87315 0x080db12e in Perl_hv_undef_flags () #87316 0x080e97cb in Perl_sv_clear () #87317 0x080e9c8a in Perl_sv_free2 () #87318 0x08111ef1 in Perl_leave_scope () #87319 0x081120bc in Perl_pop_scope () #87320 0x0811dd60 in Perl_pp_return () #87321 0x080dd748 in Perl_runops_standard () #87322 0x08076475 in Perl_call_sv () #87323 0xb7ac2148 in endElement () from /home/eric/usr/perlbrew/perls/ +5.14.2t/lib/site_perl/5.14.2/i686-linux-thread-multi/auto/XML/Parser/ +Expat/Expat.so #87324 0xb7a93a55 in ?? () from /usr/lib/../lib/libexpat.so.1 #87325 0xb7a948a1 in ?? () from /usr/lib/../lib/libexpat.so.1 #87326 0xb7a95db1 in ?? () from /usr/lib/../lib/libexpat.so.1 #87327 0xb7a9696a in ?? () from /usr/lib/../lib/libexpat.so.1 #87328 0xb7a8d64c in XML_ParseBuffer () from /usr/lib/../lib/libexpat. +so.1 #87329 0xb7a8eab5 in XML_Parse () from /usr/lib/../lib/libexpat.so.1 #87330 0xb7ab6a78 in XS_XML__Parser__Expat_ParseString () from /home/e +ric/usr/perlbrew/perls/5.14.2t/lib/site_perl/5.14.2/i686-linux-thread +-multi/auto/XML/Parser/Expat/Expat.so #87331 0x080df181 in Perl_pp_entersub () #87332 0x080dd748 in Perl_runops_standard () #87333 0x080770ea in perl_run () #87334 0x0805fe3d in main ()

        First guess, a stack overflow from an endless(?) recursive loop. [Upd: It could be a stack overflow, but it's not from endless recursion. The pattern is clearly broken at the top. ]

        The odd thing is that the loop is in perl's code.

        Same with an older version of Perl: 5.10.1 (i686-linux-thread-multi), XML::Twig 3.39, XML::Parser 2.41.

        I'll install a debug build of Perl and see if I hit an assert.

Re: XML::Twig too many children?
by ikegami (Pope) on Feb 21, 2012 at 23:37 UTC

    it would suddenly quit, without any error message

    What exit code? It could have died from a signal.

      The exit code is apparently 65280.
Re: XML::Twig too many children?
by mirod (Canon) on Feb 22, 2012 at 08:54 UTC

    Good job everybody, but you did not go far enough ;--) It's not a bug in XML::Parser, I believe it's a bug in Perl: if you use weaken a few thousand times it will segfault.

    The code below shows the bug:

    #!/usr/bin/perl use strict; use warnings; use Scalar::Util 'weaken'; # the number of iteration that causes a segmentation fault varies # on my machine, 5.14.2 18700, 5.12.4 20147, At this thresholds # the bug shows up most of the time but not always + + + my $ITER= $ARGV[0] || 18700; + + + + my $head= {}; + + my $tail= $head; + + + + foreach (1..$ITER) { my $new_tail= { p => $tail }; weaken( $new_tail->{p}); $tail->{n}= $new_tail; $tail= $new_tail; } print "done\n";

    The good news is that the bug is fixed in blead and in recent 5.15.*. I don't know which version exactly, but I know it's fixed in 5.15.7. and in 5.15.8.

    So, if possible, you should use perlbrew, install 5.15.8 and get the development version of XML::Twig from xmltwig.org since XML::Twig 3.39 produces warnings in 5.15.8. Once you've re-installed all of the modules you use, your script should then work properly. BTW the development version of XML::Twig passes all the tests, so it is safe to use.

    Let me know if you have any more problem.

      I see no leak here on windows, perl 5.14.1 mingw/gccversion='4.5.2' ( full perl -V here )

      $ perl leak.weaken.twig.pl done $ perl leak.weaken.twig.pl 28700 done $ perl leak.weaken.twig.pl 48700 done $ perl leak.weaken.twig.pl 148700 done $ perl leak.weaken.twig.pl 1348700 done

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://955372]
Approved by marto
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (10)
As of 2014-09-18 10:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (111 votes), past polls