Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Unnesting deeply nested HTML elements (Deep recursion on subroutine "HTML::Element::delete")

by mldvx4 (Pilgrim)
on Sep 19, 2022 at 16:06 UTC ( #11146972=perlquestion: print w/replies, xml ) Need Help??

mldvx4 has asked for the wisdom of the Perl Monks concerning the following question:

I'm able to process WordPress' inexcusably messy output using HTML::TreeBuilder::XPath. However when I try to call the delete method on the object I get warnings or errors. I would like the problem to go away but neither have access to the site producing the document nor to WordPress upstream which is actually where the fault lies. Anyway, when I call delete after otherwise successful processing of the document I get the following error:

Deep recursion on subroutine "HTML::Element::delete" at /usr/share/perl5/HTML/Element.pm line 567. Deep recursion on subroutine "HTML::Element::delete_content" at /usr/share/perl5/HTML/Element.pm line 580.

Can I just use undef instead of calling the delete method? Or is there another approach which is better? I use HTML::TreeBuilder::XPath extensively in the real script but maybe could or should do some pre-processing with a different parser though I'd rather not.

Here is a stripped down example of the problem:

#!/usr/bin/perl use HTML::TreeBuilder::XPath; use strict; use warnings; my $ent = HTML::TreeBuilder::XPath->new; $ent->parse_file(\*DATA); $ent->delete; exit(0); __DATA__ <html> <head> <title>foo bar</title> </head> <body> foo <br /> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <strong>bar</strong> <br /> <center>(baz)</center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </body> </html>
  • Comment on Unnesting deeply nested HTML elements (Deep recursion on subroutine "HTML::Element::delete")
  • Select or Download Code

Replies are listed 'Best First'.
Re: Unnesting deeply nested HTML elements (Deep recursion on subroutine "HTML::Element::delete")
by hippo (Bishop) on Sep 19, 2022 at 17:00 UTC
    I get the following error

    It isn't an error, it's a warning. You can disable it if you like with no warnings 'recursion'; but be sure to comment why you are doing that and do it in the smallest lexical scope you can manage.


    🦛

      Thanks. I've tried adding the no warnings 'recursion'; to both the example script above and to the real script (there in the smallest lexical scope available). It does not suppress the warnings in either case.

      I wonder if there would be a way to simply collect and not print the warnings, perhaps with an eval. However, the attempt below still prints the same warnings as the original example script above.

      #!/usr/bin/perl use HTML::TreeBuilder::XPath -weak; use strict; use warnings; my $ent = HTML::TreeBuilder::XPath->new; $ent->parse_file(\*DATA); eval { no warnings 'recursion'; $ent->delete; }; if ($@) { print "FOO\n"; } exit(0); __DATA__ <html> <head> <title>foo bar</title> </head> <body> foo <br /> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <center> <strong>bar</strong> <br /> <center>(baz)</center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </center> </body> </html>

        Yeah, warnings are lexically scoped, so turning them off in one place only suppresses them if that's where they are generated.

        In this case you need to get a bit more invasive: catch all warnings for the duration of the call, and rethrow all but the one you want to avoid:

        { local $SIG{__WARN__} = sub { warn @_ unless $_[0] =~ /^Deep recursion/; }; $ent->delete; }

        Note that it is safe to warn inside the warnings handler - the handler is suppressed while it is being called.

Re: Unnesting deeply nested HTML elements (Deep recursion on subroutine "HTML::Element::delete")
by GrandFather (Saint) on Sep 19, 2022 at 21:17 UTC

      Thanks. I had seen that and with your encouragement figured out how to read the added output from adding -d:Confess to the shebang.

      #!/usr/bin/perl -d:Confess

      In that way I can at least see which line in my script the problem comes from. That in turn helps figure out which data is at fault.

Re: Unnesting deeply nested HTML elements (Deep recursion on subroutine "HTML::Element::delete")
by GrandFather (Saint) on Sep 20, 2022 at 03:38 UTC

    I thought I'd have a play with your issue so I downloaded your code and ran it. No warnings! So I wrote the following:

    use strict; use warnings; print "$^V\n"; recurse(20000); exit; sub recurse { my ($count) = @_; recurse($count - 1) if $count; }

    which prints

    v5.32.1

    No warnings! My guess is that the warning dropped out of Perl at some point, but I can't find anything on perldelta to indicate its been "fixed". Maybe Strawberry Perl has its recursion limit warning set to some really large value? I do get an "Out of memory!" error if I set the recursion limit to 700,000.

    Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond

      I cannot reproduce your results with 5.32.0, 5.34.0 or 5.36.0, I get a "deep recursion" warning in every case. Do you have environment variables or an init script that could be suppressing it?

      v5.32.0 Deep recursion on subroutine "main::recurse" at testfile line 11. v5.34.0 Deep recursion on subroutine "main::recurse" at testfile line 11. v5.36.0 Deep recursion on subroutine "main::recurse" at testfile line 11.

      I don't have 5.32.1 handy, but I'd be astonished if this had been broken.

        I can't think of anything such as an environment variable or init script that may be coming into play. I'm running this on a fairly clean 64 bit Strawberry Perl without any environment tweaks explicitly made by me. I'll try again on my home machine.

        Update Same result at home. Same Perl version, but Windows 11 rather than Windows 10.

        Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
Re: Unnesting deeply nested HTML elements (Deep recursion on subroutine "HTML::Element::delete")
by kcott (Archbishop) on Sep 20, 2022 at 07:57 UTC

    G'day mldvx4,

    Look at perldiag: Deep recursion on subroutine "%s". Notice the value of 100 there.

    You have 102 levels of <center>...</center> nesting. Try reducing that to below 100 and see if there's any difference. Keep reducing until the warning stops: the recursion:nesting ratio may not be 1:1. You also have a couple of other levels with <html>...</html> and <body>...</body>.

    If you really do need what you've shown, you'll have to recompile Perl with PERL_SUB_DEPTH_WARN set to an appropriate number. You might want to look at Perlbrew to compile a separate installation for this task.

    See also: "Deep Recursion Limit".

    — Ken

Re: Unnesting deeply nested HTML elements (Deep recursion on subroutine "HTML::Element::delete")
by Anonymous Monk on Sep 19, 2022 at 21:41 UTC
    calling the delete method?

    Perhaps you don't need to? Based on line numbers in diagnostic messages, you have modern HTML::Element, which made calling delete superfluous. Maybe just check that's the case with "use HTML::TreeBuilder::XPath -weak;" once, to be sure, but new behaviour is default w/o explicit import.

      Thanks. I see the same warnings whether I have "use HTML::TreeBuilder::XPath -weak; or plain old "use HTML::TreeBuilder::XPath; there at the beginning. What would have been the expected difference? Or, in other words, how would I know for sure whether I can omit the deletion?

        I think anon is telling you that you do not need to use delete at all with "modern" HTML::Element. See delete. OTOH even if you don't explicitly use delete perhaps HTML::Element will (edit: see Edit2 below), when you undef an element. And thus you will still get the warnings.

        Generally, having "deep recursion" warnings is not harmful at all because you may well have a structure which is more than 100 deep. And that's fine (until your memory is exhausted). However, the real problem is whether WordPress managed to produced some HTML which parsing it causes cyclical paths somehow. Then you may get infinite recursion and that's real bad. I would investigate that before supressing the warnings.

        bw, bliako

        Edit: by delete I mean HTML::Element::delete()

        Edit2: with weak references ON, as anon mentioned, it's the Perl interpreter/garbage collector who does the cleaning up as soon as the parent object goes out of scope or set to undef. I am trying to not give the impression that delete will be called internally with the "modern" regime.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11146972]
Approved by hippo
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (5)
As of 2022-10-05 22:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My preferred way to holiday/vacation is:











    Results (25 votes). Check out past polls.

    Notices?