Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer

XML::Twig parse html trouble

by remiah (Hermit)
on Feb 20, 2014 at 14:50 UTC ( #1075584=perlquestion: print w/replies, xml ) Need Help??
remiah has asked for the wisdom of the Perl Monks concerning the following question:

Hello perlmonk.
Today I found XML::Twig as an HTML parser fails at contents that includes

For example, this will cause parse error;
#!perl use strict; use warnings; use XML::Twig; my $twig= new XML::Twig(pretty_print => 'indented'); $twig->parse_html( join('',<DATA>) )->print; __DATA__ <html> <body> <div>M&amp;M</div> </body> </html>
And this will parse ok, and print contents.
#!perl use strict; use warnings; use XML::Twig; my $twig= new XML::Twig(pretty_print => 'indented'); $twig->parse_html( join('',<DATA>) )->print; __DATA__ <html> <body> <div>M&amp;amp;M</div> </body> </html>
So, I thought this is known problem but couldn't find good work around.
I found this with strawberry perl 5.16.1.

Replies are listed 'Best First'.
Re: XML::Twig parse html trouble
by roboticus (Chancellor) on Feb 20, 2014 at 18:48 UTC


    I tried your code using XML::Twig v3.39, perl v5.14.2 and it looks like it's working properly. For the data section, I used:

    <html><body> <div>M&M</div> <div>M&amp;M</div> <div>M&amp;amp;M</div> </body></html>

    and the output is:

    <html> <head></head> <body> <div>M&amp;M</div> <div>M&amp;M</div> <div>M&amp;amp;M</div> </body> </html>

    Update: In light of the previous reply, I'm using HTML::TreeBuilder v4.2


    When your only tool is a hammer, all problems look like your thumb.

      Thanks for reply, roboticus!

      I forgot to report module version and error message.

      not well-formed (invalid token) at line 2, column 8, byte 34 at C:/stra
      rl/vendor/lib/XML/ line 187.
       at line 6.
      >perl -MXML::Twig -e "print $XML::Twig::VERSION;"
      >perl -MHTML::TreeBuilder -e "print $HTML::TreeBuilder::VERSION;"
      I got same error with your test data.
      This is very strange for me, cause I used XML::Twig as HTML parser for several times, and I got no such error like this. I can't think that there was no "character entity references" at that time...

      I would like to try available older version of modules.

Re: XML::Twig parse html trouble
by remiah (Hermit) on Feb 20, 2014 at 18:40 UTC

    I found workaround and report here.

    XML::Twig use HTML::TreeBuilder for parsing HTML by default. In my case, installing HTML::Tidy and enabling "use_tidy" option solved my problem.

Re: XML::Twig parse html trouble ( Bug #86633 )
by Anonymous Monk on Feb 21, 2014 at 03:38 UTC

      I saw this RT and found it is the same trouble.

      As one guy says that "since HTML::Tidy is a bit more of pain to install (or was at least)", I have problem with HTML::Tidy installation, and forced it to install...

      I really hope this bug could be fixed.
      thanks for information.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1075584]
Approved by hdb
[Corion]: marto: I believe nowadays, at least window.opener should not be set anymore (except maybe within the same domain)
[Corion]: But I wouldn't really know as I don't use iexplore much (except at $work) and mostly surf with JS disabled (except at $work :) )
[marto]: yeah, this is at work, where some intranet app launches links via When users close the intranet page so that only the new JS opened windows exist, clicking URLs in an email (or whatever) don't open
[Corion]: marto: It somewhat makes sense that the reduced popup window doesn't get new URLs, but it makes less sense that no new browser window opens :)
[marto]: Corion, yeah, why no new browser window opens is currently beyond my understanding
[marto]: thanks

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (7)
As of 2018-03-21 11:49 GMT
Find Nodes?
    Voting Booth?
    When I think of a mole I think of:

    Results (267 votes). Check out past polls.