http://www.perlmonks.org?node_id=1075584

remiah has asked for the wisdom of the Perl Monks concerning the following question:

Hello perlmonk.
Today I found XML::Twig as an HTML parser fails at contents that includes

&
For example, this will cause parse error;
#!perl use strict; use warnings; use XML::Twig; my $twig= new XML::Twig(pretty_print => 'indented'); $twig->parse_html( join('',<DATA>) )->print; __DATA__ <html> <body> <div>M&amp;M</div> </body> </html>
And this will parse ok, and print contents.
#!perl use strict; use warnings; use XML::Twig; my $twig= new XML::Twig(pretty_print => 'indented'); $twig->parse_html( join('',<DATA>) )->print; __DATA__ <html> <body> <div>M&amp;amp;M</div> </body> </html>
So, I thought this is known problem but couldn't find good work around.
I found this with strawberry perl 5.16.1.

Replies are listed 'Best First'.
Re: XML::Twig parse html trouble
by roboticus (Chancellor) on Feb 20, 2014 at 18:48 UTC

    remiah:

    I tried your code using XML::Twig v3.39, perl v5.14.2 and it looks like it's working properly. For the data section, I used:

    <html><body> <div>M&M</div> <div>M&amp;M</div> <div>M&amp;amp;M</div> </body></html>

    and the output is:

    <html> <head></head> <body> <div>M&amp;M</div> <div>M&amp;M</div> <div>M&amp;amp;M</div> </body> </html>

    Update: In light of the previous reply, I'm using HTML::TreeBuilder v4.2

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

      Thanks for reply, roboticus!

      I forgot to report module version and error message.

      >perl twigtest1.pl
      
      not well-formed (invalid token) at line 2, column 8, byte 34 at C:/stra
      rl/vendor/lib/XML/Parser.pm line 187.
       at twigtest1.pl line 6.
      
      >perl -MXML::Twig -e "print $XML::Twig::VERSION;"
      3.44
      >perl -MHTML::TreeBuilder -e "print $HTML::TreeBuilder::VERSION;"
      5.03
      
      I got same error with your test data.
      This is very strange for me, cause I used XML::Twig as HTML parser for several times, and I got no such error like this. I can't think that there was no "character entity references" at that time...

      I would like to try available older version of modules.
      regards.

Re: XML::Twig parse html trouble
by remiah (Hermit) on Feb 20, 2014 at 18:40 UTC

    I found workaround and report here.

    XML::Twig use HTML::TreeBuilder for parsing HTML by default. In my case, installing HTML::Tidy and enabling "use_tidy" option solved my problem.

Re: XML::Twig parse html trouble ( Bug #86633 )
by Anonymous Monk on Feb 21, 2014 at 03:38 UTC

      I saw this RT and found it is the same trouble.

      As one guy says that "since HTML::Tidy is a bit more of pain to install (or was at least)", I have problem with HTML::Tidy installation, and forced it to install...

      I really hope this bug could be fixed.
      thanks for information.