Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

XML::Twig parse html trouble

by remiah (Hermit)
on Feb 20, 2014 at 14:50 UTC ( #1075584=perlquestion: print w/ replies, xml ) Need Help??
remiah has asked for the wisdom of the Perl Monks concerning the following question:

Hello perlmonk.
Today I found XML::Twig as an HTML parser fails at contents that includes

&
For example, this will cause parse error;
#!perl use strict; use warnings; use XML::Twig; my $twig= new XML::Twig(pretty_print => 'indented'); $twig->parse_html( join('',<DATA>) )->print; __DATA__ <html> <body> <div>M&amp;M</div> </body> </html>
And this will parse ok, and print contents.
#!perl use strict; use warnings; use XML::Twig; my $twig= new XML::Twig(pretty_print => 'indented'); $twig->parse_html( join('',<DATA>) )->print; __DATA__ <html> <body> <div>M&amp;amp;M</div> </body> </html>
So, I thought this is known problem but couldn't find good work around.
I found this with strawberry perl 5.16.1.

Comment on XML::Twig parse html trouble
Select or Download Code
Re: XML::Twig parse html trouble
by remiah (Hermit) on Feb 20, 2014 at 18:40 UTC

    I found workaround and report here.

    XML::Twig use HTML::TreeBuilder for parsing HTML by default. In my case, installing HTML::Tidy and enabling "use_tidy" option solved my problem.

Re: XML::Twig parse html trouble
by roboticus (Chancellor) on Feb 20, 2014 at 18:48 UTC

    remiah:

    I tried your code using XML::Twig v3.39, perl v5.14.2 and it looks like it's working properly. For the data section, I used:

    <html><body> <div>M&M</div> <div>M&amp;M</div> <div>M&amp;amp;M</div> </body></html>

    and the output is:

    <html> <head></head> <body> <div>M&amp;M</div> <div>M&amp;M</div> <div>M&amp;amp;M</div> </body> </html>

    Update: In light of the previous reply, I'm using HTML::TreeBuilder v4.2

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

      Thanks for reply, roboticus!

      I forgot to report module version and error message.

      >perl twigtest1.pl
      
      not well-formed (invalid token) at line 2, column 8, byte 34 at C:/stra
      rl/vendor/lib/XML/Parser.pm line 187.
       at twigtest1.pl line 6.
      
      >perl -MXML::Twig -e "print $XML::Twig::VERSION;"
      3.44
      >perl -MHTML::TreeBuilder -e "print $HTML::TreeBuilder::VERSION;"
      5.03
      
      I got same error with your test data.
      This is very strange for me, cause I used XML::Twig as HTML parser for several times, and I got no such error like this. I can't think that there was no "character entity references" at that time...

      I would like to try available older version of modules.
      regards.

Re: XML::Twig parse html trouble ( Bug #86633 )
by Anonymous Monk on Feb 21, 2014 at 03:38 UTC

      I saw this RT and found it is the same trouble.

      As one guy says that "since HTML::Tidy is a bit more of pain to install (or was at least)", I have problem with HTML::Tidy installation, and forced it to install...

      I really hope this bug could be fixed.
      thanks for information.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1075584]
Approved by hdb
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (9)
As of 2015-07-04 11:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (59 votes), past polls