http://www.perlmonks.org?node_id=846175

Purdy has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to use HTML::TreeBuilder to parse some complex HTML(1) in order to do some modifications to the structure. The HTML is served encoded in UTF-8 and has lots of binary characters, such as mdashes, trademarks, smart quotes, etc.

However, every time I parse the data, the resulting HTML code has encoded the entities incorrectly. Just to pick a piece of the headline:

Chicagoland and Northwest Indiana McDonald’s® Offer a Free Taste of McCafé

It gets translated to:

Chicagoland and Northwest Indiana McDonald’s® Offer a Free Taste of McCafé

I've tried to understand UTF-8 and encoding and tried several variations, but they don't seem to leave it alone. Basically, I'd like to parse the code, do my alterations and then output it without it trying to encode the UTF-8 characters. This is the code I'm trying to use and with my understanding of the docs, it should not try to encode the characters:

my $root = HTML::TreeBuilder->new(); $root->utf8_mode(1); $root->attr_encoded(0); $root->parse( $html );

That doesn't seem to work, though -- what am I missing?

Thanks!

(1): UTF-8 HTML example

Replies are listed 'Best First'.
Re: Parsing UTF-8 HTML w/ HTML::Parser
by Your Mother (Archbishop) on Jun 23, 2010 at 21:46 UTC

    You're double encoding it as utf8. Here's a working snippet to play with-

    use warnings; use strict; use HTML::TreeBuilder; use WWW::Mechanize; use Encode; my $mech = WWW::Mechanize->new( agent => "iEatYourFaceBot/666" ); $mech->get("http://www.businesswire.com/portal/site/qsr/permalink/?ndm +ViewId=news_view&newsId=20100622005402"); my $html = HTML::TreeBuilder->new_from_content($mech->content); my $title = $html->look_down( sub{ $_[0]->tag() eq 'title' } ); print encode_utf8($title->as_text), $/; __DATA__ yields: Chicagoland and Northwest Indiana McDonald’s® Offer a Free Tas +te of McCafé at the Taste of Chicago

      Thanks, Your Mother! :)

      I used your explanation to get my code fixed and it worked on my development box, but when I rolled the code out to production, it still double-encoded it. For simplicity's sake, I took the example code you provided and ran it on both servers and I get different results. Both servers have the same version of Perl (5.8.8) and the same versions of WWW::Mech (1.22), HTML::TreeBuilder (3.23), HTML::Parser (3.65) and Encode (2.12).

      Development Server:

      $ perl /tmp/test.pl
      Chicagoland and Northwest Indiana McDonald’s® Offer a Free Taste of McCafé at the Taste of Chicago

      Production Server:

      $ perl /tmp/test.pl
      Chicagoland and Northwest Indiana McDonaldâs® Offer a Free Taste of McCafĂ© at the Taste of Chicago

      What am I missing?

      Thanks!

        What am I missing?

        do you use open? set PERLIO?...

      Or
      print encode_utf8( $mech->title ),"\n";

      If it helps, as soon as I do the $mech->get on the production box, I get a warning message. If I run it from the debugger, I see this:

      Parsing of undecoded UTF-8 will give garbage when decoding entities at /usr/local/share/perl/5.8.8/HTML/TreeBuilder.pm line 93.
      at /usr/local/share/perl/5.8.8/HTML/TreeBuilder.pm line 93
      HTML::TreeBuilder::new_from_content('HTML::TreeBuilder', '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN...') called at /tmp/update_wire.pl line 133
        Sounds like you didn't decode the HTML before passing it to HTML::TreeBuilder, so decode it. If this was LWP (which WWW::Mechanize derives), I'd say use ->decoded_content instead of ->content.
Re: Parsing UTF-8 HTML w/ HTML::Parser
by ikegami (Patriarch) on Jun 23, 2010 at 23:02 UTC

    If you use $root->utf8_mode(0); (the default) and you pass decoded text to parse, you'll get decoded text from HTML::Element. When outputting the new HTML, encode it as normal.

    use strict; use warnings; use HTML::TreeBuilder; # my $decoded_html = $http_response->decoded_content(); my $decoded_html = <<"__EOI__"; <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> <title>Foo</title> </head> <body>\xC9ric</body> </html> __EOI__ my $t = HTML::TreeBuilder->new(); $t->parse($decoded_html); $t->eof(); my $val = ( $t->content_list() )[1]->as_text(); binmode STDOUT, ":encoding(UTF-8)"; print(<<"__EOI__"); <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> <title>Foo</title> </head> <body>Extracted $val</body> </html> __EOI__