Parsing UTF-8 HTML w/ HTML::Parser

Purdy has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to use HTML::TreeBuilder to parse some complex HTML(1) in order to do some modifications to the structure. The HTML is served encoded in UTF-8 and has lots of binary characters, such as mdashes, trademarks, smart quotes, etc.

However, every time I parse the data, the resulting HTML code has encoded the entities incorrectly. Just to pick a piece of the headline:

Chicagoland and Northwest Indiana McDonald’s® Offer a Free Taste of McCafé

It gets translated to:

Chicagoland and Northwest Indiana McDonaldâ€™s^Â® Offer a Free Taste of McCafÃ©

I've tried to understand UTF-8 and encoding and tried several variations, but they don't seem to leave it alone. Basically, I'd like to parse the code, do my alterations and then output it without it trying to encode the UTF-8 characters. This is the code I'm trying to use and with my understanding of the docs, it should not try to encode the characters:

my $root = HTML::TreeBuilder->new();
$root->utf8_mode(1);
$root->attr_encoded(0);
$root->parse( $html );
[download]

That doesn't seem to work, though -- what am I missing?

Thanks!

(1): UTF-8 HTML example

Comment on Parsing UTF-8 HTML w/ HTML::Parser Download Code

Replies are listed 'Best First'.
Re: Parsing UTF-8 HTML w/ HTML::Parser by Your Mother (Archbishop) on Jun 23, 2010 at 21:46 UTC
You're double encoding it as utf8. Here's a working snippet to play with- use warnings; use strict; use HTML::TreeBuilder; use WWW::Mechanize; use Encode; my $mech = WWW::Mechanize->new( agent => "iEatYourFaceBot/666" ); $mech->get("http://www.businesswire.com/portal/site/qsr/permalink/?ndm +ViewId=news_view&newsId=20100622005402"); my $html = HTML::TreeBuilder->new_from_content($mech->content); my $title = $html->look_down( sub{ $_[0]->tag() eq 'title' } ); print encode_utf8($title->as_text), $/; __DATA__ yields: Chicagoland and Northwest Indiana McDonald’s® Offer a Free Tas +te of McCafé at the Taste of Chicago [download]	[reply] [d/l]
Re^2: Parsing UTF-8 HTML w/ HTML::Parser by Purdy (Hermit) on Jun 24, 2010 at 18:27 UTC
Thanks, Your Mother! :) I used your explanation to get my code fixed and it worked on my development box, but when I rolled the code out to production, it still double-encoded it. For simplicity's sake, I took the example code you provided and ran it on both servers and I get different results. Both servers have the same version of Perl (5.8.8) and the same versions of WWW::Mech (1.22), HTML::TreeBuilder (3.23), HTML::Parser (3.65) and Encode (2.12). Development Server: `$ perl /tmp/test.pl Chicagoland and Northwest Indiana McDonald’s® Offer a Free Taste of McCafé at the Taste of Chicago` Production Server: `$ perl /tmp/test.pl Chicagoland and Northwest Indiana McDonaldâsÂ® Offer a Free Taste of McCafÃ© at the Taste of Chicago` What am I missing? Thanks!	[reply]
Re^3: Parsing UTF-8 HTML w/ HTML::Parser by Anonymous Monk on Jun 25, 2010 at 02:40 UTC
What am I missing? do you use open? set PERLIO?...	[reply]
Re^2: Parsing UTF-8 HTML w/ HTML::Parser by Anonymous Monk on Jun 24, 2010 at 00:15 UTC
Or `print encode_utf8( $mech->title ),"\n";` [download]	[reply] [d/l]
Re^2: Parsing UTF-8 HTML w/ HTML::Parser by Purdy (Hermit) on Jun 24, 2010 at 18:49 UTC
If it helps, as soon as I do the `$mech->get` on the production box, I get a warning message. If I run it from the debugger, I see this: `Parsing of undecoded UTF-8 will give garbage when decoding entities at /usr/local/share/perl/5.8.8/HTML/TreeBuilder.pm line 93. at /usr/local/share/perl/5.8.8/HTML/TreeBuilder.pm line 93 HTML::TreeBuilder::new_from_content('HTML::TreeBuilder', '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN...') called at /tmp/update_wire.pl line 133`	[reply]
Re^3: Parsing UTF-8 HTML w/ HTML::Parser by ikegami (Patriarch) on Jun 24, 2010 at 18:57 UTC
Sounds like you didn't decode the HTML before passing it to HTML::TreeBuilder, so decode it. If this was LWP (which WWW::Mechanize derives), I'd say use `->decoded_content` instead of `->content`.	[reply] [d/l] [select]
Re^4: Parsing UTF-8 HTML w/ HTML::Parser by Purdy (Hermit) on Jun 24, 2010 at 20:22 UTC
Re: Parsing UTF-8 HTML w/ HTML::Parser by ikegami (Patriarch) on Jun 23, 2010 at 23:02 UTC
If you use `$root->utf8_mode(0);` (the default) and you pass decoded text to `parse`, you'll get decoded text from HTML::Element. When outputting the new HTML, encode it as normal. use strict; use warnings; use HTML::TreeBuilder; # my $decoded_html = $http_response->decoded_content(); my $decoded_html = <<"__EOI__"; <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> <title>Foo</title> </head> <body>\xC9ric</body> </html> __EOI__ my $t = HTML::TreeBuilder->new(); $t->parse($decoded_html); $t->eof(); my $val = ( $t->content_list() )[1]->as_text(); binmode STDOUT, ":encoding(UTF-8)"; print(<<"__EOI__"); <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> <title>Foo</title> </head> <body>Extracted $val</body> </html> __EOI__ [download]	[reply] [d/l] [select]

Back to Seekers of Perl Wisdom