perlquestion
Purdy
<p>I'm trying to use HTML::TreeBuilder to parse some complex HTML(1) in order to do some modifications to the structure. The HTML is served encoded in UTF-8 and has lots of binary characters, such as mdashes, trademarks, smart quotes, etc.</p>
<p>However, every time I parse the data, the resulting HTML code has encoded the entities incorrectly. Just to pick a piece of the headline:</p>
<p><b>Chicagoland and Northwest Indiana McDonald’s® Offer a Free Taste of McCafé</b></p>
<p>It gets translated to:</p>
<p><b>Chicagoland and Northwest Indiana McDonald’s<sup>®</sup> Offer a Free Taste of McCafé</b></p>
<p>I've tried to understand UTF-8 and encoding and tried several variations, but they don't seem to leave it alone. Basically, I'd like to parse the code, do my alterations and then output it without it trying to encode the UTF-8 characters. This is the code I'm trying to use and with my understanding of the docs, it should not try to encode the characters:</p>
<code>
my $root = HTML::TreeBuilder->new();
$root->utf8_mode(1);
$root->attr_encoded(0);
$root->parse( $html );
</code>
<p>That doesn't seem to work, though -- what am I missing?</p>
<p><b>Thanks!</b></p>
<p><small>(1): [http://www.businesswire.com/portal/site/qsr/permalink/?ndmViewId=news_view&newsId=20100622005402&newsLang=en|UTF-8 HTML example]</small></p>