|There's more than one way to do things|
Parsing UTF-8 HTML w/ HTML::Parserby Purdy (Hermit)
|on Jun 23, 2010 at 21:02 UTC||Need Help??|
Purdy has asked for the wisdom of the Perl Monks concerning the following question:
I'm trying to use HTML::TreeBuilder to parse some complex HTML(1) in order to do some modifications to the structure. The HTML is served encoded in UTF-8 and has lots of binary characters, such as mdashes, trademarks, smart quotes, etc.
However, every time I parse the data, the resulting HTML code has encoded the entities incorrectly. Just to pick a piece of the headline:
Chicagoland and Northwest Indiana McDonald’s® Offer a Free Taste of McCafé
It gets translated to:
Chicagoland and Northwest Indiana McDonaldâsÂ® Offer a Free Taste of McCafÃ©
I've tried to understand UTF-8 and encoding and tried several variations, but they don't seem to leave it alone. Basically, I'd like to parse the code, do my alterations and then output it without it trying to encode the UTF-8 characters. This is the code I'm trying to use and with my understanding of the docs, it should not try to encode the characters:
That doesn't seem to work, though -- what am I missing?
(1): UTF-8 HTML example