Purdy has asked for the wisdom of the Perl Monks concerning the following question:
I'm trying to use HTML::TreeBuilder to parse some complex HTML(1) in order to do some modifications to the structure. The HTML is served encoded in UTF-8 and has lots of binary characters, such as mdashes, trademarks, smart quotes, etc.
However, every time I parse the data, the resulting HTML code has encoded the entities incorrectly. Just to pick a piece of the headline:
Chicagoland and Northwest Indiana McDonald’s® Offer a Free Taste of McCafé
It gets translated to:
Chicagoland and Northwest Indiana McDonaldâs® Offer a Free Taste of McCafé
I've tried to understand UTF-8 and encoding and tried several variations, but they don't seem to leave it alone. Basically, I'd like to parse the code, do my alterations and then output it without it trying to encode the UTF-8 characters. This is the code I'm trying to use and with my understanding of the docs, it should not try to encode the characters:
my $root = HTML::TreeBuilder->new(); $root->utf8_mode(1); $root->attr_encoded(0); $root->parse( $html );
That doesn't seem to work, though -- what am I missing?
Thanks!
(1): UTF-8 HTML example
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Parsing UTF-8 HTML w/ HTML::Parser
by Your Mother (Archbishop) on Jun 23, 2010 at 21:46 UTC | |
by Purdy (Hermit) on Jun 24, 2010 at 18:27 UTC | |
by Anonymous Monk on Jun 25, 2010 at 02:40 UTC | |
by Anonymous Monk on Jun 24, 2010 at 00:15 UTC | |
by Purdy (Hermit) on Jun 24, 2010 at 18:49 UTC | |
by ikegami (Patriarch) on Jun 24, 2010 at 18:57 UTC | |
by Purdy (Hermit) on Jun 24, 2010 at 20:22 UTC | |
Re: Parsing UTF-8 HTML w/ HTML::Parser
by ikegami (Patriarch) on Jun 23, 2010 at 23:02 UTC |