Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Parsing UTF-8 HTML w/ HTML::Parser

by Purdy (Hermit)
on Jun 23, 2010 at 21:02 UTC ( #846175=perlquestion: print w/replies, xml ) Need Help??

Purdy has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to use HTML::TreeBuilder to parse some complex HTML(1) in order to do some modifications to the structure. The HTML is served encoded in UTF-8 and has lots of binary characters, such as mdashes, trademarks, smart quotes, etc.

However, every time I parse the data, the resulting HTML code has encoded the entities incorrectly. Just to pick a piece of the headline:

Chicagoland and Northwest Indiana McDonald’s® Offer a Free Taste of McCafé

It gets translated to:

Chicagoland and Northwest Indiana McDonald’s® Offer a Free Taste of McCafé

I've tried to understand UTF-8 and encoding and tried several variations, but they don't seem to leave it alone. Basically, I'd like to parse the code, do my alterations and then output it without it trying to encode the UTF-8 characters. This is the code I'm trying to use and with my understanding of the docs, it should not try to encode the characters:

my $root = HTML::TreeBuilder->new(); $root->utf8_mode(1); $root->attr_encoded(0); $root->parse( $html );

That doesn't seem to work, though -- what am I missing?

Thanks!

(1): UTF-8 HTML example

Replies are listed 'Best First'.
Re: Parsing UTF-8 HTML w/ HTML::Parser
by Your Mother (Bishop) on Jun 23, 2010 at 21:46 UTC

    You're double encoding it as utf8. Here's a working snippet to play with-

    use warnings; use strict; use HTML::TreeBuilder; use WWW::Mechanize; use Encode; my $mech = WWW::Mechanize->new( agent => "iEatYourFaceBot/666" ); $mech->get("http://www.businesswire.com/portal/site/qsr/permalink/?ndm +ViewId=news_view&newsId=20100622005402"); my $html = HTML::TreeBuilder->new_from_content($mech->content); my $title = $html->look_down( sub{ $_[0]->tag() eq 'title' } ); print encode_utf8($title->as_text), $/; __DATA__ yields: Chicagoland and Northwest Indiana McDonald’s® Offer a Free Tas +te of McCafé at the Taste of Chicago

      Thanks, Your Mother! :)

      I used your explanation to get my code fixed and it worked on my development box, but when I rolled the code out to production, it still double-encoded it. For simplicity's sake, I took the example code you provided and ran it on both servers and I get different results. Both servers have the same version of Perl (5.8.8) and the same versions of WWW::Mech (1.22), HTML::TreeBuilder (3.23), HTML::Parser (3.65) and Encode (2.12).

      Development Server:

      $ perl /tmp/test.pl
      Chicagoland and Northwest Indiana McDonald’s® Offer a Free Taste of McCafé at the Taste of Chicago

      Production Server:

      $ perl /tmp/test.pl
      Chicagoland and Northwest Indiana McDonaldâs® Offer a Free Taste of McCafĂ© at the Taste of Chicago

      What am I missing?

      Thanks!

        What am I missing?

        do you use open? set PERLIO?...

      Or
      print encode_utf8( $mech->title ),"\n";

      If it helps, as soon as I do the $mech->get on the production box, I get a warning message. If I run it from the debugger, I see this:

      Parsing of undecoded UTF-8 will give garbage when decoding entities at /usr/local/share/perl/5.8.8/HTML/TreeBuilder.pm line 93.
      at /usr/local/share/perl/5.8.8/HTML/TreeBuilder.pm line 93
      HTML::TreeBuilder::new_from_content('HTML::TreeBuilder', '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN...') called at /tmp/update_wire.pl line 133
        Sounds like you didn't decode the HTML before passing it to HTML::TreeBuilder, so decode it. If this was LWP (which WWW::Mechanize derives), I'd say use ->decoded_content instead of ->content.
Re: Parsing UTF-8 HTML w/ HTML::Parser
by ikegami (Pope) on Jun 23, 2010 at 23:02 UTC

    If you use $root->utf8_mode(0); (the default) and you pass decoded text to parse, you'll get decoded text from HTML::Element. When outputting the new HTML, encode it as normal.

    use strict; use warnings; use HTML::TreeBuilder; # my $decoded_html = $http_response->decoded_content(); my $decoded_html = <<"__EOI__"; <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> <title>Foo</title> </head> <body>\xC9ric</body> </html> __EOI__ my $t = HTML::TreeBuilder->new(); $t->parse($decoded_html); $t->eof(); my $val = ( $t->content_list() )[1]->as_text(); binmode STDOUT, ":encoding(UTF-8)"; print(<<"__EOI__"); <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> <title>Foo</title> </head> <body>Extracted $val</body> </html> __EOI__

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://846175]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (3)
As of 2019-12-09 08:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?