Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re: Parsing UTF-8 HTML w/ HTML::Parser

by Your Mother (Bishop)
on Jun 23, 2010 at 21:46 UTC ( #846184=note: print w/replies, xml ) Need Help??


in reply to Parsing UTF-8 HTML w/ HTML::Parser

You're double encoding it as utf8. Here's a working snippet to play with-

use warnings; use strict; use HTML::TreeBuilder; use WWW::Mechanize; use Encode; my $mech = WWW::Mechanize->new( agent => "iEatYourFaceBot/666" ); $mech->get("http://www.businesswire.com/portal/site/qsr/permalink/?ndm +ViewId=news_view&newsId=20100622005402"); my $html = HTML::TreeBuilder->new_from_content($mech->content); my $title = $html->look_down( sub{ $_[0]->tag() eq 'title' } ); print encode_utf8($title->as_text), $/; __DATA__ yields: Chicagoland and Northwest Indiana McDonald’s® Offer a Free Tas +te of McCafé at the Taste of Chicago

Replies are listed 'Best First'.
Re^2: Parsing UTF-8 HTML w/ HTML::Parser
by Purdy (Hermit) on Jun 24, 2010 at 18:27 UTC

    Thanks, Your Mother! :)

    I used your explanation to get my code fixed and it worked on my development box, but when I rolled the code out to production, it still double-encoded it. For simplicity's sake, I took the example code you provided and ran it on both servers and I get different results. Both servers have the same version of Perl (5.8.8) and the same versions of WWW::Mech (1.22), HTML::TreeBuilder (3.23), HTML::Parser (3.65) and Encode (2.12).

    Development Server:

    $ perl /tmp/test.pl
    Chicagoland and Northwest Indiana McDonald’s® Offer a Free Taste of McCafé at the Taste of Chicago

    Production Server:

    $ perl /tmp/test.pl
    Chicagoland and Northwest Indiana McDonaldâs® Offer a Free Taste of McCafĂ© at the Taste of Chicago

    What am I missing?

    Thanks!

      What am I missing?

      do you use open? set PERLIO?...

Re^2: Parsing UTF-8 HTML w/ HTML::Parser
by Anonymous Monk on Jun 24, 2010 at 00:15 UTC
    Or
    print encode_utf8( $mech->title ),"\n";
Re^2: Parsing UTF-8 HTML w/ HTML::Parser
by Purdy (Hermit) on Jun 24, 2010 at 18:49 UTC

    If it helps, as soon as I do the $mech->get on the production box, I get a warning message. If I run it from the debugger, I see this:

    Parsing of undecoded UTF-8 will give garbage when decoding entities at /usr/local/share/perl/5.8.8/HTML/TreeBuilder.pm line 93.
    at /usr/local/share/perl/5.8.8/HTML/TreeBuilder.pm line 93
    HTML::TreeBuilder::new_from_content('HTML::TreeBuilder', '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN...') called at /tmp/update_wire.pl line 133
      Sounds like you didn't decode the HTML before passing it to HTML::TreeBuilder, so decode it. If this was LWP (which WWW::Mechanize derives), I'd say use ->decoded_content instead of ->content.

        That helps, but I still don't understand what's going on. At this point, I have:

        my $root = HTML::TreeBuilder->new_from_content( $mech->response->d +ecoded_content );

        And that's working on the production box ... now I just want to back away slowly and hope it doesn't go all screwy on me again. ;)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://846184]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (4)
As of 2020-01-22 02:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Notices?