http://www.perlmonks.org?node_id=1006825

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am using Win32::IE::Mechanize to access a web page that is encoded in UTF-8.


However, when I try to access data in the DOM model that includes unicode characters these are returned as question mark characters (HEX 3F).


Any help would be very much appreciated. Sample code is below:


use strict; use warnings; use File::BOM; use Win32::IE::Mechanize; use Time::HiRes qw( usleep gettimeofday tv_interval stat ); use utf8; # create Win32::IE::Mechanize object my $mech = Win32::IE::Mechanize->new(visible => 1); # open the URL $mech->get('http://kr.yahoo.com/'); sleep (10); # get the DOM document my $doc = $mech->{agent}->Document; # get the webpage title my $title = $doc->title; # create a utf-8 text file open DEBUGFILE, ">:via(File::BOM):encoding(UTF-8)", "debug.txt" or die + $!; # write the title to file print DEBUGFILE "Title:" . $title . "\n"; # write the title length to the file print DEBUGFILE "Title Length:" . length ($title) . "\n"; # write the hex byte string of the title to the file print DEBUGFILE "Title Hex Byte String:" . unpack("H48", $title) . "\n +";

Code output is:


Title:??! ??? Title Length:7 Title Hex Byte String:3f3f21203f3f3f

Replies are listed 'Best First'.
Re: How to handle UTF-8 content with Win32::IE::Mechanize
by Anonymous Monk on Dec 03, 2012 at 10:07 UTC

    Because I know that Win32::IE::Mechanize uses Win32::OLE to drive IEAutomation,

    I searched for [ddg://IEautomation utf8] IEautomation utf8

    and found Win32::Watir which mentions a Win32::OLE utf setting you can toggle, a view of the source reveals

    Win32::OLE->Option(CP => Win32::OLE::CP_UTF8);

    So you should toggle that in your program if you want utf8 strings from OLE

Re: How to handle UTF-8 content with Win32::IE::Mechanize
by 2teez (Vicar) on Dec 03, 2012 at 11:25 UTC

    Why not use the open pragma  use open qw(:std :utf8); ORbinmode like so: binmode FILEHANDLE, LAYER. i.e

    open DEBUGFILE, '>', "debug.txt" or die $!; binmode DEBUGFILE, ":encoding(UTF-8)";

    Using your OP, I don't have Win32::IE::Mechanize, installed so I used WWW::Mechanize, like so:
    use WWW::Mechanize; ... # get the DOM document #my $doc = $mech->{agent}->Document; # comment out # get the webpage title my $title = $mech->title; # used instead ... # create a utf-8 text file open DEBUGFILE, '>', "debug.txt" or die $!; binmode DEBUGFILE,":encoding(UTF-8)"; ...
    OUTPUT:
    Title:야후! 코리아
    Title Length:7
    Title Hex Byte String:c74c120245ca44
    Hope this helps

    If you tell me, I'll forget.
    If you show me, I'll remember.
    if you involve me, I'll understand.
    --- Author unknown to me

        ... And suggesting an alternative would not be wrong either.
        Moreover,Win32::IE::Mechanize is just Like "the mech" but with IE as user-agent. Period...

        If you tell me, I'll forget.
        If you show me, I'll remember.
        if you involve me, I'll understand.
        --- Author unknown to me