Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

www:mechanize mangles unicode

by red0hat (Initiate)
on Apr 28, 2010 at 20:37 UTC ( [id://837386]=perlquestion: print w/replies, xml ) Need Help??

red0hat has asked for the wisdom of the Perl Monks concerning the following question:

I've got a problem that is melting my brain a little.

I want to read something from a database, then post it in a HTML form. It worked fine until a user decided to throw in some accented characters. The accented characters are being posed to the website as 2 characters.

example: Château becomes Château.

The website uses iso-8859-15. So, I'm inclined to believe this is an encoding issue. The rest of the code is working as expected.

use WWW::Mechanize; my $mech = WWW::Mechanize->new(); $mech->agent_alias('Windows Mozilla'); $title ="Château"; $result = $mech->get($WineURL.$ID); die "GET failed\n" unless $result->is_success; $mech->field('frmFieldName[title]', $title); $result = $mech->submit; print $mech->value('frmFieldName[title]');

Output = Château

Thanks in advance,

ChrisP.

Replies are listed 'Best First'.
Re: www:mechanize mangles unicode
by ikegami (Patriarch) on Apr 28, 2010 at 22:14 UTC

    Let's start with a server-side script

    #!/usr/bin/perl use strict; use warnings; use CGI; use Encode qw( decode ); use HTML::Entities qw( encode_entities ); my $cgi = CGI->new(); my $val = $cgi->param('key'); use Devel::Peek; Dump($val); $val = decode('iso-8859-15', $val) if defined($val); print $cgi->header('text/html; charset=iso-8859-15'); binmode STDOUT, ':encoding(iso-8859-15)'; my $val_initializer = ( defined($val) ? sprintf(' value="%s"', encode_entities($val, '<>&"')) : '' ); print(<<"__EOI__"); <title>Test</title> <form method="POST"> <input type="text" name="key"$val_initializer> <input type="submit"> </form> __EOI__

    Let's make sure it works:

    $ perl -e'print <<"__EOI__"; POST /zzz.cgi HTTP/1.0 Host: www.example.com Content-Length: 11 key=Ch\xE2teau __EOI__ ' | nc www.example.com 80 | od -c 00000 H T T P / 1 . 1 2 0 0 O K \r 00020 \n D a t e : W e d , 2 8 A 00040 p r 2 0 1 0 2 2 : 1 0 : 1 4 00060 G M T \r \n S e r v e r : A p 00100 a c h e \r \n V a r y : A c c e 00120 p t - E n c o d i n g \r \n C o n 00140 t e n t - L e n g t h : 1 1 8 00160 \r \n C o n n e c t i o n : c l 00200 o s e \r \n C o n t e n t - T y p 00220 e : t e x t / h t m l ; c h 00240 a r s e t = i s o - 8 8 5 9 - 1 00260 5 \r \n \r \n < t i t l e > T e s t 00300 < / t i t l e > \n < f o r m m 00320 e t h o d = " P O S T " > \n < i 00340 n p u t t y p e = " t e x t " 00360 n a m e = " k e y " v a l u 00400 e = " C h 342 t e a u " > \n < i n 00420 p u t t y p e = " s u b m i t 00440 " > \n < / f o r m > \n 00453

    Yup. Now let's test WWW::Mechanize.

    use strict; use warnings; use open ':std', ':locale'; use charnames ':full'; use Encode qw( encode ); use WWW::Mechanize qw( ); # Avoiding script encoding issues. my $val = "Ch\N{LATIN SMALL LETTER A WITH CIRCUMFLEX}teau"; my $mech = WWW::Mechanize->new( autocheck => 1 ); $mech->get('http://www.server.com/zzz.cgi'); $mech->field('key', $val); $mech->submit(); #print($mech->value('key'), "\n"); use Devel::Peek qw( Dump ); Dump($mech->value('key'));
    Hum, I get:
    SV = PV(0x1167c20) at 0x11d05c0 REFCNT = 1 FLAGS = (TEMP,POK,pPOK,UTF8) PV = 0x11572c0 "Ch\303\203\302\242teau"\0 [UTF8 "Ch\x{c3}\x{a2}teau" +] CUR = 10 LEN = 16
    But I expect:
    SV = PV(0x1167c20) at 0x11d05c0 REFCNT = 1 FLAGS = (TEMP,POK,pPOK,UTF8) PV = 0x115bcf0 "Ch\303\242teau"\0 [UTF8 "Ch\x{e2}teau"] CUR = 8 LEN = 16
    or the equivalent
    SV = PV(0x1167c20) at 0x11d05c0 REFCNT = 1 FLAGS = (TEMP,POK,pPOK) PV = 0x115bcf0 "Ch\342teau"\0 CUR = 7 LEN = 16

    Some debugging shows the server side is receiving the following:

    "Ch\303\242teau"

    That's the UTF-8 encoding of the value, so the problem is getting the right data to the server. Ok, fine, maybe WWW::Mechanize stupidly sends the internal storage data of the string. The solution would be to encode the inputs yourself as follows:

    #$mech->field('key', $val); $mech->field('key', encode('iso-8859-15', $val)); $mech->submit();

    But even with the change, the client side script is still sending the following to the server:

    "Ch\303\242teau"

    That's the UTF-8 encoding of the result of encode('iso-8859-15', $val). Does WWW::Mechanize assume the server expects UTF-8 rather than the page's encoding?

    It's all I have time for right now.

    • WWW-Mechanize-1.62
    • libwww-perl-5.834

      Found the bug.

      For starters, everything works fine if the server sends

      <form method="POST" accept-charset="iso-8859-15">

      HTML::Form (used by WWW::Mechanize) processes that attribute and generates the correct form data. The bug is that WWW::Mechanize doesn't inform HTML::Form of the page's charset, leaving HTML::Form with no idea what to do when accept-charset is missing. (It defaults to using UTF-8.)

      Some may not consider this a bug since the spec simple recommends the behaviour, but it's what other browsers do.

        Wow. Thanks.

        Now, I'm searching for how to tell HTML::Form which character set to use from the client side

      I have encountered the same problem of form parameters being converted to UTF-8 when using WWW::Mechanize to access a web site.

      My application automates access to a dictionary web site in order to build a grammar database. The initial web page has one HTML form. My application inserts a starting head word in that form. The response web page has a second form that contains the next head word in the dictionary. By making that form the current form a loop process is established to retrieve N conseecutive words from the dictionary.

      The trouble is that Perl converts accented characters (the 5 vowels á í é ó ú uc and lc) into UTF-8 and the dictionary misreads all words with these characters

      Looking at the HTML, I find the settings ;

      'default_charset' => 'windows-1252', 'enctype' => 'application/x-www-form-urlencoded', 'accept_charset' => 'UNKNOWN',

      I presume that I need to change this to 'accept_charset' => 'UTF-8'. However, I do not see any method in Mechanize that will allow me to do this. Is it possible to use HTML::Form with Mechanize to do this

      Would appreciate any help from members of the forum. Thank you

      use strict; use warnings; use WWW::Mechanize; use Encode qw(encode decode); # Tried encode, decode # without success # Create a new browser my $browser = WWW::Mechanize->new(autocheck => 1 ); # Tell it to get the main page $browser->get("http://193.1.97.44/focloir/"); # Okay, fill in the form with the first word to look up $browser->form_number(1); # Select first as active form $browser->field("WORD", "acht"); # Next word in dict is achtú # Get a consecutive sequence of words, one word per web request for ($i=1; $i<=2; $i++) { $browser->dump_forms(); # i=1 WORD parameter is acht Hex: [61 63 68 74] # i=2 WORD parameter is achtú Hex: [61 63 68 74 FA] $browser->click(); # Make the Web request print $browser->content; # i=1 Word found # i=2 Message: could not find - # acht&#195;&#186; # Hex: [61 63 68 74 c3 ba] # which is achtú in UTF-8 sleep (1); # Just in case we get into # trouble with the web server # # Pick the second form. It should have the next head word # already filled in # NOTE: application code does not access any parameters on # this form $browser->form_number(2); # Select second form as # active form }

        I have resolved the issue. The answer was already supplied in an earlier posting from ikegami. I did not understand how to call the HTML::Form method accept_charset

        Below is the modified code that now handles accented input correctly

        Really appreciate all the wisdom lurking in this forum

        use strict; use warnings; use WWW::Mechanize; # Create a new browser my $browser = WWW::Mechanize->new(autocheck => 1 ); # Tell it to get the main page $browser->get("http://193.1.97.44/focloir/"); # Okay, fill in the form with the first word to look up $browser->form_number(1); # Select first as active form # # This is the patch to specify input character set $browser->form_number(2)->accept_charset ("iso-8859-15"); $browser->field("WORD", "acht"); # Next word in dict is achtú # Get a consecutive sequence of words, one word per web request for (my $i=1; $i<=2; $i++) { $browser->dump_forms(); # i=1 WORD parameter is acht # Hex: [61 63 68 74] # i=2 WORD parameter is achtú # Hex: [61 63 68 74 FA] $browser->click(); # Make the Web request print $browser->content; # i=1 Word found # i=2 Message: could not find - # acht&#195;&#186; # Hex: [61 63 68 74 c3 ba] # which is achtú in UTF-8 sleep (1); # Just in case we get into # trouble with the web server # # Pick the second form. It should have the next head word # already filled in # NOTE: application code does not access any parameters on # this form $browser->form_number(2); # Select second form as # active form # # This is the patch to specify input character set $browser->form_number(2)->accept_charset ("iso-8859-15"); + }

      That is far better written than could produce.

      Eventually, I got to much the same place. Perhaps there is something fishy happening is HTTP::Form?

      btw

      perl 5.10.1

      WWW::Mechanize 1.62

Re: www:mechanize mangles unicode
by Corion (Patriarch) on Apr 28, 2010 at 20:52 UTC

    Are you sure the content you're sending is in the proper encoding? You will need to be sending your data in iso-8859-15 too. Compare what your browser sends against what WWW::Mechanize sends, also what it receives back.

      The headers claim:

      Accept-Charset: ISO-8859-1,utf-8

      and the data that is being sent is "Château". Of course, what is reading the log might be making it pretty, again.

      Thanks.

        Yes, when dealing with encoding problems, you will need to make sure that all components show you the real thing. Look at the hexdumps of the parts and check that they show the octets that correspond to the respective encoding.

        and the data that is being sent is "Château".

        But, what is "Château"? How could you be sure of that? Well, use an hexdumper for that, for example vim's xxd:

        $ echo -n Château |xxd 0000000: 4368 c3a2 7465 6175 Ch..teau

        What you specifically need then, is dumping your log file:

        $ grep 'teau\b' /path/to/log |xxd |less

        --
         David Serrano
         (Please treat my english text just like Perl code, i.e. feel free to notify me of any syntax, grammar, style and/or spelling errors. Thank you!).

Re: www:mechanize mangles unicode
by gnork (Scribe) on Nov 04, 2011 at 10:04 UTC
    Wow, I've been pulling my hair about that for about a week now. Once again perlmonks saves the day, thanks for that.

    cat /dev/world | perl -e "(/(^.*? \?) 42\!/) && (print $1))"
    errors->(c)
A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://837386]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (6)
As of 2024-04-18 05:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found