Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

www:mechanize mangles unicode

by red0hat (Initiate)
on Apr 28, 2010 at 20:37 UTC ( #837386=perlquestion: print w/ replies, xml ) Need Help??
red0hat has asked for the wisdom of the Perl Monks concerning the following question:

I've got a problem that is melting my brain a little.

I want to read something from a database, then post it in a HTML form. It worked fine until a user decided to throw in some accented characters. The accented characters are being posed to the website as 2 characters.

example: Château becomes Château.

The website uses iso-8859-15. So, I'm inclined to believe this is an encoding issue. The rest of the code is working as expected.

use WWW::Mechanize; my $mech = WWW::Mechanize->new(); $mech->agent_alias('Windows Mozilla'); $title ="Château"; $result = $mech->get($WineURL.$ID); die "GET failed\n" unless $result->is_success; $mech->field('frmFieldName[title]', $title); $result = $mech->submit; print $mech->value('frmFieldName[title]');

Output = Château

Thanks in advance,

ChrisP.

Comment on www:mechanize mangles unicode
Download Code
Re: www:mechanize mangles unicode
by Anonymous Monk on Apr 28, 2010 at 20:47 UTC
    a poor workman blames his tools

      Thanks for the input, but it isn't terribly helpful.

      If you've got a better tool to suggest, I'd be happy to hear it.

        You say www::mechanize mangles unicode and then you fail to demonstrate www::mechanize mangling unicode, and now you want better tools?
Re: www:mechanize mangles unicode
by Corion (Pope) on Apr 28, 2010 at 20:52 UTC

    Are you sure the content you're sending is in the proper encoding? You will need to be sending your data in iso-8859-15 too. Compare what your browser sends against what WWW::Mechanize sends, also what it receives back.

      The headers claim:

      Accept-Charset: ISO-8859-1,utf-8

      and the data that is being sent is "Château". Of course, what is reading the log might be making it pretty, again.

      Thanks.

        Yes, when dealing with encoding problems, you will need to make sure that all components show you the real thing. Look at the hexdumps of the parts and check that they show the octets that correspond to the respective encoding.

        and the data that is being sent is "Château".

        But, what is "Château"? How could you be sure of that? Well, use an hexdumper for that, for example vim's xxd:

        $ echo -n Château |xxd 0000000: 4368 c3a2 7465 6175 Ch..teau

        What you specifically need then, is dumping your log file:

        $ grep 'teau\b' /path/to/log |xxd |less

        --
         David Serrano
         (Please treat my english text just like Perl code, i.e. feel free to notify me of any syntax, grammar, style and/or spelling errors. Thank you!).

Re: www:mechanize mangles unicode
by ikegami (Pope) on Apr 28, 2010 at 22:14 UTC

    Let's start with a server-side script

    #!/usr/bin/perl use strict; use warnings; use CGI; use Encode qw( decode ); use HTML::Entities qw( encode_entities ); my $cgi = CGI->new(); my $val = $cgi->param('key'); use Devel::Peek; Dump($val); $val = decode('iso-8859-15', $val) if defined($val); print $cgi->header('text/html; charset=iso-8859-15'); binmode STDOUT, ':encoding(iso-8859-15)'; my $val_initializer = ( defined($val) ? sprintf(' value="%s"', encode_entities($val, '<>&"')) : '' ); print(<<"__EOI__"); <title>Test</title> <form method="POST"> <input type="text" name="key"$val_initializer> <input type="submit"> </form> __EOI__

    Let's make sure it works:

    $ perl -e'print <<"__EOI__"; POST /zzz.cgi HTTP/1.0 Host: www.example.com Content-Length: 11 key=Ch\xE2teau __EOI__ ' | nc www.example.com 80 | od -c 00000 H T T P / 1 . 1 2 0 0 O K \r 00020 \n D a t e : W e d , 2 8 A 00040 p r 2 0 1 0 2 2 : 1 0 : 1 4 00060 G M T \r \n S e r v e r : A p 00100 a c h e \r \n V a r y : A c c e 00120 p t - E n c o d i n g \r \n C o n 00140 t e n t - L e n g t h : 1 1 8 00160 \r \n C o n n e c t i o n : c l 00200 o s e \r \n C o n t e n t - T y p 00220 e : t e x t / h t m l ; c h 00240 a r s e t = i s o - 8 8 5 9 - 1 00260 5 \r \n \r \n < t i t l e > T e s t 00300 < / t i t l e > \n < f o r m m 00320 e t h o d = " P O S T " > \n < i 00340 n p u t t y p e = " t e x t " 00360 n a m e = " k e y " v a l u 00400 e = " C h 342 t e a u " > \n < i n 00420 p u t t y p e = " s u b m i t 00440 " > \n < / f o r m > \n 00453

    Yup. Now let's test WWW::Mechanize.

    use strict; use warnings; use open ':std', ':locale'; use charnames ':full'; use Encode qw( encode ); use WWW::Mechanize qw( ); # Avoiding script encoding issues. my $val = "Ch\N{LATIN SMALL LETTER A WITH CIRCUMFLEX}teau"; my $mech = WWW::Mechanize->new( autocheck => 1 ); $mech->get('http://www.server.com/zzz.cgi'); $mech->field('key', $val); $mech->submit(); #print($mech->value('key'), "\n"); use Devel::Peek qw( Dump ); Dump($mech->value('key'));
    Hum, I get:
    SV = PV(0x1167c20) at 0x11d05c0 REFCNT = 1 FLAGS = (TEMP,POK,pPOK,UTF8) PV = 0x11572c0 "Ch\303\203\302\242teau"\0 [UTF8 "Ch\x{c3}\x{a2}teau" +] CUR = 10 LEN = 16
    But I expect:
    SV = PV(0x1167c20) at 0x11d05c0 REFCNT = 1 FLAGS = (TEMP,POK,pPOK,UTF8) PV = 0x115bcf0 "Ch\303\242teau"\0 [UTF8 "Ch\x{e2}teau"] CUR = 8 LEN = 16
    or the equivalent
    SV = PV(0x1167c20) at 0x11d05c0 REFCNT = 1 FLAGS = (TEMP,POK,pPOK) PV = 0x115bcf0 "Ch\342teau"\0 CUR = 7 LEN = 16

    Some debugging shows the server side is receiving the following:

    "Ch\303\242teau"

    That's the UTF-8 encoding of the value, so the problem is getting the right data to the server. Ok, fine, maybe WWW::Mechanize stupidly sends the internal storage data of the string. The solution would be to encode the inputs yourself as follows:

    #$mech->field('key', $val); $mech->field('key', encode('iso-8859-15', $val)); $mech->submit();

    But even with the change, the client side script is still sending the following to the server:

    "Ch\303\242teau"

    That's the UTF-8 encoding of the result of encode('iso-8859-15', $val). Does WWW::Mechanize assume the server expects UTF-8 rather than the page's encoding?

    It's all I have time for right now.

    • WWW-Mechanize-1.62
    • libwww-perl-5.834

      Found the bug.

      For starters, everything works fine if the server sends

      <form method="POST" accept-charset="iso-8859-15">

      HTML::Form (used by WWW::Mechanize) processes that attribute and generates the correct form data. The bug is that WWW::Mechanize doesn't inform HTML::Form of the page's charset, leaving HTML::Form with no idea what to do when accept-charset is missing. (It defaults to using UTF-8.)

      Some may not consider this a bug since the spec simple recommends the behaviour, but it's what other browsers do.

        Wow. Thanks.

        Now, I'm searching for how to tell HTML::Form which character set to use from the client side

      That is far better written than could produce.

      Eventually, I got to much the same place. Perhaps there is something fishy happening is HTTP::Form?

      btw

      perl 5.10.1

      WWW::Mechanize 1.62

Re: www:mechanize mangles unicode
by gnork (Scribe) on Nov 04, 2011 at 10:04 UTC
    Wow, I've been pulling my hair about that for about a week now. Once again perlmonks saves the day, thanks for that.

    cat /dev/world | perl -e "(/(^.*? \?) 42\!/) && (print $1))"
    errors->(c)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://837386]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (6)
As of 2014-08-27 23:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (253 votes), past polls