Re^2: question on encoding

in reply to Re: question on encoding
in thread question on encoding

thanks for the reply and explanation. the insertion works now.

a minor problem: i want to distinguish between french word and english word, only do the decode, encode operation when it is french. but the regex /%[0-9A-Fa-f]{2}/ doesn't catch them.

        if ( /%[0-9A-Fa-f]{2}/ ) {
            # 1.
            # my $escaped = uri_unescape( $_ ); same effect as the RE 
+but slower
            s/%([0-9A-Fa-f]{2})/chr(hex($1))/eg;
            # 2.
            my $s = decode_utf8( $_ );
            # 3.
            $s = encode("iso-8859-1", $s);
            push @new_words, $s;
        } else {
            push @new_words, $_;
        }
[download]

in the case of 'énfasis' i peeked into the url submission and the data arrived to the perl program. they are different: it is %C3%A9nfasis during submission. but it becomes ĂŠnfasis after i grab the value through CGI.pm's param method.

for now, i am taking off the if .. else part and doing encode/decode_utf8 on every word i received, not a good solution i felt.

Comment on Re^2: question on encoding Select or Download Code

Replies are listed 'Best First'.
Re^3: question on encoding by graff (Chancellor) on Jan 24, 2007 at 23:40 UTC
i want to distinguish between french word and english word, only do the decode, encode operation when it is french. You want something like this, then: `s/%([0-9a-f]{2}/chr(hex($1))/egi; if ( /[\x80-\xff]/ ) { push @new_words, encode( "iso-8859-1", decode_utf8( $_ )); } else { push @new_words, $_; }` [download] The point there is that you only need to do the encoding conversion if the string happens to contain any bytes with the 8th bit set (i.e. bytes in the numeric range 128-255). Update: be aware that for this sort of approach, if the input data happen to contain any characters that are not in the iso-8859-1 table (e.g. certain "smart quote" characters, or Greek or Russian or ...), you'll get "?" instead of the intended characters as a result of the "encode(iso-8859-1)" call. That's just a limitation you have to live with if you have to stick with that old "legacy" iso-8859 encoding.	[reply] [d/l]
Re^3: question on encoding by Anonymous Monk on Jan 25, 2007 at 05:14 UTC
From the HTML 4.01 Specification: The content type "application/x-www-form-urlencoded" is inefficient for sending large quantities of binary data or text containing non-ASCII characters. The content type "multipart/form-data" should be used for submitting forms that contain files, non-ASCII data, and binary data.	[reply]

In Section Seekers of Perl Wisdom