Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Bypass utf-8 encoding/decoding?

by chayyoo (Novice)
on Nov 30, 2017 at 16:09 UTC ( [id://1204608]=perlquestion: print w/replies, xml ) Need Help??

chayyoo has asked for the wisdom of the Perl Monks concerning the following question:

I am trying my first steps in C using Inline::C. I have written a couple of string handling functions that work quite well. They work on UTF-8 strings. What really bothers me is the required transcoding when passing the strings in and out of Perl, e.g.

$perl_string = decode("utf8", my_c_function(encode("utf8",$_)));

Is there an elegant way around this encode/decode thing? (running Perl 5.10.1)

I understand that Perl's internal coding is UTF-8, but that it automatically transcodes to Latin-1 on in- and output, which causes my grief. Is the Perl interpreter smart enough to bypass these en-/decodings when it encounters the line above? I should hope so.

Replies are listed 'Best First'.
Re: Bypass utf-8 encoding/decoding?
by ikegami (Patriarch) on Nov 30, 2017 at 20:21 UTC
    Replace
    SV* my_c_function(SV* sv) { STRLEN len; const char* s = SvPVbyte(sv, len); ... return newSV(...); }

    with

    SV* my_c_function(SV* sv) { STRLEN len; const char* s = SvPVutf8(sv, len); ... return newSVpvn_utf8(..., 1); }

    Example:

    use strict; use warnings; use feature qw( say ); use open ":std", ":encoding(UTF-8)"; use Inline C => <<'__EOS__'; static const char hex_syms[16] = { '0', '1', '2', '3', '4', '5', '6', +'7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F' }; static void my_c_function(char* dst, const char* src, STRLEN n) { while (n--) { *(dst++) = hex_syms[ ((unsigned char)*src) >> 4 ]; *(dst++) = hex_syms[ *src & 0xF ]; *(dst++) = '.'; ++src; } dst[-1] = 0; } SV* my_xs_function(SV* buf_sv) { STRLEN buf_len; const char* buf = SvPVutf8(buf_sv, buf_len); if (buf_len == 0) return newSVpvs(""); { STRLEN hex_len = buf_len * 3 - 1; char* hex; SV* hex_sv; Newx(hex, hex_len, char); my_c_function(hex, buf, buf_len); hex_sv = newSVpvn_utf8(hex, hex_len, 1); Safefree(hex); return hex_sv; } } __EOS__ my $s = "\x{C9}ric"; utf8::downgrade( my $d = $s ); # Let's test with both utf8::upgrade( my $u = $s ); # string storage formats. say $u eq $d ? "Same" : "Different"; for my $s ($d, $u) { say "UCP: ", sprintf("%vX", $s); # C9.72.69.63 say "UTF-8: ", my_xs_function($s); # C3.89.72.69.63 }

    Optimized: (Avoids creating two buffers and copying one into the other. Also protects against memory leaks from long jumps in the C code by mortalizing the allocated memory sooner.)

    SV* my_xs_function(SV* buf_sv) { STRLEN buf_len; const char* buf = SvPVutf8(buf_sv, buf_len); if (buf_len == 0) return newSVpvs(""); { STRLEN hex_len = buf_len * 3 - 1; SV* hex_sv = sv_2mortal(newSV(hex_len)); SvPOK_on(hex_sv); SvCUR_set(hex_sv, hex_len); SvUTF8_on(hex_sv); my_c_function(SvPVX(hex_sv), buf, buf_len); return hex_sv; } }

      But isn't use utf8; mandatory before calling the open pragma?

      Best regards, Karl

      «The Crux of the Biscuit is the Apostrophe»

      perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help

        No. use utf8; tells Perl that the source code contains UTF-8 characters (and you can use them in identifiers, too). It's in no way related to how external data are encoded/decoded.

        ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

        use utf8; simply tells Perl that your source code is encoded using UTF-8 (as opposed to ASCII). It would have no effect in my program.

        Furthermore, the subs provided by the module are always loaded, so I don't need to explicity load the module (use utf8 ();) to access them.

        A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Bypass utf-8 encoding/decoding?
by Laurent_R (Canon) on Nov 30, 2017 at 16:31 UTC
    Not sure if this answers your question, but you can specify UTF-8 encoding for both your input and output.

    For example:

    open my $fh, "<:encoding(UTF-8)", $filename or die "Could not open $fi +lename $!";
Re: Bypass utf-8 encoding/decoding?
by karlgoethebier (Abbot) on Nov 30, 2017 at 19:06 UTC
Re: Bypass utf-8 encoding/decoding?
by Anonymous Monk on Nov 30, 2017 at 16:24 UTC
    What sort of (special ...) characters are present in the strings that you are handling now? Is the native encoding of those strings, as they are being handled within the Perl application, UTF-8 or something else?

      They are mainly accented characters like ą, é or ö. They are read from a utf-8 encoded file using "<:encoding(UTF-8)", which converts them (if I understand it right) to "internal Perl format", presumably also UTF-8. However if I don't perform the encode("utf8",$_) first, they arrive in my C function as Latin-1, not UTF-8. The result is also output to a file in UTF-8 using >:encoding(UTF-8), but if I don't perform the decode("utf8",..) on the string leaving my function, I get "double UTF8" encoded strings! So either:

      • Perl's internal string coding is Latin-1, not UTF-8 (at least for the characters I'm currently dealing with), or
      • Perl's internal string coding is UTF-8, but is converted to Latin-1 when passed to my C-function, and converted back from Latin-1 when reading the result back.

      Or is there another explanation? Anyway, regardless what characters I'm dealing with, is there a way to bypass all this recoding in an elegant and reliable way? Or is the Perl interpreter smart enough to do all the bypassing by itself, so that I incur no speed penalty?

A reply falls below the community's threshold of quality. You may see it by logging in.
A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1204608]
Approved by 1nickt
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (2)
As of 2024-04-24 23:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found