Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re^3: Encoding of emoji character

by soonix (Canon)
on Jun 20, 2022 at 11:58 UTC ( [id://11144876]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Encoding of emoji character
in thread Encoding of emoji character

That "wide character" might be not the smilie, but one (actually, three) of the bytes it is encoded with.
"F0.9F.98.80" is what sprintf( "%vX", $text) would output for e.g. $text = "\N{LATIN SMALL LETTER ETH}\x9F\x98\x80"; but for $text = "\N{GRINNING FACE}" it should instead show "1F600".

a) What if instead of sprintf( "%vX", $text) (or additionally) try
{ use charnames ':full'; use feature 'say'; for my $c ( split //, $text ) { say Dumper $c, ord $c, charnames::viacode( ord $c ); } }
b) You could feed "Test \N{GRINNING FACE}" to your test program (for Perl older than 5.16, you need an explicit use charnames; for the \N escape to work).

I suspect that your console output accidentally uses the same (wrong) encoding as the database input, so it looks right…

Replies are listed 'Best First'.
Re^4: Encoding of emoji character
by choroba (Cardinal) on Jun 20, 2022 at 13:04 UTC
    > for Perl older than 5.16, you need an explicit use charnames; for the \N escape to work

    Thanks.

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re^4: Encoding of emoji character
by dcunningham (Sexton) on Jun 20, 2022 at 21:27 UTC

    Thanks for that. If I use $text = "Test \N{GRINNING FACE}" in my program then the websocket client displays the emoji correctly. I modified your sample code a little and it gave the following output.

    With the $text = "Test \N{GRINNING FACE}" for the emoji it gives:

    $VAR1 = "\x{1f600}"; 128512 GRINNING FACE

    Using the string read from MySQL the emoji gives:

    240 LATIN SMALL LETTER ETH $VAR1 = '�'; 159 APPLICATION PROGRAM COMMAND $VAR1 = '�'; 152 START OF STRING $VAR1 = '�'; 128 PADDING CHARACTER

    Changing the database to UTF-8 will be difficult as it's not entirely under our control. Do you think the table being latin1 is the problem?

      I do think the table being latin1 is a part of the problem. On the other hand, the application that fills the table
      • seems to use a reasonable encoding (UTF-8).
      • If you change the table to something unicodey, that application most probably will NOT automagically insert a unicode character instead of the current 4 bytes.
      Probably the easier solution will be to check for bytes between 0x80 and 0x9F (because these are not defined for ISO 8859-1, the "official" Latin1). If they are not used otherwise in your variant of Latin1, it might be feasible to try it with Encode::decode. What happens, if you insert something like
      { use Encode qw(decode :fallbacks); $text = decode('UTF-8', $text, FB_WARN); }
      after reading $text from the database?
        Thank you for the suggestion, but it still died with the "Wide character" error using that. The suggestion further down seems to have resolved it though.

      Interesting. None of these characters have code points above 255, and yet you sometimes get the error in decode($text).

      You said the table encoding is latin-1. My current guess is, you get your information decoded as if it was latin1. Most of it looks like bytes, but occasionally, latin-1 text decodes to wide characters and blows up decode (which only expects bytes). What if you encode $text back to latin-1 to get bytes, then decode those as UTF-8? This transformation seems to be reversible as long as all bytes round-trip, that is, MySQL's interpretation of "latin-1" is the same as Perl's and has a meaning for all 256 possible byte values.

        I think the "wide character" doesn't simply mean "above 255(0xFF)", but "not in the character set". Latin1 defines characters in the ranges 0x00 .. 0x7E and 0xA0 .. 0xFF. Those within 0x7F .. 0x9F are "not within one of the defined ranges", thus "out of range" = "wide character".
        Thank you, this appears to have fixed it! Using the following code the emoji is passed and displayed correctly on the websocket client.
        $text = encode( 'iso-8859-1', $text ); $text = decode( 'UTF-8', $text ); $conn->send_utf8( $text );

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11144876]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (3)
As of 2024-04-24 02:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found