Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Yet another Encoding issue...

by Bod (Parson)
on Jun 01, 2024 at 19:34 UTC ( [id://11159743]=perlquestion: print w/replies, xml ) Need Help??

Bod has asked for the wisdom of the Perl Monks concerning the following question:

I'm using AI::Chat to create a Turkish practice, AI-Powered chat. The first part is for the AI to analyse the Turkish supplier by the user (me) and check it for errors. Because Turkish uses some non-latin characters in the alphabet, this has created another character encoding issue for me. To eliminate the OpenAI API and AI::Chat, I have created this test script that demonstrates the issue...(no apologies for inline CSS marto - this is a quick and dirty test script!)

#!/usr/bin/perl use CGI::Carp qw(fatalsToBrowser); use lib "$ENV{'DOCUMENT_ROOT'}/cgi-bin"; use JSON; use utf8; use incl::HTMLtest; use AI::Chat; use strict; use warnings; if ($data{'userChat'}) { my $reply = {}; $reply->{'response'} = $data{'userChat'}; print "Content-type: application/json\n\n"; print encode_json $reply; exit; } print<<"END_HTML"; Content-type: text/html; charset=UTF-8 <html> <meta charset="UTF-8"> <meta name="viewport" content="width=device-width, initial-scale=1" /> <head> <script> function sendChat() { if (document.getElementById('userChat').innerText.length > 2) { fetch('?userChat=' + encodeURIComponent(document.getElementByI +d('userChat').innerText)) .then((resp) => resp.json()) .then((json) => { document.getElementById('chatBox').innerHTML += '<div +class="textResponse">' + json.response + '</div>'; document.getElementById('userChat').innerText = ''; }); } } </script> </head> <body> <div id="chatBox" style="border:solid thin blue;min-height:100px"></di +v> <div id="userChat" contenteditable="true" style="border:solid thin gre +y"></div> <input type="button" value="send" onClick="sendChat();"> </body> </html> END_HTML

The incl::HTML module (here renamed to incl::HTMLtest) takes the URL query string and splits it up into key value pairs that it puts into %data

In this minimalistic script, text is entered into <div id="userChat"> and sent back to the Perl script when the button is clicked. This uses the fetch API. The content is in $data{'userChat'} which is just sent back as a very simple JSON object to be written into <div id="chatBox">.

This works as expected until we introduce non-latin characters - for example "café" which gets displayed as "café"

I've captured the query string before decoding and it is "userChat=caf%C3%A9"

It seems very strange to me that we start off with four characters in "café" and seem to get to five with "caf%C3%A9" which gets decoded as five characters...

The code that does the decoding in incl::HTML looks like this. I cannot recall where it came from but it has been working for many, many years and has definitely handled Turkish characters in the past under Perl v5.16.3. I wonder if it is failing after the change to Perl v5.36.0

my @pairs = split /&/, $query_string; foreach my $p(@pairs) { $p =~ tr/+/ /; $p =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg; my ($key, $val) = split /=/, $p, 2; $data{$key} = $val; }

I am beginning to think that I will never understand this mysterious world of character encodings...then I remember that for many, many years references, especially hashrefs were a total mystery to me and now I use them without having to think too hard about it. This is in no small part thanks to the Monastery and I'm hoping a similar magical revelation might be bestowed on me for character encoding! Everything was so much easier when all we had was ASCII!

Replies are listed 'Best First'.
Re: Yet another Encoding issue...
by Danny (Chaplain) on Jun 01, 2024 at 20:17 UTC
       It seems very strange to me that we start off with four characters in "café" and seem to get to five with "caf%C3%A9" which gets decoded as five characters...

    %C3%A9 is the html url encoding for é

    perl -we '$_="caf%C3%A9"; s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1 +))/eg; print "$_\n"'
    outputs café for me.
      outputs café for me

      Thanks...I've just tried that over SSH on the server and I get the correct output.

      So, I suppose that means the code to decode the URI Encoded characters is working and I need to look somewhere else!

      Any suggestions why it would work at the command line but not when sent to a browser?

        It looks like your é is the result of printing the encoded utf8 of é. You needed to print the decoded value. For example:
        perl -we 'use Encode; $c = encode("UTF-8", "é"); $dc = decode("UTF-8", + $c); print "\$c = $c \$dc = $dc\n"'
        outputs: $c = é $dc = é
        I was curious how UTF-8 was converting a sequence of bytes to a code point that wasn't obviously related to the values of those bytes. With the help of UTF-8#Examples, here is how %C3%A9 (é) is converted to the code point 233. The bits for %C3 and %A9 are 11000011 and 10101001 (195 and 169). The first 4 bits of the first byte tells how many bytes are used for this character. In this case the first 110 means two bytes are used (1110 would mean 3 bytes etc). For two byte encodings the last 5 bits of the first byte are used for the higher order bits of the code point so (00011). The leading 1 and 0 bits (10) of the second byte are used to indicate that this is a continuation byte. The rest (101001) is used for the code point. So we end up with 00011 101001; printf "%s\n", 0b00011101001 gives 233.
Re: Yet another Encoding issue...
by etj (Priest) on Jun 02, 2024 at 14:27 UTC
    Glad this is solved! A great way to think/talk about this issue that helped me understand it is to differentiate between "characters" and "bytes". Then:
    $bytes = encode($characters); $characters = decode($bytes);
    where "characters" are the conceptual thing to be represented, and "bytes" is the means the computer (and hard disk etc) use to do the representing.
Solved... (was: Re: Yet another Encoding issue...)
by Bod (Parson) on Jun 02, 2024 at 12:52 UTC

    I've solved the two issues...I'm putting the solutions here for the benefit of anyone who comes this way again with a similar issue.

    Decoding the user input has been solved with Encode as pointed out by Danny in Re^3: Yet another Encoding issue...

    The apparent issue with the output from AI::Chat was that it was being fed the wrong encoding. Using the decode method from Encode at the point the chat history is pulled out of the database helped. But the problem reappeared as the chat went on. So I looked more closely at the MariaDB database encoding.

    The table that stores the chat history was encoded as utf8. I changed it to utf8mb4 and suddenly all the encoding issues seem to have gone away 😊

      It looks like mysql utf8 (alias for utf8mb3) uses up to 3 bytes while utf8mb4 uses up to 4 bytes. It might be an interesting exercise to figure out what characters were not fitting into 3 bytes. It seems that utf8mb3 uses code point values from 0 to 65535. I guess you could look for ord($char) > 65535.

        The standard Turkish characters that are not in the English alphabet are Ç ç Ğ ğ İ ı Ö ö Ş ş Ü ü

        It's strangely interesting that the AI generates more spurious characters the more incorrectly encoded characters are fed to it. I wonder if it tries to guess the encoding and gets confused.

        When I click 'preview' here in PM, the text is partly converted to HTML entities - that's probably the characters that were causing the issue.

        Ç ç &#286; &#287; &#304; &#305; Ö ö &#350; &#351; Ü ü

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11159743]
Approved by choroba
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (5)
As of 2025-06-13 22:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.