Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

My UTF-8 text isn't surviving I/O as expected

by ibm1620 (Hermit)
on Nov 23, 2024 at 20:14 UTC ( [id://11162851]=perlquestion: print w/replies, xml ) Need Help??

ibm1620 has asked for the wisdom of the Perl Monks concerning the following question:

I decided to join the 21st century and learn how to work with Unicode and UTF-8 encoding. I've included here a test Perl program that both contains literal UTF-8 text, and writes it into a sqlite DB, reads it back in, and prints it to STDOUT.

The advice given by brian d foy (https://stackoverflow.com/a/47946606/522385) and others has been to include the following two pragmas:

use utf8; # expect UTF-8 text in this source code use open qw(:std :encoding(UTF-8)); # do UTF-8 encoding on I/O and STD +*
But the following code produces garbage output on STDOUT unless I comment out both pragmas!

The garbage I sometimes see is: Åke Lindström. If I paste and pipe that into `hexdump` I get

00000000 c3 83 c2 85 6b 65 20 4c 69 6e 64 73 74 72 c3 83 |....ke +Lindstr..| 00000010 c2 b6 6d |..m|

Other possibly-relevant info: I'm on MacOS Sequoia. I'm working in iTerm2.app, and I get the same behavior in Terminal.app.

I've read perlunitut and https://perldoc.perl.org/open, probably not enough times.

(Please note: I'm having trouble using UTF-8 text in this post, so unfortunately it's not going to look right. The text I'm trying to use, shown as a hex string, is "c3856b65204c696e64737472c3b66d". I fervently hope that, even without working code, someone can identify what I'm doing wrong.)

#!/usr/bin/env perl use v5.40; # brian d foy recommends using these settings # (https://stackoverflow.com/a/47946606/522385): # (1) recognize UTF-8 in this source code: use utf8; # (2) do the right things for writing and reading UTF-8, including to +STD*: use open qw(:std :encoding(UTF-8)); my $utf8_text1 = "Åke Lindström"; # contains UTF8 chars: say "A variable set to a UTF8 literal within perl program"; show ($utf8_text1); use DBI; my $dbh = DBI->connect( "dbi:SQLite:dbname=:memory:", "", "", { RaiseError => 1, AutoCommit => 1 } ); $dbh->do('CREATE TABLE names (name_id CHAR PRIMARY KEY, name CHAR)'); $dbh->do(qq{INSERT INTO names VALUES("nm0512537", "$utf8_text1")}); my $aoa_ref = $dbh->selectall_arrayref( q{SELECT name FROM names WHERE name_id="nm0512537"} ); say "\nUTF-8 text stored in, and retrieved from, sqlite DB:"; show($aoa_ref->[0][0]); sub show($str) { say "Binary: ", join ' ', (unpack "H*", $str) =~ m/../g ; say "Text>STDOUT: $str"; }

Replies are listed 'Best First'.
Re: My UTF-8 text isn't surviving I/O as expected
by choroba (Cardinal) on Nov 23, 2024 at 20:40 UTC
    You have correctly prepared the code to handle UTF-8 in the source code and input and output operations. What's missing is doing the same for the communication with the database.

    By default, DBD::SQLite uses a setting which is wrong (see the documentation for details). To fix it, only slight changes are needed:

    use DBD::SQLite::Constants ':dbd_sqlite_string_mode'; my $dbh = DBI->connect( "dbi:SQLite:dbname=:memory:", "", "", {RaiseError => 1, AutoCommit => 1, sqlite_string_mode => DBD_SQLITE_STRING_MODE_UNICODE_STRICT });

    PerlMonks is very old and its <code> sections can't handle Unicode. Either use <pre> instead, or replace unicode characters in the source code by their names:

    my $utf8_text1 = "\N{LATIN CAPITAL LETTER A WITH RING ABOVE}ke Lindstr +\N{LATIN SMALL LETTER O WITH DIAERESIS}m";

    BTW, get into the habit of using placeholders to insert values to prevent SQL injection:

    my $insert = $dbh->prepare('INSERT INTO names VALUES(?, ?)'); $insert->execute('nm0512537', $utf8_text1);

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
      THANK YOU!! That cleared up a lot. I didn't even suspect that the culprit was sqlite.

      I still have trouble reading UTF-8 from command line arguments. I assume this is not a Perl issue; any suggestions how to fix?

      #!/usr/bin/env perl
      use v5.40;
      use utf8;
      use open qw(:std :encoding(UTF-8));
      
      my $utf8_text1 = shift;
      say "A variable set from argument on command line";
      show ($utf8_text1);
      
      my $utf8_text2 = 'Åke Lindström;
      say "A variable set to UTF8 literal";
      show($utf8_text2);
      
      chomp (my $utf8_text3 = <>);
      say "A variable set by reading from STDIN";
      show($utf8_text3);
      
      sub show($str) {
          say "Binary:      ", join ' ', (unpack "H*", $str) =~ m/../g ;
          say "Text>STDOUT: $str";
      }
      
      Output:
      $ echo "Åke Lindström" | u 'Åke Lindström'
      A variable set from argument on command line
      Binary:      c3 85 6b 65 20 4c 69 6e 64 73 74 72 c3 b6 6d
      Text>STDOUT: Åke Lindström
      A variable set to UTF8 literal
      Binary:      c5 6b 65 20 4c 69 6e 64 73 74 72 f6 6d
      Text>STDOUT: Åke Lindström
      A variable set by reading from STDIN
      Binary:      c5 6b 65 20 4c 69 6e 64 73 74 72 f6 6d
      Text>STDOUT: Åke Lindström
      
Re: My UTF-8 text isn't surviving I/O as expected
by cavac (Parson) on Nov 25, 2024 at 13:47 UTC
      For me, ⁵ works without problems. Isn't it a browser/font problem?

      map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
      Reading Tom Christiansen's sobering post about Unicode was enough to discourage me from trying to become proficient with Unicode. I'm retired so I get to do that :-)

        On the surface, yes, it looks bad. But from my experience, you can cover nearly all cases (like 99.5% or so) by following some simple rules, no matter the encoding:

        • Convert all incoming data to perls internal representation (utf8_decode or similar)
        • Convert all outgoing data to the correct encoding (utf8 or similar)
        • Unless you really have to verify very specific things in text, just treat it like a random binary blob.
        • 0 + $var works for converting text to numeric values.
        • If you do any type of string comparison in your code, always normalize both sides using Unicode::Normalize and always stick to the same normalization form.
        • Don't assume that any other text encoding standard is saner. Or even a global standard.

        The basic ugliness of Unicode (or other text encodings) stems not from their engineers but from the basic fact that human language is a complicated mess. And written language is still a somewhat new concept in human evolution and we are still trying to figure out the finer details. At least with Unicode, you don't have to constantly switch schemes depending on who is using your software.

        PerlMonks XP is useless? Not anymore: XPD - Do more with your PerlMonks XP
        Also check out my sisters artwork and my weekly webcomics

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11162851]
Approved by choroba
Front-paged by choroba
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (4)
As of 2024-12-12 17:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Which IDE have you been most impressed by?













    Results (65 votes). Check out past polls.