ait has asked for the wisdom of the Perl Monks concerning the following question:
Pardon my ignorance on some of these internals, but I am going crazy transferring data through 3 different databases in different charsets through different layers (ssh, file transfers, direct SQL), etc.
Context: Trying to figure out why PHP::Serialization reports 30 as the string length of the 26 char string. When I serialize to the PHP Array I get this:
s:30:"Triple “S” Industrial Corp"
So I am trying to figure out if the bug is in the PHP::Serialization or somewhere else in this crazy 3 system interface. The PHP on the target server is 7.2.10 so I am assuming it supports these UTF chars w/o issue. But what seems strange to me is that both Perl and PHP would both internally represent 30 in character length? So before I dive into that module's code to try to understand what it's doing, I want to first understand how Perl stores this internally..
So given this string: Triple “S” Industrial Corp (note funky quotes), this is the Dump:
SV = PV(0x5584829062e0) at 0x558482ad2ee0
REFCNT = 1
FLAGS = (POK,IsCOW,pPOK)
PV = 0x558482b75b30 "Triple \342\200\234S\342\200\235 Industrial Corp"\0
CUR = 30
LEN = 56
COW_REFCNT = 0
What are the characters \342\200\234 (the left funky quote)?
How would I manually decode them if I wanted to ? (i.e. is this a utf8 sequence? how do I know what they mean?)
Is this is why CUR reports 30 "perl characters" instead of 26 actual characters?
Re: How to interpret characters in Devel::Peek CUR
by haukex (Archbishop) on Jun 09, 2020 at 07:58 UTC
|
Please see The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
The Unicode character U+201C LEFT DOUBLE QUOTATION MARK (“) is encoded in UTF-8 as the bytes e2 80 9c (\342\200\234), and the Unicode character U+201D RIGHT DOUBLE QUOTATION MARK (”) is encoded in UTF-8 as the bytes e2 80 9d (\342\200\235).
One way to think about Perl strings is that they store either a sequence of bytes or a sequence of Unicode characters. In your case, the Devel::Peek output does not include the "UTF8" flag, which means that this string is bytes, and yes, that's why you're getting a length of 30. (Update: It is important to note, however, that testing a string's UTF8 flag for anything other than debugging is code smell - your code should normally rely on the fact that you're getting strings in the correct format.)
You can decode bytes to characters or encode characters to bytes using the Encode module, or, in the case of UTF-8, use the "built-in" utf8 module (note that you don't have to put use utf8; in your code to load it; use utf8 means "this Perl source file is encoded in UTF-8", which may or may not be what you want). You can use utf8::decode($string); to decode the string you have, and then you'll see this output:
SV = PV(0x5584829062e0) at 0x558482ad2ee0
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x558482b75b30 "Triple \342\200\234S\342\200\235 Industrial Cor
+p"\0 [UTF8 "Triple \x{201c}S\x{201d} Industrial Corp"]
CUR = 30
LEN = 32
And length will now report 26. The UTF8 flag means that the Perl string is storing Unicode characters (the fact that they're stored internally as UTF-8 should be considered an implementation detail). Almost all Perl operators (depending on the Perl version) and many Perl modules should handle Unicode correctly.
Note that it's usually best to decode data as it's coming into Perl (e.g. specifying an open mode of '<:encoding(UTF-8)') and encode it as it leaves, and having to do this manually in your code sometimes means that the source where you're getting the data may be buggy in regards to Unicode. I don't know enough about PHP::Serialization to say if that's the case here, and the PHP serialize docs don't make any mention of Unicode either. Interestingly, the PHP String docs say "PHP only supports a 256-character set, and hence does not offer native Unicode support." So my guess is that the encoding to bytes happens somewhere before the data hits the PHP string, and then serialize and PHP::Serialization simply pass those bytes through; this means you'd have to know which encoding was used to store the Unicode data into the PHP string to correctly decode it, in the case that it's not always UTF-8.
As a general note, if you're working with Unicode it's best to be on the latest version of Perl and to put a use 5.030; at the top of the file to enable all of its features.
| [reply] [d/l] [select] |
|
Thanks a lot for this detailed answer!
I read the doc you recommended and although I knew some of the stuff in there, it is definitely a great read and clarifies part of the untold story. It also helped me understand your answer better, for example:
the fact that they're stored internally as UTF-8 should be considered an implementation detail
I think found what seems to be the root cause of the issue:
We are pulling data from an SQL Server database that is encoded in CP-1252 and we are using the DBI with the MS ODCB Driver for Linux version 13. It seems they are inserting UTF-8 data into that SQL Server, so when we get the data back in Perl the UTF-8 flag is not set (even though some records actually contain UTF-8 characters).
When we insert that data into our UTF-8 PostgreSQL debase, it seems to get double encoded. Also, some of these flawed records have a null terminator at the end too, which doesn't seem to affect the utf8 flag but it does mess up our trimming (The SQL Server char strings are padded with whitespace).
Data from SQL Server
SV = PV(0x560c8bacdf90) at 0x560c8b9b7998
REFCNT = 1
FLAGS = (POK,IsCOW,pPOK)
PV = 0x560c8bbfad00 "Triple \342\200\234S\342\200\235 Industrial Cor
+p \0"\0
CUR = 51
LEN = 53
COW_REFCNT = 1
Data after being Stored in Postgres (and retrieved)
SV = PV(0x560c8bacdec0) at 0x560c8bb7c7b0
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x560c8bbf38f0 "RAW: Triple \303\242\302\200\302\234S\303\242\3
+02\200\302\235 Industrial Corp "\0 [UTF8 "RAW: Tri
+ple \x{e2}\x{80}\x{9c}S\x{e2}\x{80}\x{9d} Industrial Corp
+ "]
CUR = 61
LEN = 63
Using utf8::decode on the string before storing into Postgres actually solves the issue. So knowing that the SQLServer end (which is out of our control) has UTF-8 data in a CP-1252 database would it be so wrong to force utf8::decode on all strings? Or is there a better way to deal with this?
| [reply] [d/l] [select] |
|
As I previously commented here, much depends on the way you set up the connection, and there is still room to play with server-side encodings:
I recently worked from perl on Linux with a MS SQL server database, and got the best results with FreeTDS:
my $dbh = DBI->connect ("dbi:ODBC:mssql_freetds", $username, $password
+, \%dbi_attributes);
$ cat ~/.odbc.ini
[mssql_freetds]
Description = My MS SQL database
Driver = FreeTDS
TDS version = 7.2
Trace = No
Server = mysql.server.local
Port = 1433
Database = DatabaseName
User = UserName
Password = PassWord
Client Charset = UTF-8
The biggest difference between FreeTDS and the MS ODBC driver is the return type of UUID field. The MS ODBC does not allow nested queries, whereas the FreeTDS driver does. So I used the ODBC driver to make a CSV dump of the database and the FreeTDS driver to actually work with the database.
For ODBC I did
my $dbh = DBI->connect ("dbi:ODBC:mssql_odbc", $username, $password, \
+%dbi_attributes);
$ cat ~/.odbc.ini
[mssql_odbc]
Description = My MS SQL database
Driver = ODBC Driver 17 for SQL Server
Server = mysql.server.local
Database = DatabaseName
User = UserName
Password = PassWord
Also make sure you put the fully qualified hostname in the server name. localhost will not work.
Enjoy, Have FUN! H.Merijn
| [reply] [d/l] [select] |
|
|
So knowing that the SQLServer end (which is out of our control) has UTF-8 data in a CP-1252 database would it be so wrong to force utf8::decode on all strings? Or is there a better way to deal with this?
It's too bad that the server is out of your control, since that seems to be the source of the problem. But anyway, yes, I think fixing the issue as early as possible - as you pull the data off the server - is the "best" (relatively) way to go about it. Two things to keep in mind: Make sure that all the data really is UTF-8, and check the return value of utf8::decode(), because if that fails, then there's definitely something wrong with the encoding. But keep in mind that false negatives (e.g. data that is actually CP-1252 but also decodes as UTF-8) are possible, though somewhat unlikely.
| [reply] [d/l] |
Re: How to interpret characters in Devel::Peek CUR
by kcott (Archbishop) on Jun 09, 2020 at 05:37 UTC
|
G'day ait,
The characters, “ and ”, are U+201C and U+201D.
The numbers \342\200\234 and \342\200\235 are the octal values of the bytes that make up those characters.
You can break those characters into their constituent bytes and check the octal values like this:
$ perl -C -E '
my $x = "\x{201c}S\x{201d}";
say $x;
{
use bytes;
printf "%vo\n", $x;
}
'
“S”
342.200.234.123.342.200.235
See also: bytes noting the emboldened warning;
and the vector flag information in sprintf.
| [reply] [d/l] |
|
| [reply] |
|
$ perl -MData::Peek -wE'say $^O;DPeek ("\x{201c}"); DPeek ("\x{201d}")
+'
os390
PV("\312\101\160"\0) [UTF8 "\x{201c}"]
PV("\312\101\161"\0) [UTF8 "\x{201d}"]
Enjoy, Have FUN! H.Merijn
| [reply] [d/l] |
|
| [reply] |
|
| [reply] |
Re: How to interpret characters in Devel::Peek CUR
by ikegami (Patriarch) on Jun 09, 2020 at 17:07 UTC
|
So given this string: Triple “S” Industrial Corp (note funky quotes)
More precisely, you have this text encoded using UTF-8.
What are the characters \342\200\234 (the left funky quote)
Octal escape sequences that produce the bytes that form the encoding of «“» using UTF-8.
use feature qw( say );
use Encode qw( encode );
say
encode("UTF-8", "\N{LEFT DOUBLE QUOTATION MARK}")
eq
"\342\200\234";
# Output: 1
How would I manually decode them if I wanted to ?
You could use
utf8::decode($s);
If this string was constructed from a string literal, then you should have used the following to tell Perl the source was encoded using UTF-8 instead of ASCII:
use utf8;
If this is read from a file, an encoding layer would do this automatically for you. You can set this up using
use open ':std', ':encoding(UTF-8)';
Is this is why CUR reports 30 "perl characters" instead of 26 actual characters?
The string has 30 characters, not 26. You can verify this using length. If you were to decode those 30 bytes, you would get 26 Unicode Code Points, but that would be a different string, and length would return 26.
use feature qw( say );
use Encode qw( decode );
no utf8;
my $utf8 = "Triple “S” Industrial Corp";
say length($utf8); # 30 chars
my $ucp = decode("UTF-8", $utf8);
say length($ucp); # 26 chars
That said, CUR indicates the number of bytes of the string buffer that are being used, not the number of characters in the string. They just happen to be the same for your string.
use feature qw( say );
use Encode qw( decode );
use Devel::Peek qw( Dump );
no utf8;
my $utf8 = "Triple “S” Industrial Corp";
say length($utf8); # 30 chars
Dump($utf8); # CUR = 30
my $ucp = decode("UTF-8", $utf8);
say length($ucp); # 26 chars
Dump($ucp); # CUR = 30
Because we called length before Dump, you'll see the PERL_MAGIC_utf8 (w) magic was added to cache the length (MG_LEN = 26).
| [reply] [d/l] [select] |
|
| [reply] |
|
|