Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Re: How to avoid decoding string to utf-8.

by Anonymous Monk
on Oct 12, 2020 at 16:52 UTC ( #11122749=note: print w/replies, xml ) Need Help??


in reply to How to avoid decoding string to utf-8.

Hi, haj, ikegami and choroba, Corion, monks who have replied to this thread.
Haj, ikegami, it seems below code worked for me, so regex seems working for me. Thank you.

my $utf8_decodable_regex = qr/[\xC0-\xDF][\x80-\xBF] | # 2 bytes unicode char [\xE0-\xEF][\x80-\xBF]{2} | # 3 bytes unicode char [\xF0-\xFF][\x80-\xBF]{3}/x; $testStr = decode('utf-8',$testStr); $testStr =~ s/($utf8_decodable_regex)/decode('utf-8',$1)/gex; $testStr = encode('utf-8',$testStr);

Though it's working, it would be great if you can explain why its working.
Cheers !!!

Replies are listed 'Best First'.
Re^2: How to avoid decoding string to utf-8.
by haj (Curate) on Oct 12, 2020 at 18:36 UTC

    Since you still didn't reveal what you did to control encoding at the database or web level, I can only guess.

    • It looks like the database content is hosed and contains strings in different encodings. You can't reliably SELECT records from these data.
    • It seems that you did either not tell your database driver to handle UTF-8, or you omitted to decode content from your web form and write doubly encoded stuff to your database. In both cases, you need that first step of decoding after reading from the database.
    • The regular expression takes care for stuff which has been inserted with a second level of UTF-8 encoding. Whenever the substitution succeeds, you found bad data in your database, inserted by either your new code, or by the legacy application. You can capture the return value of the substitution to check whether a substitution took place to identify data which need to be fixed in your database. With ikegami's suggestion to use utf8::decode you can achieve the same goal, a true return value from utf8::decode indicates broken data.
    • The final encoding step is required if you print a web response with a charset of UTF-8 without don't specifying an I/O layer for that encoding. Again, without knowing what your code does, I can't say for sure.

    Finally, if that code only seems to work, be sure to write a test suite with Unicode data, preferably also including strings with characters which can not be encoded in one byte. Also check the contents of your database with some "non-Perl" code, like the psql command line tool for PostgreSQL or whatever your database engine provides. Without that, your database operations will always be guesswork and the next migration will most likely go wrong as well.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://11122749]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (2)
As of 2020-12-05 12:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    How often do you use taint mode?





    Results (63 votes). Check out past polls.

    Notices?