http://www.perlmonks.org?node_id=11122737


in reply to Re^3: How to avoid decoding string to utf-8.
in thread How to avoid decoding string to utf-8.

Hi, haj, ikegami, Thank you for the reply.

I tried with the regex provided, unfortunately it does not seem working, and returning the same result.

Please note that, I am seeing this result on web application.

Below is what I have tried,

my $utf8_decodable_regex = qr/[\xC0-\xDF][\x80-\xBF] | # 2 bytes unicode char [\xE0-\xEF][\x80-\xBF]{2} | # 3 bytes unicode char [\xF0-\xFF][\x80-\xBF]{3}/x; $testStr=~ s/($utf8_decodable_regex)/decode('UTF-8',$1,Enc +ode::FB_CROAK | Encode::LEAVE_SRC)/gex; #$testStr = decode('utf-8',$testStr) if $testStr=~/$utf8_d +ecodable_regex/;
Any breakthrough would be appreciated, while I am trying to get around this issue.

Thank you for the efforts.

Replies are listed 'Best First'.
Re^5: How to avoid decoding string to utf-8.
by Corion (Patriarch) on Oct 12, 2020 at 09:19 UTC

    If the data comes from a web application, consider that at least for form submissions, the browser sends you the character encoding in a header. If the web application sends the data by Javascript, talk to the web developers that they need to makes sure that their data is always UTF-8.

      Hi Corion,

      Charset seem to set up to utf-8,

      META HTTP-EQUIV="Content-type" CONTENT="text/html; charset=utf-8
      Thank you.

        That is the charset of the web page, not the charset that the browser sends you when it submits form data.

        Usually, setting the charset of the web page should ensure UTF-8, but it seems that this depends on the browser. I still recommend looking at the headers of the incoming request.

Re^5: How to avoid decoding string to utf-8.
by haj (Vicar) on Oct 12, 2020 at 09:54 UTC

    I'm sorry, but it is unclear to me what "seeing this result on web application" actually means. Where do the data come from? Is your Perl code running as part of the web application, or did you write a web client and are trying to decode a response? How did you build $teststr, and how is it different from the example in my code? Where did you insert the code we suggested?

    In particular, my code example does not return anything, so I can't connect to "returning the same result". Without context, I can't offer any more.

      Hi Haj, Thank you for the reply.

      1.Data comes from the Database, as it is same as you look on the web application.
      2. Yes, Perl code is running as part of web application.
      3. TestStr is basically coming from database which got inserted while submit Form from the application itself, but at the time of showing this string on the web application this issue occurs.

      as I said earlier, I have strings with mixed encodings, which means that one string is differently encoded with another due to upgrade of application from legacy application.

      Thank you.

        So you appear to have strings with different encodings in your database. That's really bad, because you won't get correct results from database queries until you get this fixed.

        I have difficulties to understand why the regular expression does not change the result unless you are several levels of encoding away from the truth. This can happen if during the upgrade someone tried to fiddle with encoding until the result "looks right" in the browser - but what you actually have now is just a cancellation of errors. Encoding matters in the transfer from the browser form to the web application, when writing from the web application to the database, and in the opposite direction when reading from the database and when sending the data to your browser. Please tell us how you control encoding in these four places.

        For obtaining some data for debugging, please print the data - good and bad - like this (also suggested by ikegami earlier in this thread):

        printf("%vX", $testStr);