Mess with UTF-8, utf8 and raw encoding on live working platform

AlfaProject has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks :)
I have a problems with encoding on my web platform.
This app using flat files for database.
When I started to develop , I used this code to open files:
open FH,">$path";
after few weeks we started to get traffic from other countries and names on other languages.
I get broken chars with existing code , and changed it to this only on inputs(>)
open FH , ">:utf8",$file_path;
it worked like that few months
The most traffic we get is from 'facebook login' button that uses perl api to get the data.
I recognized that on post forms, perl gets another data with broken chars
I already spent a few days on reading articles about encodings in perl and I still don't get the point how to work with that in the right way
For now I have changed all the inputs and output filehandles to this,because I read that it's the strict UTF in this article http://perlgeek.de/en/article/encodings-and-unicode
open FH, "<:encoding(UTF-8)", "$dir/private_data";
After that I get problems on posts and data that come from facebook, after many tries I got a code that works well. But the problem , that I doesn't understand why it works .. here is the code

    if($params{FORM_POST}){$_=decode 'UTF-8',$_ for(@$ref_userdata);}
    if($params{FROM_FACEBOOK}){$_=encode 'UTF-8',$_ for(@$ref_userdata
+);}
    open WH,">:encoding(UTF-8)","$dir/private_data" or die $!."$dir/pr
+ivate_data";
    print WH "$_|" for(@$ref_userdata);
    close WH;
[download]

Now I see that old database that writen in regular way and old utf8 get a broken chars.
When I doing cat private_datawith the new :encoding(UTF-8) in putty i get also not right encoding that looks like that|||а�аИб�аОб�аЛаАаВаА|аЅаАаНаЕаВаА|female|03-08-1982|
I feel that I'm like in circle of searchings that will never end
Don't really know what to ask, just need a help with this.
Thanks a lot !!!

Comment on Mess with UTF-8, utf8 and raw encoding on live working platform Select or Download Code

Replies are listed 'Best First'.
Re: Mess with UTF-8, utf8 and raw encoding on live working platform by moritz (Cardinal) on Jun 02, 2011 at 12:41 UTC
So, let's get this straight. You fixed the way you read and write data from and to files, and now the old data which were written in the broken way don't work anymore. So you need three steps: Identify which data was written in the old, broken way Find out in what way that data is broken Fix the data For identifying which data is broken, I hope you have some timestamps along with your data, or that you write the rows sequentially, so that you can easily see which rows are "old" and which are "new". For identifying in which way the data is wrongly encoded, I recommend to look at the data with `hexdump -C yourfilename` - it gives more reliable information than perl (at least if you don't grok how perl does what it does). If you have some samples of the broken data and how it should look like, you can also try Encode::Repair. If that module doesn't work for you, I'll be happy to improve it, if you provide enough information. Perl 6 - second systems done right	[reply] [d/l]
Re^2: Mess with UTF-8, utf8 and raw encoding on live working platform by AlfaProject (Beadle) on Jun 02, 2011 at 14:03 UTC
Thanks for the answer, but before this I need to understand when I need to write :encoding(UTF-8) and when not. For example I wrote a script that take info from all users and put it in index. `open FH , "<:encoding(UTF-8)","$dir/private_data";` Process the data `open FH ,">:encoding(UTF-8)","$path_index";` After that another web script take the data from this index file `open FH ,"<:encoding(UTF-8)","$path_index";` And prints all the data into HTML page. All the German chars with dots and other unique chars , printed as question mark inside polygon. But if I use the regular way `open FH ,"<","$path_index";` All the chars shows as it needed to be... So I don't really understand why it happens.	[reply] [d/l] [select]
Re^3: Mess with UTF-8, utf8 and raw encoding on live working platform by moritz (Cardinal) on Jun 02, 2011 at 14:51 UTC
There are two types of string variables in Perl. One type contains text, the other contains bytes. Files contain bytes. If you want to write a text string to a file, the string needs to be converted from text to bytes. That's what the `>:encoding(UTF-8)` does. OTOH if you want to read data from a file, without the `:encoding(UTF-8)` the string will contain bytes, and with it the string contains text. Sad thing is, you can't reliably see from looking at a string if it's text or bytes. And if you mix the two up, you will see some broken output. So if you use text strings internally in your program, you need the `:encoding(UTF-8)` both for reading and writing files, and you need to decode all other byte strings that come into your program (for example with %ENV or @ARGV). OTOH some modules already decode strings for you (for example XML and JSON parsers), so you must be aware which module does that. Perl 6 - second systems done right	[reply] [d/l] [select]
Re^4: Mess with UTF-8, utf8 and raw encoding on live working platform by AlfaProject (Beadle) on Jun 05, 2011 at 07:15 UTC
Re^3: Mess with UTF-8, utf8 and raw encoding on live working platform by mje (Curate) on Jun 02, 2011 at 14:35 UTC
I don't think anyone can tell you precisely when you need to add an encoding layer on files open for read because it depends on whether the data in the input file is UTF-8 (in your case) encoded or not. If private_data is UTF-8 encoded then you need to add the UTF-8 encoding layer to decode it on input. If path_index is intended to be UTF-8 encoded file you need to encode the data when writing it. When you say "prints all the data into HTML page" are you running some sort of CGI, in web server code because if you are you need to set the the doctype in the HTML and probably on the content-type too.	[reply]


Just another Perl shrine
	PerlMonks