isync has asked for the wisdom of the Perl Monks concerning the following question:
Hi there! (running perl 5.8.7)
I am going through the tedious work of making a script unicode and utf8 aware. Now, finally I understood the difference between unicode and utf8 and thought it needs - to really make a script multi-language aware - to process all regexes etc in perls "internal format" - wrong I was!
This is my procedure pipeline:
1. read a string from variously encoded sources --> decode it properly to get "perl's internal format"
2. do various things with the textual data
3. re-encode it to utf8 (effectively a transport/storage format) and write it to disk (in binmode).
But then, surprise surprise on step 2!
I had the following regex:
So what should I do?
Should I use regexes on scalars containing unicode/"internal format" data, or on scalars containing utf8 encoded data?
Should my "script-internal-standard" be decoded unicode or unicode in utf8 encoded??
(to make it all worse, the perlfaq says, the "internal format" is utf8 encoded unicode, but I should forget about that - now SHOULD it?)
I am going through the tedious work of making a script unicode and utf8 aware. Now, finally I understood the difference between unicode and utf8 and thought it needs - to really make a script multi-language aware - to process all regexes etc in perls "internal format" - wrong I was!
This is my procedure pipeline:
1. read a string from variously encoded sources --> decode it properly to get "perl's internal format"
2. do various things with the textual data
3. re-encode it to utf8 (effectively a transport/storage format) and write it to disk (in binmode).
But then, surprise surprise on step 2!
I had the following regex:
and it removed some letters, spaces and a lot more! Then my thought was it has to do with the string being in "internal format". So I tried:$internal_format_string =~ s/\n//g;
and it worked again! So it seems perl requires my string to be in utf8, at least to use recognize the special \n newline char. But doesn't this prevent me from properly handling the broad range of unicode characters in the regex, on other regexes than removing the \n char? So I tried to get back to full unicode processing in my regexes:require Encode; my $string_in_utf8 = Encode::encode_utf8($internal_format_string); $string_in_utf8 =~ s/\n//g;
Which failed (might be because I am using wrong syntax for hex operation) (or is the string not in hex but in unicode? \u{000A} failed as well..)$internal_format_string =~ s/\x{0A}//g;
So what should I do?
Should I use regexes on scalars containing unicode/"internal format" data, or on scalars containing utf8 encoded data?
Should my "script-internal-standard" be decoded unicode or unicode in utf8 encoded??
(to make it all worse, the perlfaq says, the "internal format" is utf8 encoded unicode, but I should forget about that - now SHOULD it?)
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: The unicode / utf8 struggle, part 2: regexes
by Joost (Canon) on May 17, 2007 at 12:03 UTC | |
Re: The unicode / utf8 struggle, part 2: regexes
by graff (Chancellor) on May 17, 2007 at 14:20 UTC | |
Re: The unicode / utf8 struggle, part 2: regexes
by isync (Hermit) on May 17, 2007 at 15:40 UTC | |
by graff (Chancellor) on May 17, 2007 at 18:59 UTC | |
Re: The unicode / utf8 struggle, part 2: regexes
by mattr (Curate) on May 22, 2007 at 09:41 UTC | |
Re: The unicode / utf8 struggle, part 2: regexes
by Juerd (Abbot) on Jun 13, 2007 at 19:22 UTC |
Back to
Seekers of Perl Wisdom