note
vsespb
Well, actually recently I had experience writing not a small application (20k lines), which allows unicode everywhere and handles unicode correctly.<br><br>
But it does not need 90% of things that you listed<br><br>
Probably because my application does not try to analyze text data (it only stores it, converts, compares, reencodes), it does not need sort nor fc-aware comparison<br><br>
Your case is probably something that analyzes text (I can imagine now only something related to natural language processing or word processor or maybe a dictonary)<br><br>
So I think different applications need different level of unicode support<br><br>
Below some cases when policy you listed can be wrong in some circumstances:<br>
<blockquote>
lc($a) cmp/eq/ne/... lc($b) should be using fc. Same story with uc.
</blockquote>
<blockquote>
Something like a-z should often be \p{Ll} or \p{lower}
</blockquote>
If you write, say, code which have to deal with parsing http headers (no, that's not reinvention of wheel, like HTTP library,
that can be a proxy server or REST library), then "cmp" and "a-z" would be correct choice, and fc() \p{lower} can introduce bugs (say, with "β" vs "ss").<br>
<br>
Other examples can be unit tests where you usually have to deal with pre-defined data sets, or internal program metadata which is always plain ASCII,
or comparison of MD5/SHA hex values etc.<br>
<blockquote>
Opening a text file without stating its encoding somewhere or other is a recipe for failure.
</blockquote>
Unless it's a binary file.<br>
<br><br>
<blockquote>
@lines = do { local $/; split /\R/, <INPUT> };
</blockquote>
Hm. I think it's not correct to use something like U+2028 as line separator for files.<br><br>
You need code like this if you read from <i>text file</i>. Text file is something separated by LF or CRLF, other combinations are not portable.<br><br>
If you are writing word processor which should handle U+2028 you should not mix this with system file IO, instead introduce your own logic when
you are spliting data to "lines" and paragraphs.<br><br>
I don't see where this can be correct to mix "lines" from your word processor logic and lines of text file on disk (or socket)<br>
1057058
1057169