Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re^3: Where are the Perl::Critic Unicode policies for the Camel?

by vsespb (Hermit)
on Oct 06, 2013 at 19:35 UTC ( #1057187=note: print w/ replies, xml ) Need Help??


in reply to Re^2: Where are the Perl::Critic Unicode policies for the Camel?
in thread Where are the Perl::Critic policies for the Camel?

Well, actually recently I had experience writing not a small application (20k lines), which allows unicode everywhere and handles unicode correctly.

But it does not need 90% of things that you listed

Probably because my application does not try to analyze text data (it only stores it, converts, compares, reencodes), it does not need sort nor fc-aware comparison

Your case is probably something that analyzes text (I can imagine now only something related to natural language processing or word processor or maybe a dictonary)

So I think different applications need different level of unicode support

Below some cases when policy you listed can be wrong in some circumstances:

lc($a) cmp/eq/ne/... lc($b) should be using fc. Same story with uc.
Something like a-z should often be \p{Ll} or \p{lower}
If you write, say, code which have to deal with parsing http headers (no, that's not reinvention of wheel, like HTTP library, that can be a proxy server or REST library), then "cmp" and "a-z" would be correct choice, and fc() \p{lower} can introduce bugs (say, with "β" vs "ss").

Other examples can be unit tests where you usually have to deal with pre-defined data sets, or internal program metadata which is always plain ASCII, or comparison of MD5/SHA hex values etc.
Opening a text file without stating its encoding somewhere or other is a recipe for failure.
Unless it's a binary file.


@lines = do { local $/; split /\R/, <INPUT> };
Hm. I think it's not correct to use something like U+2028 as line separator for files.

You need code like this if you read from text file. Text file is something separated by LF or CRLF, other combinations are not portable.

If you are writing word processor which should handle U+2028 you should not mix this with system file IO, instead introduce your own logic when you are spliting data to "lines" and paragraphs.

I don't see where this can be correct to mix "lines" from your word processor logic and lines of text file on disk (or socket)


Comment on Re^3: Where are the Perl::Critic Unicode policies for the Camel?
Re^4: Where are the Perl::Critic Unicode policies for the Camel?
by tchrist (Pilgrim) on Oct 06, 2013 at 21:53 UTC
    Yes, you’re 100% right about all those things. Thanks for pointing them out, too.

    My context was in the processing of text files, normally NLP type stuff but sometimes CSV files in this or that encoding.

    I nevertheless think there are a lot of mistakes made, and that opening a textfile without specifying its encoding is a big problem.

    I wonder what if anything can reasonably be done about that though.

    --tom

      What about writing wrapper library over text file operations - thus you can enforce encoding specification, and even maybe prohobit foreach (<INPUT>) by providing own iterator function.
      Same probably can be done for some common text operations
      And probably some typical regexps, character constant can be moved out, some wrappers can be written over regexp (i.e. functions which create regexp at runtime)
      Cases when you care about "\X" vs "." are probably limited - spliting text, determining visible length, maybe something else - can be moved out to library too

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1057187]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (8)
As of 2014-09-22 18:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (198 votes), past polls