Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re^2: Where are the Perl::Critic Unicode policies for the Camel?

by tchrist (Pilgrim)
on Oct 06, 2013 at 17:21 UTC ( #1057169=note: print w/ replies, xml ) Need Help??


in reply to Re: Where are the Perl::Critic policies for the Camel?
in thread Where are the Perl::Critic policies for the Camel?

vsespb kindly wrote:

Hm, do you think it's possible to implement any useful set of checks related to Unicode in static code analyzer? Afaik, perlcritic only checks one file, so it's impossible to say where data come from and if this text data or no. It's impossible to say if we have encoding layers for our file handles or no.
I agree that static analysis has its limitations. But there are still things that can be done, I think.

These are all wrong, or at least have a code smell to them:

  • lc($a) cmp/eq/ne/... lc($b) should be using fc. Same story with uc.

  • Something like a-z should often be \p{Ll} or \p{lower} ó which mean different things!. Same with A-Z.

  • I would really like to see use warnings FATAL => "utf8".

  • I would like to see the charnames versions used instead of magic numbers, so for example \N{EM DASH} over \x{2014}.

  • The POSIX character classes are suspect. Thanks to Karlís fine work, we do studiously follow UTS #18ís recommendations on compat properties here, but things like :punct: versus \p{punct} are a problem, because POSIX mixes symbols and puncts, and Unicode doesnít.
    $ unichars -g '/[[:punct:]]/ != /\p{punct}/' U+0024 &#8237; $ GC=Sc DOLLAR SIGN U+002B &#8237; + GC=Sm PLUS SIGN U+003C &#8237; < GC=Sm LESS-THAN SIGN U+003D &#8237; = GC=Sm EQUALS SIGN U+003E &#8237; > GC=Sm GREATER-THAN SIGN U+005E &#8237; ^ GC=Sk CIRCUMFLEX ACCENT U+0060 &#8237; ` GC=Sk GRAVE ACCENT U+007C &#8237; | GC=Sm VERTICAL LINE U+007E &#8237; ~ GC=Sm TILDE

  • We should probably consider warning about using block properties instead of script properties.

  • There are all kinds of version-dependent Unicode character properties that nothing ever warns you about, and this drives me crazy. For example, you cannot use \p{space} or \p{Horiz_Space} or \h or \R in v5.8.8.

  • There was a shift in the allowable user-defined properties being restrict to InFoo or IsFoo at one point, which is not noted. Although I hope this will get better soon.

  • Should there be a policy that advises double-barrelled property names for the ones that apply? I find that that might help people who are confused about the script versus block properties. For example, \p{Latin} actually means \p{Script=Latin}, not any of \p{Block=Basic_Latin}, \p{Block=Latin_1}, \p{Block=Latin_1_Supplement}, \p{Block=Latin_Extended_A}, \p{Block=Latin_Extended_Additional}, \p{Block=Latin_Extended_B}, \p{Block=Latin_Extended_C}, or \p{Block=Latin_Extended_D}.

  • Similarly, I wonder about the GC categories. I just love \pL ó and unlove the gratuitously embraced \p{L} ó but when people write that or something like \p{Ll}, should there be a way to suggest that it would be more clearly written \p{GC=L}, \p{GC=Ll}, \p{General_Category=L}, \p{General_Category=Ll}, \p{General_Category=Letter}, or \p{General_Category=Lowercase_Letter}?

  • And even more importantly, shouldn't those be \p{Lowercase} or \p{lower}, since you otherwise miss the 159 lowercase code points that are not \p{GC=Ll}?

  • The approach of \PM\pM* instead of \X does not work.

  • Just as we tell people that /./ instead of /(?s:.)/ may sometimes get them into trouble, something that mentions that sometimes you really want /\X/ instead of /./ would be good. I just donít know when to tell them that in a general not a specific case. I mean, I can look at it and know, but a program? Dunno.

  • Not using the \X-aware versions of substr, index, rindex, length can be a problem in some programs.

  • I have no earthly idea what to do about the maxim to NFD on the way in and NFC on the way out. At the least, people need to understand that unless they normalize their two variables, not even string comparisons between them will work. Well, barring Unicode::Collate, which is awfully heavy-weight but actually works. And donít forget Unicode::Collate::Locale, either.

  • I can get rather nervous about seeing literal \n, \r, \f, \r\n in code. That really should be using \R but that only works in a regex.
    $ unichars '/ ^ \R $ /x' U+000A -- GC=Cc LINE FEED (LF) U+000B -- GC=Cc LINE TABULATION U+000C -- GC=Cc FORM FEED (FF) U+000D -- GC=Cc CARRIAGE RETURN (CR) U+0085 -- GC=Cc NEXT LINE (NEL) U+2028 -- GC=Zl LINE SEPARATOR U+2029 -- GC=Zp PARAGRAPH SEPARATOR
    Also, do note that "\r\n" = / \A \R \Z /x is also true, so like \X, \R can be more than one code point in length.

    Until Karl gets some support for $/ = q(\R) and readline and chomp into a future release, weíre stuck with this ugliness:

    @lines = do { local $/; split /\R/, <INPUT> };
    which is a big problem because it just doesnít scale to superbig files or to interactive handles like two-way socket communications. And lord knows what $\ = q(\R) would ever mean for the output record separator. :)

    I do have an inspiration for a PerlIO::via::Unicode_Linebreak I/O layer to address all this, but havenít thought it through.

  • Getting things like sorting right is hard. Under some operating systems, your cmp operator (and thus default sort, etc.) actually does work correctly provided that all these apply:
    1. You have done a POSIX::setlocale(LC_ALL, "en_US.utf8") somewhere in your program.
    2. Your string-comparison operators are within the lexical scope of a use locale pragma.
    3. (Maybe; havenít checked.)Your internal Latin1 data is in its 2-byte UTF8 representation not the 1-byte Perl cheat/short-cut.

  • I really, really want to do something about encodings. Opening a text file without stating its encoding somewhere or other is a recipe for failure. The problem is that it can be stated in so many placed: a PERL_UNICODE environment variable, a use open pragma, a middle argument to open, or binmode ó just to name a few. There are other ways, too.

So even though there is a lot that cannot be done about Unicode, there is also a lot that could be done with Unicode.

I really do think there could be some more P::C policies in these areas.chomp


Comment on Re^2: Where are the Perl::Critic Unicode policies for the Camel?
Select or Download Code
Re^3: Where are the Perl::Critic Unicode policies for the Camel?
by vsespb (Hermit) on Oct 06, 2013 at 19:35 UTC
    Well, actually recently I had experience writing not a small application (20k lines), which allows unicode everywhere and handles unicode correctly.

    But it does not need 90% of things that you listed

    Probably because my application does not try to analyze text data (it only stores it, converts, compares, reencodes), it does not need sort nor fc-aware comparison

    Your case is probably something that analyzes text (I can imagine now only something related to natural language processing or word processor or maybe a dictonary)

    So I think different applications need different level of unicode support

    Below some cases when policy you listed can be wrong in some circumstances:
    lc($a) cmp/eq/ne/... lc($b) should be using fc. Same story with uc.
    Something like a-z should often be \p{Ll} or \p{lower}
    If you write, say, code which have to deal with parsing http headers (no, that's not reinvention of wheel, like HTTP library, that can be a proxy server or REST library), then "cmp" and "a-z" would be correct choice, and fc() \p{lower} can introduce bugs (say, with "β" vs "ss").

    Other examples can be unit tests where you usually have to deal with pre-defined data sets, or internal program metadata which is always plain ASCII, or comparison of MD5/SHA hex values etc.
    Opening a text file without stating its encoding somewhere or other is a recipe for failure.
    Unless it's a binary file.


    @lines = do { local $/; split /\R/, <INPUT> };
    Hm. I think it's not correct to use something like U+2028 as line separator for files.

    You need code like this if you read from text file. Text file is something separated by LF or CRLF, other combinations are not portable.

    If you are writing word processor which should handle U+2028 you should not mix this with system file IO, instead introduce your own logic when you are spliting data to "lines" and paragraphs.

    I don't see where this can be correct to mix "lines" from your word processor logic and lines of text file on disk (or socket)
      Yes, youíre 100% right about all those things. Thanks for pointing them out, too.

      My context was in the processing of text files, normally NLP type stuff but sometimes CSV files in this or that encoding.

      I nevertheless think there are a lot of mistakes made, and that opening a textfile without specifying its encoding is a big problem.

      I wonder what if anything can reasonably be done about that though.

      --tom

        What about writing wrapper library over text file operations - thus you can enforce encoding specification, and even maybe prohobit foreach (<INPUT>) by providing own iterator function.
        Same probably can be done for some common text operations
        And probably some typical regexps, character constant can be moved out, some wrappers can be written over regexp (i.e. functions which create regexp at runtime)
        Cases when you care about "\X" vs "." are probably limited - spliting text, determining visible length, maybe something else - can be moved out to library too

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1057169]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (7)
As of 2014-09-21 06:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (166 votes), past polls