Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked

Comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
vsespb kindly wrote:
Hm, do you think it's possible to implement any useful set of checks related to Unicode in static code analyzer? Afaik, perlcritic only checks one file, so it's impossible to say where data come from and if this text data or no. It's impossible to say if we have encoding layers for our file handles or no.
I agree that static analysis has its limitations. But there are still things that can be done, I think.

These are all wrong, or at least have a code smell to them:

  • lc($a) cmp/eq/ne/... lc($b) should be using fc. Same story with uc.

  • Something like a-z should often be \p{Ll} or \p{lower} ó which mean different things!. Same with A-Z.

  • I would really like to see use warnings FATAL => "utf8".

  • I would like to see the charnames versions used instead of magic numbers, so for example \N{EM DASH} over \x{2014}.

  • The POSIX character classes are suspect. Thanks to Karlís fine work, we do studiously follow UTS #18ís recommendations on compat properties here, but things like :punct: versus \p{punct} are a problem, because POSIX mixes symbols and puncts, and Unicode doesnít.
    $ unichars -g '/[[:punct:]]/ != /\p{punct}/' U+0024 &#8237; $ GC=Sc DOLLAR SIGN U+002B &#8237; + GC=Sm PLUS SIGN U+003C &#8237; < GC=Sm LESS-THAN SIGN U+003D &#8237; = GC=Sm EQUALS SIGN U+003E &#8237; > GC=Sm GREATER-THAN SIGN U+005E &#8237; ^ GC=Sk CIRCUMFLEX ACCENT U+0060 &#8237; ` GC=Sk GRAVE ACCENT U+007C &#8237; | GC=Sm VERTICAL LINE U+007E &#8237; ~ GC=Sm TILDE

  • We should probably consider warning about using block properties instead of script properties.

  • There are all kinds of version-dependent Unicode character properties that nothing ever warns you about, and this drives me crazy. For example, you cannot use \p{space} or \p{Horiz_Space} or \h or \R in v5.8.8.

  • There was a shift in the allowable user-defined properties being restrict to InFoo or IsFoo at one point, which is not noted. Although I hope this will get better soon.

  • Should there be a policy that advises double-barrelled property names for the ones that apply? I find that that might help people who are confused about the script versus block properties. For example, \p{Latin} actually means \p{Script=Latin}, not any of \p{Block=Basic_Latin}, \p{Block=Latin_1}, \p{Block=Latin_1_Supplement}, \p{Block=Latin_Extended_A}, \p{Block=Latin_Extended_Additional}, \p{Block=Latin_Extended_B}, \p{Block=Latin_Extended_C}, or \p{Block=Latin_Extended_D}.

  • Similarly, I wonder about the GC categories. I just love \pL ó and unlove the gratuitously embraced \p{L} ó but when people write that or something like \p{Ll}, should there be a way to suggest that it would be more clearly written \p{GC=L}, \p{GC=Ll}, \p{General_Category=L}, \p{General_Category=Ll}, \p{General_Category=Letter}, or \p{General_Category=Lowercase_Letter}?

  • And even more importantly, shouldn't those be \p{Lowercase} or \p{lower}, since you otherwise miss the 159 lowercase code points that are not \p{GC=Ll}?

  • The approach of \PM\pM* instead of \X does not work.

  • Just as we tell people that /./ instead of /(?s:.)/ may sometimes get them into trouble, something that mentions that sometimes you really want /\X/ instead of /./ would be good. I just donít know when to tell them that in a general not a specific case. I mean, I can look at it and know, but a program? Dunno.

  • Not using the \X-aware versions of substr, index, rindex, length can be a problem in some programs.

  • I have no earthly idea what to do about the maxim to NFD on the way in and NFC on the way out. At the least, people need to understand that unless they normalize their two variables, not even string comparisons between them will work. Well, barring Unicode::Collate, which is awfully heavy-weight but actually works. And donít forget Unicode::Collate::Locale, either.

  • I can get rather nervous about seeing literal \n, \r, \f, \r\n in code. That really should be using \R but that only works in a regex.
    $ unichars '/ ^ \R $ /x' U+000A -- GC=Cc LINE FEED (LF) U+000B -- GC=Cc LINE TABULATION U+000C -- GC=Cc FORM FEED (FF) U+000D -- GC=Cc CARRIAGE RETURN (CR) U+0085 -- GC=Cc NEXT LINE (NEL) U+2028 -- GC=Zl LINE SEPARATOR U+2029 -- GC=Zp PARAGRAPH SEPARATOR
    Also, do note that "\r\n" = / \A \R \Z /x is also true, so like \X, \R can be more than one code point in length.

    Until Karl gets some support for $/ = q(\R) and readline and chomp into a future release, weíre stuck with this ugliness:

    @lines = do { local $/; split /\R/, <INPUT> };
    which is a big problem because it just doesnít scale to superbig files or to interactive handles like two-way socket communications. And lord knows what $\ = q(\R) would ever mean for the output record separator. :)

    I do have an inspiration for a PerlIO::via::Unicode_Linebreak I/O layer to address all this, but havenít thought it through.

  • Getting things like sorting right is hard. Under some operating systems, your cmp operator (and thus default sort, etc.) actually does work correctly provided that all these apply:
    1. You have done a POSIX::setlocale(LC_ALL, "en_US.utf8") somewhere in your program.
    2. Your string-comparison operators are within the lexical scope of a use locale pragma.
    3. (Maybe; havenít checked.)Your internal Latin1 data is in its 2-byte UTF8 representation not the 1-byte Perl cheat/short-cut.

  • I really, really want to do something about encodings. Opening a text file without stating its encoding somewhere or other is a recipe for failure. The problem is that it can be stated in so many placed: a PERL_UNICODE environment variable, a use open pragma, a middle argument to open, or binmode ó just to name a few. There are other ways, too.

So even though there is a lot that cannot be done about Unicode, there is also a lot that could be done with Unicode.

I really do think there could be some more P::C policies in these areas.chomp

In reply to Re^2: Where are the Perl::Critic Unicode policies for the Camel? by tchrist
in thread Where are the Perl::Critic policies for the Camel? by tchrist

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and all is quiet...

    How do I use this? | Other CB clients
    Other Users?
    Others pondering the Monastery: (2)
    As of 2018-02-18 01:56 GMT
    Find Nodes?
      Voting Booth?
      When it is dark outside I am happiest to see ...

      Results (250 votes). Check out past polls.