comment on

vsespb kindly wrote:

Hm, do you think it's possible to implement any useful set of checks related to Unicode in static code analyzer? Afaik, perlcritic only checks one file, so it's impossible to say where data come from and if this text data or no. It's impossible to say if we have encoding layers for our file handles or no.

I agree that static analysis has its limitations. But there are still things that can be done, I think.

These are all wrong, or at least have a code smell to them:

lc($a) cmp/eq/ne/... lc($b) should be using fc. Same story with uc.
Something like a-z should often be \p{Ll} or \p{lower} — which mean different things!. Same with A-Z.
I would really like to see use warnings FATAL => "utf8".
I would like to see the charnames versions used instead of magic numbers, so for example \N{EM DASH} over \x{2014}.

The POSIX character classes are suspect. Thanks to Karl’s fine work, we do studiously follow UTS #18’s recommendations on compat properties here, but things like :punct: versus \p{punct} are a problem, because POSIX mixes symbols and puncts, and Unicode doesn’t.

$ unichars -g '/[[:punct:]]/ != /\p{punct}/'
U+0024 &#8237; $  GC=Sc DOLLAR SIGN
U+002B &#8237; +  GC=Sm PLUS SIGN
U+003C &#8237; <  GC=Sm LESS-THAN SIGN
U+003D &#8237; =  GC=Sm EQUALS SIGN
U+003E &#8237; >  GC=Sm GREATER-THAN SIGN
U+005E &#8237; ^  GC=Sk CIRCUMFLEX ACCENT
U+0060 &#8237; `  GC=Sk GRAVE ACCENT
U+007C &#8237; |  GC=Sm VERTICAL LINE
U+007E &#8237; ~  GC=Sm TILDE
[download]

We should probably consider warning about using block properties instead of script properties.
There are all kinds of version-dependent Unicode character properties that nothing ever warns you about, and this drives me crazy. For example, you cannot use \p{space} or \p{Horiz_Space} or \h or \R in v5.8.8.
There was a shift in the allowable user-defined properties being restrict to InFoo or IsFoo at one point, which is not noted. Although I hope this will get better soon.
Should there be a policy that advises double-barrelled property names for the ones that apply? I find that that might help people who are confused about the script versus block properties. For example, \p{Latin} actually means \p{Script=Latin}, not any of \p{Block=Basic_Latin}, \p{Block=Latin_1}, \p{Block=Latin_1_Supplement}, \p{Block=Latin_Extended_A}, \p{Block=Latin_Extended_Additional}, \p{Block=Latin_Extended_B}, \p{Block=Latin_Extended_C}, or \p{Block=Latin_Extended_D}.
Similarly, I wonder about the GC categories. I just love \pL — and unlove the gratuitously embraced \p{L} — but when people write that or something like \p{Ll}, should there be a way to suggest that it would be more clearly written \p{GC=L}, \p{GC=Ll}, \p{General_Category=L}, \p{General_Category=Ll}, \p{General_Category=Letter}, or \p{General_Category=Lowercase_Letter}?
And even more importantly, shouldn't those be \p{Lowercase} or \p{lower}, since you otherwise miss the 159 lowercase code points that are not \p{GC=Ll}?
The approach of \PM\pM* instead of \X does not work.
Just as we tell people that /./ instead of /(?s:.)/ may sometimes get them into trouble, something that mentions that sometimes you really want /\X/ instead of /./ would be good. I just don’t know when to tell them that in a general not a specific case. I mean, I can look at it and know, but a program? Dunno.
Not using the \X-aware versions of substr, index, rindex, length can be a problem in some programs.
I have no earthly idea what to do about the maxim to NFD on the way in and NFC on the way out. At the least, people need to understand that unless they normalize their two variables, not even string comparisons between them will work. Well, barring Unicode::Collate, which is awfully heavy-weight but actually works. And don’t forget Unicode::Collate::Locale, either.
I can get rather nervous about seeing literal \n, \r, \f, \r\n in code. That really should be using \R but that only works in a regex.
```
$ unichars '/ ^ \R $ /x'
U+000A  -- GC=Cc LINE FEED (LF)
U+000B  -- GC=Cc LINE TABULATION
U+000C  -- GC=Cc FORM FEED (FF)
U+000D  -- GC=Cc CARRIAGE RETURN (CR)
U+0085  -- GC=Cc NEXT LINE (NEL)
U+2028  -- GC=Zl LINE SEPARATOR
U+2029  -- GC=Zp PARAGRAPH SEPARATOR
[download]
```
Also, do note that "\r\n" = / \A \R \Z /x is also true, so like \X, \R can be more than one code point in length.
Until Karl gets some support for $/ = q(\R) and readline and chomp into a future release, we’re stuck with this ugliness:
```
@lines = do { local $/; split /\R/, <INPUT> };
[download]
```
which is a big problem because it just doesn’t scale to superbig files or to interactive handles like two-way socket communications. And lord knows what $\ = q(\R) would ever mean for the output record separator. :)
I do have an inspiration for a PerlIO::via::Unicode_Linebreak I/O layer to address all this, but haven’t thought it through.
Getting things like sorting right is hard. Under some operating systems, your cmp operator (and thus default sort, etc.) actually does work correctly provided that all these apply:
1. You have done a POSIX::setlocale(LC_ALL, "en_US.utf8") somewhere in your program.
2. Your string-comparison operators are within the lexical scope of a use locale pragma.
3. (Maybe; haven’t checked.)Your internal Latin1 data is in its 2-byte UTF8 representation not the 1-byte Perl cheat/short-cut.
I really, really want to do something about encodings. Opening a text file without stating its encoding somewhere or other is a recipe for failure. The problem is that it can be stated in so many placed: a PERL_UNICODE environment variable, a use open pragma, a middle argument to open, or binmode — just to name a few. There are other ways, too.

So even though there is a lot that cannot be done about Unicode, there is also a lot that could be done with Unicode.

I really do think there could be some more P::C policies in these areas.chomp

In reply to Re^2: Where are the Perl::Critic Unicode policies for the Camel? by tchrist
in thread Where are the Perl::Critic policies for the Camel? by tchrist

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


P is for Practical
	PerlMonks