Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
Well, actually recently I had experience writing not a small application (20k lines), which allows unicode everywhere and handles unicode correctly.

But it does not need 90% of things that you listed

Probably because my application does not try to analyze text data (it only stores it, converts, compares, reencodes), it does not need sort nor fc-aware comparison

Your case is probably something that analyzes text (I can imagine now only something related to natural language processing or word processor or maybe a dictonary)

So I think different applications need different level of unicode support

Below some cases when policy you listed can be wrong in some circumstances:
lc($a) cmp/eq/ne/... lc($b) should be using fc. Same story with uc.
Something like a-z should often be \p{Ll} or \p{lower}
If you write, say, code which have to deal with parsing http headers (no, that's not reinvention of wheel, like HTTP library, that can be a proxy server or REST library), then "cmp" and "a-z" would be correct choice, and fc() \p{lower} can introduce bugs (say, with "β" vs "ss").

Other examples can be unit tests where you usually have to deal with pre-defined data sets, or internal program metadata which is always plain ASCII, or comparison of MD5/SHA hex values etc.
Opening a text file without stating its encoding somewhere or other is a recipe for failure.
Unless it's a binary file.


@lines = do { local $/; split /\R/, <INPUT> };
Hm. I think it's not correct to use something like U+2028 as line separator for files.

You need code like this if you read from text file. Text file is something separated by LF or CRLF, other combinations are not portable.

If you are writing word processor which should handle U+2028 you should not mix this with system file IO, instead introduce your own logic when you are spliting data to "lines" and paragraphs.

I don't see where this can be correct to mix "lines" from your word processor logic and lines of text file on disk (or socket)

In reply to Re^3: Where are the Perl::Critic Unicode policies for the Camel? by vsespb
in thread Where are the Perl::Critic policies for the Camel? by tchrist

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (4)
As of 2024-03-28 14:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found