|Perl Monk, Perl Meditation|
Re^3: Curious about Perl's strengths in 2018by raiph (Chaplain)
|on May 20, 2018 at 19:48 UTC||Need Help??|
To keep my commentary as short as can reasonably do the topic justice, I've sharply narrowed discussion to: characters in a Unicode string; Perl 6; and Python 3. See my notes at the end for discussion of this narrowing.
What's a character?
For a while last century, "character", in the context of computing, came close to being synonymous with an ASCII byte. But that was always a string implementation detail, one that involves an assumption that's broken in the general case -- a character is not an ASCII byte unless you stick to a very limited view of text that ignores most of the world's text including even English text if it includes arbitrary Unicode characters, eg tweets which may look English but are allowed to contain arbitrary characters.
For a while this century, "character", in the context of contemporary mainstream programming languages and developer awareness, has come close to being synonymous with a Unicode codepoint. Unfortunately, assuming a codepoint is a character in the ordinary sense is again a broken assumption in the general case. Even if you're dealing with Unicode text, a character does not correspond to a Unicode codepoint, unless you continue to stick to a very sharply limited view of text and characters that again excludes arbitrary Unicode text.
So, just what is a "character" given some Unicode string?
"What a user thinks of as a character"
If we're talking about Unicode, it's helpful to consider Unicode's precisely chosen vocabulary for describing text, and in particular, characters. Unicode's definition of "what a user thinks of as a character" is that it's a sequence of codepoints selected according to rules (algorithms) and data defined by Unicode. So a character might be just one codepoint, but it might -- often will be -- many. And you can't tell unless you iterate through a given text string, calculating the start and end of individual characters according to the relevant general Unicode rules and data (and locale specific overrides).
This latter reality -- an individual character can be comprised of multiple codepoints -- is why a character=codepoint assumption is a 21st century error that's similar to the 20th century one of assuming character=byte. The codepoint=character assumption allows for fast indexing -- but it's increasingly often wrong, leading to broken code and corrupt data.
What Perl (5 + 6) and Python (2 + 3) think of as a character
Armed with the knowledge that Unicode made the (frankly, terrible, but undoable) choice to use the word "grapheme" to denote a character in the ordinary sense of what we all used to think before bytes and codepoints confused the issue, one can begin to get some sense of the level of support for ordinary character handling in any given programming language by searching within its resources for "grapheme".
Google searches for "grapheme+<prog-lang-web-home>" and "grapheme+<prog-lang-goes-here>":
An example that's No F💩💩king Good
Given the discussion thus far, it should come as no surprise that the built in string type, functions, and standard libraries of both Python 2 and Python 3 will yield the wrong result for string length, character indexing, and substrings if A) what you're interested in is character=grapheme processing as contrasted with character=codepoint processing and B) a string contains a grapheme that isn't a single codepoint.
One fun way to see this in action is to view Patrick Michaud's lightning talk about text processing that's No F💩💩king Good. If you don't have 5 minutes, the following link takes you right to the point where Patrick spends 30 seconds trying Python 3. Of the three simple tests used in his talk it gets two "wrong".
Part of the fun is that this example may be No F**king Good in a manner not at all intended by Jonathan Worthington who wrote the presentation or me when I originally included it here. Prompted by a reader who challenged several aspects of this post, including this one, my brief investigation thus far suggests that the specific example of a D with double dots is actually a "degenerate case" -- one that "never occurs in practice", or at least one that will generally only occur in artificial/accidental scenarios such as the test in the video.
(It looks like it may have been naively taken from the the "Basic Examples" table in Unicode annex #15 on the mistaken assumption it's not degenerate when instead (perhaps) it's in the table as an example in which normalization to a single character is not appropriate because that character doesn't appear in practice. If you can confirm or deny its degenerate nature, please comment.)
Does this mean the thrust of this post -- about character=grapheme vs character=codepoint -- is essentially invalid? No. While D with double dots may be an especially poorly chosen example, the problem does occur for a huge number of non-degenerate characters as demonstrated by the reported length of a common single character in Devanagari, one of the world's most used scripts, in Python 2 and Python 3 (2 codepoints after normalization), and Perl 6 (1, i.e. correct).
Does it matter that Perl 6's character and substring accessing time is O(1)?
If you watch the whole of Patrick's talk you'll see he covers the point that Perl 6 has "O(1) substring, index, etc.".
But for most things, other langs are faster than Perl 6 -- a lot faster. So does O(1) indexing matter?
Imo it does. It's taken years to get the architecture of P6 and the Rakudo compiler right but the hard part is now done and NFG, along with all the other innovative elements in P6 and Rakudo, are in place and getting increasingly battle hardened and optimized.
If character processing in general matters, then presumably O(1) character indexing, substring processing, and regexing matters. And if so, the Perl 6 and nqp languages, and Rakudo / NQP / MoarVM compiler stack, are all in a great place given that they're the first (and I believe only) programming languages and compiler stack in the world with O(1) performance.
(As far as I know the indexing, substring and regexing performance of Swift and Elixir -- the only other languages I'm aware of that have adopted "what a user thinks of as a character" as their standard string type's character abstraction -- is still O(n) or worse.)
What about third-party add-ons for this functionality in Python?
The primary source of guidance, reference implementations, and locale specific data related to Unicode, including annex #29, is ICU (code in C/C++ and Java) and CLDR (locale specific data related to text segmentation, including of characters). Many languages rely on bindings/wrappers of these resources for much of their Unicode support. In the Python case the PyICU project is a binding/wrapper with a long history that credibly (to me, just an onlooker) claims production status.
I'm unsure about the status of other projects. The pure Python uniseg includes a PR and reply to that PR from this year but hasn't been updated since 2015, since which Unicode has substantially updated annex #29 in ways that require conforming implementations to change. Another simpler but newer library is grapheme as introduced in this blog post. In some ways this is the most promising library I found. That said, it's currently marked as Alpha status.
Note that neither PyICU nor uniseg nor grapheme provides anything remotely like the ergonomic simplicity and deep integration that the Perl 6 language provides for character=grapheme indexing, substring handling, regexing, etc.
Furthermore, ICU, and thus any modules that build directly on its code -- which I believe is true of PyICU, uniseg and grapheme -- does not provide O(1) grapheme-based indexing, substring and regexing performance. (cf the grapheme library's comment that "Execution times may improve in later releases, but calculating graphemes is and will continue to be notably slower than just counting unicode code points".)
Perhaps my overall point has gotten lost as I've tried to provide substantive detail.
The bottom line is that Perl has long been a leader in text processing capabilities and in that regard, as in many others, it's in great shape, including and perhaps especially in how it compares with Python.
Sorry it took me so long to spot your reply and write this comment. (And because of that I'm not going to simultaneously start another sub-thread about another topic as I originally said I would if you replied. Let's see if you spot this reply and then maybe we can wrap this sub-thread first and only start another if we're both interested in doing so.)
To keep my commentary as short as can reasonably do the topic justice, I sharply narrowed discussion above to characters in a Unicode string; Perl 6; and Python 3: