Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation

Re^3: Curious about Perl's strengths in 2018

by raiph (Chaplain)
on May 20, 2018 at 19:48 UTC ( #1214956=note: print w/replies, xml ) Need Help??

in reply to Re^2: Curious about Perl's strengths in 2018
in thread Curious about Perl's strengths in 2018

To keep my commentary as short as can reasonably do the topic justice, I've sharply narrowed discussion to: characters in a Unicode string; Perl 6; and Python 3. See my notes at the end for discussion of this narrowing.

What's a character?

For a while last century, "character", in the context of computing, came close to being synonymous with an ASCII byte. But that was always a string implementation detail, one that involves an assumption that's broken in the general case -- a character is not an ASCII byte unless you stick to a very limited view of text that ignores most of the world's text including even English text if it includes arbitrary Unicode characters, eg tweets which may look English but are allowed to contain arbitrary characters.

For a while this century, "character", in the context of contemporary mainstream programming languages and developer awareness, has come close to being synonymous with a Unicode codepoint. Unfortunately, assuming a codepoint is a character in the ordinary sense is again a broken assumption in the general case. Even if you're dealing with Unicode text, a character does not correspond to a Unicode codepoint, unless you continue to stick to a very sharply limited view of text and characters that again excludes arbitrary Unicode text.

So, just what is a "character" given some Unicode string?

"What a user thinks of as a character"

If we're talking about Unicode, it's helpful to consider Unicode's precisely chosen vocabulary for describing text, and in particular, characters. Unicode's definition of "what a user thinks of as a character" is that it's a sequence of codepoints selected according to rules (algorithms) and data defined by Unicode. So a character might be just one codepoint, but it might -- often will be -- many. And you can't tell unless you iterate through a given text string, calculating the start and end of individual characters according to the relevant general Unicode rules and data (and locale specific overrides).

This latter reality -- an individual character can be comprised of multiple codepoints -- is why a character=codepoint assumption is a 21st century error that's similar to the 20th century one of assuming character=byte. The codepoint=character assumption allows for fast indexing -- but it's increasingly often wrong, leading to broken code and corrupt data.

What Perl (5 + 6) and Python (2 + 3) think of as a character

Armed with the knowledge that Unicode made the (frankly, terrible, but undoable) choice to use the word "grapheme" to denote a character in the ordinary sense of what we all used to think before bytes and codepoints confused the issue, one can begin to get some sense of the level of support for ordinary character handling in any given programming language by searching within its resources for "grapheme".

Google searches for "grapheme+<prog-lang-web-home>" and "grapheme+<prog-lang-goes-here>":

  • and grapheme+perl yield interesting reading that highlights Perl's world leading Unicode support (including support for processing characters aka graphemes).

  • and grapheme+perl6 yields a different set of matches, with some overlap with the plain perl search, but this time highlighting Perl 6's leadership within the Perl world and outside it when it comes to processing graphemes.

  • yields nothing but "Missing: grapheme" matches. In other words, there are zero matches. No matches in the PEPs that drove Python 3's design, including PEP 393 -- Flexible String Representation ("There are two classes of complaints about the current implementation of the unicode type"). The Python 3 Unicode HOWTO? No matches. The entire No matches.

  • grapheme+python shows the reality for python. The opening "Did you mean: graphene python" is perhaps just amusing. Likewise the Grapheme Toolkit that allows "visualization of complex molecular interactions ... that is the most natural to chemists". The relative lack of upvotes (and no comments) on the /r/python post introducing the grapheme library is a bit more telling. Likewise the paucity of useful links in general. The bug report that ends in February 2018 with "We missed 3.7 train. ... I have many shine features I want in 3.7 and I have no time to review all. Especially, I need to understand tr29. It was hard job to me." suggests a striking lack of focus on this issue in the Python community at large.

An example that's No F💩💩king Good

Given the discussion thus far, it should come as no surprise that the built in string type, functions, and standard libraries of both Python 2 and Python 3 will yield the wrong result for string length, character indexing, and substrings if A) what you're interested in is character=grapheme processing as contrasted with character=codepoint processing and B) a string contains a grapheme that isn't a single codepoint.

One fun way to see this in action is to view Patrick Michaud's lightning talk about text processing that's No F💩💩king Good. If you don't have 5 minutes, the following link takes you right to the point where Patrick spends 30 seconds trying Python 3. Of the three simple tests used in his talk it gets two "wrong".

Part of the fun is that this example may be No F**king Good in a manner not at all intended by Jonathan Worthington who wrote the presentation or me when I originally included it here. Prompted by a reader who challenged several aspects of this post, including this one, my brief investigation thus far suggests that the specific example of a D with double dots is actually a "degenerate case" -- one that "never occurs in practice", or at least one that will generally only occur in artificial/accidental scenarios such as the test in the video.

(It looks like it may have been naively taken from the the "Basic Examples" table in Unicode annex #15 on the mistaken assumption it's not degenerate when instead (perhaps) it's in the table as an example in which normalization to a single character is not appropriate because that character doesn't appear in practice. If you can confirm or deny its degenerate nature, please comment.)

Does this mean the thrust of this post -- about character=grapheme vs character=codepoint -- is essentially invalid? No. While D with double dots may be an especially poorly chosen example, the problem does occur for a huge number of non-degenerate characters as demonstrated by the reported length of a common single character in Devanagari, one of the world's most used scripts, in Python 2 and Python 3 (2 codepoints after normalization), and Perl 6 (1, i.e. correct).

Does it matter that Perl 6's character and substring accessing time is O(1)?

If you watch the whole of Patrick's talk you'll see he covers the point that Perl 6 has "O(1) substring, index, etc.".

But for most things, other langs are faster than Perl 6 -- a lot faster. So does O(1) indexing matter?

Imo it does. It's taken years to get the architecture of P6 and the Rakudo compiler right but the hard part is now done and NFG, along with all the other innovative elements in P6 and Rakudo, are in place and getting increasingly battle hardened and optimized.

If character processing in general matters, then presumably O(1) character indexing, substring processing, and regexing matters. And if so, the Perl 6 and nqp languages, and Rakudo / NQP / MoarVM compiler stack, are all in a great place given that they're the first (and I believe only) programming languages and compiler stack in the world with O(1) performance.

(As far as I know the indexing, substring and regexing performance of Swift and Elixir -- the only other languages I'm aware of that have adopted "what a user thinks of as a character" as their standard string type's character abstraction -- is still O(n) or worse.)

What about third-party add-ons for this functionality in Python?

The primary source of guidance, reference implementations, and locale specific data related to Unicode, including annex #29, is ICU (code in C/C++ and Java) and CLDR (locale specific data related to text segmentation, including of characters). Many languages rely on bindings/wrappers of these resources for much of their Unicode support. In the Python case the PyICU project is a binding/wrapper with a long history that credibly (to me, just an onlooker) claims production status.

I'm unsure about the status of other projects. The pure Python uniseg includes a PR and reply to that PR from this year but hasn't been updated since 2015, since which Unicode has substantially updated annex #29 in ways that require conforming implementations to change. Another simpler but newer library is grapheme as introduced in this blog post. In some ways this is the most promising library I found. That said, it's currently marked as Alpha status.

Note that neither PyICU nor uniseg nor grapheme provides anything remotely like the ergonomic simplicity and deep integration that the Perl 6 language provides for character=grapheme indexing, substring handling, regexing, etc.

Furthermore, ICU, and thus any modules that build directly on its code -- which I believe is true of PyICU, uniseg and grapheme -- does not provide O(1) grapheme-based indexing, substring and regexing performance. (cf the grapheme library's comment that "Execution times may improve in later releases, but calculating graphemes is and will continue to be notably slower than just counting unicode code points".)


Perhaps my overall point has gotten lost as I've tried to provide substantive detail.

The bottom line is that Perl has long been a leader in text processing capabilities and in that regard, as in many others, it's in great shape, including and perhaps especially in how it compares with Python.


Sorry it took me so long to spot your reply and write this comment. (And because of that I'm not going to simultaneously start another sub-thread about another topic as I originally said I would if you replied. Let's see if you spot this reply and then maybe we can wrap this sub-thread first and only start another if we're both interested in doing so.)

To keep my commentary as short as can reasonably do the topic justice, I sharply narrowed discussion above to characters in a Unicode string; Perl 6; and Python 3:

  • I only discuss one very narrow topic, namely indexing characters in a Unicode string per "what a user thinks of as a character" as discussed in Unicode annex #29.

  • Of the Perls, I only discuss Perl 6 even though Perl 5 is much more mature, with broad and deep Unicode support in terms of userland modules, and is generally much faster than Perl 6. Note that the two Perls can be used together to produce best-of-both-Perls solutions.

  • Perl 6 has world leading character handling features with outstanding ergonomics and O(1) performance. This includes sped up and simplified character indexing, substring processing, and regexing. (Perl 6 has other great Unicode features too but I don't discuss these, or indeed the substring or regex features. It all builds on the fundamental character abstraction used by Perl 6 and that's all I discuss.)

  • I contrast Perl 6's support for Unicode characters with Python 3's. Python 3 is considered by many to have adequate Unicode support, on par with most mainstream languages, and significantly better support than Python 2's, especially/ironically with regard to characters. So if you like Perl 6's advantage over Python 3 then you should like Perl's advantages over most mainstream languages, including both Pythons.
  • Comment on Re^3: Curious about Perl's strengths in 2018

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1214956]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (3)
As of 2018-07-21 21:07 GMT
Find Nodes?
    Voting Booth?
    It has been suggested to rename Perl 6 in order to boost its marketing potential. Which name would you prefer?

    Results (450 votes). Check out past polls.