Bruder Savigny has asked for the wisdom of the Perl Monks concerning the following question:

I tried to use a UTF-8 non-breaking space (between day and name of month) in the format argument of POSIX::strftime, and hit (with Perl v5.32.0 and a UTF-8-encoded script file, without any non-default encoding settings) upon the following two oddities:

  1. a non-breaking space alone comes out as something unprintable (according to Emacs, Unicode 65533 (decimal) REPLACEMENT CHARACTER, but when examined in a hex-mode, looks like hexadecimal EFBFBD)
  2. when other non-ASCII characters figure in the format, they come out correctly, and this seems "infectious": the non-breaking space then comes out correctly as well! However, in that case, a string that is concatenated to what strftime returns gets garbled (perhaps erroneously encoded from an assumed iso-latin-1 (but really already utf-8) to utf-8), which does not happen in case 1

These behaviours can be demonstrated with the following script (The comments apply to the transparent space character in the format; the innocent-looking - inner, i.e. not syntactical - quotes in lines 3 and 4 are Unicode LEFT and RIGHT SINGLE QUOTATION MARK, the same as in the $string):

use POSIX qw(strftime); $string = 'hailed an ber cab on '; @t = (0, 0, 0, 23, 5, 2020, 4); print $string . strftime( '%d/%b', @t), "\n"; print $string . strftime( '%d%b', @t), "\n"; # UTF-8 nbsp print $string . strftime('%d%b', @t), "\n"; # UTF-8 nbsp print $string . strftime('%d %b', @t), "\n"; # ASCII space

This outputs (line numbers added):

1 hailed an ber cab on 23/Jun 2 hailed an ber cab on 23�Jun 3 hailed an über cab on 23Jun 4 hailed an über cab on 23 Jun

Note that

(I have deleted complaints about the wide characters in print for line 3 and 4 for brevity.)

I am guessing, rather vaguely, that this is down to strftime essentially being the C function and the latter not being Unicode-aware and maybe also the way that Perl identifies how strings are encoded and then "upgrades" some so as to harmonise their encodings (in this case under a wrong assumption), but ... :

The behaviour with a non-breaking space alone vs. (also) other non-ASCII characters seems definitely inconsistent. Why is the behaviour different between the non-breaking space and typographical quotation marks, which are all outside the ASCII block?

Also, can anything be done about it, i.e. is it possible to use non-breaking spaces in a format for strftime such that they come out correctly (and without having to resort to inserting extra - likely unwanted - non-ASCII characters), and is it possible to use any non-ASCII character in those format argument without confusing Perl? (Actually, I can think only of non-breaking spaces as useful, but other cultures may very plausibly have other use cases.)

  • Comment on strftime does not handle Unicode characters in format argument properly (at least, not consistently)
  • Select or Download Code

Replies are listed 'Best First'.
Re: strftime does not handle Unicode characters in format argument properly (at least, not consistently)
by choroba (Archbishop) on Sep 21, 2020 at 19:08 UTC
    When I add use utf8; and correctly set the encoding of the output, it seems to work:
    #!/usr/bin/perl use strict; use warnings; use utf8; use open OUT => ':encoding(UTF-8)', ':std'; use POSIX qw(strftime); my $string = 'hailed an ber cab on '; my @t = (0, 0, 0, 23, 5, 2020, 4); my $nbsp = chr 160; print $string . strftime( '%d/%b', @t), "\n"; print $string . strftime( "%d$nbsp%b", @t), "\n"; print $string . strftime("%d$nbsp%b", @t), "\n"; print $string . strftime('%d %b', @t), "\n";

    Update: I used the $nbsp here, as PerlMonk replaces the non-breakable space with a normal ASCII space, but it works with the nbsp character directly in the script, too.

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re: strftime does not handle Unicode characters in format argument properly (at least, not consistently)
by perlfan (Vicar) on Sep 21, 2020 at 18:53 UTC
    strftime is implemented in glibc or your standard C library, so I don't think this issue is related specifically to Perl. Perhaps you can use POSIX::strftime for _just_ the numerical/time bits, then feed this into sprintf.

      Many thanks to both of you, and sorry for the somewhat belated answer. I have to admit I had worked under the impression that UTF-8 works out of the box with Perl, and had to read up a lot on that. I have now understood that you should not rely on that, even if it mostly looks like it. The fix using both use utf8 and use open ':encoding(UTF-8) worked perfectly for me as well. The suggestion of using strftime for the numbers only would definitely have been a workable fallback solution that I hadn't thought of.

      The only thing that really puzzles me is the different outcome between the non-breaking space and the other non-ASCII characters. But then, it seems Perl has to do quite complex things around Unicode.

      Thanks again, and best wishes!