Bruder Savigny has asked for the wisdom of the Perl Monks concerning the following question:
I tried to use a UTF-8 non-breaking space (between day and name of month) in the format argument of POSIX::strftime, and hit (with Perl v5.32.0 and a UTF-8-encoded script file, without any non-default encoding settings) upon the following two oddities:
These behaviours can be demonstrated with the following script (The comments apply to the transparent space character in the format; the innocent-looking - inner, i.e. not syntactical - quotes in lines 3 and 4 are Unicode LEFT and RIGHT SINGLE QUOTATION MARK, the same as in the $string):
use POSIX qw(strftime); $string = 'hailed an über ‘cab’ on '; @t = (0, 0, 0, 23, 5, 2020, 4); print $string . strftime( '%d/%b', @t), "\n"; print $string . strftime( '%d %b', @t), "\n"; # UTF-8 nbsp print $string . strftime('‘%d %b’', @t), "\n"; # UTF-8 nbsp print $string . strftime('‘%d %b’', @t), "\n"; # ASCII space
This outputs (line numbers added):
1 hailed an über ‘cab’ on 23/Jun 2 hailed an über ‘cab’ on 23�Jun 3 hailed an über âcabâ on ‘23 Jun’ 4 hailed an über âcabâ on ‘23 Jun’
Note that
(I have deleted complaints about the wide characters in print for line 3 and 4 for brevity.)
I am guessing, rather vaguely, that this is down to strftime essentially being the C function and the latter not being Unicode-aware and maybe also the way that Perl identifies how strings are encoded and then "upgrades" some so as to harmonise their encodings (in this case under a wrong assumption), but ... :
The behaviour with a non-breaking space alone vs. (also) other non-ASCII characters seems definitely inconsistent. Why is the behaviour different between the non-breaking space and typographical quotation marks, which are all outside the ASCII block?
Also, can anything be done about it, i.e. is it possible to use non-breaking spaces in a format for strftime such that they come out correctly (and without having to resort to inserting extra - likely unwanted - non-ASCII characters), and is it possible to use any non-ASCII character in those format argument without confusing Perl? (Actually, I can think only of non-breaking spaces as useful, but other cultures may very plausibly have other use cases.)
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: strftime does not handle Unicode characters in format argument properly (at least, not consistently)
by choroba (Archbishop) on Sep 21, 2020 at 19:08 UTC | |
Re: strftime does not handle Unicode characters in format argument properly (at least, not consistently)
by perlfan (Vicar) on Sep 21, 2020 at 18:53 UTC | |
by Bruder Savigny (Initiate) on Sep 24, 2020 at 00:07 UTC |