Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

RFC: system calls on Unicode filesystem

by daxim (Chaplain)
on Feb 27, 2018 at 12:37 UTC ( #1210032=perlmeditation: print w/replies, xml ) Need Help??

I'm about to email the pumpking for an intervention as a personal favour. Because I'm convinced that a half-arsed solution is better than no solution, it's due past time that the over 20 year old embarrassment gets fixed:
› ver
Microsoft Windows [Version 10.0.16299.125]

› chcp 65001
Aktive Codepage: 65001.

› type αω.bat
@echo hiαω

› node -p "require('child_process').execSync('αω.bat').toString()"

› perl6 -e "run 'αω.bat'"

› php -r "system('αω.bat');"

› python -c "import subprocess;'αω.bat')"

› ruby -e "system 'αω.bat'"

› perl -Mutf8 -e "system 'αω.bat'"
Der Befehl "a?.bat" ist entweder falsch geschrieben oder
konnte nicht gefunden werden.

› perl -Mutf8 -MWin32::Unicode::Process=systemW -e "systemW('αω.bat')"
Der Befehl "a?.bat" ist entweder falsch geschrieben oder
konnte nicht gefunden werden.
Plan of attack: use 5.028 enables use feature 'just-make-it-work-already-dammit', which checks $^O eq 'MSWin32' and then replaces all the broken chdir, mkdir, open, opendir, rename, rmdir, system, unlink, utime, -X stat etc. with the working equivalent code from Win32::Unicode and also somehow on -e, not just with code executed from files.

Now tell me why this is a stupid idea, but keep in mind that

  • if all the other languages can hack it, then so can we, no matter how shitty and insufficient you think the initial patch is
  • the better is the enemy of the good and a "better" solution did not turn up for decades
  • if I simply file a perlbug it just gets marked by p5p as a duplicate of a discussion whose proposed "better" solution did not turn up for decades
  • Comment on RFC: system calls on Unicode filesystem

Replies are listed 'Best First'.
Re: RFC: system calls on Unicode filesystem
by dasgar (Priest) on Feb 27, 2018 at 23:37 UTC

    I'm not familiar with the history about p5p's previous views and actions (or inaction) related to this topic, so I'll refrain from comments on that aspect of your post.

    Since I primarily work in Windows, the idea of getting Perl to work better in Windows is appealing. I especially like the idea of having Perl default to the filesystem API that supports the longer path lengths and Unicode. However, I don't have the expertise or time to help with efforts to achieve that goal.

    My concern about your idea is proposing the use of Win32::Unicode. When I was trying to work with long paths in Windows, I had issues getting Win32::Unicode to work. It's been quite a while ago, so I don't remember what in particular I had problems with. Ignoring that, the latest version was released back in 2012. That by itself might not be an issue. But it becomes an issue when there a lot of failures in on CPAN Testers for newer versions of Perl and the author hasn't really responded to submitted issues. The combination of all of those facts leaves an impression that the module is broken and abandoned - at least that's the impression that I personally have.

    I think a better candidate might be Win32::LongPath, whose author does credit the author of the Win32::Unicode. It has a much better success record for newer versions of Perl on CPAN Testers, has no currently open issues and the latest release was about 2 weeks ago.

    I'm not going to try pressure anyone to not use Win32::Unicode. But I thought I'd share my thoughts that Win32::LongPath would be a better choice along with the reasons why I prefer Win32::LongPath over Win32::Unicode.

      This is great feedback, thank you. I didn't know about Win32::LongPath yet. It does not have systemL, so at least that part needs to be integrated from Win32::Unicode.
Re: RFC: system calls on Unicode filesystem
by salva (Abbot) on Feb 28, 2018 at 08:06 UTC
    The broken handling of Unicode on file system operations is not just a Windows issue. It is not done correctly in Unix/Linux either:

    See how the UTF8 flag is completely ignored in scalars passed as arguments to built-ins performing file system operations in the following session:

    salva@atun:/tmp/unicode$ $ $a="a\xf1o" a�o $ $b = $a a�o $ use Devel::Peek $ utf8::upgrade($b) 4 $ Dump $a SV = PV(0x557576d4e2d0) at 0x557576812a80 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK) PV = 0x557576d516c0 "a\361o"\0 CUR = 3 LEN = 10 COW_REFCNT = 0 $ Dump $b SV = PV(0x557576d4e2a0) at 0x557576d4b7d0 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x557576d6e6e0 "a\303\261o"\0 [UTF8 "a\x{f1}o"] CUR = 4 LEN = 10 $ open A, ">$a"; 1 $ open B, ">$b"; 1 $ system "ls" año a?o 0 $ $a eq $b 1
      This is wrong, it has nothing to do with the UTF8 flag, see demo program below. The reason why open A in your example creates a file with a name that is broken for display purposes only¹ is because you (implicitly) used an encoding (namely, Latin1) that does not agree with the encoding of the filesystem² (namely, UTF-8).

      If a programmer does the straight-forward thing that comes to mind, namely just using a string, not even bothering to encode to octets, then there is no problem. Both strings and UTF-8 encoded octet sequences work. I personally haven't even run into the problem during the last ten years because it requires special action to set oneself up for failure in order to trigger.

      To come back to the topic expressed in the root post, I think nothing needs to be done for Posix. This is a Windows only bugfix.

      ¹ The big difference to the Windows shitshow where system calls do not work and return error messages, on Posix they do work even if you somehow fuck up the encoding. The file will exist, will be enumerable, will create a filehandle you can operate on, merely the name will display wrong.

      ² De jure the encoding of a Posix filesystem is unknown and unknowable, but de facto it is UTF-8. I remember dwheeler is doing work to get UTF-8 as default into the next version of the standard, but I can't find the relevant document on the Web.

      use 5.026;
      use utf8;
      use Devel::Peek qw(Dump);
      my $string = 'string-año';
      #   FLAGS = (POK,IsCOW,pPOK,UTF8)
      #   PV = 0x12f4760 "string-a\303\261o"\0 [UTF8 "string-a\x{f1}o"]
      my $octets_latin1 = "octets-latin1-a\x{f1}o";
      #   FLAGS = (POK,IsCOW,pPOK)
      #   PV = 0x12f4420 "octets-latin1-a\361o"\0
      my $octets_utf8 = "octets-utf8-a\x{c3}\x{b1}o";
      #   FLAGS = (POK,IsCOW,pPOK)
      #   PV = 0x12ebb00 "octets-utf8-a\303\261o"\0
      my $fh;                         # ↓ displays in shell as: ↓
      open $fh, '>', $string;         # string-año
      open $fh, '>', $octets_latin1;  # 'octets-latin1-a'$'\361''o'
      open $fh, '>', $octets_utf8;    # octets-utf8-año
        My point is that two variables that eq sees as containing the same value, create two different files just because of its internal representation.

        The representation used by perl internally, should not affect external interactions.

        Update: BTW, note that in my sample repl above I have used utf8::upgrade which just changes the representation. Quoting utf8:

        Converts in-place the internal representation of the string from an octet sequence in the native encoding (Latin-1 or EBCDIC) to UTF-8.The logical character sequence itself is unchanged.
Re: RFC: system calls on Unicode filesystem
by choroba (Bishop) on Feb 27, 2018 at 12:49 UTC
    > › perl -Mutf8 -e "system 'αω.bat'"

    What happens if you omit the -Mutf8? (No MSWin around to test myself.)

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
      same error message

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://1210032]
Front-paged by Corion
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (4)
As of 2018-03-24 06:50 GMT
Find Nodes?
    Voting Booth?
    When I think of a mole I think of:

    Results (297 votes). Check out past polls.