Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery

Unify windows filenames

by Sewi (Friar)
on Sep 19, 2009 at 20:31 UTC ( #796321=perlquestion: print w/replies, xml ) Need Help??
Sewi has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks,
I need to detect unique filenames. On Windows, the following are the same file:
  • C:\dir\
  • C:/dir/
  • C:\dir\File.PL
  • I already fixed the / \ problem, but the last case is really complicated. I could do a lc() on everything and I'm done but as this app should run on many OS's, I'ld really prefer using a module which covers as many OS's as possible.
    Is there such a module where I feed the given filename and it returns the true one? I didn't find anything really useful on CPAN which supports at least Linux and Windows.

    Replies are listed 'Best First'.
    Re: Unify windows filenames
    by graff (Chancellor) on Sep 20, 2009 at 01:57 UTC
      There are some subtleties in handling this sort of thing across different OS's. As you've indicated, using Windows file paths in Perl leads to lots of "synonymy" (different ways of expressing the same path: all-lower-case == mixed-case == all-upper-case, "\" == "/" ), whereas on linux and "traditional" unix, there is no such "synonymy": if two path strings differ in any way on any one character, they represent distinct paths/files.

      And then there's the macosx adaptation of openBSD unix, where (in contrast to every other type of unix-based or unix-like OS I know of) case distinctions are ignored. I just tried this on my mac, and was quite saddened by the result:

      $ echo foobar > /tmp/junk $ cat /tmp/junk foobar $ cat /tmp/JUNK foobar $ echo hello > /tmp/JUNK $ cat /tmp/junk hello $ cat /tmp/JUNK hello
      I see that File::Spec has a "case_tolerant()" function, which (according to the man page) is supposed to return "a true or false value indicating, respectively, that alphabetic case is not or is significant when comparing file specifications."

      But having tried that just now on macosx, I find that it returns false (case is significant), despite the fact that case is demonstrably not significant for distinguishing paths on macosx. I think that's something that needs to be fixed in File::Spec.

      (I'm still using the Perl 5.8.8 that shipped with macosx 10.5.8 -- maybe this was fixed in Perl 5.10?)

      So to succeed in your cross-platform intentions, you yourself have to check the value of $^O -- if it's "darwin", you have to fold case; if it's "MSWin32", you have to fold case and slashes; anything else, you leave all characters as-is.

      (updated to fix link to module page)
      (updated to change "mswin" to "MSWin32", which is what File::Spec looks for)

          Ah -- right, of course. Okay, I got that, and it isn't fixed. Guess I should report to the maintainer -- I think the fix in "File/Spec/ is pretty simple:
          193c193 < sub case_tolerant { 0 } --- > sub case_tolerant { ( $^O eq 'darwin' ) ? 1 : 0 }

          There's even more strangeness about the darwin path/name "logic": the bash shell's file-name completion function is case-sensitive, even though the underlying OS file-name handling is not. So for most of the command-line stuff I do (which is most of what I do), it feels (and I have to type) as if the file names are case-sensitive, even though they aren't. Go figure.

    Re: Unify windows filenames
    by Anonymous Monk on Sep 19, 2009 at 22:36 UTC
    Re: Unify windows filenames
    by Anonymous Monk on Sep 20, 2009 at 03:35 UTC

      You'll have to lc as appropriate but this covers path normalization pretty well.

      use Path::Class; my @file =qw( C:\dir\ C:/dir/ C:\dir\File.PL ); for my $file ( @file ) { my $path = Path::Class::File->new($file); print $path->as_foreign("Win32")->absolute, $/; } # C:\dir\ C:\dir\ C:\dir\File.PL
    Re: Unify windows filenames
    by ELISHEVA (Prior) on Sep 20, 2009 at 18:32 UTC

      At present I don't know that there is a good out-of-the box solution to this problem if you really want all platforms and absolute uniqueness.

      First, if you really need to verify that two paths point to the same file, there are more factors to consider than just path name syntax. On Win32 systems, every file has several different names. Depending on the application providing input to your program a file might be identified by any of the following:

      • a short path name using 8.3 notation
      • a long "case-preserving" path name
      • one or more UNC path names, e.g. \\MyMachine\C$\Public\Foo.txt or \\Public\Foo.txt
      • a device path names, e.g. \\.\HarddiskVolume1\Public\Foo.txt
      • multiple NT path names, e.g. \DosDevice\C:\Public\Foo.txt, \Device\C:\Public\Foo.txt or \??\C:\Public\Foo.txt
      • a pathname beginning with '\\?\', e.g. \\?\UNC\Public\Foo.txt or \\?\C:\Public\Foo.txt

      Note that all of the above paths are 'case preserving' but case insensitive. That means you can safely lower case the entire path name and Win32 will still be able to find the files. In addition XP path names can represent "junction points" (roughly equivalent to hard links). Starting with Vista, symbolic links to files and directories are also supported. *nix systems don't have the huge range of path name syntax, but the same file can still have a variety of names via hard links and symbolic links. On Cygwin systems, you also have to take into account mounted paths: any Win32 drive or directory can be mounted as a *nix path, e.g. "/cygdrive/c/Public/foo.txt" and "C:\Public\foo.txt" might refer to the same file.

      Some, but not all, of these issues can be handled in a portable way using Cwd::abs_path. It doesn't handle all of the Win32 path variants and there is a reported bug for mounted Unix drives (see the bugs link on the right on the Cwd page for details) . Using the routine on a mounted drive may fail if changing the current directory to a mounted drive changes the effective GID or UID. Additionally, it relies on File::Spec for path normalization.

      If you are only interested in normalizing path names (rather than identifying the "official" name for a file), the solution recommended by most Perl documentation, including perlport is to use the File::Spec module for portability. For portability between *nix and Win32 it appears to be fairly reliable. However, if you start including platforms like Cygwin, VMS, Darwin, and Mac Classic, some of its decisions may not be entirely portable. Path::Class and Path::Classy are both wrappers around File::Spec so many of the same issues apply.

      If you do use, File::Spec, you can use File::Spec->canonize() to make path name syntax more regular. File::Spec tends to assume that all paths can and should be converted to the syntax of *nix paths. This can sometimes produce incorrect results:

      • Cygwin: Cygwin is a *nix-alike layer that runs on Win32 systems. Cygwin's implementation of canonize treats '/' as the canonical separator and this can sometimes produce illegal paths. Cygwin supports both POSIX (*nix) and Win32 style paths. Win32 style paths need to keep at least one backslash ('\') in the path or else Cygwin won't be able to recognize that the path is meant to be a Win32 style path. Without the backslash it reads 'C:/foo' as the relative path starting with a directory that just happens to be named 'C:'. Of course, such a directory is unlikely to exist so you'll get path not found errors if you canonize Win32 style paths using the File::Spec::Cygwin module.

      • VMS: VMS path syntax is very different from either Win32 or Unix. For example, A::B:[C.D.E]F.DAT;32 would mean the 32nd version of the F.DAT file found in the directory "C.D.E" (which is /C/D/E in *nix paths) on the node B within the host A.

        Modern VMS systems can mount both case sensitive and case insensitive drives. If you look at the bug list for File::Spec you will see a lot of discussion about how to deal with this but no entirely satisfactory solutions. Another problem has arisen with the introduction of the ODS5 file system. The implementation of File::Spec::VMS also tries to unixify paths before canonizing them. On ODS2 (the older native path syntax used on VMS) there were only a limited number of characters and one could reliably convert back and forth between *nix and VMS style paths. However, ODS5 allows for a much wider range of pathname characters and there is no way to do lossless path conversions. "..." on *nix could mean the ODS5 path ^.^.^., ^..^., or ^.^..

        Even in ODS2 there were conversion problems. The code that converts paths to *nix syntax needs to tell the difference between *nix paths and VMS paths, but there are certain paths that are ambiguous "perl_5.8.10" could mean the *nix executable (or directory) "perl_5.8.10" or it could mean version 10 of the VMS file "perl_5.8". The ambiguity arises because ODS2 uses "." as both a separator between file name and extension and as a separator between file name+ext and file version number.

      • Macs: Mac machines have many of the same problems as Cygwin and VMS. Older versions of Mac (Mac Classic) supported only Apple's native HFS path name syntax which uses ":" as a path name separator. It also has a few other odditities: you can't represent rooted paths without specifying a disk drive; the equivalent of 'a/..' in *nix is ":a::", among others. The newer version of the Mac operating system (known as Darwin or Mac OSX), supports both the older HFS paths and *nix paths.

        perlport classifies Darwin as a *nix-alike, but both path syntaxes are used. In particular, paths being fed to Perl from a user interface application are likely to be in HFS format. There is no 100% reliable way to tell which path is *nix and which is HFS because the HFS path separator ':' is a valid character is *nix file names and the *nix separator '/' is a valid character in HFS file names. That is, 'HD1:May/2009' could be an absolute HFS path identifying a file named 'May/2009' on drive "HD1" or a relative *nix path identifying a file named "2009" in the directory named "HD1:May".

        Depending on how you configure the system and the file system installed on each of your disk partitions, the *nix paths can be either case sensitive or case insensitive. As discussed above by graff and YourMother, File::Spec doesn't seem to take this into account.

      Best, beth

      Update: added a discussion of Cwd::abs_path.

        There is a Win32::AbsPath, in the case that Cwd::abs_path() doesn't fit the solution.

    Log In?

    What's my password?
    Create A New User
    Node Status?
    node history
    Node Type: perlquestion [id://796321]
    Approved by almut
    Front-paged by tye
    and all is quiet...

    How do I use this? | Other CB clients
    Other Users?
    Others pondering the Monastery: (6)
    As of 2017-01-20 06:37 GMT
    Find Nodes?
      Voting Booth?
      Do you watch meteor showers?

      Results (173 votes). Check out past polls.