Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Detecting whether two pathes refer to the same file

by rovf (Priest)
on Sep 10, 2010 at 09:28 UTC ( [id://859612]=perlquestion: print w/replies, xml ) Need Help??

rovf has asked for the wisdom of the Perl Monks concerning the following question:

I have two variables, $x and $y, containing file pathes, and would like to find out whether they refer to the same file. The solution should ideally work under Unix and Windows. It's not necessary that it works for symlinks too (though, of course, it would be nice).

I thought of using stat to get the inode numbers and see whether they are the same, but is this a reliable method on Windows too?

I could touch one of the files with a certain time stamp, and then stat the other file to see whether it got the same time stamp. This looks like a terrible hack, though.

In Java, there is a function JavaFileManager.isSameFile, which performs exactly this function, and I darkly believe to remember having seen this for Perl to, but could not find it anymore.

Any suggestions?
-- 
Ronald Fischer <ynnor@mm.st>

Replies are listed 'Best First'.
Re: Detecting whether two pathes refer to the same file
by ambrus (Abbot) on Sep 10, 2010 at 09:37 UTC

    The inode number alone is not enough, you also have to compare the device number from stat, because the inode numbers are unique only within each filesystem.

    The dev-inode pair usually identifies the file uniquely, but of course there are some caveats.

    • Perls are normally compiled with 64-bit file access, and you might need that for this trick too, because it makes inode numbers wider.
    • Device numbers might not be constant over a reboot. (They often are, but not always.)
    • Under linux, if you see a file through a bind mount, it gets the same device-inode pair as the file on the original mount, so it's possible to have two file names that refer to the same underlying file but one of them is read-only.
    • Probably more stuff I don't know about.
      Thank you for the quick reply!
      Perls are normally compiled with 64-bit file access, and you might need that for this trick too, because it makes inode numbers wider.
      In what way could this affect me? I just have to compare the inode numbers (and device numbers, as you pointed out), as they are returned from stat,isn't it?
      Device numbers might not be constant over a reboot.
      As I apply the call to stat to both variables in succession, there is no possibility to be interrupted by a reboot.

      -- 
      Ronald Fischer <ynnor@mm.st>
        In what way could this affect me? I just have to compare the inode numbers (and device numbers, as you pointed out), as they are returned from stat,isn't it?

        If your perl doesn't support 64 bits inodes, stat will fail in weird ways; in the worst case, it could return erroneous inodes and wrongly state different files as the same.

Re: Detecting whether two pathes refer to the same file
by BioLion (Curate) on Sep 10, 2010 at 09:40 UTC

    A quick look on CPAN turns up File::Same and File::is, both of which do file comparisons (Including checking if it is the 'same' file in different locations), so specific mention of symlinks, but they use the INODE approach for comparisons so it should be workable.

    File::is is much newer though. I haven't tested/used either, and there may well be others, but this seems to be a problem a lot of other people have thought about, so I am sure a good solution exists out there! Hope this helps!

    Just a something something...
      A quick look on CPAN turns up File::Same and File::is
      File::Same doesn't check, whether two pathes name the same file, but whether two pathes point to (possibly different) files, having the same content.

      File::is tries to verify the identity of the files, but only by inode number, which is (as we have learned from ambrus is not even correct on Unix, and does not work at all on Windows :-(

      -- 
      Ronald Fischer <ynnor@mm.st>
      Thanks a lot, I will have a look at them!

      -- 
      Ronald Fischer <ynnor@mm.st>
Re: Detecting whether two pathes refer to the same file
by BrowserUk (Patriarch) on Sep 10, 2010 at 09:57 UTC

    Inode numbers don't have any meaning and are always 0 on Windows.

    There is an equivalent unique identifier available, (the FileIndexHigh/Low fields in the structure returned by GetFileInformationByHandle(), but you'd need to use Win32::API or Inline::C to get at it.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Detecting whether two pathes refer to the same file
by cdarke (Prior) on Sep 10, 2010 at 10:53 UTC
    On Windows NTFS there is a similar concept to an inode number, but it is a 64-bit index number split into two 32-bit numbers "Low" and "High". Like inodes, these numbers are only unique within a partition, so you also need to take into account the Volume Serial Number (VSN).

    See Win32::IdentifyFile. The information returned by IdentifyFile is Volume Serial Number, File Index High, File Index Low. I am the author, so let me know if you have any issues.

    On FAT I don't think you have any options like this.
      On FAT I don't think you have any options like this.

      Right. While it is possible to create a hardlink within a FAT filesystem using a disk editor, most -- if not all -- FAT filesystem checkers consider this a filesystem error. FAT is defined in a way that each file must have exactly one directory entry, and so the fully qualified file name (relative to the filesystem) uniquely identifies a file. With VFAT (i.e. FAT + long names), things become a little bit different, because each path element may have a "long" and a "short" name. Normalizing the paths of two files to use only "short" or "long" elements gives compareable, unique identifiers.

      Alexander

      --
      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
      Would your module also work on files which reside on, say, Linux, but are accessed from Windows via WinNFS, Samba or CIFS? This wouldn't be NTFS either, would it?

      -- 
      Ronald Fischer <ynnor@mm.st>
        I have never tried it, and don't currently have a working Samba link to help. I doubt it, but I have been surprised before. I wouldn't like to bet one way or the other.

        Let me know if you try it!
Re: Detecting whether two pathes refer to the same file
by kejohm (Hermit) on Sep 10, 2010 at 13:20 UTC

    If you are looking to test whether the two paths are referring to the same location (as opposed to the two paths referring to two separate, but identical, files) you could use Cwd::abs_path() on both, then test whether the paths returned are the same. Example:

    #!perl use strict; use warnings; use feature qw(:5.12); use Cwd qw(abs_path); say abs_path ('/foo') eq abs_path ('/bar/../foo');

    You could also check out some of the File::* modules for working with files, directories, paths, etc.

    Update: Link fixed.

      It's a good solution, but I can think of two limitations.

      • That will work for symbolic links (on most devices), but not for hard links (other than "." and "..").

        $ echo foo > a $ ln a b $ cat b foo $ perl -MCwd=abs_path -E'say abs_path($_) for @ARGV' a b /tmp/a /tmp/b

        You might be able to check stat's device plus inode fields to address this limitation on some devices on some systems.

      • It also won't necessarily work across devices (since you could access the same file via two devices). The following all refer to the same file:

        C:\Temp\file \\?\C:\Temp\file # Via UNC path \\localhost\C$\Temp\file # Via localhost \\tribble\C$\Temp\file # Via domain name \\10.0.0.6\C$\Temp\file # Via IP address \\localhost\share\file # Via share Z:\file # Given subst Z: C:\Temp

        In general, this isn't solvable.

        Good points. It depends on whether or not the OP would ever come across those situations.

Re: Detecting whether two pathes refer to the same file
by afoken (Chancellor) on Sep 10, 2010 at 11:53 UTC
    It's not necessary that it works for symlinks too (though, of course, it would be nice).

    Wrong. It must work for symlinks or it will easily break on non-windows systems. Normalising paths with symlinks is relatively easy.

    In Java, there is a function JavaFileManager.isSameFile, which performs exactly this function

    I don't think so. JavaFileManager is an interface, not a class. No code, just conventions. And by the way, JavaFileManager.isSameFile() is a method, not a function. It would require a lot of knowledge about the underlying operating system and the used filesystems. And it would get really complicated with files on network file systems.

    From the documentation at http://download.oracle.com/javase/6/docs/api/javax/tools/JavaFileManager.html#isSameFile%28javax.tools.FileObject,%20javax.tools.FileObject%29:

    isSameFile boolean isSameFile(FileObject a, FileObject b) Compares two file objects and return true if they represent the sa +me underlying object. Parameters: a - a file object b - a file object Returns: true if the given file objects represent the same underlying o +bject Throws: IllegalArgumentException - if either of the arguments were cre +ated with another file manager and this file manager does not support + foreign file objects

    The method compares objects, not files. It may throw an exception when the objects weren't created by the same file manager, giving you no useful information at all.

    The interface makes no statement about network file systems. What happens when a filesystem is mounted in two different places, e.g. /mnt/a and /mnt/b? Will the method detect that /mnt/a/foo and /mnt/b/foo are the same file? Even if /mnt/a is a directory exported via NFS from one server and /mnt/b is the same directory exported via CIFS from the same server? Even if the NFS server and the CIFS server use different names, different network interfaces and different addresses?

    Another problem is that each part of the filename may be case sensitive, case preserving, or case insensitive, depending on the filesystems. File::Spec does not take that problem into account and assumes, for example, that all filenames on Unix derivates are always case sensitive (see Re^5: Unify windows filenames).

    So whenever you cross a filesystem border, the rules for normalising and comparing may change, depending on the filesystem itself and the mount options.

    Regarding the Windows API, Re: Detecting whether two pathes refer to the same file shows you a way to handle local NTFS systems. I don't know how Windows handles files on other file systems. Does it emulate the NTFS IDs like Linux does when it mounts a FAT file systems? What about ISO9660 (CDROM) and UDF (DVD)? What about network filesystems?

    Alexander

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
Re: Detecting whether two pathes refer to the same file
by ambrus (Abbot) on Sep 10, 2010 at 21:12 UTC

    <joke>

    $ cat samefile.c /* samefile -- decides whether two files given as arguments are the sa +me. follows symlinks. both files must be valid and you must have permissio +n to read them, and even then it may fail with an error or hang. exit code 0 is same, 1 if different, higher on an error. */ #include <unistd.h> #include <fcntl.h> #include <sys/types.h> #include <sys/wait.h> int main(int argc, char *argv[]) { int d0, d1, sch; struct flock lk; pid_t ppa, pch; if (3 != argc) return 10; if (-1 == (d0 = open(argv[1], O_WRONLY))) return 4; if (-1 == (d1 = open(argv[2], O_RDONLY))) return 3; lk.l_type = F_WRLCK; lk.l_whence = SEEK_SET; lk.l_start = 0; lk.l_len = 1; if (-1 == fcntl(d0, F_SETLK, &lk)) return 5; ppa = getpid(); if (-1 == (pch = fork())) return 9; if (!pch) { lk.l_type = F_RDLCK; lk.l_whence = SEEK_SET; lk.l_start = 0; lk.l_len = 1; if (-1 == fcntl(d1, F_GETLK, &lk)) return 4; return !(F_WRLCK == lk.l_type && ppa == lk.l_pid); } else { if (pch != wait(&sch)) return 8; if (!WIFEXITED(sch)) return 7; return WEXITSTATUS(sch); } } $ gcc -Wall -O -o samefile samefile.c $ ./samefile /usr/local/libexec/git-core/git-{diff,pull}; echo $? 1 $ ./samefile /usr/local/libexec/git-core/git-{diff,commit}; echo $? 0 $

    </joke>

    $ test /usr/local/libexec/git-core/git-diff -ef /usr/local/libexec/git +-core/git-pull; echo $? 1 $ test /usr/local/libexec/git-core/git-diff -ef /usr/local/libexec/git +-core/git-commit; echo $? 0

    NOTE: test -ef won't follow symlinks

    Update: exercise to the reader: explain why the fork is necessary in the above program.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://859612]
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (4)
As of 2024-04-24 06:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found