Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Identical Files to Symbolic Links

by PetaMem (Priest)
on Nov 09, 2005 at 11:34 UTC ( #507032=sourcecode: print w/ replies, xml ) Need Help??

Category: File-Tools
Author/Contact Info PetaMem Corporation
Description: This is a reimplementation of the diffy script, posted here long time ago. It is now more targeted at the needs we have when traversing filesystem and wanting to "compress" the identical files to just symlinks. It is way faster and behaves much better (no link cascades).
#!/usr/bin/perl_parallel -w
# For Emacs: -*- mode:cperl; mode:folding; -*-

# {{{ use block
use strict;

use File::Spec;


# }}}

my $wd = shift @ARGV;

$wd = '.' if(!defined $wd);
$wd = File::Spec->rel2abs($wd);

my %size   = map { $_ => -s $_ } @{&files($wd)};
my @fnames = sort { $size{$b} <=> $size{$a} ||
                           $a cmp $b } keys %size;


#
# The only speedup potential I see, is to replace the cmp system call 
+with
# something faster ???
# And to get rid of the File::Spec->abs2rel with some leaner/faster co
+de
#
while(@fnames) {
  my $f = shift @fnames;

  next if(-l $f);

  for my $f2 (@fnames) {
    next if(-l $f2);
    last if($size{$f} != $size{$f2});

    if(system("cmp -s '$f' '$f2'") == 0) {

      # $f is the file to be LINKED TO, so we need this as relative pa
+th
      # $f2 is the file to be evtl. LINKED FROM, this can/should be ab
+solute

      $f2 =~ m|^ ( (?: .* / (?: \.\.?\Z(?!\n) )? )? ) ([^/]*) |xs;
      my $new = File::Spec->abs2rel($f,$1);

      `ln -sf '$new' '$f2'`;
    }
  }
}

# {{{ files
#
sub files {
  my $path = shift;
  my $f;
  my @files = ();

  opendir(DIR, $path);                     # open directory
  for $f (readdir(DIR)) {
    next if($f =~ /^\.|\..$/);    # skip . and ..

    if(-f "$path/$f") {
      push @files, "$path/$f";
    } elsif(-d "$path/$f") {
      push @files, @{&files("$path/$f")};
    }
  }
  closedir DIR;                            # close directory

  return \@files;
}
# }}}


Comment on Identical Files to Symbolic Links
Download Code
Re: Identical Files to Symbolic Links
by merlyn (Sage) on Nov 09, 2005 at 11:59 UTC
    If possible, you should try to hardlink rather than symlink. It uses even less space, because the two directory entries merely have the same inode number, instead of wasting a block to contain the symlink direction. Also, if you elect to delete or rename one of the files, you might be deleting one that is pointed at by others. And, it's also faster to access, because the OS doesn't have to do a new name lookup with the symlink pointer.

    -- Randal L. Schwartz, Perl hacker
    Be sure to read my standard disclaimer if this is a reply.

      Not to pour Kool-aid in the swimming pool, but aside from being technically correct, are there situations where such things matter? That is, have you found systems where someone said "hmm...things would be 200% faster if we used hard links instead of symlinks"? If so, that must've been on helluva tight spec!

      thor

      Feel the white light, the light within
      Be your own disciple, fan the sparks of will
      For all of us waiting, your kingdom will come

        Not sure why you focussed on the speed. I just included that for completeness. More important is the space and delete/rename problems. But over NFS, long symlinks are very expensive, if you're curious and unaware.

        -- Randal L. Schwartz, Perl hacker
        Be sure to read my standard disclaimer if this is a reply.

      Bill Stearns's freedups works well for when you want hard links: http://www.stearns.org/freedups/freedups.pl
      The only reason why there are symbolic and not hard links, is that I assume(!), that hardlinks make problems when you tar the fileset.

      In our application, a lot of files is generated, then this dupe-eliminator rushes over it, then the result is tarballed and moved to production server. I would not want to untar copies instead of references there.

      Bye
       PetaMem
          All Perl:   MT, NLP, NLU

Re: Identical Files to Symbolic Links
by jdporter (Canon) on Nov 09, 2005 at 16:58 UTC
    while(@fnames) { my $f = shift @fnames; next if(-l $f); for my $f2 (@fnames) { next if(-l $f2);

    Correct me if I'm wrong, but it seems to me that you are simply ignoring all existing symlinks. If that is the case, then why not eliminate them from the original list? E.g.

    @fnames = grep { ! -l $_ } @fnames;
    We're building the house of the future together.
Re: Identical Files to Symbolic Links
by jdporter (Canon) on Nov 09, 2005 at 17:18 UTC
    while(@fnames) { my $f = shift @fnames; for my $f2 (@fnames) { last if($size{$f} != $size{$f2});

    That's O(n2)! You'd get a significant boost by making a set of lists of files with the same size:

    my %filesets; for ( @fnames ) { push @{ $filesets{ -s $_ } }, $_; } # filter out any lists that have only one element: for ( keys %filesets ) { @{$filesets{$_}} <= 1 and delete $filesets{$_}; }

    You could use checksums to get each list down to a set of "highly likely" candidate duplicates:

    my %filesets; for ( @fnames ) { my $size = -s $_; my $csum = `sum "$_"`; push @{ $filesets{$size.$csum} }, $_; }

    But you'd probably still want to do actual file comparisons (`cmp`) to ensure actual duplicates.

    We're building the house of the future together.

      sum has to read the entire file anyway, so there’s no gain from checksumming them to decide whether you want to compare them.

      The right method to do this would be to put all the files in one set to start out with, read them byte-for-byte, and whenever files disagree, split the set into one set for each byte value encountered. Whenever a set consists of just one file, you can drop it. When you get to the end of any of the files and still have sets with more than one file in them, each of the sets is a group of identical files.

      Of course, this is unworkable when you have more files than available handles. But in that case, all solutions I can think of (I wrote up and deleted three so far) are ugly as sin. Checksumming will usually get you out of the bind, but in edge cases with a huge number of identical files, that approach is really painful. Hmm.

      Makeshifts last the longest.

        Oh, but there is! It's much better to read each file once (O(n)) rather than compare all the pairs of files (O(n2)).

        We're building the house of the future together.
        sum has to read the entire file anyway, so there’s no gain from checksumming them to decide whether you want to compare them.

        Indeed in my own duplicate searching script (currently only deletes duplicates, but I plan to make it more flexible one day) I make clusters of files based on size since that is much lighter to take, and I calculate checksums within clusters to decide whether the files are identical or not. This is not 100% sure, as is well known, but is enough for me. If I ever decide to make it into a serious thing, I'd add an option for full comparison...

        Funny: it seems quite about everybody rolled his or her own version of this thing...

Re: Identical Files to Symbolic Links
by graff (Chancellor) on Nov 10, 2005 at 04:05 UTC
    Just a few nit-picks to add on top of (well, below) the other replies:
    • When the user happens to enter a command-line arg that is not a usable directory name, nothing happens -- not even an error message to the user saying something like "you were supposed to supply a directory path". A little error checking might help.

    • If you decide to stick with symlinks (rather than merlyn's suggestion of hard links), you could use the perl-internal "symlink" function, instead of shelling out to "ln -s" -- that can save some time, if you end up making a lot of links. (Update: of course, you first have to use "unlink" on the file being replaced, but I'd expect these two calls together are still cheaper than a whole backtick subprocess.)

    • You are doing too many stat calls. You could get by just stat'ing each file once, re-using the stat structure as needed, and keeping info you want to use later; if the loop in the "files()" sub goes like this:
      for $f (grep { ! ( /^\.{1,2}$/ or -l "$path/$_" ) } readdir(DIR)) { if ( -f _ ) { push @files, "$path/$f " . -s _; elsif ( -d _ ) { push @files, @{&files( "$path/$f" )}; } }
      then you have just one stat per file, and the map block in the caller would just be "split" instead of yet another round of stat calls. (perldoc -f stat explains about using the underscore to refer to "the existing stat structure")

      Final update: these really are very minor issues -- they could be "optimizations" in some situations, but probably won't make a noticeable difference in how fast this app goes, given that most of the run time will be spent comparing file contents. Still, if it's easier to write code that runs a little faster, why not write it that way?

Re: Identical Files to Symbolic Links
by jdporter (Canon) on Nov 10, 2005 at 17:50 UTC

    Here's another rewrite, in which the list partitioning algorithms have been abstracted.

    This doesn't do any rel2abs/abs2rel stuff, which you may find necessary.

    (Note: untested!)

    use strict; use warnings; use File::Find; use List::Util qw( reduce ); # these partition functions treat the input data as having # already been partitioned; they partition the data further. sub partition_by_calculated_key(&@) { my $cr = shift; map { my %h; push @{ $h{ &$cr } }, $_ for @$_; values %h } @_ } sub partition_by_comparison_function(&@) { my $cr = shift; map { my %h; my $added; for ( @$_ ) { for my $seen ( keys %h ) { local( $a, $b ) = ( $_, $seen ); if ( $cr->( $_, $seen ) ) # returns true if "equal" { push @{ $h{$seen} }, $_; $added = 1; last; } } $added or push @{ $h{$_} }, $_; } values %h } @_ } ##################################### my @root = @ARGV; @root or @root = ('.'); my @files; find( sub { push @files, $File::Find::name }, $_ ) for @root; # eliminate directory names: @files = grep { ! -d $_ } @files; # eliminate existing symlinks: @files = grep { ! -l $_ } @files; my @a = ( \@files ); # first do the cheap one: @a = partition_by_calculated_key { -s $_ } @a; @a = grep { @$_ > 1 } @a; # filter out singleton lists # then the expensive one: @a = partition_by_calculated_key { qx( sum "$_" ) } @a; @a = grep { @$_ > 1 } @a; # filter out singleton lists # now group those which are "equal": @a = partition_by_comparison_function { ! system qq( cmp -s "$a" "$b" +) } @a; @a = grep { @$_ > 1 } @a; # filter out singleton lists # finally, hardlink the files in each group: reduce { system qq( ln "$a" "$b" ); $b } @$_ for @a;

    One of the fun things about this functional approach is that the main program can be re-ordered as follows:

    map { reduce { system qq( ln "$a" "$b" ); $b } @$_ } grep { @$_ > 1 } partition_by_comparison_function { ! system qq( cmp -s "$a" "$b" ) } grep { @$_ > 1 } partition_by_calculated_key { qx( sum "$_" ) } grep { @$_ > 1 } partition_by_calculated_key { -s $_ } [ grep { ! -l $_ } grep { ! -d $_ } @files ]
    We're building the house of the future together.

Back to Code Catacombs

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: sourcecode [id://507032]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (6)
As of 2014-10-25 22:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (149 votes), past polls