Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Removing duplicate filenames in different directories from an array

by Anonymous Monk
on Jan 30, 2011 at 20:15 UTC ( [id://885156]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have an array of filenames that looks something like this:
/data/node12/file-29-2.txt /data/node12/file-34-2.txt /data/node12/file-50-2.txt /data/node30/file-34-2.txt /data/node30/file-60-2.txt /data/node30/file-62-2.txt /data/node34/file-29-2.txt
etc. I want to remove duplicates from this array in the sense that files with the same -##- are identical, even if they are in different directories. So in the example above, I would want to eliminate /data/node30/file-34-2.txt and /data/node34/file-29-2.txt . I can think of ways to do this, but they are probably inefficient. Since the actual array contains ~10^6 filenames, it needs to be efficient. I believe there is an easy way to do this with hashes, but I can't remember it. Thanks!
  • Comment on Removing duplicate filenames in different directories from an array
  • Download Code

Replies are listed 'Best First'.
Re: Removing duplicate filenames in different directories from an array
by wind (Priest) on Jan 30, 2011 at 20:29 UTC
    The code can probably be condensed, but this will get you what you want.
    use strict; use File::Basename qw(basename); my @files = qw( /data/node12/file-29-2.txt /data/node12/file-34-2.txt /data/node12/file-50-2.txt /data/node30/file-34-2.txt /data/node30/file-60-2.txt /data/node30/file-62-2.txt /data/node34/file-29-2.txt ); my %seen = (); foreach my $file (@files) { my $bn = basename($file); if ($bn =~ /(\d+)/) { my $id = $1; if ($seen{$id}++) { print "$file needs to be deleted\n"; } } else { warn "Unexpected file found: $file\n"; } }
    - Miller
      ... condensed ...
      >perl -wMstrict -le "my @filenames = qw( /data/node12/file-29-2.txt /data/node12/file-34-2.txt /data/node12/file-50-2.txt /data/node30/file-34-2.txt /data/node30/file-60-2.txt /data/node30/file-62-2.txt /data/node34/file-29-2.txt ); ;; my %seen; my @unique = grep { m{ (-\d\d-) }xms; !$seen{$1}++ } @filenames; print qq{'$_'} for @unique; " '/data/node12/file-29-2.txt' '/data/node12/file-34-2.txt' '/data/node12/file-50-2.txt' '/data/node30/file-60-2.txt' '/data/node30/file-62-2.txt'

      If the file names are in a file with one name per line, this could even be a one-liner.

Re: Removing duplicate filenames in different directories from an array
by chrestomanci (Priest) on Jan 30, 2011 at 21:15 UTC

    You could put the file-names to keep in an array. Perl will waste a bit of space on empty slots, but so long as the largest number is not huge it should be efficient.

    my @filenames foreach my $file (@files) { if( $file =~ m:/file-(\d+)-2.txt$: ) { $filenames[$1] = $file; } else { warn "Unexpected file found: $file\n"; } } foreach my $file (@files) { print $file if defined $file; }

    This method will silently discard duplicates, which may or may not be a problem for what you are trying to do, but it will be fast.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://885156]
Approved by toolic
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (2)
As of 2024-04-20 03:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found