Re: Duplicates in Directories

If I'm reading your description correctly, you are passing over the list of files two or three times (possibly doing steps 1 & 2 in one pass). Step 3 also does a linear search over the file names, which is going to go quadratic in terms of processing.

Maybe try to get all the info in one pass, store the details in a hash structure, and then iterate over that hash. That will take care of the need for exact matches and avoid linear searches.

use strict;
use warnings;

#  I assume the actual code will obtain this list using
#  glob or similar
my @allfiles = qw /
baz.txt
baz.epub
baz.doc
baz.pdf
bar.epub
boo.epub
boo.txt
/;

my %by_ext;
#  %by_fbase is actually redundant below,
#  but is maybe useful for other things
#  so I have left it in
my %by_fbase;  

foreach my $file (@allfiles) {
    #  should use a proper filename parser here
    #  like File::Basename, but a split will serve
    #  for the purposes of an example.
    my ($name, $ext) = split /\./, $file;
    
    $by_ext{$ext}{$name} = $file;
    $by_fbase{$name}{$ext} = $file;
}

foreach my $name (keys %by_fbase) {
    #print "$name\n";
    no autovivification;
    #  could use exists in this check if you want to avoid autoviv,
    #  but file names should evaluate to true if they have
    #  an extension, even if the name part evaluates to false
    if ($by_fbase{$name}{pdf} && $by_fbase{$name}{epub}) {
        #  do stuff
        print "$name has epub and pdf extensions: "
          . "$by_fbase{$name}{epub} $by_fbase{$name}{pdf}\n";
        #  now do stuff like moving files since you can iterate over
        #  the values of the relevant subhash
        foreach my $file (values %{$by_fbase{$name}}) {
            print "now do something to $file\n";
        }
        
    }
}
[download]

That code prints:

baz has epub and pdf extensions: baz.epub baz.pdf
now do something to baz.epub
now do something to baz.doc
now do something to baz.txt
now do something to baz.pdf
[download]

Update: Edited incomplete comment starting with "now do stuff"

Comment on Re: Duplicates in Directories Select or Download Code

Replies are listed 'Best First'.
Re^2: Duplicates in Directories by kel (Sexton) on Oct 10, 2017 at 07:57 UTC
First, thank you all for your suggestions. The problem has been one of algorythym. I am iterating @select files from and @allfile loop, and parsing for equality conditions. As the actual code is over 300 lines, I have included an edited snippet. This code is derived from an earlier script where I needed to parse for reexes in files, not necessarily exact matches , and not necesaarily at the beginning. parsing @selectexpr against @allfiles made sense there. Hashes are an excellent idea. with them I can parse foo-bar-baz.doc as as hash directly against all foo keys, with proper splitting and filtering, of course. This would allow me to scale up more efficiently. I would howver prefer, if possible to keep the matching to a regexp rather an an equality operator. Please ignore syntax errors in the code below, it has been abbreviated. `if ($mymobi =~ m/($myepub)/) {print "DUPLICATE FOUND !\n" ; &movetodir($myfilt,$dupdir ); } #Does NOT work if ($mymobi eq $myepub) {print "DUPLICATE FOUND !\n" ; &movetodir($myfilt,$dupdir ); } #Works` [download] For an author-title pair,the matching would be done in the title(value) portion rather than the key, which would be expected to identical (though there might be exceptions ). I need to hit the books on hashes here, as i havent really dealt much with them outside of a 20,000+ listing database with about 2 dozen hash fields. opendir(DIR, $dir2 ) or die $!; while ( $file = readdir(DIR)) { if (-f $file) { # read only files chomp($file); $file =~ s/^\s+\|\s+$//g; $filenam = "" ; push ( @srcarray, $file) ; if ($file =~ m/\.mobi$/ig ) { &typefiles($file, "mobifile"); } if ($file =~ m/\.azw3$/ig ) { &typefiles($file, "azw3file"); } sub typefiles( $tfile , $filetype ) { ($tfile, $filetype ) = @_ ; if ($filetype eq "mobifile" ) { push ( @mobiarray, $file) ; } # End mobifiles # Main body - parsing directory listing and performing actions foreach $authf (@srcarray){ if ($authf =~ m/\.pl$/) { next; } if ($authf =~ m/\.epub/ig ) { our $authf2 = $authf ; foreach my $myfilt (@mobiarray){ my $mymobi = $myfilt; my $myepub = $authf2; $mymobi = &extfilter($mymobi); $myepub = &extfilter($myepub); sub extfilter($line) { ($line) = @_; $line =~ s/\.mobi//ig ; $line =~ s/\.epub//ig ; $line =~ s/^\s+\|\s+$//g; $line = lc $line; return $line; } [download]	[reply] [d/l] [select]
Re^3: Duplicates in Directories by hippo (Bishop) on Oct 10, 2017 at 08:20 UTC
`if ($mymobi =~ m/($myepub)/) {print "DUPLICATE FOUND !\n" ; &movetodir($myfilt,$dupdir ); } #Does NOT work if ($mymobi eq $myepub) {print "DUPLICATE FOUND !\n" ; &movetodir($myfilt,$dupdir ); } #Works` [download] No sample data means no solution. Here's the SSCCE you could have provided: `use strict; use warnings; use Test::More tests => 2; my $mymobi = 'Hello World!'; my $myepub = 'Hello World!'; ok ($mymobi =~ m/($myepub)/); ok ($mymobi eq $myepub);` [download] See how both the string equality and the regular expression matches are true? So they both "work". Your task is now to provide the values for $mymobi and $myepub for which one or other doesn't match. At that point it should become clear to you what the difference between an exact string match and a regular expression match is (and why one or the other is preferable in different situations - because they serve different purposes).	[reply] [d/l] [select]
Re^3: Duplicates in Directories by swl (Parson) on Oct 11, 2017 at 07:52 UTC
You're welcome. However, it is unclear to me why, given you want to use regexp matching, your regexp match apparently does not work and exact equality does: `if ($mymobi =~ m/($myepub)/) {print "DUPLICATE FOUND !\n" ; &movetodir($myfilt,$dupdir ); } #Does NOT work if ($mymobi eq $myepub) {print "DUPLICATE FOUND !\n" ; &movetodir($myfilt,$dupdir ); } #Works` [download] The regexp will match anything containing your title, so for a title like "blert" you will be matching all of "blert", "blertblartblort", "foobarblertbaz" etc. Perhaps you need to filter the file names for possible partial matches when you read them? Or if you know there are spelling errors then have a look at Text::Fuzzy and similar. Even then you would perhaps be best to flag them somewhere for cleanup or modification before automated processing. Some other points are: There is no need to call your subroutines using the `&foo()` notation unless your perl is very old. `foo()` will work fine in your case. You seem not to really be using subroutine signatures, so `sub typefiles( $tfile , $filetype ) { ($tfile, $filetype ) = @_ ; #etc... }` [download] can simply be `sub typefiles { ($tfile, $filetype ) = @_ ; # etc... }` [download]	[reply] [d/l] [select]


"be consistent"
	PerlMonks