Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re: Duplicates in Directories

by swl (Parson)
on Oct 09, 2017 at 10:21 UTC ( [id://1200979]=note: print w/replies, xml ) Need Help??


in reply to Duplicates in Directories

If I'm reading your description correctly, you are passing over the list of files two or three times (possibly doing steps 1 & 2 in one pass). Step 3 also does a linear search over the file names, which is going to go quadratic in terms of processing.

Maybe try to get all the info in one pass, store the details in a hash structure, and then iterate over that hash. That will take care of the need for exact matches and avoid linear searches.

use strict; use warnings; # I assume the actual code will obtain this list using # glob or similar my @allfiles = qw / baz.txt baz.epub baz.doc baz.pdf bar.epub boo.epub boo.txt /; my %by_ext; # %by_fbase is actually redundant below, # but is maybe useful for other things # so I have left it in my %by_fbase; foreach my $file (@allfiles) { # should use a proper filename parser here # like File::Basename, but a split will serve # for the purposes of an example. my ($name, $ext) = split /\./, $file; $by_ext{$ext}{$name} = $file; $by_fbase{$name}{$ext} = $file; } foreach my $name (keys %by_fbase) { #print "$name\n"; no autovivification; # could use exists in this check if you want to avoid autoviv, # but file names should evaluate to true if they have # an extension, even if the name part evaluates to false if ($by_fbase{$name}{pdf} && $by_fbase{$name}{epub}) { # do stuff print "$name has epub and pdf extensions: " . "$by_fbase{$name}{epub} $by_fbase{$name}{pdf}\n"; # now do stuff like moving files since you can iterate over # the values of the relevant subhash foreach my $file (values %{$by_fbase{$name}}) { print "now do something to $file\n"; } } }
That code prints:
baz has epub and pdf extensions: baz.epub baz.pdf now do something to baz.epub now do something to baz.doc now do something to baz.txt now do something to baz.pdf

Update: Edited incomplete comment starting with "now do stuff"

Replies are listed 'Best First'.
Re^2: Duplicates in Directories
by kel (Sexton) on Oct 10, 2017 at 07:57 UTC

    First, thank you all for your suggestions. The problem has been one of algorythym. I am iterating @select files from and @allfile loop, and parsing for equality conditions.

    As the actual code is over 300 lines, I have included an edited snippet. This code is derived from an earlier script where I needed to parse for reexes in files, not necessarily exact matches , and not necesaarily at the beginning. parsing @selectexpr against @allfiles made sense there.

    Hashes are an excellent idea. with them I can parse foo-bar-baz.doc as as hash directly against all foo keys, with proper splitting and filtering, of course. This would allow me to scale up more efficiently.

    I would howver prefer, if possible to keep the matching to a regexp rather an an equality operator.


    Please ignore syntax errors in the code below, it has been abbreviated.

    if ($mymobi =~ m/($myepub)/) {print "DUPLICATE FOUND !\n" ; &movetodir($myfilt,$dupdir ); } #Does NOT work if ($mymobi eq $myepub) {print "DUPLICATE FOUND !\n" ; &movetodir($myfilt,$dupdir ); } #Works

    For an author-title pair,the matching would be done in the title(value) portion rather than the key, which would be expected to identical (though there might be exceptions ).

    I need to hit the books on hashes here, as i havent really dealt much with them outside of a 20,000+ listing database with about 2 dozen hash fields.

    opendir(DIR, $dir2 ) or die $!; while ( $file = readdir(DIR)) { if (-f $file) { # read only files chomp($file); $file =~ s/^\s+|\s+$//g; $filenam = "" ; push ( @srcarray, $file) ; if ($file =~ m/\.mobi$/ig ) { &typefiles($file, "mobifile"); } if ($file =~ m/\.azw3$/ig ) { &typefiles($file, "azw3file"); } sub typefiles( $tfile , $filetype ) { ($tfile, $filetype ) = @_ ; if ($filetype eq "mobifile" ) { push ( @mobiarray, $file) ; } # End mobifiles # Main body - parsing directory listing and performing actions foreach $authf (@srcarray){ if ($authf =~ m/\.pl$/) { next; } if ($authf =~ m/\.epub/ig ) { our $authf2 = $authf ; foreach my $myfilt (@mobiarray){ my $mymobi = $myfilt; my $myepub = $authf2; $mymobi = &extfilter($mymobi); $myepub = &extfilter($myepub); sub extfilter($line) { ($line) = @_; $line =~ s/\.mobi//ig ; $line =~ s/\.epub//ig ; $line =~ s/^\s+|\s+$//g; $line = lc $line; return $line; }
      if ($mymobi =~ m/($myepub)/) {print "DUPLICATE FOUND !\n" ; &movetodir($myfilt,$dupdir ); } #Does NOT work if ($mymobi eq $myepub) {print "DUPLICATE FOUND !\n" ; &movetodir($myfilt,$dupdir ); } #Works

      No sample data means no solution. Here's the SSCCE you could have provided:

      use strict; use warnings; use Test::More tests => 2; my $mymobi = 'Hello World!'; my $myepub = 'Hello World!'; ok ($mymobi =~ m/($myepub)/); ok ($mymobi eq $myepub);

      See how both the string equality and the regular expression matches are true? So they both "work". Your task is now to provide the values for $mymobi and $myepub for which one or other doesn't match. At that point it should become clear to you what the difference between an exact string match and a regular expression match is (and why one or the other is preferable in different situations - because they serve different purposes).

      You're welcome. However, it is unclear to me why, given you want to use regexp matching, your regexp match apparently does not work and exact equality does:

      if ($mymobi =~ m/($myepub)/) {print "DUPLICATE FOUND !\n" ; &movetodir($myfilt,$dupdir ); } #Does NOT work if ($mymobi eq $myepub) {print "DUPLICATE FOUND !\n" ; &movetodir($myfilt,$dupdir ); } #Works

      The regexp will match anything containing your title, so for a title like "blert" you will be matching all of "blert", "blertblartblort", "foobarblertbaz" etc. Perhaps you need to filter the file names for possible partial matches when you read them? Or if you know there are spelling errors then have a look at Text::Fuzzy and similar. Even then you would perhaps be best to flag them somewhere for cleanup or modification before automated processing.

      Some other points are:


      There is no need to call your subroutines using the &foo() notation unless your perl is very old. foo() will work fine in your case.


      You seem not to really be using subroutine signatures, so

      sub typefiles( $tfile , $filetype ) { ($tfile, $filetype ) = @_ ; #etc... }

      can simply be

      sub typefiles { ($tfile, $filetype ) = @_ ; # etc... }

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1200979]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (5)
As of 2024-04-24 08:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found