Re^2: Duplicates in Directories

First, thank you all for your suggestions. The problem has been one of algorythym. I am iterating @select files from and @allfile loop, and parsing for equality conditions.

As the actual code is over 300 lines, I have included an edited snippet. This code is derived from an earlier script where I needed to parse for reexes in files, not necessarily exact matches , and not necesaarily at the beginning. parsing @selectexpr against @allfiles made sense there.

Hashes are an excellent idea. with them I can parse foo-bar-baz.doc as as hash directly against all foo keys, with proper splitting and filtering, of course. This would allow me to scale up more efficiently.

I would howver prefer, if possible to keep the matching to a regexp rather an an equality operator.

Please ignore syntax errors in the code below, it has been abbreviated.

if ($mymobi =~ m/($myepub)/) {print "DUPLICATE FOUND !\n" ;
&movetodir($myfilt,$dupdir );     }
#Does NOT work 


if ($mymobi eq $myepub) {print "DUPLICATE FOUND !\n" ;
&movetodir($myfilt,$dupdir );     }
#Works
[download]

For an author-title pair,the matching would be done in the title(value) portion rather than the key, which would be expected to identical (though there might be exceptions ).

I need to hit the books on hashes here, as i havent really dealt much with them outside of a 20,000+ listing database with about 2 dozen hash fields.

   opendir(DIR, $dir2 ) or die $!;
     while ( $file = readdir(DIR))       {
               if (-f $file) {  #  read only files
 chomp($file);



$file =~ s/^\s+|\s+$//g;
$filenam = "" ; 
push ( @srcarray, $file) ;
if ($file =~ m/\.mobi$/ig ) {
&typefiles($file, "mobifile"); 
                          }

if ($file =~ m/\.azw3$/ig ) {
&typefiles($file, "azw3file"); 
                        }


sub typefiles( $tfile , $filetype ) { 
($tfile, $filetype ) = @_ ;
if ($filetype eq "mobifile" )  { 
push ( @mobiarray, $file) ;     } # End mobifiles 

# Main body - parsing directory listing and performing actions 
        foreach $authf (@srcarray){

if ($authf =~ m/\.pl$/) { 
next; }



if ($authf =~ m/\.epub/ig ) {
our $authf2 = $authf ;


foreach my $myfilt (@mobiarray){ 
my $mymobi  = $myfilt;
my $myepub  = $authf2;

$mymobi = &extfilter($mymobi);
$myepub = &extfilter($myepub);


sub extfilter($line) {
($line) = @_;
$line =~ s/\.mobi//ig ;
$line =~ s/\.epub//ig ;
$line =~ s/^\s+|\s+$//g;
$line = lc $line;
return $line; 
                     }
[download]

Comment on Re^2: Duplicates in Directories Select or Download Code

Replies are listed 'Best First'.
Re^3: Duplicates in Directories by hippo (Bishop) on Oct 10, 2017 at 08:20 UTC
`if ($mymobi =~ m/($myepub)/) {print "DUPLICATE FOUND !\n" ; &movetodir($myfilt,$dupdir ); } #Does NOT work if ($mymobi eq $myepub) {print "DUPLICATE FOUND !\n" ; &movetodir($myfilt,$dupdir ); } #Works` [download] No sample data means no solution. Here's the SSCCE you could have provided: `use strict; use warnings; use Test::More tests => 2; my $mymobi = 'Hello World!'; my $myepub = 'Hello World!'; ok ($mymobi =~ m/($myepub)/); ok ($mymobi eq $myepub);` [download] See how both the string equality and the regular expression matches are true? So they both "work". Your task is now to provide the values for $mymobi and $myepub for which one or other doesn't match. At that point it should become clear to you what the difference between an exact string match and a regular expression match is (and why one or the other is preferable in different situations - because they serve different purposes).	[reply] [d/l] [select]
Re^3: Duplicates in Directories by swl (Parson) on Oct 11, 2017 at 07:52 UTC
You're welcome. However, it is unclear to me why, given you want to use regexp matching, your regexp match apparently does not work and exact equality does: `if ($mymobi =~ m/($myepub)/) {print "DUPLICATE FOUND !\n" ; &movetodir($myfilt,$dupdir ); } #Does NOT work if ($mymobi eq $myepub) {print "DUPLICATE FOUND !\n" ; &movetodir($myfilt,$dupdir ); } #Works` [download] The regexp will match anything containing your title, so for a title like "blert" you will be matching all of "blert", "blertblartblort", "foobarblertbaz" etc. Perhaps you need to filter the file names for possible partial matches when you read them? Or if you know there are spelling errors then have a look at Text::Fuzzy and similar. Even then you would perhaps be best to flag them somewhere for cleanup or modification before automated processing. Some other points are: There is no need to call your subroutines using the `&foo()` notation unless your perl is very old. `foo()` will work fine in your case. You seem not to really be using subroutine signatures, so `sub typefiles( $tfile , $filetype ) { ($tfile, $filetype ) = @_ ; #etc... }` [download] can simply be `sub typefiles { ($tfile, $filetype ) = @_ ; # etc... }` [download]	[reply] [d/l] [select]


Clear questions and runnable code get the best and fastest answer
	PerlMonks