Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Re^2: Duplicates in Directories

by kel (Sexton)
on Oct 10, 2017 at 07:57 UTC ( [id://1201072]=note: print w/replies, xml ) Need Help??


in reply to Re: Duplicates in Directories
in thread Duplicates in Directories

First, thank you all for your suggestions. The problem has been one of algorythym. I am iterating @select files from and @allfile loop, and parsing for equality conditions.

As the actual code is over 300 lines, I have included an edited snippet. This code is derived from an earlier script where I needed to parse for reexes in files, not necessarily exact matches , and not necesaarily at the beginning. parsing @selectexpr against @allfiles made sense there.

Hashes are an excellent idea. with them I can parse foo-bar-baz.doc as as hash directly against all foo keys, with proper splitting and filtering, of course. This would allow me to scale up more efficiently.

I would howver prefer, if possible to keep the matching to a regexp rather an an equality operator.


Please ignore syntax errors in the code below, it has been abbreviated.

if ($mymobi =~ m/($myepub)/) {print "DUPLICATE FOUND !\n" ; &movetodir($myfilt,$dupdir ); } #Does NOT work if ($mymobi eq $myepub) {print "DUPLICATE FOUND !\n" ; &movetodir($myfilt,$dupdir ); } #Works

For an author-title pair,the matching would be done in the title(value) portion rather than the key, which would be expected to identical (though there might be exceptions ).

I need to hit the books on hashes here, as i havent really dealt much with them outside of a 20,000+ listing database with about 2 dozen hash fields.

opendir(DIR, $dir2 ) or die $!; while ( $file = readdir(DIR)) { if (-f $file) { # read only files chomp($file); $file =~ s/^\s+|\s+$//g; $filenam = "" ; push ( @srcarray, $file) ; if ($file =~ m/\.mobi$/ig ) { &typefiles($file, "mobifile"); } if ($file =~ m/\.azw3$/ig ) { &typefiles($file, "azw3file"); } sub typefiles( $tfile , $filetype ) { ($tfile, $filetype ) = @_ ; if ($filetype eq "mobifile" ) { push ( @mobiarray, $file) ; } # End mobifiles # Main body - parsing directory listing and performing actions foreach $authf (@srcarray){ if ($authf =~ m/\.pl$/) { next; } if ($authf =~ m/\.epub/ig ) { our $authf2 = $authf ; foreach my $myfilt (@mobiarray){ my $mymobi = $myfilt; my $myepub = $authf2; $mymobi = &extfilter($mymobi); $myepub = &extfilter($myepub); sub extfilter($line) { ($line) = @_; $line =~ s/\.mobi//ig ; $line =~ s/\.epub//ig ; $line =~ s/^\s+|\s+$//g; $line = lc $line; return $line; }

Replies are listed 'Best First'.
Re^3: Duplicates in Directories
by hippo (Bishop) on Oct 10, 2017 at 08:20 UTC
    if ($mymobi =~ m/($myepub)/) {print "DUPLICATE FOUND !\n" ; &movetodir($myfilt,$dupdir ); } #Does NOT work if ($mymobi eq $myepub) {print "DUPLICATE FOUND !\n" ; &movetodir($myfilt,$dupdir ); } #Works

    No sample data means no solution. Here's the SSCCE you could have provided:

    use strict; use warnings; use Test::More tests => 2; my $mymobi = 'Hello World!'; my $myepub = 'Hello World!'; ok ($mymobi =~ m/($myepub)/); ok ($mymobi eq $myepub);

    See how both the string equality and the regular expression matches are true? So they both "work". Your task is now to provide the values for $mymobi and $myepub for which one or other doesn't match. At that point it should become clear to you what the difference between an exact string match and a regular expression match is (and why one or the other is preferable in different situations - because they serve different purposes).

Re^3: Duplicates in Directories
by swl (Parson) on Oct 11, 2017 at 07:52 UTC

    You're welcome. However, it is unclear to me why, given you want to use regexp matching, your regexp match apparently does not work and exact equality does:

    if ($mymobi =~ m/($myepub)/) {print "DUPLICATE FOUND !\n" ; &movetodir($myfilt,$dupdir ); } #Does NOT work if ($mymobi eq $myepub) {print "DUPLICATE FOUND !\n" ; &movetodir($myfilt,$dupdir ); } #Works

    The regexp will match anything containing your title, so for a title like "blert" you will be matching all of "blert", "blertblartblort", "foobarblertbaz" etc. Perhaps you need to filter the file names for possible partial matches when you read them? Or if you know there are spelling errors then have a look at Text::Fuzzy and similar. Even then you would perhaps be best to flag them somewhere for cleanup or modification before automated processing.

    Some other points are:


    There is no need to call your subroutines using the &foo() notation unless your perl is very old. foo() will work fine in your case.


    You seem not to really be using subroutine signatures, so

    sub typefiles( $tfile , $filetype ) { ($tfile, $filetype ) = @_ ; #etc... }

    can simply be

    sub typefiles { ($tfile, $filetype ) = @_ ; # etc... }

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1201072]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (8)
As of 2024-04-23 13:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found