Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Correct code for deleting old file?

by markuswinter (Novice)
on Jun 15, 2012 at 09:29 UTC ( #976397=perlquestion: print w/ replies, xml ) Need Help??
markuswinter has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

sorry, total newbie question:

We have a script which gets an array of URLs (@taxonomy_file_url) and downloads the new files:

foreach (@taxonomy_file_url) { &getFile($_, "taxonomy", $local_taxonomy_directory); }

However there is a problem: the script does not delete the old file first, so the newly downloaded file becomes taxonomyFile1, then taxonomyFile2, then taxonomyFile3 etc etc

As the main app ONLY looks for taxonomyFile it will never see the new file.

So I want to delete the old file (if it exists):

foreach (@taxonomy_file_url) { if (-e $_) { # does the file exist unlink($_) #delete it } &getFile($_, "taxonomy", $local_taxonomy_directory); }

Is this the correct way of doing it?

Thanks for any advice you have!

Markus

Comment on Correct code for deleting old file?
Select or Download Code
Re: Correct code for deleting old file?
by flexvault (Parson) on Jun 15, 2012 at 10:00 UTC

    Welcome markuswinter,

    I have modified your code to my style, but the one difference is that I have replaced the '$_' with '$file'. Using '$_' in a one-liner is fine, but when you use it multiple times, you may be changing the value of '$_'. I did not test your code, so it may work fine.

    Since you are new to Perl, I'm just suggesting an alternate way to do this. I have changed this style several times in the past 15 years and all because when you have a multi-thousand line script and the compiler tells you that your 'Missing a right curly...' you start looking for ways to improve your style. YMMV.

    foreach my $file (@taxonomy_file_url) { if (-e $file) # does the file exist { unlink($file) #delete it } &getFile($file, "taxonomy", $local_taxonomy_directory); }

    You don't show 'getFile', but it must be testing for the existence of the file, so you might just remove that code since an 'open' with '>' will over-write the original file.

    Good Luck

    "Well done is better than well said." - Benjamin Franklin

      Thanks for your help. I know there is a bug in the overall code as the taxonomy file never gets replaced (that's acknowledged by the software company) but as iis an older version they don't fix the bug. The getFile code is
      sub getFile { my $url = shift; my $type = shift; my $download_directory = shift; my($remoteName, $remotePath, $sysCall, $target, $rootName, $curren +t_directory, $matchLineSave, $thisFileName, @listing, $template, $matchName, $matchLine, @hi +story, $i, $key, $value); &checkPath($download_directory); # split download URL into directory and name if ($url =~ /(.*\/)(.+?)$/) { $remotePath = $1; $remoteName = $2; } else { &lastWords("Cannot parse filename from $url"); } if ($url =~ /^http:/i) { if ($remoteName =~ /[\*\?]/) { &lastWords("Wild card filenames not allowed for http downloads +"); } $matchName = $remoteName; } else { # get a directory listing from FTP sites unlink("$download_directory/.listing"); unlink("$download_directory/index.html"); $sysCall = $wget; if ($ENV{'WINDIR'}){ $sysCall =~ s#/#\\#g; } while (($key, $value) = each(%wget_options)) { if (length($value)) { $sysCall .= " $key=$value"; } else { $sysCall .= " $key"; } } $sysCall .= " --dont-remove-listing"; $sysCall .= " --directory-prefix=$download_directory"; $sysCall .= " \"$remotePath\""; if (system($sysCall)) { &lastWords("Could not retrieve directory listing with $sysCall +"); } unless (-e "$download_directory/.listing" && -s "$download_direc +tory/.listing") { &lastWords("Failed to retrieve directory listing with $sysCall +"); } # got a directory listing from remote site, slurp local download . +history into array if (open(HISTORY, "<$download_directory/.history")) { @history = <HISTORY>; close HISTORY; } # make filename into regular expression $template = $remoteName; $template =~ s/(\W)/\\$1/g; $template =~ s/\\\?/\./g; $template =~ s/\\\*/\.\*/g; # work through directory .listing # exit with first match that isn't already in .history $matchName = "no match"; if (open(LISTING, "$download_directory/.listing")) { @listing = <LISTING>; close LISTING; foreach (@listing) { if (/\s+($template)\s*$/ || /\s+($template)\s+->/) { # got a match from .listing $matchName = $1; $matchLine = $_; for ($i = 0; $i <= $#history; $i++) { if ($history[$i] =~ /\s+($matchName)\s*$/ || $history[$i +] =~ /\s+($matchName)\s+->/) { if ($history[$i] eq $matchLine) { # this file already downloaded $matchName = ""; } else { # new version of file available $history[$i] = $matchLine; $matchLineSave = $matchLine; $matchLine = ""; } last; } } if ($matchName && $matchName ne "no match") { last; } } } } # if $matchName is empty, there is no new update # if $matchName is "no match", $remoteName was not found in the di +rectory listing # if $matchLine is empty, no need to append to .history if ($matchName eq "no match") { &lastWords("No match to $url on remote server"); } if ($matchName && $matchLine) { push @history, $matchLine; $matchLineSave = $matchLine; } if ($matchName) { # replace (potentially) wildcard $url with actual download path $url = "$remotePath$matchName"; } else { # nothing to retrieve &message("No update available for $url\n"); return 0; } } # download the file $sysCall = $wget; if ($ENV{'WINDIR'}){ $sysCall =~ s#/#\\#g; } while (($key, $value) = each(%wget_options)) { if (length($value)) { $sysCall .= " $key=$value"; } else { $sysCall .= " $key"; } } # seems like wget cannot cope with --retr-symlinks and --timestampin +g if ($matchLineSave =~ /\s+($matchName)\s*$/) { $sysCall .= " --timestamping"; } else { $sysCall .= " --retr-symlinks"; } $sysCall .= " --directory-prefix=$download_directory"; # Only use --ignore-length option if desperate # if ($url =~ /^http:/i) { # $sysCall .= " --ignore-length"; # } $sysCall .= " \"$url\""; # Try each download twice. # The second attempt will normally not result in a download because +timestamping is on. # The exception is when a file is updated on the remote server durin +g downloading: # wget resumes the download, and ends up with a mixed file of the co +rrect size, # but the file date is that of the original file. # Don't try to do this if the download file is a link if ($matchLineSave =~ /\s+($matchName)\s*$/) { if (system($sysCall)){ &lastWords("Error return from: $sysCall"); } } if (system($sysCall)){ &lastWords("Error return from: $sysCall"); } else { $logEntry .= "Downloaded $url to $download_directory\n"; print "Resting for 10 seconds\n"; sleep 10; } $thisFileName = "$download_directory/$matchName"; # ensure file is writeable chmod 0644, $thisFileName; if ($thisFileName =~ /(.*)\.gz$/i || $thisFileName =~ /(.*)\.Z$/i) + { $rootName = $1; # decompress with forced overwrite $sysCall = "$gzip -df $thisFileName"; if ($ENV{'WINDIR'}){ $sysCall =~ s#/#\\#g; } # try up to 3 times before giving up if (system($sysCall) && system($sysCall) && system($sysCall)){ &lastWords("Error return from: $sysCall"); } else { &message("Expanded $thisFileName\n"); $thisFileName = $rootName; } } if ($type eq "taxonomy") { # taxonomy file may be a tar archive if ($thisFileName =~ /\.tar$/i) { # unpack if tar archive # have to cd to $local_taxonomy_directory because -C option of t +ar unreliable in DOS if ($ENV{'WINDIR'}){ $current_directory = `cd`; $current_directory =~ s#\\#/#g; } else { $current_directory = `pwd`; } chomp($current_directory); if ($local_taxonomy_directory =~ /^(\w:)(.*)/) { chdir($1); chdir($2); } else { chdir($local_taxonomy_directory); } $thisFileName =~ /.*\/(.+?)$/; $sysCall = "$tar -xf $1"; if ($ENV{'WINDIR'}){ $sysCall =~ s#/#\\#g; } # try up to 3 times before giving up if (system($sysCall) && system($sysCall) && system($sysCall)){ &lastWords("Error return from: $sysCall"); } else { &message("Unpacked $thisFileName\n"); } unlink($thisFileName); # cd back to original directory (just in case) if ($current_directory =~ /^(\w:)(.*)/) { chdir($1); chdir($2); } else { chdir($current_directory); } } } elsif ($type eq "unigene") { # nothing more to do for UniGene } else { # sequence database files need to be renamed if ($type eq "fasta") { $target = "$local_incoming_directory/$db_name\_xyzzy.fasta"; } elsif ($type eq "name") { $target = "$local_incoming_directory/$db_name\_xyzzy.nam"; } elsif ($type eq "reference") { # use the existing extension, if any, otherwise use .ref # lose ".complete" from MSDB $thisFileName =~ s/msdb\.ref\.complete\./msdb\.ref\./; if ($thisFileName =~ /.*\.(.+?)$/) { $target = "$local_incoming_directory/$db_name\_xyzzy.".lc($1 +); } else { $target = "$local_incoming_directory/$db_name\_xyzzy.ref"; } } if (move($thisFileName, $target)) { &message("Renamed $thisFileName to $target\n"); } else { &lastWords("Error return from renaming $thisFileName to $targe +t"); } } # if we get here, download must have succeeded, so can now write out # the updated .history, to avoid getting the same file again if (@history) { open(HISTORY, ">$download_directory/.history"); foreach (@history) { print HISTORY $_; } close HISTORY; } # all done return 1; }

        markuswinter,

        I took a quick look at the code, and I think your original solution to delete the file may be the better choice. There is still a lot of code that isn't defined and probably compiled in by a 'use' statement.

        Can you test the script in a non-destructive way?

        Usually on PM we try to give hints rather than working code. In your case, it sounds like production code that you want to modify. I strongly suggest you set-up a 'sand box' (testing environment) before you modify production code. If I'm misunderstanding your problem, please correct me.

        Thank you

        "Well done is better than well said." - Benjamin Franklin

        Your code contains lot of bad practices:

        • &function(...). Correct in Perl 4, still working but with really unwanted side effects in Perl 5. Get rid of all ampersands in front of function names, except when you want to make a reference to a function.
        • Two-argument open. Use three-argument open.
        • Lack of error checks, especially after open. Unless you use autodie, add or die "Could not open $filename: $!" after open.
        • Single-argument system() with hand-crafted quoting. Always begs for trouble, because shells tend to have very different quoting rules. At least pass arguments for the invoked program as a list of arguments. See also Re^3: Perl Rename for a little background, http://www.in-ulm.de/~mascheck/various/ for more.

        Also, you seem to go through great pains to handle URLs and to issue HTTP and FTP requests using wget. Perl has LWP, so you do not have to mess with the shell and external programs at all. Look first at LWP::Simple, that should nearly be sufficient to replace wget. If you need more control, look at LWP::UserAgent. As you seem to write a spider, also have a look at LWP::RobotUA.

        Perl also can decompress gzip files and unpack tar files, if you like, even both in one package, Archive::Tar, and of course without having to mess with the shell.

        For URL handling, look at URI and its helpers.

        And for my personal taste, the getFile() function is about 200 lines longer than it should be. Too much code for one function, too deeply nested, too many variables. As a rule of thumb, a function should not exceed one screen, because otherwise, you will get lost in the code. At my university, a screen was defined as a VT420, i.e. 25 lines of no more than 80 characters, but in the real world, my editor shows about 40 lines of about 110 characters. Consider splitting the function into smaller parts.

        Alexander

        --
        Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

      Anything that relies on Mk I eyeball to ensure braces match is doomed to fail, or at least consume to a great deal of time. It is much better to use a syntax highlighting editor with smart indentation. However any such assertion will raise someone's hackles.

      At the end of the day it generally comes down to what works for you. Note though that often "what works for you" is really "what works of the team" or "what works for the community". In the case of Perl "what works for the community" is K&R indentation with a four space indent.

      Especially for Perl (due to the excellent Perl::Tidy) a good answer can be to use a pretty printer to normalise code for "public" consumption, and even to renormalise code for personal use.

      True laziness is hard work
Re: Correct code for deleting old file?
by kcott (Abbot) on Jun 15, 2012 at 10:07 UTC

    That looks like you're trying to unlink $some_url, rather than unlink $some_file. I suspect unlink should probably be in getFile(). Please show the code for that subroutine.

    -- Ken

      <Homer Simpson moment>Doh!</Homer Simpson moment>

      You are right of course. So much for my perl skills :-P

      The code of the subroutine is posted above in answer to another reply.

      Thanks!

Re: Correct code for deleting old file?
by rovf (Priest) on Jun 15, 2012 at 10:57 UTC
    If you don't check the result of unlink (i.e. whether it works or not), you can equally well omit the test for existence. Hence, I would write it either as
    unlink(@taxonomy_file_list);
    or as
    (unlink($_)==1 || report_unlink_error($_,$!)) for @taxonomy_file_list;


    -- 
    Ronald Fischer <ynnor@mm.st>
      Kind of depends on what you want to report as an error though. I'd think you'd want to be quiet if the file doesn't exist yet but whine if it exists and unlink fails (maybe the file or its directory got to be read only for you somehow).

      I'm not sure about this. My information may be a bit out of date.

      Calling unlink on some non-file types (directory or similar) still has "unspecified" consequences on some operating systems, hasn't it? And -e only checks for existence, whereas -f would check specifically for a file, making the call to unlink safer, right?

      "You have reached the Monastery. All our helpdesk monks are busy at the moment. Please press "1" to instantly donate 10 currency units for a good cause or press "2" to hang up. Or you can dial "12" to get connected directly to second level support."
        Basically i agree with you. I had assumed that you from your knowledge of your application, you know for sure that the names which you had collected in your file list, certainly contain plain files (and not directories, symbolic links or what else). In this case (only) it makes sense to omit the file testing.


        If this is not the case, I personally would be interested *which* non-file entries made it into the file list, because this (possibly) signals an error. In this case, I would do a -f test for each file, but I would also issue an error message for those entries which are not plain files.

        -- 
        Ronald Fischer <ynnor@mm.st>
Re: Correct code for deleting old file?
by morgon (Deacon) on Jun 15, 2012 at 16:07 UTC
    Note that you delete the file before you have retrieved a new version (which might fail).

    I would not delete the old file but rename it (or move it somewhere else) so that you can fall back on it.

    You can rename or more a file with the move-function from File::Copy.

Re: Correct code for deleting old file?
by CountZero (Bishop) on Jun 16, 2012 at 06:36 UTC
    I really doubt it is the right way of deleting the file(s) @taxonomy_file_url points at. The getFile subroutine does a lot of work on the filename (which I suspect is not a filename at all but rather an URL or URI, and maybe even have wildcards in it) that handing the raw filename to unlink could cause all kinds of mayhem on your system.

    As someone already said: much better to check for the existence of the file inside the getFile subroutine and handle that situation then and there.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://976397]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (7)
As of 2014-12-27 07:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (176 votes), past polls