http://www.perlmonks.org?node_id=368761

new_monk has asked for the wisdom of the Perl Monks concerning the following question:

Novice here. I wrote a script that downloads and zips log files from a server. Sometimes the files can be up to 40MB+ and the script takes a real long time to add big files to the zip file. Is there any way to increase the buffer or any way to make the script add the files faster? I use Activestate Perl on Win 2000. Here is my code...
#!/usr/local/ActivePerl-5.6/bin/perl -w use File::glob; use File::Copy; use Net::FTP; use Archive::Zip; $root = "/"; #Servers and login info $host_disco = "server"; $user_disco = "user"; $password_disco = "password"; $dirlogs_disco = "/folder/folder"; $host_gw = "server"; $user_gw = "user"; $password_gw = "password"; $dirlogs_gw = "/folder/folder"; $host_roc = "server"; $user_roc = "user"; $password_roc = "password"; $dirlogs_roc = "/folder/folder"; $host_err = "server"; $user_err = "user"; $password_err = "password"; $dirlogs_err = "/folder/folder"; $date = ""; $logname = ""; #Call 4 subroutines that backup the logs. Pass them server and directo +ry info for each environment(Schema). Zip up the logs. print("Working...\n"); #get a date timestamp $date = datestamp(); $logname = "$date.zip"; my $schema = "Disco"; #calls sub.'s and passes server info to login backuplogs($host_disco, $user_disco, $password_disco, $dirlogs_disco, +$schema, $logname); $schema = "GW"; backuplogs($host_gw, $user_gw, $password_gw, $dirlogs_gw, $schema, $lo +gname); $schema = "ROC"; backuplogs($host_roc, $user_roc, $password_roc, $dirlogs_roc, $schema, + $logname); $schema = "ERR"; backuplogs($host_err, $user_err, $password_err, $dirlogs_err, $schema, + $logname); sub backuplogs { my ($host, $user, $pass, $dirlogs, $schema, $logname) = @_; my @files = ""; my $file1 = ""; #connect to the server my $ftp = Net::FTP->new($host) or die "Can't open $host: $@\n"; $ftp->login($user, $pass) or die "Couldn't login: @{[ $ftp->message ]} +"; $ftp->ascii(); $ftp->cwd($root) or die "Couldn't cwd to $root: @{[ $ftp->message ]}\n +"; $ftp->cwd($dirlogs) or die "Couldn't cwd to $dirlogs: @{[ $ftp->messag +e ]}\n"; $, = "\n"; @files = $ftp->ls; LINE: foreach $file1 (@files) { #if this file then skip it if($file1 =~ /mdctxuapp54_cb.stderr/) { next LINE; } #do this for all files that are .stderr or .log if($file1 !~ /.stderr/ || $file1 !~ /.log/) { $ftp->get($file1) || die "Can't get files from $dirlogs :@{[ $ft +p->message ]}\n"; $ftp->delete($file1); #remove them from the server } } $ftp->close(); #zip the files my $file = ""; my $zip = Archive::Zip->new(); while($file = <C:/data/*.log>) { #add all log files to zip file my $filename = $file; $filename =~ s/C.*\///; $zip->addFile( $filename ); $zip->writeToFileNamed( $logname ); $filename = ""; } while($file = <C:/data/*.stderr>) { #add all stderr files to zip file my $filename = $file; $filename =~ s/C.*\///; $zip->addFile( $filename ); $zip->writeToFileNamed( $logname ); $filename = ""; } #copy zip file to respective folder out on sharedrive copy($logname, "G:/some folder/$schema"); copy($logname, "C:/data/some folder/$schema"); #delete local files unlink <*.log>; unlink <*.stderr>; unlink <*.zip>; print("$schema has been backed up.\n"); } #get date timestamp sub datestamp { my ($Second, $Minute, $Hour, $Day, $Month, $Year, $WeekDay, $DayOf +Year, $IsDST) = localtime(time); my $RealMonth = $Month + 1; my $FullYear = $Year + 1900; my $AMorPM = ""; if($Day < 10) { $Day = "0" . $Day; } if($RealMonth < 10) { $RealMonth = "0" . $RealMonth; } my $date = ""; $date = "$RealMonth\_$Day\_$FullYear\_$Hour$Minute"; return $date; }

Edit by castaway - added readmore tag

Replies are listed 'Best First'.
Re: How Could I Speed Up An Archive Script?
by arden (Curate) on Jun 22, 2004 at 16:09 UTC
    new_monk, a few different things come to mind with your script.

    Since each $schema is from a different server, why not perform each of the FTP downloads at the same time? I doubt you're using up all of your machine's bandwidth for a single FTP, you might save some time by performing multiple FTPs at the same time. Also, one of the longer FTP sessions could still be downloading (relatively light on the CPU) while another $schema has moved on to the compression stage (heavy on CPU, light on disk).

    Before you do the $ftp->get on the files, you have an if statement that is going to be TRUE for everything. You have if($file1 !~ /.stderr/ || $file1 !~ /.log/) which says "if NOT .stderr OR if NOT .log"; well the file isn't likely to be BOTH, so it'll return true every time. I doubt that's what you really wanted to happen.

    After you've compressed your files, you write the resultant .zip file out to the local harddrive, then copy it to another drive based on $schema twice, then you delete the .zip file. Why not just write it to where you want it with  $zip->writeToFileNamed( "G:/some folder/$schema/$logname ); instead? Then you can copy it to the second location.

    I don't think you're going to speed up the compression much. You might wish to include a few more print statements so you know where in your script the process is. Statements like  print "Starting FTP $host\n"; and  print "Finished FTP, starting compression on $schema\n; would help. Maybe even a  print "writing compressed file $logname to disk\n; too. It'll help you to figure out where your bottleneck is at.

    - - arden.
    p.s. please include <READMORE> tags on large sections of code. It really matters if your node gets front-paged!

      Also note that the dot in the regexes is not a literal, but matches any non-newline.

      We're not really tightening our belts, it just feels that way because we're getting fatter.
      Thank you for your assistance oh wise ones.

      I thought "!~" is the same as "=~,"? I have debugged my code before, the bottelneck is definitely where the file is getting added to the zip file. Getting the files from the server takes but just a few seconds.

      This is not going to speed things up but how do I spawn threads in Perl?

      On the copying files, I like to create it locally first then copy it to my backup location and to the network sharedrive. I do this in case the fileserver is down, I will have a copy of it locally.

Re: How Could I Speed Up An Archive Script?
by chromatic (Archbishop) on Jun 22, 2004 at 18:44 UTC

    I wonder if writing and rewriting the files is hurting you. Here's an untested simplification of the relevant section of your code:

    while(my $file = ( <C:/data/*.log>, C<:/data/*.stderr> )) { # add all log files to zip file $file =~ s/C.*\///; $zip->addFile( $file ); } $zip->writeToFileNamed( $logname );

    This'll only write the zip file once for each subroutine invocation. It may trade memory for speed, but I'm not sure how Archive::Zip handles open zip files internally, so it may be a moot point.

Re: How Could I Speed Up An Archive Script?
by waswas-fng (Curate) on Jun 22, 2004 at 15:25 UTC
    As stated here, The tradeoff is usually compression size or speed. The default is level 6, you can drop that to 1 to make compressing faster at the cost of zip file size. You should also investigate where the bottleneck is -- it could be disk, memory or CPU. Usually compressing bangs on the CPU more so then disk or memory.


    -Waswas
      Thanks. I had the same problem and your advice helped me to speed up zip creation process :)
Re: How Could I Speed Up An Archive Script?
by NetWallah (Canon) on Jun 22, 2004 at 18:14 UTC
    • You could run the sub "backuplogs" as a thread, and do all 4 in parallel.
    • Just a style pointer: adding a leading zero in the datestamp sub is easier, and clearer if you use sprintf:
    my ($day,$month) = (5,6); for ($day,$month){ $_ = sprintf('%2.2d',$_); # Adds Leading zero if necessary }

    Earth first! (We'll rob the other planets later)

Re: How Could I Speed Up An Archive Script?
by new_monk (Sexton) on Jun 22, 2004 at 18:59 UTC
    Thank you oh wise ones, thanks for your suggestions. I should have mentioned this before I am afraid the major bottleneck is not with the logic of my code or the manner in which I manage the files but with the actual adding of the files to the zip archive. I have verified this. I am testing setting the compression method to "1" to see if that helps. Any other suggestions or is there a quicker way to compress large files?
      You can also exclude certain filetypes which are only a little compressable (like .jpg .pdf .mp3 .zip etc..) from compression and save them uncompressed.
Re: How Could I Speed Up An Archive Script?
by elwarren (Priest) on Jun 23, 2004 at 17:56 UTC
    It's a significant change, but have you considered streaming your download filehandle directly into a gzip filehandle? It remove several significant file touching steps from your code...

    I don't think zip allows this but gzip or tar will allow it. gzip compresses better than zip and most software that handles zip files will handle gz these days...