http://www.perlmonks.org?node_id=588386

neilwatson has asked for the wisdom of the Perl Monks concerning the following question:

I have a directory where all files, except .txt, are gzipped. I want to have Perl gunzip them, process them and then gzip them again. I've stripped out all the middle process to attempt to debug the problem.
#!/usr/bin/perl use warnings; use strict; use Date::Calc qw(:all); my $PRINTFILE_DIRECTORY = "/home/lprjob1/filest"; # Directories to work in #my @dirs = ('dpi', 'dsc', 'dscpprod'); my @dirs = ('dscpprod'); my $dir; my ($year, $month, $day, $control_file, $pday, $pdatestr); # Form date string with arguement or default to today. my $dateform; if (scalar(@ARGV) eq 1 && $ARGV[0] =~ /\d\d\d\d-\d\d-\d\d/) { $dateform = $ARGV[0]; } else { ($year, $month, $day) = Today(); $dateform = sprintf("%d-%02d-%02d",$year,$month,$day); # Create current previous day's date string for later use # This will be needed for the control file. $pday = $day - 1; $year =~ s/^\d{2}(\d{2})/$1/; $pdatestr = sprintf("%02d%02d%02d",$year,$month,$pday); $control_file = "CONTROL_$pdatestr.txt"; } foreach (@dirs){ $dir = $_; print "Should $dir be processed?\n"; #If the control file does not exist then skip next unless ( -e "$PRINTFILE_DIRECTORY/$dir.$dateform/$control_fil +e" ); print "Yes, $dir should be processed?\n"; # Unzip files gzip('unzip', "$PRINTFILE_DIRECTORY/$dir.$dateform"); # Zip files again gzip('zip', "$PRINTFILE_DIRECTORY/$dir.$dateform"); } sub gzip { my $cmd = shift; my $dir = shift; my $file; my $fullfile; print "$cmd files in $dir...\n"; # Make sure command is secure and predictable. if ( $cmd eq 'zip' ){ $cmd = 'gzip'; } elsif ( $cmd eq 'unzip' ){ $cmd = 'gunzip'; } else { warn "zip or unzip command not given $!"; } opendir DIR, $dir or warn "Cannot open $dir $!"; while ( defined ( $file = readdir(DIR) ) ){ $fullfile = "$dir/$file"; if ( !-f $fullfile || $file =~ m/\.txt$/ || $file !~ m/^[\w-]+ +(\.[\w-]+)*$/ ){ print "skipping $fullfile\n"; next; } #print "$cmd $fullfile"; system ( "$cmd $fullfile") == 0 or warn "Cannot $cmd $fullfile + : $!"; } closedir DIR; }
I consistently get seemingly random errors.
[lprjob1@tor-lx-sftp report-printing]$ ./test.pl Should dscpprod be processed? Yes, dscpprod should be processed? unzip files in /home/lprjob1/filest/dscpprod.2006-12-07... skipping /home/lprjob1/filest/dscpprod.2006-12-07/. skipping /home/lprjob1/filest/dscpprod.2006-12-07/.. skipping /home/lprjob1/filest/dscpprod.2006-12-07/CONTROL_061206.txt gunzip: /home/lprjob1/filest/dscpprod.2006-12-07/cns061206: unknown su +ffix -- ignored Cannot gunzip /home/lprjob1/filest/dscpprod.2006-12-07/cns061206 : at + ./test.pl line 75. skipping /home/lprjob1/filest/dscpprod.2006-12-07/print.files.txt skipping /home/lprjob1/filest/dscpprod.2006-12-07/print.archive.txt zip files in /home/lprjob1/filest/dscpprod.2006-12-07... skipping /home/lprjob1/filest/dscpprod.2006-12-07/. skipping /home/lprjob1/filest/dscpprod.2006-12-07/.. skipping /home/lprjob1/filest/dscpprod.2006-12-07/CONTROL_061206.txt gzip: /home/lprjob1/filest/dscpprod.2006-12-07/dd061206.gz already has + .gz suffix -- unchanged Cannot gzip /home/lprjob1/filest/dscpprod.2006-12-07/dd061206.gz : at + ./test.pl line 75. skipping /home/lprjob1/filest/dscpprod.2006-12-07/print.files.txt skipping /home/lprjob1/filest/dscpprod.2006-12-07/print.archive.txt gzip: /home/lprjob1/filest/dscpprod.2006-12-07/dl061206.GGDHSETD.gz al +ready has .gz suffix -- unchanged Cannot gzip /home/lprjob1/filest/dscpprod.2006-12-07/dl061206.GGDHSETD +.gz : at ./test.pl line 75. gzip: /home/lprjob1/filest/dscpprod.2006-12-07/dq061206.new.gz already + has .gz suffix -- unchanged Cannot gzip /home/lprjob1/filest/dscpprod.2006-12-07/dq061206.new.gz : + at ./test.pl line 75. gzip: /home/lprjob1/filest/dscpprod.2006-12-07/hs061206.v4.gz already +has .gz suffix -- unchanged Cannot gzip /home/lprjob1/filest/dscpprod.2006-12-07/hs061206.v4.gz : + at ./test.pl line 75. gzip: /home/lprjob1/filest/dscpprod.2006-12-07/ncdsc061206.gz already +has .gz suffix -- unchanged
I see no pattern to the errors. The files associated with the errors are different each time. What have I missed?

Neil Watson
watson-wilson.ca

Replies are listed 'Best First'.
Re: maddening system call gzip/gunzip problems
by jasonk (Parson) on Dec 07, 2006 at 16:51 UTC

    Your errors fall into two categories.

    1. Gunzip won't do anything with files that don't have the correct extension, so running gunzip on the file 'cns061206' (for example) doesn't work. If you want to uncompress something that doesn't have a .gz extension, try zcat cns061206 > newfilename instead.

    2. You are attempting to compress files that already have a .gz extension, so gzip won't try and compress it again. Without more details, I would guess that either you didn't unzip these files in the first part of the program, or that some external system has continued to create new .gz files in the directory while you were processing.

    Depending on what kind of processing you are doing, you might want to consider taking a better approach than 'unzip everything', 'process everything', 'zip everything'. For example, if you are not actually changing the contents, you can just run 'zcat' on the file and read the output, using something like open( IN, "zcat $file |" ) (or, better yet, something like Compress::Zlib.) If you are modifying the files, you might consider processing them one at a time, rather than a directory at a time, which could avoid problems with new files being created while your program is running.


    We're not surrounded, we're in a target-rich environment!
      Although, gunzip reports that the file has no .gz extension it did before the script was run. That file was a gzip file before the script started. Similarly, the files are all unzipped before the gzip part is run. It seems that the script attempts to gunzip or gzip some files twice. I would love to rewrite all of the processing code for this script. Alas, time constraints do not permit (the rest of the code is very poor, undocumented, and not mine).

      Neil Watson
      watson-wilson.ca

        You are using readdir() to iterate over all of the files in the directory, but adding files to the directory inside of your loop. Remember that g(un)zip creates a new file to receive the output of the (de)compression. This is why things appear to be getting processed twice.


        The intelligent reader will judge for himself. Without examining the facts fully and fairly, there is no way of knowing whether vox populi is really vox dei, or merely vox asinorum. — Cyrus H. Gordon
Re: maddening system call gzip/gunzip problems
by madbombX (Hermit) on Dec 07, 2006 at 16:47 UTC
    There are easier ways to do certain parts of your existing program. If you want to find all gzip'd files in a directory, use File::Find:
    use File::Find; my $dir = "/home/lprjob1/filest" find( sub { push @files, $File::Find::name if -f && /.+gz/ }, $dir );

    Then I would look into using some of the CPAN modules associated with gzip like Tie::Gzip. If the files you are looking to process are text files, you can tie them to a gzip'd filehandle and read them:

    use Tie::Gzip; for my $log (@files) { tie *LOG, 'Tie::Gzip'; open (\*LOG, '<', "$log") or die("Cannot open $log: $!"); while (my $line = <LOG>) { # Iterate over the files here and do blah } close (LOG); untie *LOG; }
      If you want to find all gzip'd files in a directory, use File::Find:

      (sigh) No. Stick with readdir, like the OP wants, but use it with grep. And as for actually handling the data, my favorite is PerlIO::gzip:

      my @target_files; my ( $imode, $omode ); opendir DIR, $dir or warn "Cannot open $dir $!"; if ( $cmd eq 'gzip' ) { @target_files = grep { !/\.gz$/ and -f "$dir/$_" } readdir DIR +; $imode = "<"; $omode = ">:gzip"; } else { @target_files = grep /.\.gz$/, readdir DIR; $imode = "<:gzip"; $omode = ">"; } closedir DIR; for my $ifile ( @target_files ) { my $ofile; if ( $cmd eq 'gzip' ) { $ofile .= ".gz"; } else { $ofile =~ s/\.gz$//; } open( I, $imode, $ifile ) or die "$ifile: $!"; open( O, $omode, $ofile ) or die "$ofile: $!"; while(<I>) { print O } close I; close O; unlink $ifile; }

      (update: Having read the later replies after I posted this, I should point out that this approach (using grep with readdir) avoids the problem of modifying the directory contents before being completely done with readdir -- we get all the files into an array first, then work on the files. I did update my code snippet to put an explicit "closedir" before the "for" loop, to clarify this point.)

      UPDATE: (2010-10-18) It seems that PerlIO::gzip should be viewed as superseded by PerlIO::via:gzip. (see PerlIO::gzip or PerlIO::via::gzip).

Re: maddening system call gzip/gunzip problems
by swampyankee (Parson) on Dec 07, 2006 at 17:48 UTC

    One minor issue is that your regex,

    $file !~ m/^[\w-]+(\.[\w-]+)*$/

    will, if I read the regex correctly, consider this:
    cns061206
    to be acceptable: it's ending with at least zero strings comprising one dot followed by at least one word character or hyphen. That is, the (/\.[\w-]+)* portion of the regex is successfully matching a null-string. This, you could fix by replacing the final asterisk in the regex by a plus sign.

    Something like this may do what you want (note: NOT TESTED)

    sub zip { (my $cmd, my $dir) = @_[0,1]; my $count; if($cmd eq 'unzip') { foreach my $file (glob($dir . '/*.gz')){ # get only gzipped fi +les if(system(($cmd, $file)) == 0) { $count++; } else { warn "Could not $cmd $file\n"; } } return $count; } elsif($cmd eq 'zip') { foreach my $file (glob($dir . '/*')) { next if $file =~ /\.gz$/; next if $file =~ /\.txt$/; next unless $file =~ /(\.\w+)+$/; # which will skip files +without an extension if(system(($cmd, $file)) == 0){ $count++; } else { warn "Could not $cmd $file\n"; } } return $count; } else { warn "Unknown command $cmd\n"; } }

    You could also check out Tie:Gzip, PerlIO::gzip, or IO::Uncompress::AnyInflate. Note that I've not used any of these, so I'm not endorsing any one of them.


    NOTE: akin to what idsfa said about readdir, the glob calls should be probably be moved out of the foreach loop control statements. That is, replace:
    foreach my $file (glob($dir . '/*')) {

    and
    foreach my $file (glob($dir . '/*.gz'))

    with
    my @list = glob($dir . '/*'); foreach my $file (@list) {

    and
    my @list = glob($dir . '/*.gz'); foreach my $file (@list) {

    respectively.

    emc

    At that time [1909] the chief engineer was almost always the chief test pilot as well. That had the fortunate result of eliminating poor engineering early in aviation.

    —Igor Sikorsky, reported in AOPA Pilot magazine February 2003.