Re: system "gzip $full_name" is taking more time
by kcott (Archbishop) on Dec 07, 2013 at 15:46 UTC
|
G'day dushyant,
You say "system "gzip $full_name" is taking more time." but you don't say more time than what.
More time than some other compression program? More than the same command from the command line? More time than it took last week?
Your code shows nothing to indicate how you are measuring this time. How are you doing this?
You've provided no indication of how big your files are.
Perhaps 2 hours is perfectly reasonable for individually compressing 90,000 of your files.
I suggest you do a comparison of the times taken to compress a representative file from the command line (e.g. time gzip filename) and from a Perl script (Time::HiRes might be useful here).
If you do find a distinct difference, show us how you did the timings and what the results were.
A few other points regarding your code:
-
Don't prepend an ampersand ('&') to your function names as you've done here (see perlsub). Also, there's no need to quote variable arguments. So, go_dir($_) instead of &go_dir("$_").
-
You may find the -m file test operator easier than all the date calculations you're doing.
-
Read the subroutine arguments once (e.g. my ($dir) = @_;) rather than repeatedly accessing elements of @_ (e.g. you've used $_[0] in your foreach loop).
-
As well as excluding /^\./, you might want to also exclude /\.gz$/ — you don't want to compress files that have already been compressed.
-
I see no need for the chomp in the foreach loop (readdir just returns the file names — no newline is added); however, using it for the `date '+%Y%m%d'` return value would be correct.
| [reply] [Watch: Dir/Any] [d/l] [select] |
Re: system "gzip $full_name" is taking more time
by fishmonger (Chaplain) on Dec 07, 2013 at 15:30 UTC
|
There are several things you can do to improve your script.
The first thing I'd do would be to use the File::Find module. It handles the recursion to traverse a directory tree.
I'd also use another one of the cpan modules, such as IO::Compress::Gzip to replace your system call to gzip.
Rather than compressing each individual file by itself, I'd probably group them into one or more tar.gz archives. Maybe one archive per directory.
| [reply] [Watch: Dir/Any] |
|
I Tried using IO::Compress::Gzip but it is not replacing original file but creating another compressed file. So I have to make another system call for removing original file. And this will not improve the performance.
I Cant create one compressed file for all. I have to follow some rules and standard in my production environment.
I am thinking to run 5-6 copies of same script. Dividing 120 directories between them.
| [reply] [Watch: Dir/Any] |
|
| [reply] [Watch: Dir/Any] [d/l] |
|
There are other similar modules to choose from and I'm sure at least one of them would be able to replace the original file. I have not looked over the related modules in any detail, so I can't say which one would be better suited for your needs.
What do you think would be a reasonable amount of time to compress an averaged size file from one of the directories from the command line? Would you say that a tenth of a second would be reasonable? Multiple that time by 90,000 and that will give you a very rough estimate of the required time per directory. Assuming a tenth of a second average, you'd be looking at more than 2 hours per directory and that's not including the overhead of executing the system function/call.
Having an average of 90,000 files per dir seems to be a major factor in the overall problem. Can you rework your policies to work on 30 day intervals rather than 90 day intervals?
| [reply] [Watch: Dir/Any] |
Re: system "gzip $full_name" is taking more time
by NetWallah (Canon) on Dec 07, 2013 at 15:49 UTC
|
Try running the gzip in the background .. change the line to :
system "gzip $full_name&";
This will achieve the same effect as multiple copies.
When in doubt, mumble; when in trouble, delegate; when in charge, ponder. -- James H. Boren
| [reply] [Watch: Dir/Any] [d/l] |
Re: system "gzip $full_name" is taking more time
by Laurent_R (Canon) on Dec 07, 2013 at 16:45 UTC
|
Although this is unlikely to change things tremendously, it very inefficient to do your date calculations 90,000 times. You could get the current system date, calculate only once what is the current date minus 90 days (or 93 days, whatever you really need, your code is at variance with your description of the requirement on that point) and then only start to look at your directories and the files' last change date.
| [reply] [Watch: Dir/Any] |
|
Thanks Laurent_R,
I Will change that time calculation part and check.
| [reply] [Watch: Dir/Any] |
Re: system "gzip $full_name" is taking more time
by wazat (Monk) on Dec 08, 2013 at 02:16 UTC
|
Someone has already mentioned checking that the file is not already a gzip file. I see you have a test for that, but your regular expression does not correctly test if the file name ends in '.gz', only if it contains '.gz'.
I suggest:
next if /\.gz$/;
Your test of the file age might be more readable if you use '-M' (see perldoc perlfunc).
As others suggest do some timing measurements. The fact that the directory has 90,000 files might slow down directory related operations, especially if it is network mounted. unlink and gzip could be affected by this as they perform directory updates.
If you comment out the gzip portion, how long does the unlinking take on these large directories.
Measure how long gzip takes with a typical file. Also, when running your script check the processes. Is there a single long running gzip? You should be able to come up with some rough order of magnitude figures (I assume you have examined typical directories and have an idea of the number and sizes of files).
Also consider whether these directories a overdue for cleanup. If so, the first run of your script may take a lot longer that future runs.
| [reply] [Watch: Dir/Any] [d/l] |
|
To continue on wazat's post, use substr and eq instead of regexs for such ridiculously simple patterns. It will be faster.
my $file_time = (stat($full_name))[9];
my $diff = $now - $file_time;
$diff = $diff / 86400;
my $read = localtime($file_time);
combine statements 1, 2, and 3. Something like
my $file_time;
my $diff = ($now - ($file_time = (stat($full_name))[9])) / 86400;
my $read = localtime($file_time);
less assignments/reads and less pp_nextstate ops.
$diff = $diff / 86400;
my $read = localtime($file_time);
if ( $diff > 93 ) {
print FILE "$full_name : $diff : $read\n";
unlink "$full_name";
} elsif ( $diff > 3 ) {
next if (/\.gz/);
Optimize out the division by multiplying 93 and 3 by 86400, and comparing to the larger numbers than doing the division. And substr/eq instead of the .gz regex.
unlink "$full_name";
That looks bizzare like someone who has never done Perl before, don't do that.
my $read = localtime($file_time);
Don't do that, don't print the converted time to console, just the unix time. If someone really wants read the log they can do the conversion themselves.
IDK if you can do it with stat() or not, but get the -f and -d and stat on $full_name into exactly ONE syscall, save results to lexicals, then process the results. Don't do redundant I/O calls.
I would guess if you used Nytprof (you should have used that BEFORE coming to perlmonks), your script is either I/O bound to disk/filing system or CPU bound in gzip compression algoritm. | [reply] [Watch: Dir/Any] [d/l] [select] |
|
Others may make arguments for benchmarks regarding various points above.
... don't print the converted time to console, just the unix time. If someone really wants read the log they can do the conversion themselves. -- bulk88
Not converting the time would be good enough if there won't be further use of the log. Otherwise, that places a high cost on the person reading to do the time conversions to be able to find the relevant entries. And, have a damn functioning log.
| [reply] [Watch: Dir/Any] |
Re: system "gzip $full_name" is taking more time
by Anonymous Monk on Dec 07, 2013 at 15:15 UTC
|
Why do you think that a gzip process should finish at least in the same time as it takes to remove a file? (Mind that gzip creates a temporary file to replace the existing one. And there you have two file system activities that has yet to account for time taken for compression.)
| [reply] [Watch: Dir/Any] |
Re: system "gzip $full_name" is taking more time
by taint (Chaplain) on Dec 07, 2013 at 20:24 UTC
|
My personal preferences, with regard commands already provided my the system. Is to compare the results to those provided by Perl modules, and the likes. I have personally found that in the case of (tar|gzip) tar, two things; 1.) a system call to tar is quicker (asuming your routine is efficient). 2.) LZMA, which may depend on your systems implementation, on average, provides a 30% better compression rate. In the second point; this won't necessarily speed up your process, except to the extent that unlinking them will be quicker. In case you're interested; my options given to tar are
tar -cvJ --options xz:9 -f <filename>.tar.xz <directory||filename>
You will probably not have a use for the v switch, except to the extent that it will help with "debugging" possible errors.
HTH
--Chris
Hey. I'm not completely useless. I can be used as a bad example.
| [reply] [Watch: Dir/Any] [d/l] |
Re: system "gzip $full_name" is taking more time
by dushyant (Acolyte) on Dec 07, 2013 at 17:02 UTC
|
90,000 files will take more than 2 hours that’s true. I was comparing gzip task with unlink task.
I will update scripts as suggested by you all.
I will also check the bench mark as suggested by kcott.
Thanks.
| [reply] [Watch: Dir/Any] |
|
Zipping a file will generally take more time than deleting it, often much more time, especially if the file is large. But this has nothing to do with Perl, this will also be true if you doing it manually under your shell (or under Windows, for that matter).
| [reply] [Watch: Dir/Any] |
Re: system "gzip $full_name" is taking more time
by fishmonger (Chaplain) on Dec 07, 2013 at 16:13 UTC
|
| [reply] [Watch: Dir/Any] |
Re: system "gzip $full_name" is taking more time
by taint (Chaplain) on Dec 08, 2013 at 10:53 UTC
|
Something else occurred to me, after making my last response. That I think will help you greatly in more ways than one.
Consider "batching" the jobs/tasks. For example, concat a list of the files that will be processed; for tests for timing, and, more importantly, within your current script. You should end up with a much more efficient process, overall. This way, you can batch the tar command -- concat the file names, and dump the list on tar. The same goes for the rest. Slurping up a list should be pretty easy, and pretty quick. In short attempt to consolidate as much as possible.
Best wishes.
--Chris
Hey. I'm not completely useless. I can be used as a bad example.
| [reply] [Watch: Dir/Any] |