Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Perl Script Causing high CPU usage

by smart_amorist (Initiate)
on Sep 29, 2013 at 12:50 UTC ( #1056216=perlquestion: print w/ replies, xml ) Need Help??
smart_amorist has asked for the wisdom of the Perl Monks concerning the following question:

Hi Team, I have written below code to remove, archive and remove files based on pattern and last modified time of file. There are more than 2 lac files in a directory daily matching pattern (ABC_XYZ_?????????.xml) If most of the files are older than say 7 days then maximum files are handled by below section (i.e moved, archived or deleted, and cpu take rest while mv, zip or unlink running). In that case CPU is 4-11 %, that still looks good. But if out of say 2 lac files, only 10,000 files are older than 7 days then for remaining 1,90,000 files while loop just run with print statement as there is no archive, move or delete action required. In this case CPU is reaching 25-83% which is too high. Is there a better way to control it, or better way to achieve it? Also, with perl glob files are just matched and not sorted based on modified time? Is it possible to sort files based on modification time with glob? Is there a way to just read files which are only older than 7 days and not others avoiding extra 1,90,000 rotations by while loop?

while(glob("$MatchPattern")) { my $ObjectName = $_; my $age = sprintf("%.6f", -M $ObjectName) if ($ObjectNam +e); my $size = -s $ObjectName if ($ObjectName); my $Display = ($DtFmt) ? $DtFmt : "%Y-%m-%d"; my $ObjModTime = strftime("$Display", localtime((stat($ObjectN +ame)->mtime))); if ($age > 0 && $age > $AgeLimit && $size <= $SizeLimit) { if (($AutoAction eq 'AutoArchive') && ((-f $ObjectName +) || (-d $ObjectName))) { print "[$ObjModTime] Age & Size = $age > $AgeLim +it And $size < $SizeLimit :: Marked to Zip\n"; qx(gzip $ObjectName) if (-f +$ObjectName && $Type eq 'Sys_File'); qx(zip -rmT $ObjectName $ObjectName) if (-d +$ObjectName && $Type eq 'Sys_Dir' ); $ObjCounter++; } elsif (($AutoAction eq 'AutoDelete') && (-f $ObjectN +ame)) { print "Age & Size = $age > $AgeLimit And $siz +e < $SizeLimit :: Marked to Remove \n"; unlink ($ObjectName); $ObjCounter++; } elsif (($AutoAction eq 'AutoMove') && $MoveToPath && +($MoveToPath ne 'Nil') && ((-f $ObjectName) || (-d $ObjectName))) { if (-e $MoveToPath) { print "[$ObjModTime] Age & Size = $age > +$AgeLimit And $size < $SizeLimit :: Marked to Move\n"; qx(mv $ObjectName $MoveToPath/.) if (-f $ +ObjectName && $SignOff eq 'Yes'); $ObjCounter++; } } else { print "[$ObjModTime] Age & Size = $age > $AgeLim +it And $size < $SizeLimit :: Marked No Action\n"; next; } } else { next; } } #End of while

Comment on Perl Script Causing high CPU usage
Download Code
Re: Perl Script Causing high CPU usage
by marinersk (Chaplain) on Sep 29, 2013 at 14:51 UTC
    In my early days in Perl on Windows, glob behaved in some undesireable fashion -- I no longer remember what the problems were.

    You might try to isolate whether glob is causing the high CPU utilization -- I suspect it is NOT, but worth checking. You can replace glob with something like this:

    # ... Create a regular expression to represent the wildcard string nor +mally submitted to glob my $MatchPatternRegex = "ABC\\_XYZ\\_.{9}\\.xml"; if (opendir CURDIR, $directoryName) { my $nextFilename = readdir CURDIR; while ($nextFilename) { if ($nextFilename =~ /$MatchPatternRegex/) { # ... Do your stuff here } } closedir CURDIR; }
    If using this instead of glob eliminates your CPU problem, you know something new about your environment, and perhaps you'll want to craft around glob. If this does not change the behavior, you've ruled out glob as the problem, which could be useful information as you troubleshoot this.
Re: Perl Script Causing high CPU usage
by roboticus (Canon) on Sep 29, 2013 at 15:29 UTC

    smart_amorist:

    Doing a stat on every file is too time consuming. I'd suggest rather than rescanning the entire directory tree and checking the age that you make a text file containing the filename and last modified date of all the files known. Then you can quickly ignore many of the files without doing a stat. For the files that *are* candidate for removal, do the stat and update your text database. If a file is new (wasn't in the last scan) it's obviously not older than the previous scan, so you could simply enter it into your table with the previous rundate. Something like:

    # Read text file from last run my %Files; open my $FH, '<', 'my_database'; while (<$FH>) { my ($YYYYMMDD, $FName) = split; $Files{$FName}=$YYYYMMDD; } # Retrieve last runtime my $LASTRUN = $Files{TheLastRunTime}; my $OLDEST_FILE_DATE = function_getting_cutoff_date(); # Scan file tree for my $FName (function_retreiving_file_list()) { if (! exists $Files{$FNAME}) { # New file, just add it to the database $Files{$FName}=$LASTRUN; } elsif ($Files{$FNAME} lt $OLDEST_FILE_DATE) { # Last recorded "modified" time is too old ...code to check last date and call archiver as needed... } else { # Nothing to do, we know this file was touched within # the cutoff period, so leave it until next time. } } # Rewrite the database file for next time $Files{TheLastRunTime} = function_returning_now_as_YYYYMMDD; rename 'my_database', 'my_database.".$Files{TheLastRunTime}; open $FH, '>', 'my_database'; for $K (keys %Files) { print $FH "$Files{$K} $K\n"; }

    If you typically have 10,000 files of 1,90,000 files needing to be archived, then you're saving many stat calls--on some operating systems, anyway. I don't know the details on yours, but if you typically have the stat calls taking a significant amount of time, this approach will save you 1,80,000 stat calls per run.

    Hope this helps...

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

      Another option to reduce the stat() calls would be to call it once at the beginning of the loop, saving the results in an array, and then referencing the array after that. Alternatively, call it once with the filename as the argument, and then for the rest of the loop replace the filename with the special filehandle "_" (underscore), which checks the cached results of the latest stat call. See stat for more info.

      Personally, on any system with unix-ish tools available, I'd move most of this into a find command, then use xargs to pass the list of only the files to be deleted/archived/whatever to a script which would do that.

      Aaron B.
      Available for small or large Perl jobs; see my home node.

Re: Perl Script Causing high CPU usage
by bulk88 (Priest) on Sep 29, 2013 at 15:40 UTC
    Do a ` with ls or dir, then parse the output. Those 2 tools will use a faster/lower overhead OS call than perl would be using. You are also touching the disk repeatedly and unnecessarily by doing -f $ObjectName and -d $ObjectName many times in the logic tree instead of doing them once and caching the result to a scalar before the logic tree.
Re: Perl Script Causing high CPU usage
by CountZero (Bishop) on Sep 29, 2013 at 19:14 UTC
    Did you think of using File::Find::Rule?

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics
Re: Perl Script Causing high CPU usage
by graff (Chancellor) on Sep 30, 2013 at 06:14 UTC
    If I understand you correctly, you're saying that when the job is i/o bound (because it's doing lots of "mv, zip or unlink"), CPU usage is "okay" (at 4-11%), but when there's relatively little need to alter disk content, CPU usage shoots up as high as 83%, which is bad for some reason.

    The replies above about reducing how often you invoke "stat" will help some, but I would also tend to worry about having so many files in a single directory that everything gets bogged down, just because it takes so long to scan a directory that has so many files in it (especially if you're doing multiple stat calls on every file, instead of using all the information you get from a single stat call).

    If you're seeing 190,000 files being created in less than a week (over 2700 a day), you might want to see if you can divide that up among different directories, to limit the number of files per directory. File age seems to be most important, and it's apparently ok to move things around, so you might want to try creating a directory for each date, and move files into daily directories according to their age. That would make things a lot easier to manage, in addition to reducing the overall load on both cpu and disk system.

      Hi All, Thanks a lot for so many ideas. To be more clear, in a single day system generates 5-6 lakh files in just 3-4 directories consuming around 10-14 GB space per day (individual file size is not huge - 12345 bytes, 5-6 digits number in bytes). There are files of 4-5 different patterns but count is huge. No date timestamp in file name (e.g ABC_YYYMMDD.log), just random 8 digits number in file names (i.e. ABC_????????.log) As per my observation, glob is not consuming high CPU. As soon as we are entering while loop, then can see CPU spikes. After commenting out mv, zip, unlink, just keeping print statements, I noticed CPU around 80%. After adding below in "next" section, CPU is under control 10-12% or even low to 4% sometime. select (undef, undef, undef, 0.250); but already process running slow because of huge number of files and stat calls as highlighted by you "all" above, cant afford sleep. Once again thanks to all of you, will try out various ideas given by you. Thanks for letting me know about File::Find::Rule as well.

        Good luck in the hunt -- if the loop is using lots of CPU just to do print statements, then I don't know that there's much you can do to limit its CPU utilization. Sounds like the OS is giving your script everything it needs; are you sure it's a problem?

        As a complete side note, I am deadly curious -- what is a lakh file? Is that some kind of special log format or something?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1056216]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (11)
As of 2014-09-19 12:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (137 votes), past polls