Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Merge log files causing Out of Memory

by malokam (Novice)
on Aug 24, 2016 at 15:13 UTC ( #1170330=perlquestion: print w/replies, xml ) Need Help??

malokam has asked for the wisdom of the Perl Monks concerning the following question:

I have this script that merges log files. It reads the list of dirs from a file and looks for a certain log file and merges all the content and saves to another file (in a separate directory)

While the script itself is sound and is working on my sandbox, it goes "Out of Memory!" on the machine I am trying to run this script.

Can I optimize the script any further? Or is there some other way to do the same more effectively?

#!/usr/bin/perl ###################### Globals ##################################### +#### %keydirs; ## Hash for use in tracking directories @d; @ed; $profile = `\. ~/.profile`; # Load user profile $log_for_days = eval(86400 * 91); ## 91 is the number of days for rete +ntion ###################### Globals ##################################### +#### &date_calc; &get_dirs; foreach $dir1_(@d) { chomp $dir1_; @dir_ = split(/\//,$dir1_); $num = eval(@dir_ - 1); $logfile = "@dir_[$num].$myLog"; $delfile = "@dir_[$num].$delTime"; if(-e "$dir1_/logs/$delfile") { $del_old_recs = `rm -f $dir1_/logs/$delfile`; } chop $dir1_; get_ldat($dir1_); open(ARCLOG,">$dir1_/logs/$logfile"); print ARCLOG "@data1"; close(ARCLOG); undef @data1; } ############################ Subroutines ########################### +#### sub date_calc { ($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) = localtime(time + - 86400); $syear = $year+1900; $smon = sprintf '%02d',$mon+1; $smday = sprintf '%02d',$mday; $shour = sprintf '%02d', $hour; $smin = sprintf '%02d', $min; $ssec = sprintf '%02d', $sec; $myTime = "$smon/$smday/$syear $shour:$smin:$ssec"; $myLog = "$syear"."$smon"."$smday"; ($fsec,$fmin,$fhour,$fmday,$fmon,$fyear,$fwday,$fyday,$fisdst) = local +time(time); $fsyear = $fyear+1900; $fsmon = sprintf '%02d',$fmon+1; $fsmday = sprintf '%02d',$fmday; $fshour = sprintf '%02d', $fhour; $fsmin = sprintf '%02d', $fmin; $fssec = sprintf '%02d', $fsec; $fmyTime = "$fsmon/$fsmday/$fsyear"; ($dsec,$dmin,$dhour,$dmday,$dmon,$dyear,$dwday,$dyday,$disdst) = local +time(time - $log_for_days); $delyear = $dyear+1900; $delmon = sprintf '%02d',$dmon+1; $delmday = sprintf '%02d',$dmday; $delhour = sprintf '%02d',$dhour; $delmin = sprintf '%02d',$dmin; $delsec = sprintf '%02d',$dsec; $delTime = "$delyear"."$delmon"."$delmday"; } sub get_dirs { open(DIRFILE,"</opt/scripts/dirs") || die "Cannot open file for readin +g:$^E\n"; @directories = <DIRFILE>; close(DIRFILE); foreach $directory(@directories) { next unless $directory !~ /\#/ig; chomp $directory; @dire = split(/(encrypt|decrypt|\:)/,$directory); if(! exists($keydirs{$dire[2]})) { $keydirs{$dire[2]} = 1; @d = (@d,$dire[2]); } } } sub get_ldat { $dirtoget = "@_"; chomp $dirtoget; if(-e "$dirtoget/encrypt") { @encdir = `ls -Af $dirtoget/encrypt 2>&1`; } else { @encdir = ""; } foreach $endir(@encdir) { chomp $endir; if($endir !~ /\./) { if(-e "$dirtoget/encrypt/$endir/sample.log") { @data = `cat $dirtoget/encrypt/$endir/sample.log`; $cnt=0; $dcnt= eval(@data - 1); foreach $row(@data) { if($row !~ /$fmyTime/i) { $cnt++; } else { last; } } $cnt = eval($cnt - 1); $icnt=0; for($icnt=0;$icnt<= $cnt;$icnt++) { push @data1,$data[$icnt]; } $cnt = eval($cnt + 1); for($icnt2=$cnt;$icnt2<=$dcnt;$icnt2++) { push @data2,$data[$icnt2]; } open(LOG,">$dirtoget/encrypt/$endir/sample.log"); print LOG "@data2"; close(LOG); } } undef $cnt; undef $dcnt; undef @data; undef $row; undef $icnt; undef $irow; undef $irow2; undef @data2; } if(-e "$dirtoget/decrypt/sample.log") { @data = `cat $dirtoget/decrypt/sample.log`; $cnt=0; $dcnt=@data; foreach $row(@data) { if($row !~ /$fmyTime/i) { $cnt++; } else { last; } } $icnt=0; for($icnt=0;$icnt< $cnt;$icnt++) { push @data1,$data[$icnt]; } for($icnt2=$cnt;$icnt2<=$dcnt;$icnt2++) { push @data2,$data[$icnt2]; } open(LOG,">$dirtoget/decrypt/sample.log"); print LOG "@data2"; close(LOG); } undef $cnt; undef $dcnt; undef @data; undef $row; undef $icnt; undef $irow; undef $irow2; undef @data2; }

Replies are listed 'Best First'.
Re: Merge log files causing Out of Memory
by afoken (Abbot) on Aug 24, 2016 at 21:33 UTC
    • Your code is very strangely indented, quite hard to read. You may want to use perltidy and/or a better editor.
    • It actually looks like someone ported a shell script, sprinkled some parts of a C program over it, and copied some pieces of ancient perl scripts of dubious origin. Yes, perl will understand what you want, but it does not look very perlish and is not very efficient.
    • Your code uses lots of global variables where a local scope (see my) is sufficient.
    • use strict; and use warnings; are missing
    • $profile = `\. ~/.profile`;  # Load user profile does not do what the comment says. And it is most likely completely useless.
    • eval is useless in $log_for_days = eval(86400 * 91);, in $dcnt=eval(@data-1); and all other places.
    • &date_calc; and &get_dirs; is Perl4-style. Avoid the ampersand <UPDATE>when calling functions</UPDATE>, it usually does not do what you intent to do.
    • Your code use BAREWORD file handles and the insecure two-argument open, better use the three-argument form and scalar file handles. Also use or instead of || here. And use $! instead of $^E: open my $dirfile,'<','/opt/scripts/dirs') or die "Can't open /opt/scripts/dirs: $!";
    • Neither the i flag nor the g flag are needed for matching # in the lines read from /opt/scripts/dirs.
    • get_dirs could be implemented using grep to both find unique directory names and to remove comments. Returning right after reading the file's contents automatically closes the file handle IF you use a scalar file handle: return grep { !/\#/ && !$seen{$_}++ } <$dirfile>;
    • Using backticks (``) introduces a lot of security problems, especially if you run the script as root. Shells don't always parse data pasted into backticks as you intent. Try to use perl functions instead: `rm -f ...` should be replaced by unlink or by the remove_tree() function from File::Path. `cat ...` should be replaced by open, readline alias <$handle>, close. `ls` should be replaced by opendir, readdir, closedir, or maybe glob.
    • @array=`cat ...` reads the entire file into memory. If your log files are large, you may quite easily run out of memory. After that, you iterate over the array line by line. Better use open, as explained above, and use a while loop to read the file line by line. This way, you have only one line in memory, even if your log file contains terabytes of data:
      open my $handle,'<',$filename or die "Can't open filename: $!"; while (my $line=<$handle>) { chomp $line; if ($line is interesting ...) { # do something with $line ... } } close $handle;
    • @encdir=`ls .... 2>&1` is even worse than just @encdir=`ls ...`, you end with error messages as directory names in @encdir. You REALLY don't want that, use opendir/readdir/closedir or glob, as explained above.
    • You are doing really weird things with arrays and strings, stringifying arrays just to split them back into arrays. Avoid that. Perl functions can return arrays, don't pass function results around by writing to global variables.

    Alexander

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
      as always you are very complete and competent, but (as other times) i must dissent on:

      &date_calc; and &get_dirs; is Perl4-style. Avoid the ampersand, it usually does not do what you intent to do.

      is not Perl4, is perfectly valid Perl5, and is reffered by official docs as optional in modern perl

      It is still not optional in three cases:

      # while naming a sub like in: defined &my_sub_name; # doing indirect sub call (but $subref->() is another valid option) &$subref(); # making a reference to a sub $coderef = \&handler;

      ..it usually does not do what you intent to do.
      just means that &get_dirs receive the current @_ even if no args are specified. The programmer must be aware of this and the feature can be also used in a profitable way.

      The example in perlsub is exahustive:

      &foo(1,2,3); # pass three arguments foo(1,2,3); # the same foo(); # pass a null list &foo(); # the same &foo; # foo() get current args, like foo(@_) !! foo; # like foo() IFF sub foo predeclared, else "foo"

      L*

      There are no rules, there are no thumbs..
      Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
        is not Perl4, is perfectly valid Perl5

        Yes, it is perfectly valid Perl5. However it is also perfectly valid Perl4 and as afoken points out this chimes with all the rest of the script looking like Perl4 too. I don't have a perl4 installation to hand to test this on but would not be surprised if that's what the script is (or at least how it started out).

        Ampersand notation without good reason (such as the exceptions you have correctly quoted from the docs) is very much Perl4-style.

        Edited for typo and clarity

        I've updated my posting to limit the "avoid the ampersand" rule to function calls.

        just means that &get_dirs receive the current @_ even if no args are specified

        Exactly for this reason. &function; has this surprising behaviour.

        The programmer must be aware of this

        Correct. The programmer must be aware of this, because it is nastily surprising behaviour due to backwards compatibility with Perl 4. In other words: Don't expect a novice to understand what happens here.

        Compare to what we tell our little children: "Don't touch the knife." ("Don't use ampersands in function calls.") You don't explain an average three-year old the physics of cutting, the medical problems of infections and loss of large amounts of blood, and so on. Later in the life, you or someone else explains the child (or the novice) why knifes (or ampersands in front of function calls) are dangerous, and how knifes (or ampersands in front of function calls) can be useful.

        i must dissent on:
        &date_calc; and &get_dirs; is Perl4-style. Avoid the ampersand, it usually does not do what you intent to do.
        is not Perl4, is perfectly valid Perl5

        I never said that &get_dirs; was Perl 4, and I never said it was illegal in Perl 5. I named it Perl4-style, because that's how functions had to be called in Perl 4.

        Alexander

        --
        Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
        There's one more context where &amp; isn't optional:
        goto &func;

        See goto.

        ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

      Thank you very much for the response. I really do appreciate the inputs.

      Obviously, I am newbie to PERL . This is a script that I created from looking at 3-4 existing scripts in place (read 10 years or above) and picking parts that would suit my needs.

      I will go through each and every point of yours and try to update it.

Re: Merge log files causing Out of Memory
by neilwatson (Priest) on Aug 24, 2016 at 15:18 UTC
      You might try Test::Memory::Cycle to look for issues.

      How could Test::Memory::Cycle help finding inefficient (and insecure) code like @data = `cat $dirtoget/decrypt/sample.log`; that reads probably huge log files into memory?

      The most complex data structures used in this script, apart from the $seen{$_}++ variant in get_dirs(), are arrays of strings. There is not a single trace of using references that could create circular references. How could Test::Memory::Cycle help here?

      Alexander

      --
      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1170330]
Approved by Athanasius
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (5)
As of 2019-05-21 17:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Do you enjoy 3D movies?



    Results (136 votes). Check out past polls.

    Notices?
    • (Sep 10, 2018 at 22:53 UTC) Welcome new users!