Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things

Memory Management Problem

by PrimeLord (Pilgrim)
on Nov 20, 2003 at 21:27 UTC ( #308711=perlquestion: print w/replies, xml ) Need Help??
PrimeLord has asked for the wisdom of the Perl Monks concerning the following question:

Monks I come to you seeking your wisdom once again. I am having a bit of a memory management issue I was hoping you could help me with. I have written a script that produces a daily report on a system scan it does. It first reads in a benchmark file from the scan the day before into a hash. It then runs the daily scan dumping the info it finds into the new benchmark file and also compares the information to the data from the day before in the hash.

The problem I am running into is the data being read into the hash can be several hunder megabytes in size. I need to find a more efficent way to handle this data. Here is an example of the code I have written.
use strict; sub _read_benchmark { my %yesterday; open BENCH, "benchmark_file" or die "$!"; while (<BENCH>) { chomp; $yesterday{$_}++; } close BENCH or warn "$!"; return \%yesterday; } sub _scan_system { my $yesterday = shift; my %today; open BENCH, "> benchmark_file" or die "$!"; open IN, "find / $search_files -print |" or die "$!"; while (<IN>) { chomp; print BENCH "$_\n"; if (exists $yesterday->{$_}) { delete $yesterday->{$_}; } else { $today{$_}++; } } close IN or warn "$!"; close BENCH or warn "$!"; return \%today, $yesterday; } sub _print_report { ... } my $yesterday = _read_benchmark; my ($today, $yesterday) = _scan_system($yesterday); _print_report($today, $yesterday);
I believe there is a way I can tie the benchmark file to a hash and that would probably be a huge improvement, but I am not sure how to do that. Any suggestions on how to make this less of a memory hog would be very appreciated. Thanks!


Replies are listed 'Best First'.
Re: Memory Management Problem
by PodMaster (Abbot) on Nov 20, 2003 at 21:37 UTC
Re: Memory Management Problem
by Roger (Parson) on Nov 21, 2003 at 00:07 UTC
    If you want to minimize memory usage, you will have to suffer in speed somewhere, somehow. There is a trade off that you have to make.

    I think one of the easiest method is to use a proper database to record your files, like MySQL, Postgres, Oracle, etc. Create a table with an *index* on the filename to speed up SQL query. Let the database do the search optimization for you.

    #!/usr/local/bin/perl -w use strict; use DBI; use DBD::Sybase; use File::Find; my $dbh = ....; # connect to database my $sth = $dbh->prepare("select filename from bench where filename=?") +; # look for new files my @new_files; find( { follow => 1, no_chdir => 1, wanted => sub { if (! /\.$/) { # ignore unwanted . or .. $sth->execute($_); my $file_exists; while (my @res = $sth->fetchrow_array()) { $file_exists+ ++ } push @new_files if !$file_exits; } } }, '/'); $sth->finish; # insert new files into the database $sth = $dbh->prepare("insert into bench (filename) values (?)"); $sth->execute($_) for @new_files; $sth->finish; # do stuff with @new_files ....
    I can think of another solution, although the performance in speed is not great, that uses resonable amount of memory.

    #!/usr/local/bin/perl -w use strict; use File::Find; my $bench_file = 'bench.txt'; my @new_files; find( { follow => 1, no_chdir => 1, wanted => \&callback }, '/'); sub callback { if (! /\.$/) { # ignore unwanted . or .. if ( ! `grep '$_' $bench_file` ) { push @new_files, $_; # remember this file } } } # append unseen filenames to bench.txt file open BENCH, ">>bench.txt" or die "Can not append to bench.txt"; print BENCH "$_\n" foreach (@new_files); close BENCH;
Re: Memory Management Problem
by Zaxo (Archbishop) on Nov 20, 2003 at 21:40 UTC

    How about dumping the new report to a file, then generating the comparison of the two reports line-by-line? That will remove all but the wildest memory constraints. You can rename files after you're done with all that.

    After Compline,

Re: Memory Management Problem
by duff (Parson) on Nov 20, 2003 at 21:51 UTC

    Depending on the actual content of your report, you may want to push some of the work you're currently doing in perl out to the find command. You could use the -cnewer option to find to get a list of files that are newer than the timestamp of some other file. So, after each scan, you touch a special file (maybe that's your benchmark_file) and then use that next time to find out which are newer. man find.

    Using that information (and some other standard unix utilities) you should be able to generate a file with all of the files that were there yesterday and another file with all of the files that are new today. And then just read those two files a line at a time for your report (again, depending on the exact output of your report).

    Just some ideas ...

Re: Memory Management Problem
by swngnmonk (Pilgrim) on Nov 21, 2003 at 07:02 UTC


    Can you provide a little more information about the contents of the Bench file, and what print_report() is doing?

    From what I gather, the bench file is simply a list of absolute file paths on the filesystem (since you're using a find call to populate %today). What exactly are you trying to track?

    Another question - have you confirmed your find command on your machine? On my box (redhat 9), that call to find (assuming $search_files is a scaler for a text match of some kind) would return every file on the filesystem. Are you sure you're getting the correct results?

    Now that I think about it, I've got an idea on a general approach, assuming you've got access to the standard Unix utils - use sort, uniq, and diff, and parse the output of the diff. e.g.

    `cat benchmark_files|sort|uniq -c > benchmark_counted`; `find / $search_files -print |sort | uniq -c > todays_find`; open IN, "diff benchmark_counted todays_find|" or die "$!"; while (<IN>) { ## parse diff output into %yesterday and %today ## an exercise for the reader } close IN;

    By using the unix tools, you've now got the same output as you had after the call to _scan_system(). Note - diff will flag identical lines with different counts (that's what the -c option to uniq does) - you'd have to account for that when parsing the diff output.

    This assumes, of course, that the real memory hog is %yesterday, before a pile of keys are deleted in building %today. If I'm wrong, and at the end of processing %yesterday and %today are both too big to handle by print_report(), you may well need to look at some kind of BerkeleyDB-type solution, but realize it's going to slow things down by a lot.

    I hope this helps - sort/diff/uniq can be a great way to reduce the load on perl when processing large files.

        Doh. Point made. :)

        let that line read:

        `sort benchmark_files | uniq -c > benchmark_counted`;

        And all can be right in the world.

Re: Memory Management Problem
by thospel (Hermit) on Nov 21, 2003 at 19:35 UTC
    Your literal question about using DB ties has already been answered, so I'll skip that part here, but I will consider the bigger problem.

    Basically it seems like your files represent sets, and order isn't relevant. Comparing two big sets is easiest if both sets are sorted since you can then simply keep an active pointer in each sorted sequence and progress them in tandem.

    Remains the question of how to sort the sets. One way is to use unix sort, which normally will not load a big file completely in memory. So that idea leads to code like:

    # warning: untested code # A string that will sort beyond any returned file (they all start wit +h /) use constant INFINITY => chr(ord("/")+1); open(local *YESTERDAY, "<", $yesterday_file) || die "Could not open $yesterday_file: $!"; open(local *CURRENT, "find / $search_files -print | sort") || die "Could not start find: $!"; open(local *TODAY, ">", $today_file) || die "Could not create $today_file: $!"; my $yesterday = <YESTERDAY> || INFINITY; local $_; while (<CURRENT>) { print TODAY $_; while ($yesterday lt $_) { print "Lost file $yesterday"; $yesterday = <YESTERDAY> || INFINITY; } # Now $yesterday ge $_ if ($yesterday gt $_) { print "New file $_"; } else { $yesterday = <YESTERDAY> || INFINITY; } } if ($yesterday ne INFINITY) { print "Lost file $yesterday"; print "Lost file $_" while <YESTERDAY>; }

    Due to the sort it still has complexity O(n*log(n)) in the number of files. It would be nice if find had an option to walk the directories in lexical order, since then the sorting only needs to happen on the directory level, which very likely makes the logaritmic factor very low. Instead you could make perl do the find work. This causes you to miss out on many of the clever optimizations find style programs can do though, so this might not always be a gain (considering the amount of files you process it probably is though).

    In perl you can do a directory walk using File::Find and you can even use find2perl to convert a find specification to equivalent perl code. But as a quick and dirty demo I'll show the code with a handrolled loop here where I list all names that aren't directories

    # Again untested, so take care ! # A string that will sort beyond any returned file (they all start wit +h /) use constant INFINITY => chr(ord("/")+1); my $yesterday; sub walk_dir { # dir argument is assumed to already end on / my $dir = shift; opendir(local *DIR, $dir) || die "Could not opendir $dir: $!"; for (sort readdir(DIR)) { next if $_ eq "." || $_ eq ".."; my $f = "$dir$_"; if (-d $f) { walk_dir("$f/"); } else { $f .= "\n"; print TODAY $f; while ($yesterday lt $f) { print "Lost file $yesterday"; $yesterday = <YESTERDAY> || INFINITY; } # Now $yesterday ge $f if ($yesterday gt $f) { print "New file $f"; } else { $yesterday = <YESTERDAY> || INFINITY; } } } } open(local *YESTERDAY, "<", $yesterday_file) || die "Could not open $yesterday_file: $!"; open(local *TODAY, ">", $today_file) || die "Could not create $today_file: $!"; $yesterday = <YESTERDAY> || INFINITY; walk_dir("/"); if ($yesterday ne INFINITY) { print "Lost file $yesterday"; local $_; print "Lost file $_" while <YESTERDAY>; }

    Update I forgot to stress that in this last solution there is no place anymore that would be expected to use a lot of memory (like e.g. a shell sort based one still would do). Real memory use will probably be only a few megabytes (I'm assuming no directory is huge).

    It might in fact still be interesting to split up the task in two processes, one running a perl based find to generate the ordered list of files, and one to run the set difference, so that the diff style work can overlap in time with the directory scanning. This would allow you to do usefull work during the disk I/O wait periods.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://308711]
Approved by delirium
Front-paged by broquaint
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (7)
As of 2018-07-18 14:52 GMT
Find Nodes?
    Voting Booth?
    It has been suggested to rename Perl 6 in order to boost its marketing potential. Which name would you prefer?

    Results (393 votes). Check out past polls.