comment on

Your literal question about using DB ties has already been answered, so I'll skip that part here, but I will consider the bigger problem.

Basically it seems like your files represent sets, and order isn't relevant. Comparing two big sets is easiest if both sets are sorted since you can then simply keep an active pointer in each sorted sequence and progress them in tandem.

Remains the question of how to sort the sets. One way is to use unix sort, which normally will not load a big file completely in memory. So that idea leads to code like:

# warning: untested code

# A string that will sort beyond any returned file (they all start wit
+h /)
use constant INFINITY => chr(ord("/")+1);

open(local *YESTERDAY, "<", $yesterday_file) || 
    die "Could not open $yesterday_file: $!";
open(local *CURRENT, "find / $search_files -print | sort") ||
    die "Could not start find: $!";
open(local *TODAY, ">", $today_file) ||
    die "Could not create $today_file: $!";
my $yesterday = <YESTERDAY> || INFINITY;

local $_;
while (<CURRENT>) {
    print TODAY $_;
    while ($yesterday lt $_) {
        print "Lost file $yesterday";
        $yesterday = <YESTERDAY> || INFINITY;
    }
    # Now $yesterday ge $_
    if ($yesterday gt $_) {
        print "New file $_";
    } else {
        $yesterday = <YESTERDAY> || INFINITY;
    }
}
if ($yesterday ne INFINITY) {
    print "Lost file $yesterday";
    print "Lost file $_" while <YESTERDAY>;
}
[download]

Due to the sort it still has complexity O(n*log(n)) in the number of files. It would be nice if find had an option to walk the directories in lexical order, since then the sorting only needs to happen on the directory level, which very likely makes the logaritmic factor very low. Instead you could make perl do the find work. This causes you to miss out on many of the clever optimizations find style programs can do though, so this might not always be a gain (considering the amount of files you process it probably is though).

In perl you can do a directory walk using File::Find and you can even use find2perl to convert a find specification to equivalent perl code. But as a quick and dirty demo I'll show the code with a handrolled loop here where I list all names that aren't directories

# Again untested, so take care !
# A string that will sort beyond any returned file (they all start wit
+h /)
use constant INFINITY => chr(ord("/")+1);

my $yesterday;

sub walk_dir {
    # dir argument is assumed to already end on /
    my $dir = shift;
    opendir(local *DIR, $dir) || die "Could not opendir $dir: $!";
    for (sort readdir(DIR)) {
        next if $_ eq "." || $_ eq "..";
        my $f = "$dir$_";
        if (-d $f) {
            walk_dir("$f/");
        } else {
            $f .= "\n";
            print TODAY $f;
            while ($yesterday lt $f) {
                print "Lost file $yesterday";
                $yesterday = <YESTERDAY> || INFINITY;
            }
            # Now $yesterday ge $f
            if ($yesterday gt $f) {
                print "New file $f";
            } else {
                $yesterday = <YESTERDAY> || INFINITY;
            }
        }
    }
}

open(local *YESTERDAY, "<", $yesterday_file) ||
    die "Could not open $yesterday_file: $!";
open(local *TODAY, ">", $today_file) ||
    die "Could not create $today_file: $!";
$yesterday = <YESTERDAY> || INFINITY;

walk_dir("/");

if ($yesterday ne INFINITY) {
    print "Lost file $yesterday";
    local $_;
    print "Lost file $_" while <YESTERDAY>;
}
[download]

Update I forgot to stress that in this last solution there is no place anymore that would be expected to use a lot of memory (like e.g. a shell sort based one still would do). Real memory use will probably be only a few megabytes (I'm assuming no directory is huge).

It might in fact still be interesting to split up the task in two processes, one running a perl based find to generate the ordered list of files, and one to run the set difference, so that the diff style work can overlap in time with the directory scanning. This would allow you to do usefull work during the disk I/O wait periods.

In reply to Re: Memory Management Problem by thospel
in thread Memory Management Problem by PrimeLord

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Syntactic Confectionery Delight
	PerlMonks