Your literal question about using DB ties has already been answered,
so I'll skip that part here, but I will consider the bigger problem.
Basically it seems like your files represent sets, and order isn't
relevant. Comparing two big sets is easiest if both sets are sorted
since you can then simply keep an active pointer in each sorted
sequence and progress them in tandem.
Remains the question of how to sort the sets. One way is to use unix
sort, which normally will not load a big file completely in memory.
So that idea leads to code like:
# warning: untested code
# A string that will sort beyond any returned file (they all start wit
+h /)
use constant INFINITY => chr(ord("/")+1);
open(local *YESTERDAY, "<", $yesterday_file) ||
die "Could not open $yesterday_file: $!";
open(local *CURRENT, "find / $search_files -print | sort") ||
die "Could not start find: $!";
open(local *TODAY, ">", $today_file) ||
die "Could not create $today_file: $!";
my $yesterday = <YESTERDAY> || INFINITY;
local $_;
while (<CURRENT>) {
print TODAY $_;
while ($yesterday lt $_) {
print "Lost file $yesterday";
$yesterday = <YESTERDAY> || INFINITY;
}
# Now $yesterday ge $_
if ($yesterday gt $_) {
print "New file $_";
} else {
$yesterday = <YESTERDAY> || INFINITY;
}
}
if ($yesterday ne INFINITY) {
print "Lost file $yesterday";
print "Lost file $_" while <YESTERDAY>;
}
Due to the sort it still has complexity O(n*log(n)) in the number
of files. It would be nice if find had an option to walk the
directories in lexical order, since then the sorting only needs
to happen on the directory level, which very likely makes the
logaritmic factor very low. Instead you could make perl do the find
work. This causes you to miss out on many of the clever optimizations
find style programs can do though, so this might not always be a gain
(considering the amount of files you process it probably is though).
In perl you can do a directory walk using File::Find and you can even
use find2perl to convert a find specification to equivalent perl code.
But as a quick and dirty demo I'll show the code with a handrolled
loop here where I list all names that aren't directories
# Again untested, so take care !
# A string that will sort beyond any returned file (they all start wit
+h /)
use constant INFINITY => chr(ord("/")+1);
my $yesterday;
sub walk_dir {
# dir argument is assumed to already end on /
my $dir = shift;
opendir(local *DIR, $dir) || die "Could not opendir $dir: $!";
for (sort readdir(DIR)) {
next if $_ eq "." || $_ eq "..";
my $f = "$dir$_";
if (-d $f) {
walk_dir("$f/");
} else {
$f .= "\n";
print TODAY $f;
while ($yesterday lt $f) {
print "Lost file $yesterday";
$yesterday = <YESTERDAY> || INFINITY;
}
# Now $yesterday ge $f
if ($yesterday gt $f) {
print "New file $f";
} else {
$yesterday = <YESTERDAY> || INFINITY;
}
}
}
}
open(local *YESTERDAY, "<", $yesterday_file) ||
die "Could not open $yesterday_file: $!";
open(local *TODAY, ">", $today_file) ||
die "Could not create $today_file: $!";
$yesterday = <YESTERDAY> || INFINITY;
walk_dir("/");
if ($yesterday ne INFINITY) {
print "Lost file $yesterday";
local $_;
print "Lost file $_" while <YESTERDAY>;
}
Update
I forgot to stress that in this last solution there is no place anymore that would be expected to use a lot of memory (like e.g. a shell sort based one still would do). Real memory use will probably be only a few megabytes (I'm
assuming no directory is huge).
It might in fact still be interesting to split up the task in two processes, one running a perl based find to generate
the ordered list of files, and one to run the set difference, so that the diff style work can overlap in time
with the directory scanning. This would allow you to do
usefull work during the disk I/O wait periods.
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.