|Problems? Is your data what you think it is?|
Re: Memory Management Problemby thospel (Hermit)
|on Nov 21, 2003 at 19:35 UTC||Need Help??|
Your literal question about using DB ties has already been answered, so I'll skip that part here, but I will consider the bigger problem.
Basically it seems like your files represent sets, and order isn't relevant. Comparing two big sets is easiest if both sets are sorted since you can then simply keep an active pointer in each sorted sequence and progress them in tandem.
Remains the question of how to sort the sets. One way is to use unix sort, which normally will not load a big file completely in memory. So that idea leads to code like:
Due to the sort it still has complexity O(n*log(n)) in the number of files. It would be nice if find had an option to walk the directories in lexical order, since then the sorting only needs to happen on the directory level, which very likely makes the logaritmic factor very low. Instead you could make perl do the find work. This causes you to miss out on many of the clever optimizations find style programs can do though, so this might not always be a gain (considering the amount of files you process it probably is though).
In perl you can do a directory walk using File::Find and you can even use find2perl to convert a find specification to equivalent perl code. But as a quick and dirty demo I'll show the code with a handrolled loop here where I list all names that aren't directories
Update I forgot to stress that in this last solution there is no place anymore that would be expected to use a lot of memory (like e.g. a shell sort based one still would do). Real memory use will probably be only a few megabytes (I'm assuming no directory is huge).
It might in fact still be interesting to split up the task in two processes, one running a perl based find to generate the ordered list of files, and one to run the set difference, so that the diff style work can overlap in time with the directory scanning. This would allow you to do usefull work during the disk I/O wait periods.