Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

Re: Memory Management Problem

by thospel (Hermit)
on Nov 21, 2003 at 19:35 UTC ( #309016=note: print w/replies, xml ) Need Help??

in reply to Memory Management Problem

Your literal question about using DB ties has already been answered, so I'll skip that part here, but I will consider the bigger problem.

Basically it seems like your files represent sets, and order isn't relevant. Comparing two big sets is easiest if both sets are sorted since you can then simply keep an active pointer in each sorted sequence and progress them in tandem.

Remains the question of how to sort the sets. One way is to use unix sort, which normally will not load a big file completely in memory. So that idea leads to code like:

# warning: untested code # A string that will sort beyond any returned file (they all start wit +h /) use constant INFINITY => chr(ord("/")+1); open(local *YESTERDAY, "<", $yesterday_file) || die "Could not open $yesterday_file: $!"; open(local *CURRENT, "find / $search_files -print | sort") || die "Could not start find: $!"; open(local *TODAY, ">", $today_file) || die "Could not create $today_file: $!"; my $yesterday = <YESTERDAY> || INFINITY; local $_; while (<CURRENT>) { print TODAY $_; while ($yesterday lt $_) { print "Lost file $yesterday"; $yesterday = <YESTERDAY> || INFINITY; } # Now $yesterday ge $_ if ($yesterday gt $_) { print "New file $_"; } else { $yesterday = <YESTERDAY> || INFINITY; } } if ($yesterday ne INFINITY) { print "Lost file $yesterday"; print "Lost file $_" while <YESTERDAY>; }

Due to the sort it still has complexity O(n*log(n)) in the number of files. It would be nice if find had an option to walk the directories in lexical order, since then the sorting only needs to happen on the directory level, which very likely makes the logaritmic factor very low. Instead you could make perl do the find work. This causes you to miss out on many of the clever optimizations find style programs can do though, so this might not always be a gain (considering the amount of files you process it probably is though).

In perl you can do a directory walk using File::Find and you can even use find2perl to convert a find specification to equivalent perl code. But as a quick and dirty demo I'll show the code with a handrolled loop here where I list all names that aren't directories

# Again untested, so take care ! # A string that will sort beyond any returned file (they all start wit +h /) use constant INFINITY => chr(ord("/")+1); my $yesterday; sub walk_dir { # dir argument is assumed to already end on / my $dir = shift; opendir(local *DIR, $dir) || die "Could not opendir $dir: $!"; for (sort readdir(DIR)) { next if $_ eq "." || $_ eq ".."; my $f = "$dir$_"; if (-d $f) { walk_dir("$f/"); } else { $f .= "\n"; print TODAY $f; while ($yesterday lt $f) { print "Lost file $yesterday"; $yesterday = <YESTERDAY> || INFINITY; } # Now $yesterday ge $f if ($yesterday gt $f) { print "New file $f"; } else { $yesterday = <YESTERDAY> || INFINITY; } } } } open(local *YESTERDAY, "<", $yesterday_file) || die "Could not open $yesterday_file: $!"; open(local *TODAY, ">", $today_file) || die "Could not create $today_file: $!"; $yesterday = <YESTERDAY> || INFINITY; walk_dir("/"); if ($yesterday ne INFINITY) { print "Lost file $yesterday"; local $_; print "Lost file $_" while <YESTERDAY>; }

Update I forgot to stress that in this last solution there is no place anymore that would be expected to use a lot of memory (like e.g. a shell sort based one still would do). Real memory use will probably be only a few megabytes (I'm assuming no directory is huge).

It might in fact still be interesting to split up the task in two processes, one running a perl based find to generate the ordered list of files, and one to run the set difference, so that the diff style work can overlap in time with the directory scanning. This would allow you to do usefull work during the disk I/O wait periods.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://309016]
[Corion]: stonecolddevin: I think Jack White is highly regarded, but ...
[Corion]: ... I'm not really knowledgeable about good guitar players
[stonecolddevin]: I've been on a Stevie Ray Vaughan kick lately: https://www. v=wVjdMLAMbM0
[stonecolddevin]: Corion I haven't heard much of his work to be honest.
[erix]: here is a nice cover, stevieb
[planetscape]: hello, Corion
[Corion]: Hi planetscape!
[stevieb]: Corion I like the groundbreaking ones (guitar players). I have the ability to pick up on sounds that are groundbreaking or specific to a person, thanks to my years of doing recording/mixing/ sampling (hip-hop mind you, but years of it...
[stevieb]: ...has honed in my skills of recognizing sound
[stevieb]: All of the early members are coming out of the woodwork today :) Hey, planetscape

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (9)
As of 2017-06-22 21:26 GMT
Find Nodes?
    Voting Booth?
    How many monitors do you use while coding?

    Results (531 votes). Check out past polls.