Beefy Boxes and Bandwidth Generously Provided by pair Networks RobOMonk
more useful options
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??
Your literal question about using DB ties has already been answered, so I'll skip that part here, but I will consider the bigger problem.

Basically it seems like your files represent sets, and order isn't relevant. Comparing two big sets is easiest if both sets are sorted since you can then simply keep an active pointer in each sorted sequence and progress them in tandem.

Remains the question of how to sort the sets. One way is to use unix sort, which normally will not load a big file completely in memory. So that idea leads to code like:

# warning: untested code # A string that will sort beyond any returned file (they all start wit +h /) use constant INFINITY => chr(ord("/")+1); open(local *YESTERDAY, "<", $yesterday_file) || die "Could not open $yesterday_file: $!"; open(local *CURRENT, "find / $search_files -print | sort") || die "Could not start find: $!"; open(local *TODAY, ">", $today_file) || die "Could not create $today_file: $!"; my $yesterday = <YESTERDAY> || INFINITY; local $_; while (<CURRENT>) { print TODAY $_; while ($yesterday lt $_) { print "Lost file $yesterday"; $yesterday = <YESTERDAY> || INFINITY; } # Now $yesterday ge $_ if ($yesterday gt $_) { print "New file $_"; } else { $yesterday = <YESTERDAY> || INFINITY; } } if ($yesterday ne INFINITY) { print "Lost file $yesterday"; print "Lost file $_" while <YESTERDAY>; }

Due to the sort it still has complexity O(n*log(n)) in the number of files. It would be nice if find had an option to walk the directories in lexical order, since then the sorting only needs to happen on the directory level, which very likely makes the logaritmic factor very low. Instead you could make perl do the find work. This causes you to miss out on many of the clever optimizations find style programs can do though, so this might not always be a gain (considering the amount of files you process it probably is though).

In perl you can do a directory walk using File::Find and you can even use find2perl to convert a find specification to equivalent perl code. But as a quick and dirty demo I'll show the code with a handrolled loop here where I list all names that aren't directories

# Again untested, so take care ! # A string that will sort beyond any returned file (they all start wit +h /) use constant INFINITY => chr(ord("/")+1); my $yesterday; sub walk_dir { # dir argument is assumed to already end on / my $dir = shift; opendir(local *DIR, $dir) || die "Could not opendir $dir: $!"; for (sort readdir(DIR)) { next if $_ eq "." || $_ eq ".."; my $f = "$dir$_"; if (-d $f) { walk_dir("$f/"); } else { $f .= "\n"; print TODAY $f; while ($yesterday lt $f) { print "Lost file $yesterday"; $yesterday = <YESTERDAY> || INFINITY; } # Now $yesterday ge $f if ($yesterday gt $f) { print "New file $f"; } else { $yesterday = <YESTERDAY> || INFINITY; } } } } open(local *YESTERDAY, "<", $yesterday_file) || die "Could not open $yesterday_file: $!"; open(local *TODAY, ">", $today_file) || die "Could not create $today_file: $!"; $yesterday = <YESTERDAY> || INFINITY; walk_dir("/"); if ($yesterday ne INFINITY) { print "Lost file $yesterday"; local $_; print "Lost file $_" while <YESTERDAY>; }

Update I forgot to stress that in this last solution there is no place anymore that would be expected to use a lot of memory (like e.g. a shell sort based one still would do). Real memory use will probably be only a few megabytes (I'm assuming no directory is huge).

It might in fact still be interesting to split up the task in two processes, one running a perl based find to generate the ordered list of files, and one to run the set difference, so that the diff style work can overlap in time with the directory scanning. This would allow you to do usefull work during the disk I/O wait periods.


In reply to Re: Memory Management Problem by thospel
in thread Memory Management Problem by PrimeLord

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others having an uproarious good time at the Monastery: (8)
    As of 2014-04-16 11:21 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      April first is:







      Results (424 votes), past polls