Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re: Splitting up a filesystem into 'bite sized' chunks

by zork42 (Monk)
on Jul 10, 2013 at 09:02 UTC ( #1043426=note: print w/ replies, xml ) Need Help??


in reply to Splitting up a filesystem into 'bite sized' chunks

I'm thinking in terms of using 'File::Find' to build up a list, but this seems ... well, a bit inefficient to me - traversing a whole directory structure, in order to feed a virus scanner a list that'll... then traverse the file structure again.
But "traversing a whole directory structure" will take a very tiny fraction (much smaller than 1/1,000,000) of the amount of time that "traversing a whole directory structure + scaning each file" will take, so does this inefficiency matter?

If you can efficiently feed a virus scanner 1 file (or better a list of say 1,000 files), you might as well use 'File::Find' to build up a list of all files (even if this takes a few hours), then feed the virus scanner from the list.

If you can only give the virus scanner a path to scan, you could have a folder called 'scan' or somthing, create (hard / symbolic?) links to the next N files (*) in the 'File::Find' list, tell the virus scanner to scan the 'scan' folder, when it's finished remove the links and repeat.
(*) If you wanted each scan to take a simiar-ish time, rather than the next N files, you'd be better adding files until they totalled between a lower MB limit and upper MB limit.

I've got the feeling I must have missed something here...


Comment on Re: Splitting up a filesystem into 'bite sized' chunks
Re^2: Splitting up a filesystem into 'bite sized' chunks
by sundialsvc4 (Monsignor) on Jul 10, 2013 at 11:54 UTC

    Uh huh... it actually could, in this case.   NFS is a monster.

Re^2: Splitting up a filesystem into 'bite sized' chunks
by Preceptor (Chaplain) on Jul 10, 2013 at 18:16 UTC

    I'm working on something that uses File::Find to send file lists to another thread (or two) that's using Thread::Queue.

    My major requirement is breaking down a 10Tb, 70million file monster filesystem - taking time is less of a problem (but if I can optimise it, so much the better) than keeping track of progress and being able to resume processing. Just a 'find' takes a substantial amount of time on this filesystem (days). I'm considering if File::Find will allow me to give it a 'start location' to resume processing.

      I'm working on something that uses File::Find to send file lists to another thread (or two) that's using Thread::Queue. My major requirement is breaking down a 10Tb, 70million file monster filesystem

      Given the size of your dataset, using an in-memory queue is a fatally flawed plan from both memory consumption and persistance/re-startability point of views.

      I'd strongly advocate putting your file-paths into a DB of some kind and have your scanning processes remove them (or mark them done) as the processes them.

      That way, if any one element of the cluster fails, it can be restarted and pick up from where it left off.

      It also lends itself to doing incremental scans in subsequent passes.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        I was thinking I could stall the find process, in order to simply buffet, rather than maintain ... Well, any full lists, be they database or flat file. After all, in a sense, my filesystem is a datasource, and mtime tells me if I need to rescan

Re^2: Splitting up a filesystem into 'bite sized' chunks
by Preceptor (Chaplain) on Jul 10, 2013 at 18:32 UTC

    One of the suggestions I've picked up is the notion Belief propagation. I'm not entirely sure how well it'll apply, but it's something I'll look into.

    I'm looking at a node distributed solution for bulk filesystem traversal and processing. NFS make this easier on one hand, but harder on the other. I've got a lot of spindles and controllers behind the problem though, especially if I'm able to make it quiesce during peak times. (Which is no small part of why I'm trying to do 'clever stuff' with it - virus scanning 100k files per hour or so, is going to take me nearly a year if I'm doing 2 billion of the blasted things. (but that might be acceptable if I can then treat it as a baseline, and do incremental sweeps thereafter).

    I think however I slice the problem, it's still going to be big and chewy.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1043426]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (15)
As of 2014-07-31 13:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (248 votes), past polls