Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Splitting up a filesystem into 'bite sized' chunks

by Preceptor (Deacon)
on Jul 09, 2013 at 19:57 UTC ( [id://1043370]=perlquestion: print w/replies, xml ) Need Help??

Preceptor has asked for the wisdom of the Perl Monks concerning the following question:

I've got a need to run a virus scan on my server estate. I have some rather large filesystems to deal with - 70million files, 30Tbs of data sort of large. I've got rather a lot of filesystems to process too. So I hit the perennial problem of the virus checkers running so long to run their scheduled scan, that they never actually finish.

What I've started doing, is pull together a set of 'scanning servers', with a view to distributing the workload. For smaller filesystems, this is good enough - I can do one at a time, and have a crude queuing mechanism. But when I hit the larger filesystems, I've a need to break down the problem into manageable slices. I don't need to be particularly precise - if a chunk is between 1000 and 100,000 files, that's good enough granularity.

I'm thinking in terms of using 'File::Find' to build up a list, but this seems ... well, a bit inefficient to me - traversing a whole directory structure, in order to feed a virus scanner a list that'll... then traverse the file structure again. Can anyone offer better suggestions for how to 'divide up' an NFS filesystems, without doing a full directory tree traversal?

Update:OS is Linux, but I can probably wangle a rebuild to a Windows platform if it's useful. Virus scanner I'm using is Sophos, but it's triggered in much the same way as 'find' - hand it a path to traverse, and it 'does the business'.

Replies are listed 'Best First'.
Re: Splitting up a filesystem into 'bite sized' chunks
by ambrus (Abbot) on Jul 09, 2013 at 21:22 UTC

      Looks promising - it's very similar to what I'm thinking of. I'll have a look through. Thanks.

Re: Splitting up a filesystem into 'bite sized' chunks
by BrowserUk (Patriarch) on Jul 09, 2013 at 21:17 UTC
    I'm thinking in terms of using 'File::Find' to build up a list, but this seems ... well, a bit inefficient to me - traversing a whole directory structure, in order to feed a virus scanner a list that'll... then traverse the file structure again. Can anyone offer better suggestions for how to 'divide up' an NFS filesystems, without doing a full directory tree traversal?

    Doesn't your virus scanner have a 'scan this file only' option?

    Beyond that, I'd look to giving the scanner one drive at a time, rather than (bits of) one file system. Drives are a small number of finite sizes so would make the capacity planning for your distributed system fairly simple.

    A realise that *nix file systems are logically single entities, but it is surely possible to mount individual drives/raid units such that it appears as a single subdirectory within the file system.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Unfortunately, I'm pulling NFS mounts off a NAS. So I can't easily subdivide my volumes. I've got a few places where they're separated into (known size) subdirectories, and that's fine. But almost by the nature of it, the most unwieldy are also the ones that have silly numbers of TB and filecounts within a single structure. I can subdivide the mountpoints, but I'd rather not do it by hand.

      It's a good point though - my scanner probably does have 'per file' scanning, which would mean I could stream a file list from a single source to multiple scanning engines. So perhaps that's the way to go.

      In the grand scheme of things though, the biggest problem isn't so much parallelising the scans on a single filesystem, as that'll create contention, but to have a good notion of a process that can be resumed part way - it's not such a big deal that it's done within a defined time window, but more that I can track progress and ensure everything _does_ get scanned eventually.

Re: Splitting up a filesystem into 'bite sized' chunks
by zork42 (Monk) on Jul 10, 2013 at 09:02 UTC
    I'm thinking in terms of using 'File::Find' to build up a list, but this seems ... well, a bit inefficient to me - traversing a whole directory structure, in order to feed a virus scanner a list that'll... then traverse the file structure again.
    But "traversing a whole directory structure" will take a very tiny fraction (much smaller than 1/1,000,000) of the amount of time that "traversing a whole directory structure + scaning each file" will take, so does this inefficiency matter?

    If you can efficiently feed a virus scanner 1 file (or better a list of say 1,000 files), you might as well use 'File::Find' to build up a list of all files (even if this takes a few hours), then feed the virus scanner from the list.

    If you can only give the virus scanner a path to scan, you could have a folder called 'scan' or somthing, create (hard / symbolic?) links to the next N files (*) in the 'File::Find' list, tell the virus scanner to scan the 'scan' folder, when it's finished remove the links and repeat.
    (*) If you wanted each scan to take a simiar-ish time, rather than the next N files, you'd be better adding files until they totalled between a lower MB limit and upper MB limit.

    I've got the feeling I must have missed something here...

      One of the suggestions I've picked up is the notion Belief propagation. I'm not entirely sure how well it'll apply, but it's something I'll look into.

      I'm looking at a node distributed solution for bulk filesystem traversal and processing. NFS make this easier on one hand, but harder on the other. I've got a lot of spindles and controllers behind the problem though, especially if I'm able to make it quiesce during peak times. (Which is no small part of why I'm trying to do 'clever stuff' with it - virus scanning 100k files per hour or so, is going to take me nearly a year if I'm doing 2 billion of the blasted things. (but that might be acceptable if I can then treat it as a baseline, and do incremental sweeps thereafter).

      I think however I slice the problem, it's still going to be big and chewy.

      I'm working on something that uses File::Find to send file lists to another thread (or two) that's using Thread::Queue.

      My major requirement is breaking down a 10Tb, 70million file monster filesystem - taking time is less of a problem (but if I can optimise it, so much the better) than keeping track of progress and being able to resume processing. Just a 'find' takes a substantial amount of time on this filesystem (days). I'm considering if File::Find will allow me to give it a 'start location' to resume processing.

        I'm working on something that uses File::Find to send file lists to another thread (or two) that's using Thread::Queue. My major requirement is breaking down a 10Tb, 70million file monster filesystem

        Given the size of your dataset, using an in-memory queue is a fatally flawed plan from both memory consumption and persistance/re-startability point of views.

        I'd strongly advocate putting your file-paths into a DB of some kind and have your scanning processes remove them (or mark them done) as the processes them.

        That way, if any one element of the cluster fails, it can be restarted and pick up from where it left off.

        It also lends itself to doing incremental scans in subsequent passes.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Splitting up a filesystem into 'bite sized' chunks
by sundialsvc4 (Abbot) on Jul 12, 2013 at 04:37 UTC

    Maybe I should adopt the principle of writing every single terse-comment that I am prone to in a splendiferous loquacious paragraph, or three, in a vain attempt to forestall the “down-vote demons.”   I dunno.   But, wrapped-up in the terse-comment “NFS is a monster” is a very-valid point:   NFS is a network file-system that does not (unlike, say, Microsoft’s famous system) pretend to be otherwise.

    With NFS, filesystems can be unfathomably-large, and network transports can be slow, and NFS will still work.   However, all that having been said ... your (Perl-implemented) algorithms must match.   You must, for example, come up with a plausible strategy for “splitting up a filesystem into bite-sized chunks,” whatever that strategy might be, that assumes both that you cannot immediately ascertain how many files/directories are in any particular area of that filesystem, and that you cannot obtain such a count in a timely fashion.   Instead of an algorithm, therefore, you are obliged to make use of a heuristic.

      NFS does have it's limitations - one of these is transport layer - you can do 10G multi channel Ethernet but you still have a price to pay with the connection latency. With a large storage environment you get a lot of spindles and controllers, but that only helps if you can go wide on your IO

Re: Splitting up a filesystem into 'bite sized' chunks
by zork42 (Monk) on Jul 13, 2013 at 11:13 UTC
    Just a 'find' takes a substantial amount of time on this filesystem (days).
    1. Any idea why it takes so long?
    2. Does find or File::Find spend most of the time waiting for a response from the remote server?
    3. Would you be able to speed up File::Find by searching multiple directories simultaneously?
      If you go down a few directory levels and find (say) 100 subfolders, could you search each of those subfolders simultaneously?
    4. Would you be able to speed up File::Find by using RPCs?

      Contention and sheer number of files, mostly. Parallel traversals will help if I divide the workload reasonably - I've got a lot of spindles and controllers. Some filesystems will work ok with a 'traverse down' approach, but others are much more random in distribution. I don't want to extend my batches too much, because of outages, glitches etc.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1043370]
Approved by frozenwithjoy
Front-paged by MidLifeXis
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (3)
As of 2024-03-19 05:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found