Beefy Boxes and Bandwidth Generously Provided by pair Networks Bob
Keep It Simple, Stupid
 
PerlMonks  

Re^2: Splitting up a filesystem into 'bite sized' chunks

by Preceptor (Chaplain)
on Jul 10, 2013 at 18:16 UTC ( #1043521=note: print w/ replies, xml ) Need Help??


in reply to Re: Splitting up a filesystem into 'bite sized' chunks
in thread Splitting up a filesystem into 'bite sized' chunks

I'm working on something that uses File::Find to send file lists to another thread (or two) that's using Thread::Queue.

My major requirement is breaking down a 10Tb, 70million file monster filesystem - taking time is less of a problem (but if I can optimise it, so much the better) than keeping track of progress and being able to resume processing. Just a 'find' takes a substantial amount of time on this filesystem (days). I'm considering if File::Find will allow me to give it a 'start location' to resume processing.


Comment on Re^2: Splitting up a filesystem into 'bite sized' chunks
Re^3: Splitting up a filesystem into 'bite sized' chunks
by BrowserUk (Pope) on Jul 10, 2013 at 19:17 UTC
    I'm working on something that uses File::Find to send file lists to another thread (or two) that's using Thread::Queue. My major requirement is breaking down a 10Tb, 70million file monster filesystem

    Given the size of your dataset, using an in-memory queue is a fatally flawed plan from both memory consumption and persistance/re-startability point of views.

    I'd strongly advocate putting your file-paths into a DB of some kind and have your scanning processes remove them (or mark them done) as the processes them.

    That way, if any one element of the cluster fails, it can be restarted and pick up from where it left off.

    It also lends itself to doing incremental scans in subsequent passes.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      I was thinking I could stall the find process, in order to simply buffet, rather than maintain ... Well, any full lists, be they database or flat file. After all, in a sense, my filesystem is a datasource, and mtime tells me if I need to rescan

        I was thinking I could stall the find process, in order to simply buffet, rather than maintain
        1. Processes, and hardware do fail. Given the length of time this whole process is likely to take, it woudl be silly to risk getting to 90% and then have to start over because you ignored this possibility.
        2. Given the size of your dataset, you'd have to carefully manage the size of your queue to avoid running out of memory.
        . Well, any full lists, be they database or flat file.

        The problem with flat files is that the make lousy queues. (Great filos but lousy fifos.)

        Removing records/lines at the beginning of a file is (for all intents and purposes) impossible; and marking records done, means reading from the top each time to find the next piece of work to do. An O(n^2) process.

        Thus you would then need a second (pointer) file that tells you how far down the first file you've processed; and that file becomes a bottleneck of contention.

        As for file systems...I've often used (and advocated the use of) file systems for queues -- the producer creates small (often zero-length) files in a todo directory; consumers rename the first file they find in that directory into a /consumerN.processing/ directory whilst they process it; and then rename it into a done directory (or just delete it) once they finished. -- but again, given the size of your dataset, you'd have to very carefully manage the number of files you put into a single directory. And if you try to structure it, you're just moving the goal posts.

        And what happpens if your find/findfile process dies? Working out how far it got so you can avoid starting over is a problem.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1043521]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (14)
As of 2014-04-16 17:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (433 votes), past polls