Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

randomising file order returned by File::Find

by jefferis (Novice)
on Mar 01, 2011 at 11:39 UTC ( #890729=perlquestion: print w/ replies, xml ) Need Help??
jefferis has asked for the wisdom of the Perl Monks concerning the following question:

I am using File::Find to process a directory hierarchy. Some of the files are symlinks that I must follow. What I want to do is randomise the processing order of the files, something like this:

use File::Find; use List::Util 'shuffle'; my $somedir = "."; sub handlefind {print $_,"\n";} find({ wanted => \&handlefind, preprocess=>\&shuffle,follow=>1 }, $somedir);

However, preprocess is a noop when follow symlinks is on. Could anyone suggest an alternative approach to meet my goal. Thank you very much for your help,

Greg.

(first time poster)

PS if you are curious, this is part of perl script that is run > 100x on a cluster to process 1000s of 3d brain images. I use simple file locks to make sure that only one job is running on any image. Unfortunately file locking on NFS is not very reliable (and I am already taking precautions). What happens is that if I start 100x jobs at once everyone tries to grab the same file and it's a mess. So I want to scramble the processing order of files to reduce the number of possible collisions to put less pressure on the file locking mechanism.

Comment on randomising file order returned by File::Find
Download Code
Re: randomising file order returned by File::Find
by roboticus (Canon) on Mar 01, 2011 at 11:44 UTC

    jefferis:

    How about having a single script dig through the directory tree and link the items it finds to 100 output directories. Then each of your 100 processing scripts can troll through their own input directory for items to process. That way you can avoid most of the locking problems. If you make your script that assigns items to queues, it could even watch the output directories and generate image links in the smallest directory available to balance the load (in case some process more quickly than others).

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re: randomising file order returned by File::Find
by BrowserUk (Pope) on Mar 01, 2011 at 12:11 UTC
    perl script that is run > 100x on a cluster to process 1000s of 3d brain images.

    Even if the 1000s become low millions, it would be far more efficient to have a single script that scans the directory hierarchy and builds a big list in memory and then partitions the matching files into 100+ lists (1 per cluster instance) and writes the to separate files. It then initiates the processes on the cluster instances passing the name of one of those files to it. This is simple to implement and avoids the need for locking files entirely.

    It could suffer from one problem though, that of imbalanced processing, if there is any great variability in the time taken to process individual images.

    If that were the case, I'd opt for a slightly more sophisticated scheme. It have the directory scanning process open a server port that responded to inbound connections by returning the name of the next file to be processed. Each cluster instance then connects, gets the name of a file to process, closes the connection and processes the file; connecting again when it is ready to do another. Again, not a complicated scheme to program, but one that ensures balanced workloads across the cluster, and completely avoids the need for locking or synchronisation.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      "... [script] builds a big list in memory and then partitions the matching files into 100+ lists (1 per cluster instance) and writes the to separate files."

      This is pretty much what Hadoop does for you.

      jeffa

      L-LL-L--L-LL-L--L-LL-L--
      -R--R-RR-R--R-RR-R--R-RR
      B--B--B--B--B--B--B--B--
      H---H---H---H---H---H---
      (the triplet paradiddle with high-hat)
      

        The downside of that mechanism is control. If, as the OP says later, the need to suspend or terminate the processing early arises, then you're stuck with starting the whole process over from scratch. Same thing if the number of workers varies up or down.

        With server/clients approach, pause and restart the clients, or knock out half the clients--or double them--and the processing continues without duplication and automatically redistributes to accommodate the changes.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        Trouble with this is that it doesn't make the best use of your hardware if you have machines that run at different speeds or if some data files take longer to process than others.

        When I was trying to solve a similar problem (in my case, rendering individual frames of video), using whatever spare cycles were available across a whole bunch of machines (so different amounts of CPU were available on different boxes and at different times) my solution was for the individual renderers to request work units from a master, and rather than just mounting the master's filesystem and hoping for the best, they made a request to my own application. My application was a simple perl script that they accessed over telnet. The script was only working on its local filesystem so locking worked reliably, and simply told each client the filename that it should next work on. The clients then grabbed that file using NFS.

        That's what I think you should do rather than randomising the list - randomising will reduce the problem, but won't eliminate it.

        However, if you do want to randomise, then the wanted function should build up a list instead of doing any processing on the files. You then shuffle that list, and only after that do you process the files.

Re: randomising file order returned by File::Find
by salva (Monsignor) on Mar 01, 2011 at 12:19 UTC
    1. Push the filenames into a list
    2. shuffle the list
    3. process the elements on the list.
    use File::Find; use List::Util qw(shuffle); my @files; find({wanted => sub { push @files, $File::Find::name }, follow => 1}, +$somedir); @files = shuffle @files; process_file($_) for @files;
      Thank you very much indeed! This is what I was after and so simple when you think about it! Greg.

      PS to the others making suggestions about how to schedule multiple jobs etc, your suggestions are good in theory but the problem is more complicated than I outlined and pre-scanning at the beginning does not work in practice (for reasons including load-balancing due to different processing time (as mentioned) , the fact that the files that are available to process can change, that I often have to kill off jobs when other people want to use the cluster, etc, etc).

        pre-scanning at the beginning does not work in practice (for reasons including load-balancing due to different processing time (as mentioned) , the fact that the files that are available to process can change, that I often have to kill off jobs when other people want to use the cluster, etc, etc).

        Another advantage of the file scanner server idea is that if you need to pause the 100+ processing clients, you only need instruct the server to stop responding to requests. Then those clients stop as soon as they've finished with their current file and just sit dormant waiting for a response.

        When the cluster is free again, another single instruction to the server and they all kick off again, continuing their way down the list without any possibility of revisiting files already processed. Every file gets processed exactly once with none missed and no time wasted locking files and no possibility of race conditions.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: randomising file order returned by File::Find
by jethro (Monsignor) on Mar 01, 2011 at 12:33 UTC

    Your problem descriptions sound strange to me: Why would you run the same script 100 times on every file? At least that would be the result if you had 100 script invocations run on the same files, randomized or not. Or is there some code that ensures that an already processed file isn't done again? Or is the script run with different parameters?

    Do the scripts transform the images, i.e. rewrite these files? Otherwise file locking would be unnecessary. But I can't imagine 100 transformations on a single image without having to do that in a specific order.

Re: randomising file order returned by File::Find
by JavaFan (Canon) on Mar 01, 2011 at 13:03 UTC
    What I would do: open 100+ files for writing. Find your files using File::Find. Write them to the 100 files in a round robin way (that way, you don't need to store all files in a huge list in memory). Close the files. Now start your jobs, each taking a different file containing file names to process as argument.
      a list of filenames with thousands of entries is not huge anymore!
Re: randomising file order returned by File::Find
by GrandFather (Cardinal) on Mar 01, 2011 at 21:46 UTC

    Reading between the lines a little, especially in some of you later replies, it looks to me like you are using a cluster of machines to do some heavy duty image processing and that the individual machines are fighting over claiming images to process. If that is the case then a better solution may be to consider one of the job scheduling modules such as Schedule::Load (disclaimer: I've not used the module, but the docs imply it does the right stuff).

    If you want to roll your own task management a common technique to manage scheduling and locking is to use a database which provides transacted processing to mediate locking and task queue management.

    For a build and automated test system I've put together I use a mysql database. The system processes about 10,000 tasks a month across a build/test farm comprising 6 machines using a mixture of Linux, Mac and Windows boxes. The locking can be done, but it is tricky and needs care to get right!

    True laziness is hard work

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://890729]
Approved by Corion
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (8)
As of 2014-07-12 11:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (239 votes), past polls