Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Re^16: randomising file order returned by File::Find

by jeffa (Chancellor)
on Mar 02, 2011 at 03:06 UTC ( #890888=note: print w/ replies, xml ) Need Help??


in reply to Re^15: randomising file order returned by File::Find
in thread randomising file order returned by File::Find

"Last year it was "use a database"; this year "ooh, ooh, ooh, Hadoop""

Just pointing out that you are rejecting technology unfairly. Without even having tried it. Which you have not, even though you tailor your language to imply that you have.

jeffa

L-LL-L--L-LL-L--L-LL-L--
-R--R-RR-R--R-RR-R--R-RR
B--B--B--B--B--B--B--B--
H---H---H---H---H---H---
(the triplet paradiddle with high-hat)


Comment on Re^16: randomising file order returned by File::Find
Select or Download Code
Re^17: randomising file order returned by File::Find
by BrowserUk (Pope) on Mar 02, 2011 at 03:53 UTC

    I've not rejected Hadoop. I even suggested it here a couple of weeks ago.

    I'm simply not recommending it for this as it is inappropriate. To summarise: You cannot force fit variable size and (typically huge) 3D image files into fix-sized aggregated packets; nor can you process images line-by-line from STDIN as is required by Hadoop streaming.

    I'm not rejecting the tool; I'm reject the suggestion for mis-application of that tool. The OP reads and makes up his own mind on the matter.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      "I'm reject the suggestion for mis-application of that tool."

      I merely pointed out that "... [script] builds a big list in memory and then partitions the matching files into 100+ lists (1 per cluster instance) and writes the to separate files" is what Hadoop gives you for free. Whether or not the OP's problem is CPU bound or IO bound is determined by exactly how the OP is processing said images, which has yet to be revealed. Rather than make assumptions, like you have done, i merely saw a chance for an idea.

      And concurring with tilly's suggestion is not really suggesting it. tilly suggested it. You didn't even mention Hadoop in your link.

      jeffa

      L-LL-L--L-LL-L--L-LL-L--
      -R--R-RR-R--R-RR-R--R-RR
      B--B--B--B--B--B--B--B--
      H---H---H---H---H---H---
      (the triplet paradiddle with high-hat)
      
        s what Hadoop gives you for free.

        It isn't free if you don't already have the cluster set up to use it.

        And once you've gone through the cluster set-up process, just in order to deliver a couple of hundred k of filenames to the clients, they still have to get access to each of the huge image files, which they cannot do in-situ, they would have to be shipped to the local HDFS filesystem. And then Hadoop has nothing whatsoever to offer in the processing of that file.

        And if you can think of some legitimate reason for going through all of that in order to distribute a few thousand filenames...

        Let's face it. Your suggestion is a crock.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://890888]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (8)
As of 2014-08-21 08:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (128 votes), past polls