Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Re^5: randomising file order returned by File::Find

by BrowserUk (Pope)
on Mar 01, 2011 at 22:37 UTC ( #890851=note: print w/ replies, xml ) Need Help??


in reply to Re^4: randomising file order returned by File::Find
in thread randomising file order returned by File::Find

True ... but Hadoop scales linearly, meaning what used to take multiple hours or days to run now only takes a few hours, maybe even a few minutes.

So does the server/clients scheme. The difference is in the level of control.

Such termination becomes trivial.

For some types of processing. For other types, the cost of throwing away the results of a job when it is 99% complete and starting over can be very high.

I do not know how familiar you are with Hadoop/cloud computing.

Not so much. But it isn't so different with stuff I was doing 15 years ago on a server farm.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.


Comment on Re^5: randomising file order returned by File::Find
Re^6: randomising file order returned by File::Find
by jeffa (Chancellor) on Mar 01, 2011 at 22:42 UTC

    "So does the server/clients scheme. The difference is in the level of control."

    Right, with Hadoop, all that hard work is done for you. Why roll another wheel?

    "Not so much. But it isn't so different with stuff I was doing 15 years ago on a server farm."

    Except that now the hardware can (realistically) support the volumes of data being processed. You should check it out.

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)
    
      Right, with Hadoop, all that hard work is done for you.

      Hm. Only once you have turned over your cluster to single purpose, Java-based HDFS monoculture. And if all your processing needs can be force fitted into that monoculture's way of working.

      But, if your hardware resources have to serve a variety of needs and uses...

      Besides, it does not have to be "hard work". There is a very simple pattern to be followed, exemplified by a server instance something like:

      #! perl -slw use strict; use threads; use IO::Socket; my $pause :shared = 0; async{ while( <STDIN> ) { chomp; if( /^suspend/i ) { $pause = 1; } elsif( /^resume/i ) { $pause = 0; } } }->detach; my $lsn = IO::Socket::INET->new( Listen => 1, LocalPort => 12345 ); while( my $fname = <*.png> ) { my $client = $lst->accept; print $client $fname; }

      And a client template:

      #! perl -slw use strict; use IO::Socket; my $server = shift; while( 1 ) { my $svr = IO::Socket->new( $server ); my $fname = <$svr>; close $svr; ## Process $fname. }

      The work items distributed by the server can be anything you like besides filenames. The processing inside the client is the same bit you'd have to write yourself for you hadoop application. Hardly onerous, but very flexible.

      Except that now the hardware can (realistically) support the volumes of data being processed.

      The scale of things is relative. We didn't have as much data to process back then as now, but disks were smaller and machines slower and more expensive. But the trade-offs of monocultural versus flexible remain the same.

      Transaction engines like CICS--the hadoop's of that time--could process prodigious volumes of data through a very specific small band of processing requirements, but couldn't handle the variety of general purpose programming requirements and applications that arose.

      The same problem arises today with map/reduce. If you're requirements fit its way of working--datasets that are infinitely partitionable into fixed-sized chunks that each take the same amount of time to process and don't require feedback loops, you're laughing.

      But, if your images can vary widely in size and so do not fit into the fix-sized chunks of HDFS; and processing times vary greatly with size--many image processing algorithms increase exponentially with the size of the image--then map/reduce scheduling algorithms get tied in knots.

      You should check it out.

      My needs, resources and pockets do not lend themselves to such.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        "Only once you have turned over your cluster to single purpose, Java-based HDFS monoculture. And if all your processing needs can be force fitted into that monoculture's way of working."

        That is simply not true. Look into Hadoop streaming.

        "My needs, resources and pockets do not lend themselves to such."

        That's too bad. I am currently working for a company and we are in the process of replacing our traditional means of processing the extremely massive volumes of data we consume with Hadoop. It is beyond awesome. Again, you should check it out instead of making assumptions. Cheers! :)

        jeffa

        L-LL-L--L-LL-L--L-LL-L--
        -R--R-RR-R--R-RR-R--R-RR
        B--B--B--B--B--B--B--B--
        H---H---H---H---H---H---
        (the triplet paradiddle with high-hat)
        

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://890851]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (5)
As of 2014-09-16 03:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite cookbook is:










    Results (155 votes), past polls