Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Re: Could there be ThreadedMapReduce (and/or ForkedMapReduce) instead of DistributedMapReduce?

by tphyahoo (Vicar)
on Oct 22, 2006 at 19:26 UTC ( #579891=note: print w/ replies, xml ) Need Help??


in reply to Could there be ThreadedMapReduce (and/or ForkedMapReduce) instead of DistributedMapReduce?

I wrote Ira Woodhead (the author of MapReduce):

Ira, I'm real interested by that module you recently posted.

I am having to write a lot of code that works in a parallel way, and I hate it. What is most appealing to me about MapReduce is not so much that it allows you to split things up on a cluster (though that's great), but that the code itself is so clean. No locks, no guards, no name-your-poison. Just Reader/Map/Reduce. Simple. Clean. All the dirt is hidden away, where your rocket scientists can optimize things in your custom MapReduce code to their hearts content.

I posted about this at

Could there be ThreadedMapReduce (and/or ForkedMapReduce) instead of DistributedMapReduce?

What I'm wondering is if you could have this goodness working even when you just have one computer, by using threads/forks. I'm thinking ThreadedMapReduce / ForkedMapReduce. Maybe even all inheriting from your original module in a sane way.

This would do the same thing that ParallelForkManager, and Thread::Queue, and countless other modules, but with a much cleaner functional interface.

I'm thinking along the same lines as Joel in "Can Your Programming Language Do This?":

http://www.joelonsoftware.com/items/2006/08/01.html

Well, can perl do this?

What do you think? :)

Best, thomas.

******************************************************
******************************************************

Ira Woodhead responded:

Actually, yes! The version I'm currently getting ready to post has two config options "maps" and "reduces" which allow you to set the number of map task and reduce task processes per machine. To use it on a single machine you simply have a single-machine cluster. If you want threads, well, sorry but I'm not comfortable using perl threads, having experienced some pitfalls with them. Not to mention they require a recompile, and I really want to keep this simple to deploy. The relatively more expensive full forks are the way I'm going, at least for now.

I'm glad there is some interest in this. I was quite surprised to find no effort underway besides Hadoop to implement this model. And yes, I did see JSpolsky's post about mapreduce, it's a nice way of explaining it.

Currently I'm trying to figure out if there's a way to make the deployment even easier, and, more significantly, to somehow allow mapreduce operations to take place inside of a program, rather than just being a separate utility, so that multiple such operations can be used as part of a larger system. Right now it's one invocation of a command line "mr" program per operation, but wouldn't it be nice as more of a language feature?

Anyway, I'll be uploading the next version soon. Let me know if you have any comments or ideas.

Cheers!

******************************************************
******************************************************

I responded:

Ira, I think it's great what you're doing. The feature to run a controlled "cluster" on a single computer could potentially pay dividends for a lot of people, both in functionality and maintainability. The way I see it, it's basically threading/forking with the dirty bits hidden away, in a way that will scale.

My thoughts / idea, and links to the evolution thereof, are well summarized in a perlmonks post I just made

Using functional programming to reduce the pain of parallel-execution programming (with threads, forks, or name your poison)

If I could plug MapReduce into the "hashmap_parallel" function builder in the Map.pm code there (buried in a readmore), I would have my pony.

******************************************************
******************************************************

To sum it up, there should be a way of doing what I want pretty soon, using the option for a single machine cluster described by Ira above.

Kudos to Ira for putting this together, and keeping the improvements coming.


Comment on Re: Could there be ThreadedMapReduce (and/or ForkedMapReduce) instead of DistributedMapReduce?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://579891]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (6)
As of 2014-07-29 00:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (211 votes), past polls