|Problems? Is your data what you think it is?|
Could there be ThreadedMapReduce (and/or ForkedMapReduce) instead of DistributedMapReduce?by tphyahoo (Vicar)
|on Oct 20, 2006 at 09:27 UTC||Need Help??|
tphyahoo has asked for the
wisdom of the Perl Monks concerning the following question:
Okay, I'm revising my post a bit. Originally I just wanted to hear people's experiences with the recently-posted-to-cpan perl MapReduce.
But now, I'm asking a more general type question.
Basically, I'm thinking along the same lines as Joel in Can Your Programming Language Do This?
The MapReduce algo used by google is an abstraction which they use to simplify code that needs to run in a massively parallel way. A simple example of this would be Distributed Grep, which works the same way as grep, but takes advantage of the fact that you have 80,000 machines at your disposal to speed things up a bit. Same thing with Distributed Sort. Etc, etc.
Now, I don't have 80,000 machines.
But I do have a single machine that can run multiple processes.
But I hate writing code that does this, because threads are painful, forks are painful, you get race conditions, you have to use locks... arrrg.
So, what I would like to do is hide my parallel (threaded/forked/whatever) code in a functional way using MapReduce, and hide all the complexity of the parallelization in my personal MapReduce module.
If I happen to have 80,000 computers (I'm a spam king with a zombie horde... well, bad examples), I would use DistributedMapReduce and run code on multiple computers. If I only have a single computer, but have some extra cpu cycles and a place where it seems that threading cpeed things up, I would use ThreadedMapReduce.
Maybe both these modules could inherit from MapReduce in a sane way.
Does this make some kind of sense, or am I totally barking up the wrong tree?
If it makes sense, I say: let's build itand take some of the pain out of threading.
By the way, that last link was to a thread where BrowserUK -- who I consider sort of a threading guru -- writes some sample code to do something simple in a parallel way, but gets corrected by ikegami because he forgot to use locks. What I want is a kind of universe where this couldn't happen, because a "parallelize me" module in the background takes care of the messy details. Whether with threads, or distributed computers, or whatever.
UPDATE: Before someone else points this out, I am aware that adding threads to a slow running program doesn't always speed things up. Sometimes it does, sometimes no. It depends where your bottleneck is.
For example, my gut feeling is that ThreadedMRGrep (running through ThreadedMapReduce) might be a little faster than grep for a long list, but ThreadedSort wouldn't be much of a problem, because you'd hit the same computational bottleneck you hit with non-threaded grep. But I admit lack of experience and could be wrong here. But the example of threadedly downloading a bunch of web site and grepping against desired values in them would *definitely* be faster when run through ThreadedMapReduce. So, what I'm after really here is a framework for reducing program complexity, a powerful abstraction that can yield dividends when used with care.
And of course, I'd like it in perl :)
ORIGINAL POST: I'm experimenting with functional programming, and stumbled across MapReduce, piping fresh on CPAN.
Has anyone noticed and/or tried this module?
This was wishlisted at perlmonks a couple of years ago:
Now I'm wondering if I could use this technology to do something simple in a parallel way, like... using parallel processing to concatenate a string, where order of concatenation doesn't matter
UPDATE: I changed the title from "Has anyone noticed and/or tried the MapReduce recently uploaded to CPAN?"
UPDATE 2: After kicking this around some, I concluded that for my personal needs, just plain ThreadedReduce would be fine, and I think this is a bit of magic that would help lots of people and not just me. So, I am leaving ThreadedMap for another day, but I got started writing stub code for ThreadedReduce, including a failing test, further down in this thread: Re^2: Could there be a ThreadedMapReduce (instead of DistributedMapReduce)?. If someone can make my "not ok" an "ok" they will be my hero for today.
UPDATE 3: As a side note, I'm slowly starting to think that map (in the perl sense) can actually be built on top of reduce with functional programming, just like in my stub below, grep is built functionally on top of reduce. So, if you get parallelReduce, you also get parallelMap built on top of that with no mess. I still need to verify that by actually building map out of reduce, or having someone chime in that yes, that's right. But I'm almost 99% percent sure...
UPDATE 4: With regards to function builders, that last bit was not quite right. There really needs to be two base functions to build other functions out of. In terms of the "Map/Reduce" nomenclature, "Map" processes a set, not necessarily in order. This can be parallelized Reduce processes an array, or a vector, and order matters. This can't be parallelized -- of if it can, somebody please tell me how..
So for example, I could build ParallelGrep on top of ParallelMap, but not ParallelConcatenate. Concatenate would have to be built on top of reduce. And ParallelReduce as a function builder would be impossible.
ParallelSort could be built on top of Parallelmap, because the cmp function for any two elements doesn't care about the other elements. I'm not sure if that would be a good idea, but I guess I could try it to see with little work if I had ParallelMap working.
So, as I see it now, the kernel of this saga boils down to the task of writing ParallelMap. But I'm writing another post about that, which shall hopefully be posted shortly.