Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
I wrote Ira Woodhead (the author of MapReduce):

Ira, I'm real interested by that module you recently posted.

I am having to write a lot of code that works in a parallel way, and I hate it. What is most appealing to me about MapReduce is not so much that it allows you to split things up on a cluster (though that's great), but that the code itself is so clean. No locks, no guards, no name-your-poison. Just Reader/Map/Reduce. Simple. Clean. All the dirt is hidden away, where your rocket scientists can optimize things in your custom MapReduce code to their hearts content.

I posted about this at

Could there be ThreadedMapReduce (and/or ForkedMapReduce) instead of DistributedMapReduce?

What I'm wondering is if you could have this goodness working even when you just have one computer, by using threads/forks. I'm thinking ThreadedMapReduce / ForkedMapReduce. Maybe even all inheriting from your original module in a sane way.

This would do the same thing that ParallelForkManager, and Thread::Queue, and countless other modules, but with a much cleaner functional interface.

I'm thinking along the same lines as Joel in "Can Your Programming Language Do This?":

Well, can perl do this?

What do you think? :)

Best, thomas.


Ira Woodhead responded:

Actually, yes! The version I'm currently getting ready to post has two config options "maps" and "reduces" which allow you to set the number of map task and reduce task processes per machine. To use it on a single machine you simply have a single-machine cluster. If you want threads, well, sorry but I'm not comfortable using perl threads, having experienced some pitfalls with them. Not to mention they require a recompile, and I really want to keep this simple to deploy. The relatively more expensive full forks are the way I'm going, at least for now.

I'm glad there is some interest in this. I was quite surprised to find no effort underway besides Hadoop to implement this model. And yes, I did see JSpolsky's post about mapreduce, it's a nice way of explaining it.

Currently I'm trying to figure out if there's a way to make the deployment even easier, and, more significantly, to somehow allow mapreduce operations to take place inside of a program, rather than just being a separate utility, so that multiple such operations can be used as part of a larger system. Right now it's one invocation of a command line "mr" program per operation, but wouldn't it be nice as more of a language feature?

Anyway, I'll be uploading the next version soon. Let me know if you have any comments or ideas.



I responded:

Ira, I think it's great what you're doing. The feature to run a controlled "cluster" on a single computer could potentially pay dividends for a lot of people, both in functionality and maintainability. The way I see it, it's basically threading/forking with the dirty bits hidden away, in a way that will scale.

My thoughts / idea, and links to the evolution thereof, are well summarized in a perlmonks post I just made

Using functional programming to reduce the pain of parallel-execution programming (with threads, forks, or name your poison)

If I could plug MapReduce into the "hashmap_parallel" function builder in the code there (buried in a readmore), I would have my pony.


To sum it up, there should be a way of doing what I want pretty soon, using the option for a single machine cluster described by Ira above.

Kudos to Ira for putting this together, and keeping the improvements coming.

In reply to Re: Could there be ThreadedMapReduce (and/or ForkedMapReduce) instead of DistributedMapReduce? by tphyahoo
in thread Could there be ThreadedMapReduce (and/or ForkedMapReduce) instead of DistributedMapReduce? by tphyahoo

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others chanting in the Monastery: (2)
    As of 2018-12-16 04:03 GMT
    Find Nodes?
      Voting Booth?
      How many stories does it take before you've heard them all?

      Results (70 votes). Check out past polls.

      • (Sep 10, 2018 at 22:53 UTC) Welcome new users!