Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Google's MapReduce

by BioGeek (Hermit)
on Oct 27, 2004 at 05:06 UTC ( #402897=perlmeditation: print w/replies, xml ) Need Help??

One of the things the people at Google noticed is, that a lot of their problems amount to process a lot of input data to compute some derived data from that input data. For example, they process the ca. 4 billion web pages and compute an index from them. These input data are very diverse: document records, log files, on-disk data structures, etc. and require lots of CPU time. They have the infrastructure to deal with it, but they wanted a framework for automatic en efficient distribution and parallelization of the jobs across their clusters. Even better if it provided fault-tolerance and scheduled I/O. Status monitoring would also be nice.

So over the last year, they devised MapReduce- inspired by the map and reduce primitives present in the functional language Lisp. Most of their operations involved applying a map operation to each logical "record" in their input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key, in order to combine the derived data appropriately.

More information can be read here, or be seen in this video (some 23 minutes into the presentation).

I don't know in which programming language MapReduce was written, but could you write the basic functions in Perl?

Replies are listed 'Best First'.
Re: Google's MapReduce
by Anonymous Monk on Oct 27, 2004 at 09:10 UTC
    Not quite sure what you mean by the "basic functions", but map is a Perl primitive, and reduce is found in List::Util.

    But the paper isn't about map or reduce. It's about distributing work over a large set of unreliable (read 'cheap') computers. Since the work is distributed over stand-alone computers, which sit in a ethernet network it means you don't have any significant I/O. (Oh, sure, you've got lots of TCP/IP bandwidth, but that's no where near enough to share memory (or even disks)). This means you can only successfully distribute work that can be divided into parts, where each part can be worked on without needed (intermediate) results of works on other parts. That's where the map/reduce comes in. It separates the work that can be done independently (map) from work that can't be done independently (reduce).

    For instance, calculating the total number of links in a set of documents. Parsing each document and calculating the number of links in a single document can be done independently from parsing any other document (map phase). Note also that parsing a document is a task that can easily be restarted, without having to undo any other work - an important aspect for google, due to the nature of their infrastructure. The summing of all individual counts (reduce phase) can't be done independently (cause you need all results), but that's a relative small task compared to the parsing. (But you might recurse and sum the totals of groups of documents, then collect the results).

    I've no doubt that you can write the implementation of MapReduce in Perl. (No doubt countless many people have already implemented this on a much smaller scale using threads on a single computer, but without the tolerance of failing components). I'm not sure if you want to though.

Re: Google's MapReduce
by radiantmatrix (Parson) on Oct 27, 2004 at 16:09 UTC

    I don't know in which programming language MapReduce was written
    On page 13 of the PDF you linked to, I find:
    #include "mapreduce/mapreduce.h"
    So I'm guessing one of the C variants.

    could you write the basic functions in Perl?

    Well, yes. Nothing about Perl limits your ability to write MapReduce: you have robust I/O, complex computational ability, networking capabilities, and comprehensive tools to manipulate data structures.

    I think the real question, given the low-level and high-performance nature of some of the MapReduce requirements, is "would a Perl version of this be fast enough?"

    Perl's runtime compiler and VM are pretty damn good; but, they still can't (and shouldn't try to) compete with C for low-level system operations. It is those kinds of operations that make MapReduce efficient for the volumes of data Google deals with. Of course, if you needn't process terabytes of data in one go, Perl might perform acceptably, and you'd have all of Perl's advantages in exchange for waiting a bit.

    I think it's a worthy project to implement something like this in Perl. At the very least, it would be an interesting benchmark opportunity. Perl is a great language for prototyping, so even if it's too slow for practical use, having an open Perl implementation of MapReduce provides a great reference for porting to "faster" languages.

    radiantmatrix
    require General::Disclaimer;
    "Users are evil. All users are evil. Do not trust them. Perl specifically offers the -T switch because it knows users are evil." - japhy

      Erm, read harder guys. :) Section 2.1 third paragraph.

      . . . The user's code is linked together with the MapReduce library (implemented in C++). . . .

        Read?! But that would mean I wasn't exercising the Noble Virtues of laziness (why read all of something?), impatience (I don't have time to read 13 pages of detail, just give me the code!), and hubris (besides, I already know everything.). ;-P.

        j/k, of course...

        radiantmatrix
        require General::Disclaimer;
        "Users are evil. All users are evil. Do not trust them. Perl specifically offers the -T switch because it knows users are evil." - japhy
Re: Google's MapReduce
by SpanishInquisition (Pilgrim) on Oct 27, 2004 at 13:18 UTC
    I don't know in which programming language MapReduce was written, but could you write the basic functions in Perl?
    It's Turing Complete.

      TC deals with computation in terms of functions. Anything with I/O is pretty much outside Turing's view. And you'll be doing a good deal of I/O in a distributed application like this. So just being TC isn't good enough.

      Just how much I/O you'll be doing depends on your application. There are some problems that could get sufficient bandwidth by having an intern load data off a floppy. Others are going to need high-speed fiber optic connections in order to keep up. Some problems are going to be just plain slower than doing it on a single machine.

      In any case, you could certainly do this with Perl. Would it be useful? If the application's bottleneck is I/O, then Perl would probably be a viable choice. However, good candidates for distributed systems are usually not I/O-bound. They're CPU-bound, like "take this DES-encrypted message and try decrypting it with keys x through x + 2**y, and let me know if any of them break the message". For something like that, you want a good number-cruncher language like C or FORTRAN.

      "There is no shame in being self-taught, only in not trying to learn in the first place." -- Atrus, Myst: The Book of D'ni.

        If the application's bottleneck is I/O, then Perl would probably be a viable choice.

        That depends on what kind of I/O we are talking about. There are various kinds of I/O, of which I will mention three:

        1. Network I/O. The least interesting category. If your application is bound by network I/O, there isn't much you can do except upgrade your network.
        2. Disk I/O. Interesting category, and which brings us in the realm of SANs, fibre-channel and multiple controllers. You're right that Perl might be a good for those applications.
        3. CPU-Memory I/O. Perl would absolutely suck for those kind of applications, as Perl is very memory hungry, and gives the programmer very little control over what is stored where. It uses gazillion pointers, storing stuff all over the place, resulting in a low cache hit ratio.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://402897]
Approved by FoxtrotUniform
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (4)
As of 2018-12-13 03:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    How many stories does it take before you've heard them all?







    Results (61 votes). Check out past polls.

    Notices?