One of the things the people at Google noticed is, that a lot of their problems amount to process a lot of input data to compute some derived data from that input data. For example, they process the ca. 4 billion web pages and compute an index from them. These input data are very diverse: document records, log files, on-disk data structures, etc. and require lots of CPU time. They have the infrastructure to deal with it, but they wanted a framework for automatic en efficient distribution and parallelization of the jobs across their clusters. Even better if it provided fault-tolerance and scheduled I/O. Status monitoring would also be nice.
So over the last year, they devised MapReduce
- inspired by the map
primitives present in the functional language Lisp. Most of their operations involved applying a map operation
to each logical "record" in their input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key, in order to combine the derived data appropriately.
More information can be read here
, or be seen in this video
(some 23 minutes into the presentation).
I don't know in which programming language MapReduce was written, but could you write the basic functions in Perl?