![]() |
|
good chemistry is complicated, and a little bit messy -LW |
|
PerlMonks |
comment on |
( #3333=superdoc: print w/replies, xml ) | Need Help?? |
This is a followup from two recent posts:
Could there be ThreadedMapReduce (and/or ForkedMapReduce) instead of DistributedMapReduce? using parallel processing to concatenate a string, where order of concatenation doesn't matter In a sense it is a followup from an older post as well: What is the fastest way to download a bunch of web pages? The common theme to all these posts is doing something in parallel to accomplish a task, where each bit of the task is independent of the other bits. It turns out that this is the essense of the MapReduce algorithm that google uses to build much of its most important code. Google uses cluster computing to accomplish this, but the same "parallelization" logic could also be used on a single computer running multiple processes, or even a single process with threads. (Ira Woodhouse has indicated that the next version of MapReduce released to the CPAN will probably have configuration options for single computer clusters in order to do exactly that.) Currently, the canonical example of a situation where this would be useful for me, on my job, is to download a bunch of web pages and verify that each page regex-matches something that it should. In other words, grep, where part of the grep test is to download a web page. Throughout my experience with perl, have had numerous other situations where this basic idea -- break down a task into components that can run in parallel, run them, and then reassemble the results -- would have been helpful. Sometimes I would do this, but because I'm not an experienced thread programmer, and sometimes I was on windows sometimes not, sometimes would have to recompile perl, etc etc, it was always an unpleasant experience. There were many occasions when I thought, maybe I could use parallelization here... but painful... not enough time to debug.... I'll just make do with having it be a little slower. An example of a time when I could have, but didn't, use functional programming based on parallelization to speed things up, was when I had ~50,000 web pages to parse and reformat using a variety of functions based on HTML::TreeBuilder. None of the files cared what the other files looked like, so I could have processed multiples at the same time. But I got the basic system working serially, and it got the job done in a bit over 48 hours. This was too slow, so I did some simple things with forks and got it down to under a day, which was acceptable because the script only had to run once. But I remember thinking the forking code the ugliest, and hardest to debug, in my program. If I had had 5 million web pages instead of 50,000, and wanted to split the computation among multiple processors somehow, putting the job queue stuff together for it along the lines I had used till then, it would have been a nightmare. Even though all I was doing was grep, with a little network communication inside the grep test function. Later when I came across the article it all rang very true to me. Okay, I was working with forks, in a simple context, and this presentation is maybe more about threads with systems programming. But the basic difficulties apply to any situation where you are running stuff in an order that isn't guaranteed. It's different from running stuff through a simple, ordered, for loop in ways that keep quite subtle and hard to detect. What frustrated me the most is the feeling that I couldn't encapsulate the logic that I want, that I have to keep writing the "ugly bits" again and again. This was before I had heard of functional programming. Since then, I have been learning a lot about functional programming, and trying to incorporate it into my bag of tools. I originally became interested in this after reading Paul Graham's On Lisp. But I want to apply functional programming to make my life easier in perl. Joel Spolsky suggests in Can Your Programming Language Do This that functional programming is a good technique to hide the "ugly but important bits" of your code. In the article, spolsky suggests that this is exactly what Google has done with their MapReduce algorithm. Google programmers can write code that says the equivalent of my @results = distributed_mapreduce_grep ( $test_function, [ @in_array ]); and this would do exactly what my @results = grep { $test_function($) )} @array would do in perl. Except that it works on a cluster. So you can process a lot more data, faster. And the "ugly bits" are hidden. I want to start hiding the ugly bits of my code, using functional programming. The following is my attempt to do that. Unfortunately it still isn't working. But I think it's an interesting read, and I'm also hoping someone can plug something in there that will make it work. What's nice is that the "ugly bit" that isn't working is encapsulated. This is the function hashmap_parallel. Currently my "parallelization strategy" involves forking off processes and storing values in a DBM::Deep hard disk store. But actually I don't care about the implementation details, I just want it to work. **************************** UPDATE: Thanks to LanceDeeply, I now have code that works, using threads for the parallelization. I kept the function that doesn't work as hashmap_parallel_forks, which I am still hoping to get working. The code that does work is called hashmap_parallel_threads. The test script has also been updated accordingly. If anyone else want to shoot me some candidates for their favorite way of implementing transparent parallelization with map, I will add them to the catalog. **************************** Hashmap here means essentially the same as the "map" half in MapReduce. It processes a set, where order doesn't matter. I have several hashmap functions in this code, two of which work, and one (the one that executes in parallel) which doesn't. These "mapping" functions are given as an argument to the function builder flexygrep, which returns a grepping function. So, as a consequence two of my grepping functions work and one (the parallel one) doesn't. If I can get hashmap_parallelto work, I'm thinking could theoretically use this to build other functions, like sort_parallel, permute_parallel, you name it. Sometimes, for efficiency reasons, this will make sense. A lot of the time it won't. Depends what your bottleneck is -- cpu, memory, network, disk io, etc. But the good news is that once the parallel mapping function -- or mapping functions -- work, you can just plug them in and try. A lot easier than writing threading code for all scenarios. Again, the bit that I need to get to work -- but which will pay major dividends in maintainability when I do -- is hashmap_parallel. Now, here's a little test output.
UPDATE: Seemingly relevant comments from Jenda at Re^3: Parrot, threads & fears for the future.: "You can only transparently paralelize map{} if the executed block is side-effect-free....." I'm actually not sure if my code here is side effects free or not. Hm... ***************************************************** Posts I'm looking at to see if I can use something there to get hashmap_parallel to work...: •Re: Run N similar tasks in parallel In reply to Using functional programming to reduce the pain of parallel-execution programming (with threads, forks, or name your poison) by tphyahoo
|
|