http://www.perlmonks.org?node_id=1068673

This post is aimed at the people who have heard of Perl threading, and think it intriguing - but haven't really gotten to grips with how it's done. I'm going to put together a ... template, if you like, for a very basic style of threading.

Parallel code is somewhat hazardous for the unwary - because you're 'splitting' your program, and making different parts run at different speeds, you can end up with some incredibly frustrating and hard to track bugs. Every thread is a race condition waiting to happen. So all the bad habits you've picked up when coding in Perl, may well come and bite you if you 'thread it'.

The simplest thing to thread is what's known as an 'Embarassingly Parallel' problem. It's a type of problem where there are multiple tasks, but no dependencies, communication or synchronisation needed.

When 'doing' parallel code, you start to think in terms of scalability and efficiency - every thread start has an overhead. So does every communication between threads. However the most 'expensive' task is synchronising all your threads - they all have to wait until the slowest thread 'catches up'.

Thankfully - an 'embarassingly parallel' problem has none of these things.

An example I might use is pinging 1000 servers. You want to ping each of them, but you don't need to do so in any particular order. However, if a server is offline, then a 'ping' will wait for a timeout, making the process a lot slower.

The only thing you have to worry about is if you ping them 'all at once' you might end up sending a lot of data across the network.

This is a near perfect example of a type of problem I encounter regularly, and so I give it as example code.

Perl actually has quite a good way of 'spotting' embarassingly parallel stuff - the 'foreach' loop is often a good sign.

If you're doing the same thing on every item in a list, then there's a good chance that they might be suitable for parallelisation. You may not gain a large advantage from doing it though - the real advantage of threading is in making use of multiple system resources - processors, network sockets, etc. It's not the only way of achieving that result though, and it will - as a result - 'hog' more of a system's resource when it runs. (but hopefully for less time)

To break down the task:

Which looks a bit like this:

#!/usr/bin/perl use strict; use warnings; use threads; use Thread::Queue; my $nthreads = 5; my $process_q = Thread::Queue -> new(); my $failed_q = Thread::Queue -> new(); #this is a subroutine, but that runs 'as a thread'. #when it starts, it inherits the program state 'as is'. E.g. #the variable declarations above all apply - but changes to #values within the program are 'thread local' unless the #variable is defined as 'shared'. #Behind the scenes - Thread::Queue are 'shared' arrays. sub worker { #NB - this will sit a loop indefinitely, until you close the queue. #using $process_q -> end #we do this once we've queued all the things we want to process #and the sub completes and exits neatly. #however if you _don't_ end it, this will sit waiting forever. while ( my $server = $process_q -> dequeue() ) { chomp ( $server ); print threads -> self() -> tid(). ": pinging $server\n"; my $result = `/bin/ping -c 1 $server`; if ( $? ) { $failed_q -> enqueue ( $server ) } print $result; } } #insert tasks into thread queue. open ( my $input_fh, "<", "server_list" ) or die $!; $process_q -> enqueue ( <$input_fh> ); close ( $input_fh ); #we 'end' process_q - when we do, no more items may be inserted, #and 'dequeue' returns 'undefined' when the queue is emptied. #this means our worker threads (in their 'while' loop) will then exit. $process_q -> end(); #start some threads for ( 1..$nthreads ) { threads -> create ( \&worker ); } #Wait for threads to all finish processing. foreach my $thr ( threads -> list() ) { $thr -> join(); } #collate results. ('synchronise' operation) while ( my $server = $failed_q -> dequeue_nb() ) { print "$server failed to ping\n"; }

Now, this _is_ a very simple model of a 'threaded' task - and it will only suit situations where there are no dependencies on the results.