Testing many devices - are threads the answer?

McDarren has asked for the wisdom of the Perl Monks concerning the following question:

Greetings :)

I have a challenge for which I _think_ the use of threads is possibly the answer. However, as threading is completely new and unchartered (and somewhat daunting) territory for me, I felt that it may be wise to consult the monks before diving in and getting a mouthful ;)

My problem can be summarised as follows:

I have a number of remote Wireless Access Controllers (WAC's) - currently ~200, but growing
Each controller is responsible for its own group of access points (AP's), which can number anywhere between 1 and ~50
Each of the AP's needs to be tested on a regular basis to determine whether or not it is alive.
The current method for testing whether an AP is alive of not is via telnet.
This is done WAC by WAC, AP by AP, one after the other - which means that it takes quite a while to test them all.

In other words....

# psuedocode
foreach my $foo (%controllers) {
    foreach (@{$controllers{$foo}{access_points}}) {
        Is $_ alive?
    }
}
[download]

Currently, I run my test every 20 minutes, and it takes ~15 minutes to test every single AP. Because I need to allow for a time out, the run time will increase or decrease depending on the number of AP's detected as off line on any given run. Clearly, this is not scalable, and I'm going to have problems as the number of devices to be tested increases. So I need to find a more efficient way to do it.

I've done a bit of reading on Perl threads, and in particular the threads tutorial, and I suspect that what I'm looking for is a "Work Crew" model. But again, I'm not sure, so I'm seeking your collective advice...

Are threads what I should be using, or should I consider something else?
If the "Work Crew" model is the right one for my particular problem, can somebody point me to some example code that demonstrates how this model is implemented?

Many thanks,
Darren :)

PS. I probably should make mention of the fact that my current code is implemented as an extension for the Big Brother Network Monitoring Tool. That is, the reports that my code produces are fed into Big Brother.

Update: I feel quite overwhelmed by the number of responses and the variety of potential solutions offered. Proves once again the TIMTOWTDIness of Perl, and the value of Perlmonks as a community ('sif we didn't know these already). For the time being, I have a very quick & dirty implementation of Parallel:ForkManager working, which has solved my immediate problem. However, I fully intend to investigate most of the other solutions offered. So a big thank you to all those that offered their advice and comment :)

Comment on Testing many devices - are threads the answer? Download Code

Replies are listed 'Best First'.
Re: Testing many devices - are threads the answer? by BrowserUk (Patriarch) on May 12, 2009 at 13:58 UTC
Seems an ideal application for threads. Something like this should get you started. You'll need to fill in the blanks. #! perl -slw use strict; use threads; use threads::shared; use Thread::Queue; my $logSem :shared; sub LOG { lock $logSem; print @_, "\n"; } sub worker { my( $Q ) = @_; require Net::Telnet; my $tn = Net::Telnet->new( Timeout => 10, ... ); while( my $apip = $Q->dequeue ) { if( $tn->open( $apip ) ) { LOG( "$apip OK" ); $tn->close; } else { LOG( "$apip: Failed" ); } } } our $W \|\|= 15; ## Default to 15 threads; Don't get carried away! ## Create a Q to supply workers with work my $Q = new Threads::Queue; ## Create the worker threads, passing the Q handle my @workers = map threads->create( \&worker, $Q ), 1 .. $W; ## Push the IPs onto the queue for my $foo ( %controllers ) { ## ???keys values??? for( @{ $controllers{ $foo }{ access_points } } ) { $Q->enqueue( $_ ); } } ## Terminate worker loops $Q->enqueue( (undef) x $W ); ## Wait for the workers and join them when they're done $_->join for @workers; [download] If your serial code takes 15 minutes, this should reduce it to ~1 minute. But don't get carried away increasing the number of threads, as there are diminishing returns. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l]
Re: Testing many devices - are threads the answer? by gwadej (Chaplain) on May 12, 2009 at 13:29 UTC
Threading is actually a good solution for this kind of problem. It also happens to be one of the situations where threading is even useful on a single CPU machine. Each thread should spend most of its time waiting. You might need to test to be sure that the library you are using to do the communications doesn't have any weird restrictions that stop threading, ... G. Wade	[reply]
Re: Testing many devices - are threads the answer? by derby (Abbot) on May 12, 2009 at 13:30 UTC
You don't say what platform but if *nix, I would go with a Parallel::ForkManager solution before I did threads. -derby	[reply]
Re^2: Testing many devices - are threads the answer? by McDarren (Abbot) on May 12, 2009 at 14:25 UTC
Thanks for the pointer. I gave this a go, and found that implementing it into my existing code was incredibly easy - and had the desired result. Using Parallel::ForkManager with 10 "workers" reduced my total run time from ~15 minutes to just under 2 minutes. This has basically solved my problem, but I'm still curious about threads, so I plan to have a look at the example provided by BrowserUK next. Thanks again, Darren :)	[reply]
Re: Testing many devices - are threads the answer? by roboticus (Chancellor) on May 12, 2009 at 13:44 UTC
McDarren: While threading can do the job, I'd do something much simpler: I'd simply assign each WAC to a group, and run the same program multiple times, each instance handling a single group. That way, you get through all your WACs quickly enough (for some definition of quickly enough), and without the headache of worrying about threads. It's not that I'm afraid of threads (I rather like them for some things). It's just that problems like this don't really need it. I generally think of using threads when I have multiple tasks that need to be coordinated with each other. But in your case, you don't really care¹ if you test WAC #3 a couple more times per hour than WAC #7, so long as they're all tested at least X times per hour. If after adding some WACs you can't hit your minimum test interval, just split off some groups and start a few more instances--so it's really not hard to scale. Using multiple instances of the program even makes it easy to distribute the testing among multiple servers with no great effort. ¹ At least, I don't think you care... ...roboticus	[reply]
Re^2: Testing many devices - are threads the answer? by McDarren (Abbot) on May 12, 2009 at 14:42 UTC
Thanks Roboticus, This was actually an approach that I had already considered, and was probably going to be my fall back if I couldn't find a more elegant solution. I think my main issue with this approach is that it is very much "hit and miss" in terms of creating the groupings - insofar as you can't really predict how long the testing of any particular group will take, and there might be very large discrepancies in individual run times, which I'd really have no control over. My preference is for a solution that will (roughly) do them all in one go, which I can run at fairly frequent intervals - say every 5 minutes. Cheers, Darren :)	[reply]
Re: Testing many devices - are threads the answer? by zentara (Archbishop) on May 12, 2009 at 15:05 UTC
The one thing I will mention, just because no one else really did, concerns the use of the term "devices". The one thing in linux (and windows) that can still lock up a machine, is a hanging device driver. Google for it. Anyways, how that plays into the question of threads vs.forks is probably going to be left up to experimentation, and the quallity of the device driver code. Personally, I would fork each device off, since threads share filehandles, and other things, you may have better luck with forking and let each fork write your device status(es) to a small database( or similar file). Additionally, you will not have to worry about 1 device malfunctioning and taking the rest down thru the thread connection, because the driver code just hangs the kernel. Using alarm may be useful too.....but last I checked, alarm dosn't work well in threads....the parent thread intercepts all signals.....but there may be improvements in current thread code internals. Update: I was informed by a knowledgable monk, that these are remote devices, and the device-driver hassle may not occur, since you will probably be accessing them thru the regular TCP/IP channels (internal mini-web server ,telnet, or ssh2) . But this sort of things have been asked before...and timeouts (alarms), are usually needed to make things foolproof. I'm not really a human, but I play one on earth. Old Perl Programmer Haiku	[reply]
Re: Testing many devices - are threads the answer? by Herkum (Parson) on May 12, 2009 at 13:31 UTC
Concurrency gives you simultaneous execution, but not necessarily speed. The question you really need to ask is what is the bottleneck. If it is the in the code formatting the results, threading will not help you at all. If you are spending timing waiting for the devices to respond back, then threading will probably help you. I would suggest you look at your code with Devel::NYTProf to get a detailed look at what your program is doing and then decide what you want to do.	[reply]
Re^2: Testing many devices - are threads the answer? by McDarren (Abbot) on May 12, 2009 at 14:33 UTC
"The question you really need to ask is what is the bottleneck" Although I haven't done any proper profiling, I'm quite certain (through observation of the logs that I create) that the bottleneck is caused through waiting for timeouts as each "offline" device is encountered. Whenever this happens, the script basically blocks and waits, before proceeding. The problem this creates for me (which I forgot to mention in my OP) is that I have been getting a significant number of false negatives, due to the time out being set too low. But increasing the time out by just a few seconds causes the total runtime to increase dramatically, hence my need to look at something like threading. Cheers, Darren :)	[reply]
Re^3: Testing many devices - are threads the answer? by Herkum (Parson) on May 13, 2009 at 13:36 UTC
Chances were that timeouts would be a real issue especially for hardware. However, I mentioned this anyways because, no offense, too many people jump to a solution without understanding the problem. I wanted to suggest to ensure that at least someone think about this before heading down the wrong path.	[reply]
Re: Testing many devices - are threads the answer? by shmem (Chancellor) on May 13, 2009 at 13:02 UTC
What kind of test are you doing via telnet? If you just want to check connectivity, a select loop on multiple sockets would suffice, no threads needed: $\=$/; use IO::Socket::INET; my @hosts = qw(pantagruel gargantua foo bar quux); my (%active,%sockets); my $maxconn = 30; # or as much as your OS permits ;-) my $timeout = 20; sub setup_sock { my $fh = eval { IO::Socket::INET->new( PeerAddr => "$_[0]:23", Blocking => 0, ) }; unless ($fh) { print "$_[0] not reachable"; return; } $active{$_[0]} = $fh; $sockets{fileno($fh)} = $_[0]; # print "$_[0] fileno ",fileno($fh); } sub close_sock { my $fileno = shift; my $host = $sockets{$fileno}; $active{$host}->close; delete $active{$host}; delete $sockets{$fileno}; } sub set_bits { my $rin; for (keys %active) { vec($rin, fileno($active{$_}),1) = 1; } $rin; } while (@hosts \|\| keys %active) { # $c++; print "pass $c"; setup_sock(shift @hosts) while ($maxconn > keys %active and @hosts +); my $rin = set_bits; my $rout; select($rout=$rin,undef,undef,$timeout); # if there has been a timeout, there has been no response, # and $rout has no bit set. mark those hosts as unreachable if ($rout eq "\0") { for(0..unpack"B",$rin) { if (vec $rin,$_,1) { print "no answer from $sockets{$_} (fileno $_)"; close_sock($_); } } } for(0..unpack"B",$rout) { if (vec $rout,$_,1) { print "$sockets{$_} ($_) is alive"; close_sock($_); } } } [download] If there's more to it, i.e chatting with the AP, then you might want to fire off threads for the check part (i.e. the `close_sock()` sub above.) update: mhm. Maybe the bottleneck is moved to `IO::Socket::INET->new()` that way... update2: nope ;-) ... you get to wait the full timeout only once per $maxconn if the %active hash is filled up with unresponsive peers.	[reply] [d/l] [select]
Re: Testing many devices - are threads the answer? by okram (Monk) on May 13, 2009 at 13:34 UTC
Thought of using POE sessions in a tree formation? One master session creates (every 20 minutes so first with delay=>0, then with delay=>60*20) one worker session per WAC, and that worker session handles one session per physical AP. If you use the right POE Wheels, you'll then be able to run those blocking AP tests "concurrently" in each of the AP sessions. You'd even be able to specify what happens on error conditions. Have a look at poe.perl.org and see if this may suit you. POE is quite powerful.	[reply]
Re: Testing many devices - are threads the answer? by wol (Hermit) on May 13, 2009 at 10:59 UTC
This sounds like a problem with lots of potential solutions. The approach I was thinking about depends on how you need to test each individual AP. You say it's via telnet, and (in a follow up posting) that you want to increase the timeout to avoid false positive, so maybe the algorithm looks like this: Open socket Connect (TCP) to telnet port on AP Send "Are you OK" message Wait for long enough for healthy AP to respond Read from socket Work out from reply whether AP is OK or not Close socket If this is the algorithm for each AP, then this can be scaled up either by creating threads (or processes) which can all follow the same sequence (as described above) or alternatively just process an array of APs in parallel: Open 10 sockets Connect to 10 APs Send 10 messages Wait Once ... Maybe this approach isn't suitable for you (maybe your AP testing is too wrapped up in a module/DLL) but it might be suitable for someone else with a generally similar problem who stumbles accross these posts in future. -- use JAPH; print JAPH::asString();	[reply]
Re^2: Testing many devices - are threads the answer? by BrowserUk (Patriarch) on May 13, 2009 at 14:13 UTC
Open 10 sockets You are still opening the 10 sockets serially, and if the first machine isn't there, you will have to wait for the open attempt to fail--timeout--before moving on to the second... Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re^3: Testing many devices - are threads the answer? by shmem (Chancellor) on May 13, 2009 at 14:31 UTC
you will have to wait for the open attempt to fail--timeout-- Not true if you open the socket in non-blocking mode.	[reply]
Re^4: Testing many devices - are threads the answer? by BrowserUk (Patriarch) on May 13, 2009 at 14:39 UTC
Re^5: Testing many devices - are threads the answer? by shmem (Chancellor) on May 13, 2009 at 20:36 UTC
Some notes below your chosen depth have not been shown here

Back to Seekers of Perl Wisdom