|Perl: the Markov chain saw|
So, a while back, I wanted to pull stats from my cable modem, so I wrote Device::CableModem::Zoom5341. I want it to keep working in the future, so I wrote tests for it too. And even better, CPAN has that whole infrastructure to run the tests across all sorts of platforms and tell me what goes on. So far, so good.
What's not so good, though is when tests fail. Well, when they fail because I broke something, that's good because I find out about it, but when they fail because of external factors, it's just noise to me, and unhelpful to other people looking at it.
That's happening in this case. On the plus side, I know why. Even better, I know how to fix it so it never happens again. But, I'd kinda rather not, so I'm hoping somebody has an idea for a third option.
The quick overview is that the module works by grabbing a HTML page via HTTP from the modem and scraping it. The test causing the problem is faking a HTTP fetch to test that part of the process.
Naturally, there's a test that actually runs the whole stack, all the way down to an actual cablemodem. But I know people might want to run the test suite without having a device on their network, or wanting to hit it if it is there, so that's not enabled by default. It only runs if you set an env variable (this is encapsulated in t/02-realfetch.t in the dist).
However, I have another test that uses HTTP::Daemon (from LWP) to serve a (not actually HTML) file locally to test the fetching routine. See t/02-fetch.t (on CPAN). And in rare cases, this is failing for some CPAN testers. My guess is that it's because they're running in an environment that can't actually talk to itself via 127.0.0.1. Which is unusual perhaps, but still a perfectly valid system setup.
There were several failures with the 1.00 release. A main purpose of the 1.01 was to add some extra instrumentation in the test to see where it fell down. There was only one failure with 1.01. From the "Timed out" message from the child side and the lack of returns of real data, it sounds like it just can't talk to itself.
OTOH, it sounds like the ealier test (when it's supposed to get a 404) did succeed, which means it could talk to itself for a moment. And it should have failed earlier and skipped the tests if it couldn't bind() to 127.0.0.1 initially. So I'm not actually certain what's happening here. It's possible the alarm() killed it before it got a chance to make/answer the second request, but it would have to be a very loaded system for it to take 3 seconds to make two requests. Or maybe it just sent the signal way early. There's probably a reasonable chance I just did something stupid, so if anybody can point it out to me, I'd be grateful.
Now, I could avoid all the noise easily enough just by adding a conditional like on the realfetch test, so this test isn't run by default. I'd rather not do that though; any test not run by default won't be run very often, even by me as the developer. And this is the only test that the requests are actually running right all the way down to the network, and storing the results correctly. And going the other way (env var to disable this test) is pointless, since nobody is going to set that either...
So, the failure above may or may not actually be a case of "can't see myself at 127.0.0.1" If it's not, I could use some help with exactly what it is. But either way, the "can't see" case is sure to exist, and if it's due to firewalling or similar network configuration rather than something like "address not assigned", it may not be caught by failing HTTP::Daemon->new.
However it breaks down, I'm seeking advice on how I can robustly keep running this sort of test if at all possible, but cleanly skip things if something in the system environment makes it impossible. I know that it's impossible to really ensure it both ways, but surely it can be done better than what I have now.