Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

I/O Watchdog Daemon

by IdleResonance (Acolyte)
on Aug 22, 2012 at 03:07 UTC ( #988937=perlquestion: print w/ replies, xml ) Need Help??
IdleResonance has asked for the wisdom of the Perl Monks concerning the following question:

This is by far the simplest yet most terrifying Perl solution that I have ever developed. I have written entire mod_perl websites and associated toolsets used by several fortune 100 companies but I can't bring myself to implement this on one of my own personal systems.

This system is booted (/boot, /, /var, /usr, etc...) off of an OCZ PCI-Express SSD card. Unfortunately every 1-3 months the card locks up resulting in I/O errors. In this case, even if I have an open shell on the system I am only able to run programs that are currently in disk cache in memory. Attempting to exec all other processes will result in I/O errors until a reset. There are usually no direct hands near the box and unfortunately I do not have access to out of band management of any kind to reset the box (IPMI, BMC, iLO, RSA, DRAC, PDU, eRIC, ATEN IP8000, etc...). I feel that my only option is to attempt to read a few KB from one of the ~4K files in /usr/bin and do an immediate (or postponed) reset if I get an EIO error back from sysread. I need to be able to reset the box immediately if I cannot read from one of the logical volumes on the SSD.

Keep in mind this is a very preliminary rough draft and I do not take posting to perlmonks lightly. However, I would appreciate any input on the logic.

  • Specific Concerns
    • Is it possible that the "next" statement after the open statement could result in going to the start of the loop in the case of an I/O error? My gut says that this will be cached in the VFS and there are no errors from open for I/O that I could find in the open(2) man page.
    • I have *NO* way to test this script/daemon. How can one emulate EIO?
    • This is a desperate situation. I cannot afford to replace the SSD with a proper RAID-1 but I do have good backups.
#!/usr/bin/perl use strict; use Errno; # 3740 files, 3739 max index my @files = glob("/usr/bin/*"); my $range = $#files; #scalar(@files)-1; while (1) { my $random = int(rand($range)); my $filename = $files[$random]; my $buffer; print STDERR "Checking $filename - ( $random / $range )\n"; #my ($dev,$ino,$mode,$nlink,$uid,$gid,$rdev,$size,$atime,$mtime,$c +time,$blksize,$blocks) = stat($filename); open(FILE, "< $filename") or next; my $retval = sysread(FILE, $buffer, 8192); if (!defined $retval) { if ($!{EIO}) { print STDERR "I/O Error on $filename. Resetting.\n"; #system("/var/tmpfs/reboot -nf"); -- MUST BE IN MEMORY FS +AND STATICALLY LINKED. system calls /bin/sh but should be cached in m +emory #exec {'/var/tmpfs/reboot'} '-nf'; -- MUST BE IN MEMORY FS + AND STATICALLY LINKED } } close(FILE); sleep(15); }

Comment on I/O Watchdog Daemon
Download Code
Re: I/O Watchdog Daemon
by aitap (Deacon) on Aug 22, 2012 at 07:35 UTC

    Not perl-related, but you can use memlockd to store any files you need (including their dynamic library dependencies) in memory. You can use busybox for rebooting and other useful manipulations.

    No idea how to detect I/O lockup properly, though.

    Sorry if my advice was wrong.
      Looks like a very useful tool to have. I've already installed it on the server and you can bet your ass that /etc/memlockd.cfg is about to triple in size. Thanks for the tip!
Re: I/O Watchdog Daemon
by GlitchMr (Sexton) on Aug 22, 2012 at 09:25 UTC

    I would assume that "system is out to get you", so you should just assume that trying to read files will fail, but what exactly will happen doesn't matter.

    For example, when you open file or next FILE, I would instead rerun glob and try again (to protect from files removed in /usr/bin - you wouldn't want this to reboot system, right?). As for basic idea, I've something similar to your script except more... paranoic (when something will go wrong, it will reboot). You don't need to know what will happen when EIO will appear - some condition in the code (or autodie pragma) will catch something is wrong (what exactly is wrong doesn't matter).

      Interesting script. I'll dig through it a bit later on this evening. As per your concerns, if a file were removed it would not cause a reboot. Instead, the open would fail and it would try the next file. I will be adding a test to ensure that it is a file, but even in the case that it is a directory, the sysread will return EISDIR instead of EIO and would just sleep for 15 seconds and try the next file.

Re: I/O Watchdog Daemon
by flexvault (Parson) on Aug 22, 2012 at 13:37 UTC

    IdleResonance,

    I don't have any of your hardware, but I'd like to give you some thoughts:

    • You're using the SSD for performance, but if it hangs, performance is worthless. Have you tried to use cron to reboot every Saturday night (or whenever it is low usage ). Immediately, stress is reduced. If it needs to be done daily, so be it. I have done this successfully on an AIX box, so it should work with Linux.

    • Have you tried the Knoppix in memory Linux. Use the SSD for performance, but reboot when the SSD fails. The operating system is in memory and on the read only DVD/CD, so the reboot should work without needing anything from the SSD.

    Perl depends on the operating system, so if the OS is failing, Perl may also fail and your no better off.

    If at the moment this is a 'desperate situation', make it easier on your self by using the system to ease your life.

    Good Luck!

    "Well done is better than well said." - Benjamin Franklin

      Perl may also fail and your no better off.

      "his" no better off? "his" what?

      You are looking for "you're" meaning "you are" not "your" meaning "belonging to you"

      This is like the third or fouth time I noticed you're typo and duty called :)

        Dear Monks,

        But you understood!

        Personally, I find programming and grammar to use opposite sides of my brain, so today I'm programming.

        I also find that I type 'perl' for 'per' all the time now.

        Thank you

        "Well done is better than well said." - Benjamin Franklin

      I have considered doing weekly reboots. It would cause unwanted downtime, but on the other hand it is far cleaner and safer to do "shutdown -r now" than "reboot -nf" :) I still might do this -- or both.

      Not sure what you mean about using Knoppix. I have no hands on the box. It's 2000 miles away and I don't have out of band management.

      I wouldn't say that the OS is failing... In this state, the kernel is fine and processes are still responding (so long as they're not accessing the SSD). Since the daemon is running in memory, then it should be fine. It's the potential EIO failures that I want to detect that are the primary issue and if I can trigger the reboot -nf without any disk I/O then I think it will be an acceptable band-aid until the situation can be resolved permanently.

      Thanks!

        IdleResonance,

        I wasn't trying to solve your problem, but to help you start 'thinking outside the box'.

        You said you do not know how to test the Perl solution. For me, if I can't test a solution, then I wouldn't depend on it working, but that's me.

        I have never used a SSD, but I have been told that they can reboot in less than a minute. Since you have the equipment, you know that answer. Only you can evaluate and weight the value of a minute of downtime versus unpredictable downtime.

        But I think you're on your way! ( No giggling AM! )

        Good Luck!

        "Well done is better than well said." - Benjamin Franklin

Re: I/O Watchdog Daemon
by CountZero (Bishop) on Aug 22, 2012 at 16:13 UTC
    Is it possible that the I/O errors cause the open to fail? If so, you should not use next as you will then not catch the EIO condition.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics
      I asked that very question under the "Specific Concerns" section and haven't seen an answer yet. open(2) does not have any I/O error conditions to check against. As I said, this is a rough draft and I'd appreciate any insight as to that logic.
        So, unless you are certain open cannot fail on an EIO, don't test if the open succeeded and directly go to the sysread and catch your EIO there.

        CountZero

        A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

        My blog: Imperial Deltronics
Re: I/O Watchdog Daemon
by runrig (Abbot) on Aug 22, 2012 at 17:18 UTC
    This won't help you at all, but just to be a pedant and provide a correct example for anyone else wanting to select a random file:
    my @files = glob("/usr/bin/*"); my $range = $#files; #scalar(@files)-1; while (1) { my $random = int(rand($range));
    should be (if you want to include the possibility of getting the last file in /usr/bin):
    my @files = glob("/usr/bin/*"); while (1) { my $random = int(rand(@files));
    But GlitchMr has it correct in his answer.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://988937]
Front-paged by BrowserUk
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (10)
As of 2014-09-17 07:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (64 votes), past polls