http://www.perlmonks.org?node_id=1005673

Dave Howorth has asked for the wisdom of the Perl Monks concerning the following question:

I have a program that is giving me headaches. It stops in an obscure way and I haven't managed to figure out a way to identify the fault. I'd welcome any thoughts on how to narrow down the cause.

I have a perl program that runs for a couple of days and then crashes. It scans a database and creates various files for each row in some tables. One of the files is a graph, created using the Graphviz2 module. Graphviz2 runs a binary via IPC::Run to do the heavy lifting. The binary is /usr/bin/dot from the graphviz release, and IPC::Run reports that it failed with the message "Argument list too long". I don't understand where that error message comes from, because I didn't think IPC::Run invoked a shell.

But if I restart the program from that point in the databse, it creates the graph just fine, runs on for another couple of days or so and then crashes with the same error message on another seemingly random record.

I can't find any evidence of running out of memory or similar problems.

I'm really not sure where I should be looking to localise the problem, and would welcome any suggestions.

Replies are listed 'Best First'.
Re: how to localise a problem?
by choroba (Cardinal) on Nov 26, 2012 at 14:34 UTC
    As far as my memory and Googling skills reach, "Argument list too long" is not a shell error, it is reported by the kernel. The argument space limit is set to 1/4 of the stack, so you might postpone the crash by increasing the stack size - but there is probably something somewhere leaking in the process, so try to detect the source. I have used valgrind several times for similar tasks, but it slows the process significantly.
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

      Thanks. That removes one mystery if it's a kernel message! I'm not seeing any evidence of leaks, although sometimes these things flare up suddenly and disappear almost before you can catch them.

      I think any kind of trace, monitor or logging is pretty much a non-starter given how long the process runs before it crashes. The monitoring will slow it down and/or create huge logs.

      What I have done is put some extra variables into the die message from IPC::Run so hopefully in another couple of days I should know more about the argument data. Whether that will help, I'm not sure, since the program happily recalculates the particular graph when it is restarted.

        Hi,

        I do see the problem of logging everything. But in your case I'm pretty sure that you have to see the whole process until the argument list is too long. I guess that there is some weired case of accumulation as I said before.

        Anyway, IMHO you don't have to change the code of IPC::Run, as I'm sure you could catch the argument list this way (only an example):

        my $gv = GraphViz2->new(%args); eval { $gv->run(); }; if(my $rc = $@) { print STDERR "argument string: " . $gv->dot_input . "\n"; confess $rc; }
        I'm really curious whether you catch your bug. Perhaps you have the time to tell us.

        Happy digging
        McA

Re: how to localise a problem?
by blue_cowdawg (Monsignor) on Nov 26, 2012 at 14:57 UTC

    Where I work we have a script running in production that is very key to the operation of one of our products. Many moons ago an issue was discovered with it that caused it to grow in memory (and I am not privy to what that was) until it eventually SIGSEG-ed and died. The solution that was put in place was the script inspected itself for how much memory it had consumed and when it went beyond a certain limit it spawned and died.

    Just a thought...


    Peter L. Berghold -- Unix Professional
    Peter -at- Berghold -dot- Net; AOL IM redcowdawg Yahoo IM: blue_cowdawg
Re: how to localise a problem?
by McA (Priest) on Nov 26, 2012 at 14:40 UTC

    Hi,

    that looks like something is accumulating over time. So I would look at objects, variables which are probably used more then once over the time. Probably they are not initialized properly on every run. Another guess is that you expect that a object is shutdown correctly while going out of scope but it doesn't. Check the manual of the "bigger" objects you use.

    UPDATE: I just looked at the GraphViz2 (2.06) package. It seems that there is exactly one line where IPC::Run is used:

    $self -> dot_input(join('', @{$self -> command -> print} ) . "}\n"); $self -> log(debug => $self -> dot_input); my($stdout, $stderr); IPC::Run::run([$driver, "-T$format"], \$self -> dot_input, \$stdout, \ +$stderr);
    So, the dot_input is logged if logging is enabled. Probably the right point to catch the culprit.

    Best regards
    McA

      Hi McA, thanks for your input. You were quite right about hacking Graphviz vs IPC::Run. I made a change almost identical to what you suggested. Now I'm just waiting for another crash.

      Uninitialised or left-over objects is usually my first thought with this kind of problem (don't ask me how I know!) but I'm not seeing any signs of steady process growth. It's also a rather odd method of crashing for an OOM fault, although it is still a possibility.

      So I'll see what the next lot of diagnostics shows me. Hopefully I won't have to iterate too many times until I find the root cause.

        Have you found the bug. I'm really curious.

        Best regards
        McA

Re: how to localise a problem?
by greengaroo (Hermit) on Nov 26, 2012 at 15:10 UTC

    I suggest you create Unit Test scripts using Test::More. You could create different scenarios and test the functions from your modules.

    The error "Argument list too long" sounds like you pass too many arguments to a function. Have you tried a grep to find where this exact error message comes from? Maybe it comes from one of the CPAN modules you are using!

    If you find the source of the message, then look in your code where you call that function and put some extra validation around it.

    Testing never proves the absence of faults, it only shows their presence.
Re: how to localise a problem?
by pvaldes (Chaplain) on Nov 27, 2012 at 10:56 UTC

    Grapviz is unable to manage very big plots normally, and crashes in this situation with a similar error message. This software does lots and lots of calculus repeatedly eating a lot of memory.

    Can we see some code? You probably are using the wrong graphviz option

    The binary is /usr/bin/dot

    Ok, enough, wrong option. If your plot has more than 100 nodes try instead, sfdp

      Thanks for that. Do you have any links to reports about Graphviz crashes with similar messages? That could be very helpful.

      Most of my graphs are pretty small (less than ten nodes). And which one fails seems to be random, and more importantly, it works perfectly and is not unusual when it is retried. So I don't think it is a data-dependent Graphviz error. The dot files look sensible.

      FWIW, here's where I mess with Graphviz

          $self->{graph} = GraphViz2->new(
                          global  => {
                              name    => 'fold_graph_map',
                          },
                          node    => {
                              fontsize => 10,
                              shape    => 'box',
                          },
                      );
      
      stuff adding nodes and edges omitted
      
          eval {
              $self->{graph}->run(format  => 'png');
              $png   = $self->{graph}->dot_output();
          };
          die "Failed to create png for node #$id:\n$@\n"
              . $self->{graph}->dot_input()
            if $@;
      
Re: how to localise a problem?
by Anonymous Monk on Nov 26, 2012 at 14:54 UTC
Re: how to localise a problem?
by pvaldes (Chaplain) on Nov 27, 2012 at 19:52 UTC

    Dot should be perfectly happy with ten nodes... and also with forty

    In order to isolate the problem change this:

    eval { $self->{graph}->run(format => 'png'); $png = $self->{graph}->dot_output(); };

    To this:

    eval { $self->{graph}->run(format => 'svg'); $png = $self->{graph}->dot_output(); };

    And show us the result (or at least the first and last lines)