Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

panic: COND_DESTROY(6)

by menth0l (Monk)
on Jan 26, 2012 at 14:06 UTC ( #950091=perlquestion: print w/ replies, xml ) Need Help??
menth0l has asked for the wisdom of the Perl Monks concerning the following question:

I got a multi-threaded application which terminates unexpectedly with following message:
panic: COND_DESTROY(6)
Today it occurred after approximately 3 hours of work, but sometimes it takes 3 days to crash. I don't know how to trace source of this error. What should i look for? When this sort of message should occur? Google says it's related to threads C code but that's as far as i can get.

Any ideas how to get rid of it? Or maybe some clues where to start searching?

Am i thinking right that the "6" is a TID of thread that crashed? This would make it a little easier to find...

UPDATE:
OS: Windows 2008 Server
Perl: ActiveState Perl 5.12.4
threads 1.86
threads::shared 1.40

Comment on panic: COND_DESTROY(6)
Download Code
Re: panic: COND_DESTROY(6)
by BrowserUk (Pope) on Jan 26, 2012 at 14:35 UTC

    You could provide a little more info perhaps?

    1. OS?
    2. Perl version?
    3. threads version?
    4. threads::shared version?
    Am i thinking right that the "6" is a TID of thread that crashed?

    More likely the number is the numeric error code. On Windows that would be "invalid handle" returned from the attempt to close the semaphore associated with a threads::shared condition variable:

    #define COND_DESTROY(c) \ STMT_START { \ (c)->waiters = 0; \ if (CloseHandle((c)->sem) == 0) \ Perl_croak_nocontext("panic: COND_DESTROY (%ld)",GetLastError( +)); \ } STMT_END

    Of course it might mean something different on other OSes.

    Your best bet would be to post the code, assuming it isn't too large or proprietary or require too much in the way of unique set-up.

    If it is, then try to reduce as much as possible whilst still having the error occur. (I appreciate that can be difficult with transient errors like this.) But it will be very hard to advise without sight of the code in question.

    If it is the invalid handle problem, the most likely cause is the handle being closed twice, but working out how that might occur will require sight of the code.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      I've just updated my post with os info and such. Tommorow i'll refer to the rest of your post.
        From root update:
        Windows 2007 Server

        Um. Did you mean Windows Server 2008?

      I can't really put the code here since i'm bounded by my company's policy.

      But maybe there is another way around this. Someone suggested that this may be related to semaphores in my code. But i don't use semaphores, only locks (i'm locking Object-InsideOut type object). I assume that perlish locks are implemented using low level semaphores?
        But i don't use semaphores, only locks (i'm locking Object-InsideOut type object). I assume that perlish locks are implemented using low level semaphores?

        Yes. A condition variable is a C struct containing a count of the threads waiting, and a semaphore handle:

        typedef struct win32_cond { LONG waiters; HANDLE sem; } perl_cond;

        When a condition variable is garbage collected (DESTROYed), the semaphore handle is closed, then the memory for the struct is freed. The panic you are seeing is occurring when the attempt to close the semaphore handle fails. The only way I can see this happening is if there is a second attempt to DESTROY a condition variable that has previously been destroyed.

        That would put the root cause of problem outside of the realms of your code firmly in the auspices of Perl/threads::shared. But that doesn't help you solve or work around your problem; nor does it give the maintainers any clue as to the circumstances under which the bug is occurring.

        The only long-term viable way forward that I see, is for you to remove as much of the proprietary code and dependencies from the code as you can, whilst retaining the flow that causes the bug to occur, and then post that. Odds are that this would allow us to find a workaround that you could fold back into your proprietary code; and give the maintainers a testcase on which to base a future fix.

        Looking at the change history for threads::shared, there were changes relating to shared object destruction in the latest build (which you are using), and earlier in version 1.33. My first step would be to downgrade thread::shared on your installation to version 1.32 and see if that 'fixes' the problem.

        But for a long term fix, you should really consider trying to come up with a cut-down testcase for the problem, that you have permission to publish. (The smaller the better!).


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

        I can't really put the code here since i'm bounded by my company's policy.

        Could you reduce your code to a minimal version that demonstrates the problem, is short enough to post, and contains no proprietary information?

        For example, about a week ago, I also posted a question relating to threads. The initial problem I saw was in a big and secret perl script, that I would definitely not be allowed to post, but I reduced the script by removing & commenting out big blocks of code until I was left with a 25 line script that demonstrated the problem.

        That script contains nothing secret so there is no problem posting it, and also it is much shorter so it is easy for our fellow monks to understand the problem.

Re: panic: COND_DESTROY(6)
by choroba (Abbot) on Jan 26, 2012 at 14:37 UTC
    Is the application written in Perl? If yes, can you show how threads are handled in the code?
Re: panic: COND_DESTROY(6)
by sundialsvc4 (Monsignor) on Jan 26, 2012 at 14:53 UTC

    Superficial googling suggests that cond_destroy is a Unix system-call which, per the documentation, is expected to return zero, any nonzero value indicating some kind of error.   We may presume (guess...) is instead returning 6.   The document unfortunately does not then go on to give a list of them.   Perhaps you can chase-down the OS source code of the call-handler or otherwise find a detailed explanation of each return-code possibility, but perhaps the discussion alone in the man page will provoke some worthy ideas.   The fact that the problem occurs rarely-but-consistently pretty much establishes that what you have is some kind of rare-but-possible race condition bug in your code, which is “a one-in-a-million chance, but it is executed millions of times.”   Sux, but that’s always the way that such things are.   You didn’t need all that extra hair, anyway . . .

      The fact that the problem occurs rarely-but-consistently pretty much establishes that what you have is some kind of rare-but-possible race condition bug in your code, ...

      More guff. Panics are usually indicative of a bug in perl. User code should not be able to cause them.

      When are you going to get bored of spouting nonsense?


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

        I admit that it is becoming quite entertaining to provoke you.   It’s always so consistently successful...   :->

        I ring the little bell, you jump.   Every time.

Re: panic: COND_DESTROY(6)
by sundialsvc4 (Monsignor) on Jan 31, 2012 at 15:02 UTC

    Addressing now the point of the original poster, and speaking to the original poster not the esteemed monk, BrowserUK, I suggest that enough evidence has been gleaned from the various (useful...) responses to this thread to point the way to sleuthing-out the deficiency in the design of the original program.   There is a timing hole, somewhere, in the application, and the objective is to solve the damm thing make the application work properly.   (The Perl Gods can do their magic on their own time; meanwhile, one must work with what one has.)

    In my experience, the most probable cause of such a “once in a million” issue has to do with the order in which condition-variables are asserted and released.   In much the same way that the Linux kernel stipulates that you must obtain this lock before you may obtain that one, the mutual-exclusion controls within the application should be arranged in a definite hierarchy.   Each of the alternative paths through the application which employ mutual-exclusion must be hand-examined in this way.   Furthermore, if you find yourself releasing one condition and in the very next statements grabbing another one, this practice probably should be avoided:   devise some exclusion control that covers both.

    Mutual exclusion mechanisms cover two distinct but useful purposes:   not only do they regulate simultaneous access to a single atomic resource, but they also and more usefully can be employed to compel programs that need to shuffle between several resources to do that “shuffling” in only certain selected code-paths and therefore only in a known-in-advance timing sequence.   If x code-paths are manipulating y resources, then you can wind up with x^y possible combinations between them, and that’s just too many possibilities to manage.   Select a handful of reasonable sequences for the work, and oblige the programs who are doing it to grab some kind of semaphore to serialize their passage through it.

    Even if the mutual-exclusion tools are “buried” within nice, safe, well-tested (as they certainly are...) perlguts, the essential principle remains:   you have an application to write.   Even if some bizzare, not yet found bug still exists in those “guts,” you have to devise this application so as to stay well clear of any of them.   HTH.

      Maybe you're right. I added a tons of logging just next to each lock occurrence to see at which point my app fails. I found that each time it was crashing near calling this function:
      sub UnshareHash { my $reference = shift; lock $reference if is_shared($reference); given (ref $reference) { when ('HASH') { return { map UnshareHash($_), %{$reference} } } when ('ARRAY') { return [ map UnshareHash($_), @{$reference} ] } when ('REF') { return \UnshareHash($$reference) } default { return $reference } } }
      I have a configuration object shared between threads which sometimes need to clone/unshare some part of it using the function above. I've changed the function to lock only the top-level structure:
      sub UnshareHash { my $reference = shift; my $deep = shift; lock $reference if is_shared($reference) and not $deep; given (ref $reference) { when ('HASH') { return { map UnshareHash($_, 1), %{$reference} } } when ('ARRAY') { return [ map UnshareHash($_, 1), @{$reference} ] } when ('REF') { return \UnshareHash($$reference, 1) } default { return $reference } } }
      For now it looks promising: my app runs for about 40 straight hours now. Before that crash happened after few hours at most, sometimes after few minutes. But that may be just a coincidence, i'll have to wait some more time.

      But if it happens to be true (i.e. UnshareHash() is the culprit) then i assume that recursive locking is the problem? That would be a bug in threads::shared, wouldn't it?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://950091]
Approved by Corion
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (7)
As of 2014-07-28 04:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (185 votes), past polls