Bugs like this one ... ahem ... suck large. But, “bugs they are,” and your application as-written will never work reliably in production.
Maddeningly, it will be “tantalizingly close.” You’ll spend a lot of time trying to convince yourself that the existing design can be made to work. But these flaws are structural. “Patches and workarounds,” like disabling signals and whatever else you might now be doing, only (I am very sorry to pronounce...) prolong the inevitable.
“Because I have been there too, my son... And a more unpleasant place to be, might never be found on this world.”
Perhaps you can isolate the necessary disciplines into a few base-classes. Or, maybe you can just change the routine that hands-off the work to the child threads. (Ideally, the child would simply pull data off of one queue and put it back onto another.) So, the code changes might not be as dramatic as you may fear. But, until the changes are made, the code will continue to throw sporadic errors under production loads (and maybe nowhere else).
You definitely want to implement the solution so that it is “nearly invisible,” being encapsulated into a class such that every other part of the application can “just ignore it... It Just Works™.” And I’d say that you can do that fairly easily. Once it is done, the app will be rock-steady under any production load.