My admittedly-dim understanding of this requirement is that it would present a fairly impossible-to-resolve synchronization problem that would more or less defeat the notion of using threads – and that would likely make things worse as the number of threads increased.
I suggest that you should structure the problem so that each worker-thread constructs some part of the final structure, freezes it, and then sends that freeze to the parent using a thread-safe queue. Then, the parent – who is the only one with access to the eventual data structure – thaws each message and applies it to the structure. There should be no need for locking or thread synchronization, and the threads should not themselves refer in any way to the structure that they are contributing to. The threads only burn the available CPU resources, and each one does so without contention.
Finally, the total number of threads should (a) be easily adjustable, and (b) should be appropriate to the hardware capabilities of the underlying system. (Instead of 40 threads, how about 4?) Threading divides the per-core CPU resource. The only way to find the “sweet spot” is through empirical measurement, realizing that a “thrash point” lies just beyond it. The “thrash point elbow” is not pretty: the performance-curve abruptly goes from mostly-linear to exponentially-awful. And I feel fairly certain that “40 threads” will run smack-dab into it, even on modern hardware. Or, they will simply give the disk-drive a very inefficient workout, whether-or-not it is solid-state.