|Problems? Is your data what you think it is?|
Re: dynamic number of threads based on CPU utilizationby BrowserUk (Pope)
|on Sep 26, 2012 at 15:44 UTC||Need Help??|
In addition to trying to figure out why the CPU utilization is not staying constant, I would like to find a way to dynamically add threads based on CPU utilization such that more work can be done when the CPU load falls below a certain threshold (arbitrarily, let's say 25%).
Forget the idea of throwing more threads at the problem in order to increase your cpu utilisation.
trying to figure out why the CPU utilization is not staying constant ...
In the first half of your program, less than 100% utilisation is to be expected as a) you are doing disk IO, reading the XML files from disk; b) writing to your shared %similar hash which will involve locking -- whether you are doing it; which you do not show -- or just using the implicit locking that perl does internally for its own protection.
In the second half of your program, you are a) iterating over that same shared hash, which means that internal locking comes into play. (But that's probably not a big problem for that part of your program).
Much worse is that each of your second pool of threads is performing the same bi-gramming process on each of the keys of the shared hash, for each work item it receives. AND EACH OF THOSE WORK ITEMS IS ITSELF, ONE OF THE KEYS OF THAT HASH!
That means you are doing a O(n^2) process of comparing each key against each key, and bi-graming both keys again every time. That is monumentally wasteful.
If you fix that algorithm -- ie. bi-gram each of the keys in the hash once first -- and then compare them each against the next, a single thread will probably finish much faster than your current mechanism. At which point, all the talk of dynamically spawning threads according to cpu load becomes moot.
Once you've you have made your similarity algorithm implementation efficient -- single-threaded -- it might then be worth considering whether if can be sped up using threading. But for now, ditch your second thread pool, use a single threaded loop to construct bi-gram arrays for each of your keys; and then use a second loop to compare one against the other.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.