|Do you know where your variables are?|
Re^11: statistics of a large textby BrowserUk (Pope)
|on Feb 10, 2011 at 18:27 UTC||Need Help??|
Hm. Your doing rather more than just creating two big hashes aren't you.
There are also some pretty iffy programming practices in your code that mean you're using far more memory than you need to.
For example, in the routine Intersection(), you're build two arrays, @intersection and @difference. But you never use the latter, so why construct it?
And then, when you return @intersection to the caller, you return as a list assigning it to an array in the caller. It's not possible to tell by inspection how big that array is, but by returning this way, you are consuming at 3 times as much memory as is necessary. The memory for the array inside the subroutine; then as much again (and more) to turn that into a list on the stack; then finally you treble it when you assign it to the final array. If you returned a reference to the array, you would avoid all that duplication for what is a very minor change in syntax when you use it.
Another problem with Intersection(), is the way you pass data into it. You pass it two arrays, and assign to two arrays inside, but all the data from both arrays will end up in the first of those. Ie:
This is especially wasteful as all you do with the returned array is use its size! Even more wasteful when you realise that the two arrays you pass into Intersection() are derived from the keys in two hashes.
All you really want from all that expensive (time & space) processing is a count of the keys that are common to both hashes. And that can be done far more economically and simply. This refactoring of MI() will save time and space and allow you to discard Intersection() completely:
And don't you find your code more readable with the extra horizontal white-space?
There are similar problems with to_hash().
You build a huge hash inside a subroutine and then return it to the caller by flattening it to a (huge)list where you then have to re-build the huge hash again. Not only is this very expensive in terms of cpu, it requires at least 3 times as much memory as necessary to hold the hash.
The simplest way to avoid that with minimal changes to your existing code is to pass a reference into your subroutine and have it populate it:
I suspect that if you made those changes to your subroutines, you might find that it would run without blowing your memory.
If not, then you are processing what I suspect is a very large XML document using a parser that stores everything in memory. I've no personal experience of LibXML, but they are notorious for consuming prodigious amounts of memory. Switching to the excellent XML::Twig which is specifically designed for handling huge XML structures in a modest amount of memory might help.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.