|Problems? Is your data what you think it is?|
Re^3: to thread or fork or ?by BrowserUk (Pope)
|on Oct 19, 2012 at 04:59 UTC||Need Help??|
Split your bigfile size across N machines.
Have a process on each machine that processes a filesize/N chunk of the bigfile. (Say, 32 machines each reading a different 32GB chunk of your 1TB file.)
Each reader accumulates word counts in a hash until the hash size approaches it's memory limit.
(Assume ~1.5GB/10 million words/keys on a 64-bit Perl; somewhat less on a 32-bit.)
When that limit is reached; it posts (probably udp) out the word/count pairs to the appropriate accumulator machines; frees the hash and continues reading the file from where it left off.
No threading, shared memory or locking required. Simple to set up and efficient to process.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.