in reply to Re^2: to thread or fork or ?
in thread to thread or fork or ?
An architecture:
Split your bigfile size across N machines.
Have a process on each machine that processes a filesize/N chunk of the bigfile. (Say, 32 machines each reading a different 32GB chunk of your 1TB file.)
Each reader accumulates word counts in a hash until the hash size approaches it's memory limit.
(Assume ~1.5GB/10 million words/keys on a 64-bit Perl; somewhat less on a 32-bit.)
When that limit is reached; it posts (probably udp) out the word/count pairs to the appropriate accumulator machines; frees the hash and continues reading the file from where it left off.
- The file is only processed once.
- It can be processed in parallel by as many reader processes/machines as your IO/Disk bandwidth will allow.
- Each reader process only uses whatever amount of memory you decide is the limit.
- You can split the accumulations across as many (or few) boxes as are required.
- As only word/count pairs are exchanged, IO (other than reading the file) will be minimal.
No threading, shared memory or locking required. Simple to set up and efficient to process.
|
---|
Replies are listed 'Best First'. | |
---|---|
Re^4: to thread or fork or ?
by sundialsvc4 (Abbot) on Oct 19, 2012 at 14:21 UTC | |
by BrowserUk (Patriarch) on Oct 19, 2012 at 15:54 UTC |