Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change

Re: Using threads to process multiple files

by sundialsvc4 (Abbot)
on Jan 30, 2015 at 20:38 UTC ( #1115133=note: print w/replies, xml ) Need Help??

in reply to Using threads to process multiple files

Upvotes all around, sirs ...

Another question that comes to my mind is ... “by loading all those files into hashes with a dummy-value, it seems clear that your ultimate purpose must be to be able to detect if a particular value read from the file exists().”   Very well then, where exactly are you going with this approach?   Is it likely or not-so likely that “most if not all” of the bucket-keys that you have loaded into the hash actually will be searched-for and touched?   Or, is this a requirement that could be achieved using an SQL, therefore disk-based / file-index based, JOIN operation?   If you are reading enough data into memory right now, such that the time spent inefficiently-copying that data (as BrowserUK excellently describes) is “human noticeable,” and especially if either file is not particularly volatile from run to run, then perhaps a more disk-based solution should be considered and explored as an alternative.   Even a tied hash backed by a simple (BerkelyDB) indexed-file structure ... which can be referenced, separately loaded and updated and so on from its “flat file” source ... but which is, ultimately, a file versus a memory data structure.

Or maybe a hybrid approach:   two hashes, one tied to a file, the other a real hash.   The algorithm first checks the real-hash.   If it’s not there yet, it pings the tied-file to see if the key exists.   If it does, then a value of 1 is inserted into the in-memory has; if not, a value of 0 is inserted.   The in-memory structure therefore acts as a “lookaside buffer” so that any given key will be searched-for on disk only once, and thereafter the answer (whether Yes or No) of whether-or-not the key exists will be answered by the in-memory key.

Many alternatives.   The actual “hit pattern” that these hashes will experience (hit count, probability of hit vs. miss, etc.) during any particular run is definitely an important consideration as you weigh your alternative approaches, as is the cost of loading the data structures before use at the start of each and every run vs. not having to do so.

  • Comment on Re: Using threads to process multiple files

Replies are listed 'Best First'.
Re^2: Using threads to process multiple files
by BrowserUk (Pope) on Jan 30, 2015 at 22:11 UTC

    Downvoted: Because the OP stated that his non-threaded version works fine; and he's trying to use threading to speed it up.

    How could moving from his two memory-based hashes, to disk-based tied hashes "speed things up", when they are at least 1000 times slower.

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
Re^2: Using threads to process multiple files
by eyepopslikeamosquito (Chancellor) on Feb 05, 2015 at 23:11 UTC

    Upvotes all around, sirs ... (as BrowserUK excellently describes)

    Downvotes all around you, sir ... for being obsequious.   Eew'd.    “--”

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1115133]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (5)
As of 2019-11-20 20:12 GMT
Find Nodes?
    Voting Booth?
    Strict and warnings: which comes first?

    Results (101 votes). Check out past polls.