Upvotes all around, sirs ...

Another question that comes to my mind is ... “by loading all those files into hashes with a dummy-value, it seems clear that your ultimate purpose must be to be able to detect if a particular value read from the file exists().”   Very well then, where exactly are you going with this approach?   Is it likely or not-so likely that “most if not all” of the bucket-keys that you have loaded into the hash actually will be searched-for and touched?   Or, is this a requirement that could be achieved using an SQL, therefore disk-based / file-index based, JOIN operation?   If you are reading enough data into memory right now, such that the time spent inefficiently-copying that data (as BrowserUK excellently describes) is “human noticeable,” and especially if either file is not particularly volatile from run to run, then perhaps a more disk-based solution should be considered and explored as an alternative.   Even a tied hash backed by a simple (BerkelyDB) indexed-file structure ... which can be referenced, separately loaded and updated and so on from its “flat file” source ... but which is, ultimately, a file versus a memory data structure.

Or maybe a hybrid approach:   two hashes, one tied to a file, the other a real hash.   The algorithm first checks the real-hash.   If it’s not there yet, it pings the tied-file to see if the key exists.   If it does, then a value of 1 is inserted into the in-memory has; if not, a value of 0 is inserted.   The in-memory structure therefore acts as a “lookaside buffer” so that any given key will be searched-for on disk only once, and thereafter the answer (whether Yes or No) of whether-or-not the key exists will be answered by the in-memory key.

Many alternatives.   The actual “hit pattern” that these hashes will experience (hit count, probability of hit vs. miss, etc.) during any particular run is definitely an important consideration as you weigh your alternative approaches, as is the cost of loading the data structures before use at the start of each and every run vs. not having to do so.


In reply to Re: Using threads to process multiple files by sundialsvc4
in thread Using threads to process multiple files by anli_

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":