Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re^3: Faster grep in a huge file(10 million)

by BrowserUk (Pope)
on May 12, 2013 at 15:04 UTC ( #1033207=note: print w/ replies, xml ) Need Help??


in reply to Re^2: Faster grep in a huge file(10 million)
in thread Faster grep in a huge file(10 million)

I'd draw your attention to the first word of both of the sentences you quoted; and also to both the id est; and contraction that follows it.

If the OPs circumstances do not comply with either of those two criteria; then *I* wouldn't use this approach.

But, his records might only be 80 characters in size (ie.<1GB of data); and if I were purchasing my next machine right now, I wouldn't consider anything with less than 8GB, preferably 16; and I'd also be looking at putting in a SSD configured to hold my swap partition effectively giving me 64GB (or 128GB or 256GB) of extended memory that is a couple of orders of magnitude faster than disk.

So then you are trading 2x O(N Log N) processes + merge at disk speeds; against a single O(N2) process at ram speed. Without the OP clarifying the actual volumes of data involved; there is no way to make a valid assessment of the trade-offs.

Also, if they are free-format text records -- ie. the key is not in a fixed position; or there might be multiple or no keys per record -- sorting them may not even be an option.

Equally, the OP mentioned 'patterns'; if they are patterns in the regex sense of the word, that would exclude using a hash. And, if you had to search the records to locate the embedded keys in order to build a hash, you've done 90% of the work of the in-memory method, before you've started to actually use the hash.

The bottom line is, I offered just one more alternative that might make sense -- or not -- given the OPs actual data; and it is up to them to decide which best fits.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.


Comment on Re^3: Faster grep in a huge file(10 million)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1033207]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (5)
As of 2015-07-04 21:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (60 votes), past polls