Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW

Re^2: Faster grep in a huge file(10 million)

by educated_foo (Vicar)
on May 12, 2013 at 12:50 UTC ( #1033195=note: print w/replies, xml ) Need Help??

in reply to Re: Faster grep in a huge file(10 million)
in thread Faster grep in a huge file(10 million)

If your 12 million records average less than a couple of kbytes each (ie. if the size of the records file is less than your available memory)
Is 12GB a normal amount of memory for a single process to use these days? My sense was that 4GB was standard on an entry-level desktop or a mid-level laptop. Even if you have a super-machine with 16GB, you may not want to have a single process suck that all up to run an O(n^2) program. A hash containing the smaller file or two on-disk sorts would be a much better option, and not that hard to do.
Just another Perler interested in Algol Programming.
  • Comment on Re^2: Faster grep in a huge file(10 million)

Replies are listed 'Best First'.
Re^3: Faster grep in a huge file(10 million)
by BrowserUk (Pope) on May 12, 2013 at 15:04 UTC

    I'd draw your attention to the first word of both of the sentences you quoted; and also to both the id est; and contraction that follows it.

    If the OPs circumstances do not comply with either of those two criteria; then *I* wouldn't use this approach.

    But, his records might only be 80 characters in size (ie.<1GB of data); and if I were purchasing my next machine right now, I wouldn't consider anything with less than 8GB, preferably 16; and I'd also be looking at putting in a SSD configured to hold my swap partition effectively giving me 64GB (or 128GB or 256GB) of extended memory that is a couple of orders of magnitude faster than disk.

    So then you are trading 2x O(N Log N) processes + merge at disk speeds; against a single O(N2) process at ram speed. Without the OP clarifying the actual volumes of data involved; there is no way to make a valid assessment of the trade-offs.

    Also, if they are free-format text records -- ie. the key is not in a fixed position; or there might be multiple or no keys per record -- sorting them may not even be an option.

    Equally, the OP mentioned 'patterns'; if they are patterns in the regex sense of the word, that would exclude using a hash. And, if you had to search the records to locate the embedded keys in order to build a hash, you've done 90% of the work of the in-memory method, before you've started to actually use the hash.

    The bottom line is, I offered just one more alternative that might make sense -- or not -- given the OPs actual data; and it is up to them to decide which best fits.

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1033195]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (8)
As of 2018-02-23 19:06 GMT
Find Nodes?
    Voting Booth?
    When it is dark outside I am happiest to see ...

    Results (309 votes). Check out past polls.