Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Re: Faster grep in a huge file(10 million)

by BrowserUk (Pope)
on May 10, 2013 at 23:32 UTC ( #1033061=note: print w/ replies, xml ) Need Help??


in reply to Faster grep in a huge file(10 million)

If your 12 million records average less than a couple of kbytes each (ie. if the size of the records file is less than your available memory), I'd just load the entire file into memory as a single string and the read the circuits file one line at a time and use index to see if it is in the records:

#! perl -slw use strict; my $records; { local( @ARGV, $/ ) = $ARGV[0]; $records = <>; } open CIRCUITS, '<', 'circuits' or die $!; while( <CIRCUITS> ) { unless( 1+ index $records, $_ ) { print; } } __END__ C:\test>1033014 records circuits >notfound

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.


Comment on Re: Faster grep in a huge file(10 million)
Download Code
Re^2: Faster grep in a huge file(10 million)
by educated_foo (Vicar) on May 12, 2013 at 12:50 UTC
    If your 12 million records average less than a couple of kbytes each (ie. if the size of the records file is less than your available memory)
    Is 12GB a normal amount of memory for a single process to use these days? My sense was that 4GB was standard on an entry-level desktop or a mid-level laptop. Even if you have a super-machine with 16GB, you may not want to have a single process suck that all up to run an O(n^2) program. A hash containing the smaller file or two on-disk sorts would be a much better option, and not that hard to do.
    Just another Perler interested in Algol Programming.

      I'd draw your attention to the first word of both of the sentences you quoted; and also to both the id est; and contraction that follows it.

      If the OPs circumstances do not comply with either of those two criteria; then *I* wouldn't use this approach.

      But, his records might only be 80 characters in size (ie.<1GB of data); and if I were purchasing my next machine right now, I wouldn't consider anything with less than 8GB, preferably 16; and I'd also be looking at putting in a SSD configured to hold my swap partition effectively giving me 64GB (or 128GB or 256GB) of extended memory that is a couple of orders of magnitude faster than disk.

      So then you are trading 2x O(N Log N) processes + merge at disk speeds; against a single O(N2) process at ram speed. Without the OP clarifying the actual volumes of data involved; there is no way to make a valid assessment of the trade-offs.

      Also, if they are free-format text records -- ie. the key is not in a fixed position; or there might be multiple or no keys per record -- sorting them may not even be an option.

      Equally, the OP mentioned 'patterns'; if they are patterns in the regex sense of the word, that would exclude using a hash. And, if you had to search the records to locate the embedded keys in order to build a hash, you've done 90% of the work of the in-memory method, before you've started to actually use the hash.

      The bottom line is, I offered just one more alternative that might make sense -- or not -- given the OPs actual data; and it is up to them to decide which best fits.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1033061]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (6)
As of 2014-10-24 23:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (138 votes), past polls