Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Faster grep in a huge file(10 million)

by Thomas Kennll (Acolyte)
on May 10, 2013 at 19:46 UTC ( #1033014=perlquestion: print w/ replies, xml ) Need Help??
Thomas Kennll has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have a file1 which has 5 million patterns or circuit names. I have file2 which has 12 million records. I need to find the circuits or patterns not present in file2... something like fgrep -f file1 -v file2 Would be greatful if someone can help me here..Thank you

Comment on Faster grep in a huge file(10 million)
Re: Faster grep in a huge file(10 million)
by educated_foo (Vicar) on May 10, 2013 at 19:52 UTC
    Does file1 contain patterns or literal strings? If the latter, you might want to read it into a hash, or sort both files and use a modified merge.
    Just another Perler interested in Algol Programming.
Re: Faster grep in a huge file(10 million)
by thewebsi (Scribe) on May 10, 2013 at 19:56 UTC
      Thanks for the reply. I tried this but, doesn't seem to help . :(
      #!/usr/bin/perl use strict; use warnings; my %file2; open my $file2, '<', '/home/match_miss' or die "Couldn't open file2: $ +!"; while ( my $line = <$file2> ) { ++$file2{$line}; } open my $file1, '<', '/home/BIG_FILE' or die "Couldn't open file1: $!" +; while ( my $line = <$file1> ) { print $line if defined $file2{$line}; }
        print $line if defined $file2{$line};
        I think that finds lines in file 1 that are also in file 2. To find lines in file 1 not in file 2, maybe change that to:

        print $line unless defined $file2{$line};

Re: Faster grep in a huge file(10 million)
by InfiniteSilence (Curate) on May 10, 2013 at 21:05 UTC

    Is there a problem with writing all of the data to a relational database and simply doing something like,

    SELECT tbl2.* FROM tbl2 WHERE tbl2.id NOT IN (SELECT tbl1.id FROM tbl1 +);
    ?

    Celebrate Intellectual Diversity

Re: Faster grep in a huge file(10 million)
by Laurent_R (Parson) on May 10, 2013 at 21:56 UTC

    5 million circuit names is not that huge (at least not by my standards), it is just big. It should fit in a hash in memory I think. And if it fits in memory, it is a very simple problem.

Re: Faster grep in a huge file(10 million)
by BrowserUk (Pope) on May 10, 2013 at 23:32 UTC

    If your 12 million records average less than a couple of kbytes each (ie. if the size of the records file is less than your available memory), I'd just load the entire file into memory as a single string and the read the circuits file one line at a time and use index to see if it is in the records:

    #! perl -slw use strict; my $records; { local( @ARGV, $/ ) = $ARGV[0]; $records = <>; } open CIRCUITS, '<', 'circuits' or die $!; while( <CIRCUITS> ) { unless( 1+ index $records, $_ ) { print; } } __END__ C:\test>1033014 records circuits >notfound

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      If your 12 million records average less than a couple of kbytes each (ie. if the size of the records file is less than your available memory)
      Is 12GB a normal amount of memory for a single process to use these days? My sense was that 4GB was standard on an entry-level desktop or a mid-level laptop. Even if you have a super-machine with 16GB, you may not want to have a single process suck that all up to run an O(n^2) program. A hash containing the smaller file or two on-disk sorts would be a much better option, and not that hard to do.
      Just another Perler interested in Algol Programming.

        I'd draw your attention to the first word of both of the sentences you quoted; and also to both the id est; and contraction that follows it.

        If the OPs circumstances do not comply with either of those two criteria; then *I* wouldn't use this approach.

        But, his records might only be 80 characters in size (ie.<1GB of data); and if I were purchasing my next machine right now, I wouldn't consider anything with less than 8GB, preferably 16; and I'd also be looking at putting in a SSD configured to hold my swap partition effectively giving me 64GB (or 128GB or 256GB) of extended memory that is a couple of orders of magnitude faster than disk.

        So then you are trading 2x O(N Log N) processes + merge at disk speeds; against a single O(N2) process at ram speed. Without the OP clarifying the actual volumes of data involved; there is no way to make a valid assessment of the trade-offs.

        Also, if they are free-format text records -- ie. the key is not in a fixed position; or there might be multiple or no keys per record -- sorting them may not even be an option.

        Equally, the OP mentioned 'patterns'; if they are patterns in the regex sense of the word, that would exclude using a hash. And, if you had to search the records to locate the embedded keys in order to build a hash, you've done 90% of the work of the in-memory method, before you've started to actually use the hash.

        The bottom line is, I offered just one more alternative that might make sense -- or not -- given the OPs actual data; and it is up to them to decide which best fits.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1033014]
Approved by tobyink
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (13)
As of 2014-12-18 12:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (51 votes), past polls