Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Pattern matching across two files, Need something better than grep -f!

by anasuya (Novice)
on Apr 10, 2012 at 17:49 UTC ( #964371=perlquestion: print w/ replies, xml ) Need Help??
anasuya has asked for the wisdom of the Perl Monks concerning the following question:

I have a pattern.txt file which looks like this:
2gqt+FAD+A+601 2i0z+FAD+A+501 1n1e+NDE+A+400 2qzl+IXS+A+449 1llf+F23+A+800 1y0g+8PP+A+320 1ewf+PC1+A+577 2a94+AP0+A+336 2ydx+TXP+E+1339 3g8i+RO7+A+1 1gvh+HEM+A+1398 1v9y+HEM+A+1140 2i0z+FAD+A+501 3m2r+F43+A+1 1h6d+NDP+A+500 3rt4+LP5+C+501 1w07+FAD+A+1660 2pgn+FAD+A+612 2qd1+PP9+A+701 3gsi+FAD+A+902
There is another file called data (approx 8gb in size) which has lines like this.
2gqt+FAD+A+601 2i0z+FAD+A+501 0.874585 0.785412 1n1e+NDE+A+400 2qzl+IXS+A+449 0.145278 0.589452 1llf+F23+A+800 1y0g+8PP+A+320 0.784512 0.341786 1ewf+PC1+A+577 2a94+AP0+A+336 0.362542 0.784785 2ydx+TXP+E+1339 3g8i+RO7+A+1 0.251452 0.365298 1gvh+HEM+A+1398 1v9y+HEM+A+1140 0.784521 0.625893 2i0z+FAD+A+501 3m2r+F43+A+1 0.369856 0.354842 1h6d+NDP+A+500 3rt4+LP5+C+501 0.925478 0.365895 1w07+FAD+A+1660 2pgn+FAD+A+612 0.584785 0.325863 2qd1+PP9+A+701 3gsi+FAD+A+902 0.874526 0.125453
However the data file is not as simple as it looks like given above. The large size of the file is due to the fact that there are approx 18000 lines in it which begin the string in the first column of every line. i.e. 18000 lines beginning with 2gqt+FAD+A+601, followed by 18000 lines beginning with 1n1e+NDE+A+400. But there will be only one such line which matches the given pattern as in pattern.txt I am trying to match the lines in pattern.txt with data and want to print out:
2gqt+FAD+A+601 2i0z+FAD+A+501 0.785412 1n1e+NDE+A+400 2qzl+IXS+A+449 0.589452 1llf+F23+A+800 1y0g+8PP+A+320 0.341786 1ewf+PC1+A+577 2a94+AP0+A+336 0.784785 2ydx+TXP+E+1339 3g8i+RO7+A+1 0.365298 1gvh+HEM+A+1398 1v9y+HEM+A+114 0 0.625893 2i0z+FAD+A+501 3m2r+F43+A+1 0.354842 1h6d+NDP+A+500 3rt4+LP5+C+501 0.365895 1w07+FAD+A+1660 2pgn+FAD+A+612 0.325863 2qd1+PP9+A+701 3gsi+FAD+A+902 0.125453
As of now I am using something in perl, like this:
use warnings; open AS, "data"; open AQ, "pattern.txt"; @arr=<AS>; @arr1=<AQ>; foreach $line(@arr) { @split=split(' ',$line); foreach $line1(@arr1) { @split1=split(' ',$line1); if($split[0] eq $split1[0] && $split[1] eq $split1[1]) { print $split1[0],"\t",$split1[1],"\t",$split1[3],"\n";} } } close AQ; close AS;
I have tried using grep -f, but it is taking a very long time to do this job. how do i modify this existing code using something like:
while ($line = <AQ>) #file handler for pattern { while ($line_data = <AS>) { #do the matching here.? } }
I want to minimise the runtime of this code to as small as possible. please help.

Comment on Pattern matching across two files, Need something better than grep -f!
Select or Download Code
Re: Pattern matching across two files, Need something better than grep -f!
by kennethk (Monsignor) on Apr 10, 2012 at 18:26 UTC
    So anytime you want to "minimise the runtime... code to as small as possible", you need to study the code to determine where you are spending your time. This means profiling and benchmarking. I would recommend you check out Devel::NYTProf and Benchmark.

    A couple things you are doing, which you could address, some of which would likely improve performance and others would are good coding practice, include:

    1. Don't slurp the whole files into memory. If you operate on one line at time, you won't chew up huge amounts of memory (8GB + data overhead).
    2. You reparse the entirety of your pattern file on each loop. Instead, parse once and store it in a hash. Then you can use the fast look-up a hash offers you.
    3. You should probably also get in the habit of testing if your opens succeed, a la open AS, "data" or die $!;, or even better open $as, '<', "data" or die "data open failed: $!";
    4. Consider adding strict; give Use strict warnings and diagnostics or die a read.

    Implementing all this might result in something like (untested):

    use strict; use warnings; open my $as, '<', "data" or die "data open failed: $!\n"; open my $aq, '<', "pattern.txt" or die "pattern.txt open failed: $!\n" +;; my %pattern; while (<$aq>) { my @split = split; $pattern{"$split[0] $split[1]"} = 1; } while (<$as>) { my @split = split; if ($pattern{"$split[0] $split[1]"}) { print "$split[0]\t$split[1]\t$split[3]\n"; } }

    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

      A little bit faster:
      use strict; use warnings; open my $data_fh, '<', "data" or die $!; open my $patern_fh, '<', "pattern.txt" or die $!; my %patterns; while (<$patern_fh>) { $patterns{join $;, split} = (); } close $patern_fh; { local $, = "\t"; local $\ = "\n"; while (<$data_fh>) { my @line = split; if (exists $patterns{$line[0] . $; . $line[1]}) { print @line[0, 1, 3]; } } } close $data_fh;
      I suggest that even before going to measure the bottlenecks, the OP needs to define what is goal is. If the process is taking 10 seconds and he wants it to run in two, then mere tweaking may not be enough.

      xoxo,
      Andy

Re: Pattern matching across two files, Need something better than grep -f!
by BrowserUk (Pope) on Apr 10, 2012 at 18:41 UTC

    How many lines are there in the pattern file?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

Re: Pattern matching across two files, Need something better than grep -f!
by BrowserUk (Pope) on Apr 10, 2012 at 20:23 UTC

    NOTE:The following code assumes that the whitespace in both files consists of single tabs.

    Assuming the patterns file has less than say 15 million records, this should process the entire data file in less than 5 minutes:

    #! perl -slw use strict; my %patterns; open PAT, '<', 'patterns.txt' or die $!; chomp, undef $patterns{ $_ } while <PAT>; close PAT; open DAT, '<', 'data' or die $!; while( <DAT> ) { my( $key, $v1, $v2 ) = m[(\S+\s+\S+)\s+(\S+)\s+(\S+)]; exists $patterns{ $key } and print "$key\t$v2"; } close DAT;

    If the pattern file is a lot bigger than that -- ie. too big to build the hash in your memory -- then you would need to run multiple passes.

    If you are seeking to reduce the time to much less than the above code takes, you'll need to look at parallelising the operation.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

Re: Pattern matching across two files, Need something better than grep -f!
by JavaFan (Canon) on Apr 10, 2012 at 23:48 UTC
    I have tried using grep -f, but it is taking a very long time to do this job
    grep is optimized to do one job well. Unless you can make shortcuts because you know something special about the input which you can use, it's likely that any Perl solution isn't going to beat the grep one.

    For instance, if the blocks of 18000 lines which share the first "token" are in the same order as the entries in pattern.txt, you can use this fact and make a much faster solution than just trying to match every line with every other.

Re: Pattern matching across two files, Need something better than grep -f!
by pvaldes (Chaplain) on Apr 11, 2012 at 00:12 UTC

    Sorry, but I'm afraid that I'm not understanding the problem here. Just to suggest that you choose carefully your variable names:

    if($split[0] eq $split1[0] && $split[1] eq $split1[1])

    Lines like this are artificially hard to read. Is really easy to miss the "one" before the "[". You'll realize also that to name a new variable with the same name of a common function was not so great idea when you'll need to debug or expand your code some months later.

Re: Pattern matching across two files, Need something better than grep -f!
by snape (Pilgrim) on Apr 11, 2012 at 08:20 UTC

    Try Hash table for comparing the two files.

    #!usr/bin/perl use strict; use warnings; my %data1; ## Hash for Pattern File open IN1, 'patterns.txt' or die $!; while (<IN1>){ chomp $_; my @line = split('\t',$_); $data1{$line[0]} = $line[1]; } close (IN1); open IN2, 'data.txt' or die $!; while( <IN2> ) { my @line = split('\t',$_); if (exists $data1{$line[0]}){ print$line[0],"\t", $line[1],"\t",$line[2],"\t",$line[3], "\n"; } } close IN2;
      how can i make a code for file that look like this...
      pattern.txt AT1G48210 AT1G48240 AT1G48260 AT1G48330 AT1G48370 AT1G48440 AT1G48450 AT1G01073
      data.txt
      AT1G01010 Bra033296 . . AT1G01020 Bra033295 . . AT1G01030 Bra033294 . . AT1G01040 Bra033293 . . AT1G01046 . . . AT1G01050 Bra033292 Bra032616 . AT1G01060 Bra033291 . . AT1G01070 Bra033290 Bra032617 . AT1G01073 . . . AT1G01080 Bra033287 . . AT1G01090 Bra033286 Bra032619 . AT1G01100 Bra033285 Bra032620 . AT1G01110 Bra033284 . . AT1G01115 . . . AT1G01120 Bra033283 Bra032621 . AT1G01130 Bra033282 Bra032622 . AT1G01140 Bra033282 Bra032622 . AT1G01150 . . . AT1G01160 Bra033281 Bra032623 . AT1G01170 Bra033280 . . AT1G01180 Bra033279 . . AT1G01183 . . . AT1G01190 Bra033278 Bra032624 .
      and i need to find the lines with "dots" only and pull out the sequences above and below that have atleast one "Bra...." rather than "dot" which is most closer to the line with only "dots". In the file there can be more than one lines closer which have only "dots". plss help... Im so stuck!!

        how can i make a code ...

        SMOP

Re: Pattern matching across two files, Need something better than grep -f!
by olus (Curate) on Apr 11, 2012 at 15:16 UTC

    Depending on the size of the files, there may be some gain or loss in time by sorting both files and then iterate through the source patterns and work on successive manageable chunks from the data file. This will bring extra complexity to the script, but the issue is speed.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://964371]
Approved by kennethk
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (13)
As of 2014-11-24 20:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My preferred Perl binaries come from:














    Results (147 votes), past polls