Pattern matching across two files, Need something better than grep -f!

anasuya has asked for the wisdom of the Perl Monks concerning the following question:

I have a pattern.txt file which looks like this:

2gqt+FAD+A+601   2i0z+FAD+A+501
1n1e+NDE+A+400   2qzl+IXS+A+449
1llf+F23+A+800   1y0g+8PP+A+320
1ewf+PC1+A+577   2a94+AP0+A+336
2ydx+TXP+E+1339   3g8i+RO7+A+1
1gvh+HEM+A+1398   1v9y+HEM+A+1140
2i0z+FAD+A+501   3m2r+F43+A+1
1h6d+NDP+A+500   3rt4+LP5+C+501
1w07+FAD+A+1660   2pgn+FAD+A+612
2qd1+PP9+A+701   3gsi+FAD+A+902
[download]

There is another file called data (approx 8gb in size) which has lines like this.

2gqt+FAD+A+601   2i0z+FAD+A+501    0.874585  0.785412
1n1e+NDE+A+400   2qzl+IXS+A+449    0.145278  0.589452
1llf+F23+A+800   1y0g+8PP+A+320    0.784512  0.341786
1ewf+PC1+A+577   2a94+AP0+A+336    0.362542  0.784785
2ydx+TXP+E+1339   3g8i+RO7+A+1     0.251452  0.365298
1gvh+HEM+A+1398   1v9y+HEM+A+1140  0.784521  0.625893
2i0z+FAD+A+501   3m2r+F43+A+1      0.369856  0.354842
1h6d+NDP+A+500   3rt4+LP5+C+501    0.925478  0.365895
1w07+FAD+A+1660   2pgn+FAD+A+612   0.584785  0.325863
2qd1+PP9+A+701   3gsi+FAD+A+902    0.874526  0.125453
[download]

However the data file is not as simple as it looks like given above. The large size of the file is due to the fact that there are approx 18000 lines in it which begin the string in the first column of every line. i.e. 18000 lines beginning with 2gqt+FAD+A+601, followed by 18000 lines beginning with 1n1e+NDE+A+400. But there will be only one such line which matches the given pattern as in pattern.txt I am trying to match the lines in pattern.txt with data and want to print out:

2gqt+FAD+A+601   2i0z+FAD+A+501 0.785412
1n1e+NDE+A+400   2qzl+IXS+A+449 0.589452
1llf+F23+A+800   1y0g+8PP+A+320 0.341786
1ewf+PC1+A+577   2a94+AP0+A+336 0.784785  
2ydx+TXP+E+1339   3g8i+RO7+A+1  0.365298
1gvh+HEM+A+1398   1v9y+HEM+A+114 0 0.625893
2i0z+FAD+A+501   3m2r+F43+A+1 0.354842
1h6d+NDP+A+500   3rt4+LP5+C+501 0.365895
1w07+FAD+A+1660   2pgn+FAD+A+612 0.325863
2qd1+PP9+A+701   3gsi+FAD+A+902 0.125453
[download]

As of now I am using something in perl, like this:

use warnings;
open AS, "data";
open AQ, "pattern.txt";
@arr=<AS>;
@arr1=<AQ>;
foreach $line(@arr)
{
    @split=split(' ',$line);
    foreach $line1(@arr1)
    {
     @split1=split(' ',$line1);
     if($split[0] eq $split1[0] && $split[1] eq $split1[1])
     { print $split1[0],"\t",$split1[1],"\t",$split1[3],"\n";}
   }

}
close AQ;
close AS;
[download]

I have tried using grep -f, but it is taking a very long time to do this job. how do i modify this existing code using something like:

while ($line = <AQ>) #file handler for pattern
{
   while ($line_data = <AS>)
    { #do the matching here.?
    }
}
[download]

I want to minimise the runtime of this code to as small as possible. please help.

Comment on Pattern matching across two files, Need something better than grep -f! Select or Download Code

Replies are listed 'Best First'.
Re: Pattern matching across two files, Need something better than grep -f! by kennethk (Abbot) on Apr 10, 2012 at 18:26 UTC
So anytime you want to "minimise the runtime... code to as small as possible", you need to study the code to determine where you are spending your time. This means profiling and benchmarking. I would recommend you check out Devel::NYTProf and Benchmark. A couple things you are doing, which you could address, some of which would likely improve performance and others would are good coding practice, include: Don't slurp the whole files into memory. If you operate on one line at time, you won't chew up huge amounts of memory (8GB + data overhead). You reparse the entirety of your pattern file on each loop. Instead, parse once and store it in a hash. Then you can use the fast look-up a hash offers you. You should probably also get in the habit of testing if your opens succeed, a la `open AS, "data" or die $!;`, or even better `open $as, '<', "data" or die "data open failed: $!";` Consider adding strict; give Use strict warnings and diagnostics or die a read. Implementing all this might result in something like (untested): `use strict; use warnings; open my $as, '<', "data" or die "data open failed: $!\n"; open my $aq, '<', "pattern.txt" or die "pattern.txt open failed: $!\n" +;; my %pattern; while (<$aq>) { my @split = split; $pattern{"$split[0] $split[1]"} = 1; } while (<$as>) { my @split = split; if ($pattern{"$split[0] $split[1]"}) { print "$split[0]\t$split[1]\t$split[3]\n"; } }` [download] #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.	[reply] [d/l] [select]
Re^2: Pattern matching across two files, Need something better than grep -f! by petdance (Parson) on Apr 10, 2012 at 19:59 UTC
I suggest that even before going to measure the bottlenecks, the OP needs to define what is goal is. If the process is taking 10 seconds and he wants it to run in two, then mere tweaking may not be enough. xoxo, Andy	[reply]
Re^2: Pattern matching across two files, Need something better than grep -f! by Anonymous Monk on Apr 10, 2012 at 18:52 UTC
A little bit faster: `use strict; use warnings; open my $data_fh, '<', "data" or die $!; open my $patern_fh, '<', "pattern.txt" or die $!; my %patterns; while (<$patern_fh>) { $patterns{join $;, split} = (); } close $patern_fh; { local $, = "\t"; local $\ = "\n"; while (<$data_fh>) { my @line = split; if (exists $patterns{$line[0] . $; . $line[1]}) { print @line[0, 1, 3]; } } } close $data_fh;` [download]	[reply] [d/l]
Re: Pattern matching across two files, Need something better than grep -f! by BrowserUk (Patriarch) on Apr 10, 2012 at 20:23 UTC
NOTE:The following code assumes that the whitespace in both files consists of single tabs. Assuming the patterns file has less than say 15 million records, this should process the entire data file in less than 5 minutes: `#! perl -slw use strict; my %patterns; open PAT, '<', 'patterns.txt' or die $!; chomp, undef $patterns{ $_ } while <PAT>; close PAT; open DAT, '<', 'data' or die $!; while( <DAT> ) { my( $key, $v1, $v2 ) = m[(\S+\s+\S+)\s+(\S+)\s+(\S+)]; exists $patterns{ $key } and print "$key\t$v2"; } close DAT;` [download] If the pattern file is a lot bigger than that -- ie. too big to build the hash in your memory -- then you would need to run multiple passes. If you are seeking to reduce the time to much less than the above code takes, you'll need to look at parallelising the operation. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply] [d/l]
Re: Pattern matching across two files, Need something better than grep -f! by JavaFan (Canon) on Apr 10, 2012 at 23:48 UTC
I have tried using grep -f, but it is taking a very long time to do this job `grep` is optimized to do one job well. Unless you can make shortcuts because you know something special about the input which you can use, it's likely that any Perl solution isn't going to beat the `grep` one. For instance, if the blocks of 18000 lines which share the first "token" are in the same order as the entries in `pattern.txt`, you can use this fact and make a much faster solution than just trying to match every line with every other.	[reply] [d/l] [select]
Re: Pattern matching across two files, Need something better than grep -f! by pvaldes (Chaplain) on Apr 11, 2012 at 00:12 UTC
Sorry, but I'm afraid that I'm not understanding the problem here. Just to suggest that you choose carefully your variable names: `if($split[0] eq $split1[0] && $split[1] eq $split1[1])` Lines like this are artificially hard to read. Is really easy to miss the "one" before the "[". You'll realize also that to name a new variable with the same name of a common function was not so great idea when you'll need to debug or expand your code some months later.	[reply] [d/l]
Re: Pattern matching across two files, Need something better than grep -f! by snape (Pilgrim) on Apr 11, 2012 at 08:20 UTC
Try Hash table for comparing the two files. `#!usr/bin/perl use strict; use warnings; my %data1; ## Hash for Pattern File open IN1, 'patterns.txt' or die $!; while (<IN1>){ chomp $_; my @line = split('\t',$_); $data1{$line[0]} = $line[1]; } close (IN1); open IN2, 'data.txt' or die $!; while( <IN2> ) { my @line = split('\t',$_); if (exists $data1{$line[0]}){ print$line[0],"\t", $line[1],"\t",$line[2],"\t",$line[3], "\n"; } } close IN2;` [download]	[reply] [d/l]
Re^2: Pattern matching across two files, Need something better than grep -f! by Anonymous Monk on Feb 18, 2013 at 13:13 UTC
how can i make a code for file that look like this... `pattern.txt AT1G48210 AT1G48240 AT1G48260 AT1G48330 AT1G48370 AT1G48440 AT1G48450 AT1G01073` [download] data.txt AT1G01010 Bra033296 . . AT1G01020 Bra033295 . . AT1G01030 Bra033294 . . AT1G01040 Bra033293 . . AT1G01046 . . . AT1G01050 Bra033292 Bra032616 . AT1G01060 Bra033291 . . AT1G01070 Bra033290 Bra032617 . AT1G01073 . . . AT1G01080 Bra033287 . . AT1G01090 Bra033286 Bra032619 . AT1G01100 Bra033285 Bra032620 . AT1G01110 Bra033284 . . AT1G01115 . . . AT1G01120 Bra033283 Bra032621 . AT1G01130 Bra033282 Bra032622 . AT1G01140 Bra033282 Bra032622 . AT1G01150 . . . AT1G01160 Bra033281 Bra032623 . AT1G01170 Bra033280 . . AT1G01180 Bra033279 . . AT1G01183 . . . AT1G01190 Bra033278 Bra032624 . [download] and i need to find the lines with "dots" only and pull out the sequences above and below that have atleast one "Bra...." rather than "dot" which is most closer to the line with only "dots". In the file there can be more than one lines closer which have only "dots". plss help... Im so stuck!!	[reply] [d/l] [select]
Re^3: Pattern matching across two files, Need something better than grep -f! by Anonymous Monk on Feb 21, 2013 at 04:58 UTC
how can i make a code ... SMOP	[reply]
Re: Pattern matching across two files, Need something better than grep -f! by BrowserUk (Patriarch) on Apr 10, 2012 at 18:41 UTC
How many lines are there in the pattern file? With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply]
Re: Pattern matching across two files, Need something better than grep -f! by olus (Curate) on Apr 11, 2012 at 15:16 UTC
Depending on the size of the files, there may be some gain or loss in time by sorting both files and then iterate through the source patterns and work on successive manageable chunks from the data file. This will bring extra complexity to the script, but the issue is speed.	[reply]