http://www.perlmonks.org?node_id=964371

anasuya has asked for the wisdom of the Perl Monks concerning the following question:

I have a pattern.txt file which looks like this:
2gqt+FAD+A+601 2i0z+FAD+A+501 1n1e+NDE+A+400 2qzl+IXS+A+449 1llf+F23+A+800 1y0g+8PP+A+320 1ewf+PC1+A+577 2a94+AP0+A+336 2ydx+TXP+E+1339 3g8i+RO7+A+1 1gvh+HEM+A+1398 1v9y+HEM+A+1140 2i0z+FAD+A+501 3m2r+F43+A+1 1h6d+NDP+A+500 3rt4+LP5+C+501 1w07+FAD+A+1660 2pgn+FAD+A+612 2qd1+PP9+A+701 3gsi+FAD+A+902
There is another file called data (approx 8gb in size) which has lines like this.
2gqt+FAD+A+601 2i0z+FAD+A+501 0.874585 0.785412 1n1e+NDE+A+400 2qzl+IXS+A+449 0.145278 0.589452 1llf+F23+A+800 1y0g+8PP+A+320 0.784512 0.341786 1ewf+PC1+A+577 2a94+AP0+A+336 0.362542 0.784785 2ydx+TXP+E+1339 3g8i+RO7+A+1 0.251452 0.365298 1gvh+HEM+A+1398 1v9y+HEM+A+1140 0.784521 0.625893 2i0z+FAD+A+501 3m2r+F43+A+1 0.369856 0.354842 1h6d+NDP+A+500 3rt4+LP5+C+501 0.925478 0.365895 1w07+FAD+A+1660 2pgn+FAD+A+612 0.584785 0.325863 2qd1+PP9+A+701 3gsi+FAD+A+902 0.874526 0.125453
However the data file is not as simple as it looks like given above. The large size of the file is due to the fact that there are approx 18000 lines in it which begin the string in the first column of every line. i.e. 18000 lines beginning with 2gqt+FAD+A+601, followed by 18000 lines beginning with 1n1e+NDE+A+400. But there will be only one such line which matches the given pattern as in pattern.txt I am trying to match the lines in pattern.txt with data and want to print out:
2gqt+FAD+A+601 2i0z+FAD+A+501 0.785412 1n1e+NDE+A+400 2qzl+IXS+A+449 0.589452 1llf+F23+A+800 1y0g+8PP+A+320 0.341786 1ewf+PC1+A+577 2a94+AP0+A+336 0.784785 2ydx+TXP+E+1339 3g8i+RO7+A+1 0.365298 1gvh+HEM+A+1398 1v9y+HEM+A+114 0 0.625893 2i0z+FAD+A+501 3m2r+F43+A+1 0.354842 1h6d+NDP+A+500 3rt4+LP5+C+501 0.365895 1w07+FAD+A+1660 2pgn+FAD+A+612 0.325863 2qd1+PP9+A+701 3gsi+FAD+A+902 0.125453
As of now I am using something in perl, like this:
use warnings; open AS, "data"; open AQ, "pattern.txt"; @arr=<AS>; @arr1=<AQ>; foreach $line(@arr) { @split=split(' ',$line); foreach $line1(@arr1) { @split1=split(' ',$line1); if($split[0] eq $split1[0] && $split[1] eq $split1[1]) { print $split1[0],"\t",$split1[1],"\t",$split1[3],"\n";} } } close AQ; close AS;
I have tried using grep -f, but it is taking a very long time to do this job. how do i modify this existing code using something like:
while ($line = <AQ>) #file handler for pattern { while ($line_data = <AS>) { #do the matching here.? } }
I want to minimise the runtime of this code to as small as possible. please help.