Hello Monks, I need little help with searching or Matching contents of huge files. I got two file, first file(snp file) contains ids and a score (which is about 2,00,000 lines) and second file(map file) contains ids, genes and a score(which is 3 to 4 times larger than first file). I need to search id of first file in second and if there is a match than print that line of second file into a new file.I wrote a program to do it , by using that it take days to complete,so I need your help in optimizing it and make it reasonably fast.
My files are:(Both files are tab delimited)
First file:
#snp file
snp_rs log_1_pval
rs3749375 11.7268615355335
rs10499549 10.4656064706897
rs7837688 9.85374546064131
rs4794737 9.41576680248523
rs10033399 9.36407447191822
rs4242382 9.22809709356544
rs4242384 8.91767075801336
rs9656816 8.61480602028324
rs982354 8.40833878650415
rs31226 8.38047936810042
......... .........
Second file
#Map file
rs10904494 NP_817124 17881
rs7837688 NP_817124 39800
rs4881551 ZMYND11 21567
rs7909028 ZMYND11 5335
rs10499549 ZMYND11 0
rs12779173 ZMYND11 0
rs2448370 ZMYND11 0
rs2448366 ZMYND11 0
rs2379078 ZMYND11 0
rs3749375 ZMYND11 0
......... ....... .
My new file should look like this:
rs3749375 ZMYND11
rs10499549 ZMYND11
rs7837688 NP_817124
I can't use binary search as there files are not sorted and also i don't know how to do that.
Have a look at my code here
# This program is for getting snps and genes
open(SNP,"D:\\gsea.chi2")or die("File cant be opened");
<SNP>;
while($line = <SNP>){
@snps = split(/\t/,$line);
pop(@snps);
foreach $id(@snps){
#print "$id \n";
search($id);
}
}
sub search {
$snpid = $_[0];
#print "$snpid \n";
open (MAP,"D:\\gsea1.SNPGENEMAP")or die("File cant be opened");
my @map = <MAP>;
close (MAP);
foreach $mapid(@map){
if ($mapid =~ m/^$snpid/i){
print $mapid;
last;
}
}
}
Thank you very much in advance.