Hello all,
I rarely post these days, but I'm a bit stumped currently and need more pairs of eyes to examine my code to see where the problem lies. My code "works" to some degree, in that it gives me an output, but the program doesn't end. Like with anything I do, I really need to make sure that I'm getting the correct data.

Background: I have two files, one in excess of 300k lines long. I need to split the lines, compare the values, and look for matches so that I can pull the gene name from the annotation file and add it to the original values from the data file. Essentially, its a glorified V-lookup. The catch is that I'm not using pattern matching, but equality based on the "window" values, or numbers that tell me where the feature would be found on the genome.

My guess is that I have the flow of the operations mixed up somewhere, possibly in the main loop. If you have any suggestions, I would be grateful. I also realize that I could have gone about this using a binary search or something similar, but I figured that this would be the simplest and quickest way to get at the results. oops :)
Thanks so much!!!
#!/usr/bin/perl -w use strict; our (@window, @probe)=(); our @main = &open_file_main(); our @annot = &open_file_annot(); &outfile; # loop through the main data file OLC: foreach my $md (@main) { # remove newlines chomp $md; # pull out chromosome #, window start, end my ($main_chrom, $winl, $winr) = split(/\t/, $md); # put the window start, end into array for further processing @window = ($winl, $winr); # loop through the "annotation" file ILC: foreach my $ad (@annot) { # pull out the chromosome #, window start, end my ($an_chrom, undef, undef, $prol, $pror, undef, undef, undef +, $mess) = split(/\t/, $ad); # make sure the chromosomes match to save processor time, skip + if no match if ($main_chrom ne $an_chrom) {next ILC;} # get the gene name fromt the $mess variable my (undef, undef, $name) = split(/\;/, $mess); # load the window start sites for the individual probes @probe = ($prol, $pror); # call the range_finding sub to look for matches my $return = range_find(); if ($return eq 1) { # upon matching, print out the name of the gene along with + the original values # and print OUTPUT "$name\t $md\n"; next OLC; } else {next ILC;} } } close OUTPUT; exit; sub open_file_main { # open the file, pull in the data print " What is the name of the ChIPOTle file\?"; chomp (my $file = <STDIN>); open (FILE, $file) || die "Cannot open $!"; my @data = <FILE>; close FILE; return @data; } sub open_file_annot { # open the file, pull in the data print " What is the name of the annotation file\?"; chomp (my $file = <STDIN>); open (FILE, $file) || die "Cannot open $!"; my @data = <FILE>; close FILE; return @data; } sub outfile { # open the outputfile open (OUTPUT, ">output.txt")|| die "Cannot open $!"; } sub calc_range { # simple sub to create the full window from the start + and end sites my @peaks = @_; my @peak_range = ($peaks[0] .. $peaks[1]); return @peak_range; } sub range_find { # loop in loop to look for ANY overlap of the values; + will be a true/false return my @range1 = &calc_range(@window); my @range2 = &calc_range(@probe); my $test = pop @range2; my $test2 = pop @range1; OLC2: foreach my $value1 (@range1) { # look to see if the windows don't overlap if ($test lt $window[0]) {return 0;} # look to see if the windows don't overlap elsif ($test2 lt $probe[0]) {return 0;} ILC2: foreach my $value2 (@range2) { if ( $value1 eq $value2) {return 1;} else {next ILC2;} } } return 0; }
Main data file:
chr10 726178 726428 1.121753867297440e-012 1.1607063976283 +87e-015 6 4.81000 chr10 4922028 4922428 2.163402534569952e-012 7.06701849934 +5696e-014 9 4.43000 chr10 5126478 5127178 8.348017333255820e-013 8.24324954124 +5406e-021 15 4.42000 chr10 14649778 14650028 2.090598548899472e-013 6.000728326 +245207e-018 6 5.04000 chr10 14651328 14651428 2.915017653920210e-012 1.890051416 +361198e-014 3 4.20000
Annotation Data:
chr1 NGS primary_transcript 4224 7502 -1 - . I +D=00003; accession=BC063682; Name=FLJ25222; ncbi_gene_id=374666; syno +nyms=-; description=CXYorf1-related protein; url=http://www.ncbi.nlm. +nih.gov/entrez/query.fcgi?db%3Dnucleotide%26cmd%3Dsearch%26term%3DBC0 +63682 chr1 NGS transcription_start_site 7502 7502 -1 - +. Parent=00003; accession=BC063682; Name=FLJ25222; ncbi_gene_id=37 +4666; synonyms=-; description=CXYorf1-related protein chr1 NGS primary_transcript 4268 7438 -2 - . I +D=00005; accession=BC073913; Name=MGC52000; ncbi_gene_id=375260; syno +nyms=CXYorf1|MGC104889|MGC111476|MGC117230|MGC90409; description=CXYo +rf1-related protein; url=http://www.ncbi.nlm.nih.gov/entrez/query.fcg +i?db%3Dnucleotide%26cmd%3Dsearch%26term%3DBC073913 chr1 NGS transcription_start_site 7438 7438 -2 - +. Parent=00005; accession=BC073913; Name=MGC52000; ncbi_gene_id=37 +5260; synonyms=CXYorf1|MGC104889|MGC111476|MGC117230|MGC90409; descri +ption=CXYorf1-related protein chr1 NGS primary_transcript 4268 14754 -3 - . +ID=00004; accession=BC048328; Name=MGC52000; ncbi_gene_id=375260; syn +onyms=CXYorf1|MGC104889|MGC111476|MGC117230|MGC90409; description=CXY +orf1-related protein; url=http://www.ncbi.nlm.nih.gov/entrez/query.fc +gi?db%3Dnucleotide%26cmd%3Dsearch%26term%3DBC048328 chr1 NGS transcription_start_site 14754 14754 -3 - + . Parent=00004; accession=BC048328; Name=MGC52000; ncbi_gene_id= +375260; synonyms=CXYorf1|MGC104889|MGC111476|MGC117230|MGC90409; desc +ription=CXYorf1-related protein chr1 NGS primary_transcript 4268 19697 -4 - . +ID=00006; accession=BC110996; Name=MGC52000; ncbi_gene_id=375260; syn +onyms=CXYorf1|MGC104889|MGC111476|MGC117230|MGC90409; description=CXY +orf1-related protein; url=http://www.ncbi.nlm.nih.gov/entrez/query.fc +gi?db%3Dnucleotide%26cmd%3Dsearch%26term%3DBC110996
Bioinformatics

In reply to Faulty Control Structures? by bioinformatics

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.