split a file into records and process it

biohisham has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks, I am really stuck trying to visualize the best data structure to use to get a generic format file processed for info, the text file was so messed up but I got it to look like what follows

#### The file has the following headers for each record###
Exon #
Gene id
Nm_id
snoRNA Key
text Sequence  Query, subject  
Gene name and weblink

                      ##Start the records###
3
GI:91982771
NM_001040105.1  
snoRNA 10
Query  4     TGGAGTCAAT  13
             ||||||||||
Sbjct  4854  TGGAGTCAAT  4845
Homo sapiens mucin 17, cell surface associated (MUC17), mRNA.
http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=Retrieve&db=nucleotide&do
+pt=GenBank&RID=UDU305DZ01N&log%24=nuclalign&blast_rank=97&list_uids=9
+1982771
3
GI:154448895
NM_001100162.1  
snoRNA 25, 26 and 27
Query  2    CCTGGAGTCGAGTG  15
            ||||||||||||||
Sbjct  146  CCTGGAGTCGAGTG  133
Homo sapiens exportin 7 (XPO7), transcript variant 3, mRNA.
http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=Retrieve&db=nucleotide&do
+pt=GenBank&RID=UDW41RSS01S&log%24=nuclalign&blast_rank=2&list_uids=15
+4448895                    
31
4 different hits
GI:153945877
NM_002458.1  
snoRNA 25, 26 and 27
Query  3     CTGGAGTCGAGTG  15
             |||||||||||||
Sbjct  6818  CTGGAGTCGAGTG  6806
Query  3     CTGGAGTCGAGTG  15
             |||||||||||||
Sbjct  8489  CTGGAGTCGAGTG  8477
Query  3      CTGGAGTCGAGTG  15
              |||||||||||||
Sbjct  10589  CTGGAGTCGAGTG  10577
Query  3      CTGGAGTCGAGTG  15
              |||||||||||||
Sbjct  12260  CTGGAGTCGAGTG  12248
Homo sapiens mucin 5B, oligomeric mucus/gel-forming (MUC5B), mRNA.
http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=Retrieve&db=nucleotide&do
+pt=GenBank&RID=UDW41RSS01S&log%24=nuclalign&blast_rank=9&list_uids=15
+3945877
4
GI:150418008
NM_206862.2  
snoRNA 25, 26 and 27
Query  1     ACCTGGAGTCGAG  13
             |||||||||||||
Sbjct  4775  ACCTGGAGTCGAG  4763
Homo sapiens transforming, acidic coiled-coil containing protein 2 (TA
+CC2), transcript variant 1, mRNA.
http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=Retrieve&db=nucleotide&do
+pt=GenBank&RID=UDW41RSS01S&log%24=nuclalign&blast_rank=10&list_uids=1
+50418008
[download]

each record starts with a number on a separate line and this number is not unique, and each records ends in a weblink too, I tried to use that atttribute as a record separator, but splitting around the number as a boundary is not possible through a regexp since "split /^\d+$/;" would leave a space where the number is and hence I can not use it any further, instead I tried a range matching.

I read this data from a file and processed it into the above form, however, I want to use each line that starts with a snoRNA as a hash key which corresponds to a record entity relative to the headers, each key can have more than one instance of the following associated with it..

Exon #
Gene id
Nm_id
text Sequence Query subject
Gene name and weblink

My approach to sovling this problem can be the reason I am stuck and hence I would implore you to assist me into breaking this into records and give me ideas on what I can do to enhance my approach

Here is my code so far:

#!/usr/local/bin/perl
use strict;
use warnings;
open (FH,'<',"F:/Bioinformatics_NCBI/20MARCH_10/PERL Analysis/test.txt
+") or die("$!\n");
open(FO, '>',"F:/Bioinformatics_NCBI/20MARCH_10/PERL Analysis/testOut.
+txt") or die ("$!\n");   #TESTING
my (@snoRNA, @geneID, @productID, @geneNames, @references,@queries,@su
+bjects);
while(<FH>){
        chomp;
        if(/(?=^\d+$)/../(?=http:.*)\n/){ #range matching
               # s/\W+\n+!\W+//;
               next unless /(\w+ |\| | \n+)/x;  #except for words | pi
+pes | \n
                print FO $_, "\n" ;
        }
        if(/snoRNA(\s+|\d+)[\s|-|\d]/){   #snoRNA
        push @snoRNA, $_;
                }
         if(/^\d+$/){      #exon Numbers
                push @exonNumbers, $_;
                }
                if(/^GI:\d+[\.\d+]/){    #gene Names
                push @geneID , $_;
                }
        if(/^NM_\d+[\.\d+]/){        #gene product ID
                my $name = $_;
                $name =~ s/\s+$//; #substitute the trailing blanks..
                push @productID, $name;
                }
        if(/homo sapiens[\w+\W+]/i){ #gene name, Need MultiLine suppor
+t..
                my $name = $_;
                push @geneNames, $name;
                }
        if(/http:.*/){          #web refs, need multiline support..
                my $name = $_;
                push @references, $name;
                }
        if(/^Query(\s+)\d+\s+[agtc]/i){       #Prepare the query and s
+ubject arrays
                my $queryName = $_;
                $queryName =~ s/$1//;
                push @queries, $queryName;
                }
        if(/^sbjct(\s+)\d+\s+[agtc]/i){
                        my $sbjctName =  $_;
                        $sbjctName =~ s/$1//;
                        push @subjects, $sbjctName;
                        }

        }
[download]

UPDATE:

UPDATE: BrowserUK I am obliged and thanks to jethro and wfsp too, I was running in circles...

Excellence is an Endeavor of Persistence. Chance Favors a Prepared Mind.

Back to Seekers of Perl Wisdom