biohisham has asked for the wisdom of the Perl Monks concerning the following question:
Dear monks, I am really stuck trying to visualize the best data structure to use to get a generic format file processed for info, the text file was so messed up but I got it to look like what follows
each record starts with a number on a separate line and this number is not unique, and each records ends in a weblink too, I tried to use that atttribute as a record separator, but splitting around the number as a boundary is not possible through a regexp since "split /^\d+$/;" would leave a space where the number is and hence I can not use it any further, instead I tried a range matching.
#### The file has the following headers for each record### Exon # Gene id Nm_id snoRNA Key text Sequence Query, subject Gene name and weblink ##Start the records### 3 GI:91982771 NM_001040105.1 snoRNA 10 Query 4 TGGAGTCAAT 13 |||||||||| Sbjct 4854 TGGAGTCAAT 4845 Homo sapiens mucin 17, cell surface associated (MUC17), mRNA. http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=Retrieve&db=nucleotide&do +pt=GenBank&RID=UDU305DZ01N&log%24=nuclalign&blast_rank=97&list_uids=9 +1982771 3 GI:154448895 NM_001100162.1 snoRNA 25, 26 and 27 Query 2 CCTGGAGTCGAGTG 15 |||||||||||||| Sbjct 146 CCTGGAGTCGAGTG 133 Homo sapiens exportin 7 (XPO7), transcript variant 3, mRNA. http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=Retrieve&db=nucleotide&do +pt=GenBank&RID=UDW41RSS01S&log%24=nuclalign&blast_rank=2&list_uids=15 +4448895 31 4 different hits GI:153945877 NM_002458.1 snoRNA 25, 26 and 27 Query 3 CTGGAGTCGAGTG 15 ||||||||||||| Sbjct 6818 CTGGAGTCGAGTG 6806 Query 3 CTGGAGTCGAGTG 15 ||||||||||||| Sbjct 8489 CTGGAGTCGAGTG 8477 Query 3 CTGGAGTCGAGTG 15 ||||||||||||| Sbjct 10589 CTGGAGTCGAGTG 10577 Query 3 CTGGAGTCGAGTG 15 ||||||||||||| Sbjct 12260 CTGGAGTCGAGTG 12248 Homo sapiens mucin 5B, oligomeric mucus/gel-forming (MUC5B), mRNA. http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=Retrieve&db=nucleotide&do +pt=GenBank&RID=UDW41RSS01S&log%24=nuclalign&blast_rank=9&list_uids=15 +3945877 4 GI:150418008 NM_206862.2 snoRNA 25, 26 and 27 Query 1 ACCTGGAGTCGAG 13 ||||||||||||| Sbjct 4775 ACCTGGAGTCGAG 4763 Homo sapiens transforming, acidic coiled-coil containing protein 2 (TA +CC2), transcript variant 1, mRNA. http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=Retrieve&db=nucleotide&do +pt=GenBank&RID=UDW41RSS01S&log%24=nuclalign&blast_rank=10&list_uids=1 +50418008
I read this data from a file and processed it into the above form, however, I want to use each line that starts with a snoRNA as a hash key which corresponds to a record entity relative to the headers, each key can have more than one instance of the following associated with it..
- Exon #
- Gene id
- Nm_id
- text Sequence Query subject
- Gene name and weblink
My approach to sovling this problem can be the reason I am stuck and hence I would implore you to assist me into breaking this into records and give me ideas on what I can do to enhance my approach
Here is my code so far:
#!/usr/local/bin/perl use strict; use warnings; open (FH,'<',"F:/Bioinformatics_NCBI/20MARCH_10/PERL Analysis/test.txt +") or die("$!\n"); open(FO, '>',"F:/Bioinformatics_NCBI/20MARCH_10/PERL Analysis/testOut. +txt") or die ("$!\n"); #TESTING my (@snoRNA, @geneID, @productID, @geneNames, @references,@queries,@su +bjects); while(<FH>){ chomp; if(/(?=^\d+$)/../(?=http:.*)\n/){ #range matching # s/\W+\n+!\W+//; next unless /(\w+ |\| | \n+)/x; #except for words | pi +pes | \n print FO $_, "\n" ; } if(/snoRNA(\s+|\d+)[\s|-|\d]/){ #snoRNA push @snoRNA, $_; } if(/^\d+$/){ #exon Numbers push @exonNumbers, $_; } if(/^GI:\d+[\.\d+]/){ #gene Names push @geneID , $_; } if(/^NM_\d+[\.\d+]/){ #gene product ID my $name = $_; $name =~ s/\s+$//; #substitute the trailing blanks.. push @productID, $name; } if(/homo sapiens[\w+\W+]/i){ #gene name, Need MultiLine suppor +t.. my $name = $_; push @geneNames, $name; } if(/http:.*/){ #web refs, need multiline support.. my $name = $_; push @references, $name; } if(/^Query(\s+)\d+\s+[agtc]/i){ #Prepare the query and s +ubject arrays my $queryName = $_; $queryName =~ s/$1//; push @queries, $queryName; } if(/^sbjct(\s+)\d+\s+[agtc]/i){ my $sbjctName = $_; $sbjctName =~ s/$1//; push @subjects, $sbjctName; } }
UPDATE: BrowserUK I am obliged and thanks to jethro and wfsp too, I was running in circles...
Excellence is an Endeavor of Persistence. Chance Favors a Prepared Mind.
Back to
Seekers of Perl Wisdom