Dear All,
I am trying to parse a file in a while loop and printing some matched regular expression parameters.
Below is my code and data file
my $filename = test.summary";
open (IN, "<", $filename) or die "Check the summary file. $!\n";
while (my $line = <IN>) {
chomp $line;
if ($line =~/^LOCUS\s+\w+\d+\s+(\d+)\sbp/) {
$gene_length = $1;
}
if ($line =~/^DEFINITION\s+(.*)/s) {
$definition = $1;
}
if ($line =~/^ACCESSION\s+(.*?)\s+/) {
$accession = $1;
}
if ($line =~ /\s+\/db_xref="GI\:(\d+)\"/) {
$gi_number = $1;
}
if ($line =~ /\s+\/db_xref=\"GeneID\:(\d+)\"/) {
$gene_id = $1;
}
}
Data file:
LOCUS NM_001098209 3415 bp mRNA linear PRI 27
+-APR-2014
DEFINITION Homo sapiens catenin (cadherin-associated protein), beta 1
+, 88kDa
(CTNNB1), transcript variant 2, mRNA.
ACCESSION NM_001098209 XM_001133660 XM_001133664 XM_001133673 XM_001
+133675
VERSION NM_001098209.1 GI:148233337
KEYWORDS RefSeq.
SOURCE Homo sapiens (human)
ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Eutele
+ostomi;
Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhin
+i;
Catarrhini; Hominidae; Homo.
CDS 269..2614
/gene="CTNNB1"
/gene_synonym="armadillo; CTNNB; MRD19"
/codon_start=1
/product="catenin beta-1"
/protein_id="NP_001091679.1"
/db_xref="GI:148233338"
/db_xref="CCDS:CCDS2694.1"
/db_xref="GeneID:1499"
/db_xref="HGNC:HGNC:2514"
/db_xref="MIM:116806"
/translation="MATQADLMELDMAMEPDRKAAVSHWQQQSYLDSGI
+HSGATTTAP
SLSGKGNPEEEDVDTSQVLYEWEQGFSQSFTQEQVADIDGQYAMTRAQR
+VRAAMFPET
LDEGMQIPSTQFDAAHPTNVQRLAEPSQMLKHAVVNLINYQDDAELATR
+AIPELTKLL
//
My questions:
a) How can I parse the multiline DEFINITION in the while loop as the regular expression captures only the first line .
b) Could I get some help in capuring the content of CDS block and then parse individual entries one by one( like GI, GeneID etc.).
I am trying to learn using Perl only so I am not using the BioPerl module for the above purpose.
Regards