multiline in while loop and regular expression

newtoperlprog has asked for the wisdom of the Perl Monks concerning the following question:

Dear All,

I am trying to parse a file in a while loop and printing some matched regular expression parameters.

Below is my code and data file

my $filename =  test.summary";
open (IN, "<", $filename) or die "Check the summary file. $!\n";
while (my $line = <IN>) {
    chomp $line;
    if ($line =~/^LOCUS\s+\w+\d+\s+(\d+)\sbp/) {
        $gene_length = $1;
    }
    if ($line =~/^DEFINITION\s+(.*)/s) {
        $definition = $1;
    }
    if ($line =~/^ACCESSION\s+(.*?)\s+/) {
        $accession = $1;
    }
    if ($line =~ /\s+\/db_xref="GI\:(\d+)\"/) {
        $gi_number = $1; 
    }
    if ($line =~ /\s+\/db_xref=\"GeneID\:(\d+)\"/) {
        $gene_id = $1;
    }
}
[download]

 
Data file:
LOCUS       NM_001098209            3415 bp    mRNA    linear   PRI 27
+-APR-2014
DEFINITION  Homo sapiens catenin (cadherin-associated protein), beta 1
+, 88kDa
            (CTNNB1), transcript variant 2, mRNA.
ACCESSION   NM_001098209 XM_001133660 XM_001133664 XM_001133673 XM_001
+133675
VERSION     NM_001098209.1  GI:148233337
KEYWORDS    RefSeq.
SOURCE      Homo sapiens (human)
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Eutele
+ostomi;
            Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhin
+i;
            Catarrhini; Hominidae; Homo.
     CDS             269..2614
                     /gene="CTNNB1"
                     /gene_synonym="armadillo; CTNNB; MRD19"
                     /codon_start=1
                     /product="catenin beta-1"
                     /protein_id="NP_001091679.1"
                     /db_xref="GI:148233338"
                     /db_xref="CCDS:CCDS2694.1"
                     /db_xref="GeneID:1499"
                     /db_xref="HGNC:HGNC:2514"
                     /db_xref="MIM:116806"
                     /translation="MATQADLMELDMAMEPDRKAAVSHWQQQSYLDSGI
+HSGATTTAP
                     SLSGKGNPEEEDVDTSQVLYEWEQGFSQSFTQEQVADIDGQYAMTRAQR
+VRAAMFPET
                     LDEGMQIPSTQFDAAHPTNVQRLAEPSQMLKHAVVNLINYQDDAELATR
+AIPELTKLL
//
[download]

My questions:

a) How can I parse the multiline DEFINITION in the while loop as the regular expression captures only the first line .

b) Could I get some help in capuring the content of CDS block and then parse individual entries one by one( like GI, GeneID etc.).

I am trying to learn using Perl only so I am not using the BioPerl module for the above purpose.

Regards

Comment on multiline in while loop and regular expression Select or Download Code

Replies are listed 'Best First'.
Re: multiline in while loop and regular expression by GrandFather (Saint) on Nov 24, 2014 at 22:21 UTC
Is that the entire contents of the file, or is that one record of many? If there are many records, what does a record separator look like? Maybe you need to show us two records? Are ORGANISM and CDS really indented like that? Are there other field types you haven't told us about? Is the // actually part of the file, or did it just happen to "slip in" while you weren't looking? Maybe the following parsing code will get you started: use strict; use warnings; my @records; my $currTail; my $currField; while (defined(my $line = <DATA>) or defined $currField) { my $field; my $tail; ($field, $tail) = $line =~ /^(.{10}) (.*)/ if defined $line; next if !defined $tail && !defined $currField; $field =~ tr/ //d if defined $field; $currField //= $field; if (! defined $field or (length $field && $currField ne $field)) { push @records, {} if $currField eq 'LOCUS'; $records[-1]{$currField} = $currTail; $currField = undef; $currTail = undef; last if !defined $tail; } $currField = $field if length $field; push @$currTail, $tail if defined $tail; } for my $record (@records) { print "$_:\n", map{" $_\n"} @{$record->{$_}} for sort keys %$rec +ord; } __DATA__ LOCUS NM_001098209 3415 bp mRNA linear PRI 27 +-APR-2014 DEFINITION Homo sapiens catenin (cadherin-associated protein), beta 1 +, 88kDa (CTNNB1), transcript variant 2, mRNA. ACCESSION NM_001098209 XM_001133660 XM_001133664 XM_001133673 XM_001 +133675 VERSION NM_001098209.1 GI:148233337 KEYWORDS RefSeq. SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Eutele +ostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhin +i; Catarrhini; Hominidae; Homo. CDS 269..2614 /gene="CTNNB1" /gene_synonym="armadillo; CTNNB; MRD19" /codon_start=1 /product="catenin beta-1" /protein_id="NP_001091679.1" /db_xref="GI:148233338" /db_xref="CCDS:CCDS2694.1" /db_xref="GeneID:1499" /db_xref="HGNC:HGNC:2514" /db_xref="MIM:116806" /translation="MATQADLMELDMAMEPDRKAAVSHWQQQSYLDSGI +HSGATTTAP SLSGKGNPEEEDVDTSQVLYEWEQGFSQSFTQEQVADIDGQYAMTRAQR +VRAAMFPET LDEGMQIPSTQFDAAHPTNVQRLAEPSQMLKHAVVNLINYQDDAELATR +AIPELTKLL // [download] Prints: ACCESSION: NM_001098209 XM_001133660 XM_001133664 XM_001133673 XM_001133675 CDS: 269..2614 /gene="CTNNB1" /gene_synonym="armadillo; CTNNB; MRD19" /codon_start=1 /product="catenin beta-1" /protein_id="NP_001091679.1" /db_xref="GI:148233338" /db_xref="CCDS:CCDS2694.1" /db_xref="GeneID:1499" /db_xref="HGNC:HGNC:2514" /db_xref="MIM:116806" /translation="MATQADLMELDMAMEPDRKAAVSHWQQQSYLDSGIHSGATTTAP SLSGKGNPEEEDVDTSQVLYEWEQGFSQSFTQEQVADIDGQYAMTRAQRVRAAMFPET LDEGMQIPSTQFDAAHPTNVQRLAEPSQMLKHAVVNLINYQDDAELATRAIPELTKLL DEFINITION: Homo sapiens catenin (cadherin-associated protein), beta 1, 88kDa (CTNNB1), transcript variant 2, mRNA. KEYWORDS: RefSeq. LOCUS: NM_001098209 3415 bp mRNA linear PRI 27-APR-2014 ORGANISM: Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Homo. SOURCE: Homo sapiens (human) VERSION: NM_001098209.1 GI:148233337 [download] Perl is the programming world's equivalent of English	[reply] [d/l] [select]
Re^2: multiline in while loop and regular expression by newtoperlprog (Sexton) on Nov 24, 2014 at 22:33 UTC
Dear GrandFather, Thank you for your reply. (a) The datafile was truncated as the file was very long and only one record is shown. I am parsing one file only although you are correct that multifiles are separated by // delimiter. (b) Other fields which I parsed using the loop (and were direct) were, LOCUS, ORGANISM etc. Many thanks for your reply, I will study the code and suggestion and will follow-up with the discussion. Regards	[reply]
Re^3: multiline in while loop and regular expression by GrandFather (Saint) on Nov 24, 2014 at 23:58 UTC
Knowing the record separator we can do a little "better": use strict; use warnings; my @records = {}; $/ = "\n//"; while (defined(my $rec = <DATA>)) { my %fields = $rec =~ /^(?:(?! {10}) (\S{1,10}))? (.?(?=\n(?! {1 +0})\|\Z))/gms; $fields{$_} = [map {s/^\s*//; $_} split "\n", $fields{$_}] for key +s %fields; push @records, \%fields; } for my $record (@records) { print "$_:\n", map{" $_\n"} @{$record->{$_}} for sort keys %$rec +ord; print "\n\n"; } __DATA__ LOCUS NM_001098210 DEFINITION Homo sapiens catenin ACCESSION NM_001098210 VERSION NM_001098210.1 KEYWORDS RefSeq. SOURCE Homo sapiens (human) ORGANISM Homo sapiens CDS 269..2614 /gene="CTNNB2" // LOCUS NM_001098209 3415 bp mRNA linear PRI 27 +-APR-2014 DEFINITION Homo sapiens catenin (cadherin-associated protein), beta 1 +, 88kDa (CTNNB1), transcript variant 2, mRNA. ACCESSION NM_001098209 XM_001133660 XM_001133664 XM_001133673 XM_001 +133675 VERSION NM_001098209.1 GI:148233337 KEYWORDS RefSeq. SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Eutele +ostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhin +i; Catarrhini; Hominidae; Homo. CDS 269..2614 /gene="CTNNB1" /gene_synonym="armadillo; CTNNB; MRD19" /codon_start=1 /product="catenin beta-1" /protein_id="NP_001091679.1" /db_xref="GI:148233338" /db_xref="CCDS:CCDS2694.1" /db_xref="GeneID:1499" /db_xref="HGNC:HGNC:2514" /db_xref="MIM:116806" /translation="MATQADLMELDMAMEPDRKAAVSHWQQQSYLDSGI +HSGATTTAP SLSGKGNPEEEDVDTSQVLYEWEQGFSQSFTQEQVADIDGQYAMTRAQR +VRAAMFPET LDEGMQIPSTQFDAAHPTNVQRLAEPSQMLKHAVVNLINYQDDAELATR +AIPELTKLL // [download] Prints: ACCESSION: NM_001098210 CDS: 269..2614 /gene="CTNNB2" DEFINITION: Homo sapiens catenin KEYWORDS: RefSeq. LOCUS: NM_001098210 ORGANISM: Homo sapiens SOURCE: Homo sapiens (human) VERSION: NM_001098210.1 ACCESSION: NM_001098209 XM_001133660 XM_001133664 XM_001133673 XM_001133675 CDS: 269..2614 /gene="CTNNB1" /gene_synonym="armadillo; CTNNB; MRD19" /codon_start=1 /product="catenin beta-1" /protein_id="NP_001091679.1" /db_xref="GI:148233338" /db_xref="CCDS:CCDS2694.1" /db_xref="GeneID:1499" /db_xref="HGNC:HGNC:2514" /db_xref="MIM:116806" /translation="MATQADLMELDMAMEPDRKAAVSHWQQQSYLDSGIHSGATTTAP SLSGKGNPEEEDVDTSQVLYEWEQGFSQSFTQEQVADIDGQYAMTRAQRVRAAMFPET LDEGMQIPSTQFDAAHPTNVQRLAEPSQMLKHAVVNLINYQDDAELATRAIPELTKLL DEFINITION: Homo sapiens catenin (cadherin-associated protein), beta 1, 88kDa (CTNNB1), transcript variant 2, mRNA. KEYWORDS: RefSeq. LOCUS: NM_001098209 3415 bp mRNA linear PRI 27-APR-2014 ORGANISM: Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Homo. SOURCE: Homo sapiens (human) VERSION: NM_001098209.1 GI:148233337 [download] Perl is the programming world's equivalent of English	[reply] [d/l] [select]
Re: multiline in while loop and regular expression by ww (Archbishop) on Nov 24, 2014 at 23:14 UTC
I see no strict nor warnings. I do see a (single) double quote in Ln 1, about which `perl -c yourscript.pl` would have complained and which might have prompted strict or warnings to toss out some admonitions. I assume the missing quote before the file name is a typo introduced while posting. If you're going to post code (and you should!) cut'n'paste is your best bet. check Ln42!	[reply] [d/l]
Re: multiline in while loop and regular expression by Anonymous Monk on Nov 24, 2014 at 22:32 UTC
a) b) Do it like C programmers do... use a preprocessor :) Join all these lines into one big happy line. Save output to a new file and work with that. Like that (very briefly tested, seems to work): `use strict; use warnings; use feature 'say'; my $headers = qr{ \A \s* (?: LOCUS \| DEFINITION \| ACCESSION \| VERSION \| KEYWORDS \| SOURCE \| + ORGANISM \| CDS ) }x; my $skip = qr{ \A \s* (?: Data\s+file: \| \/\/ \| \z ) }x; my @lines; while (<>) { next if /$skip/; s/\R//; if (/$headers/ and @lines) { say @lines; @lines = (); } push @lines, $_; } say @lines if @lines;` [download]	[reply] [d/l]


Don't ask to ask, just ask
	PerlMonks