Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

multiline in while loop and regular expression

by newtoperlprog (Sexton)
on Nov 24, 2014 at 21:21 UTC ( [id://1108286]=perlquestion: print w/replies, xml ) Need Help??

newtoperlprog has asked for the wisdom of the Perl Monks concerning the following question:

Dear All,

I am trying to parse a file in a while loop and printing some matched regular expression parameters.

Below is my code and data file

my $filename = test.summary"; open (IN, "<", $filename) or die "Check the summary file. $!\n"; while (my $line = <IN>) { chomp $line; if ($line =~/^LOCUS\s+\w+\d+\s+(\d+)\sbp/) { $gene_length = $1; } if ($line =~/^DEFINITION\s+(.*)/s) { $definition = $1; } if ($line =~/^ACCESSION\s+(.*?)\s+/) { $accession = $1; } if ($line =~ /\s+\/db_xref="GI\:(\d+)\"/) { $gi_number = $1; } if ($line =~ /\s+\/db_xref=\"GeneID\:(\d+)\"/) { $gene_id = $1; } }
Data file: LOCUS NM_001098209 3415 bp mRNA linear PRI 27 +-APR-2014 DEFINITION Homo sapiens catenin (cadherin-associated protein), beta 1 +, 88kDa (CTNNB1), transcript variant 2, mRNA. ACCESSION NM_001098209 XM_001133660 XM_001133664 XM_001133673 XM_001 +133675 VERSION NM_001098209.1 GI:148233337 KEYWORDS RefSeq. SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Eutele +ostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhin +i; Catarrhini; Hominidae; Homo. CDS 269..2614 /gene="CTNNB1" /gene_synonym="armadillo; CTNNB; MRD19" /codon_start=1 /product="catenin beta-1" /protein_id="NP_001091679.1" /db_xref="GI:148233338" /db_xref="CCDS:CCDS2694.1" /db_xref="GeneID:1499" /db_xref="HGNC:HGNC:2514" /db_xref="MIM:116806" /translation="MATQADLMELDMAMEPDRKAAVSHWQQQSYLDSGI +HSGATTTAP SLSGKGNPEEEDVDTSQVLYEWEQGFSQSFTQEQVADIDGQYAMTRAQR +VRAAMFPET LDEGMQIPSTQFDAAHPTNVQRLAEPSQMLKHAVVNLINYQDDAELATR +AIPELTKLL //

My questions:

a) How can I parse the multiline DEFINITION in the while loop as the regular expression captures only the first line .

b) Could I get some help in capuring the content of CDS block and then parse individual entries one by one( like GI, GeneID etc.).

I am trying to learn using Perl only so I am not using the BioPerl module for the above purpose.

Regards

Replies are listed 'Best First'.
Re: multiline in while loop and regular expression
by GrandFather (Saint) on Nov 24, 2014 at 22:21 UTC

    Is that the entire contents of the file, or is that one record of many? If there are many records, what does a record separator look like? Maybe you need to show us two records?

    Are ORGANISM and CDS really indented like that? Are there other field types you haven't told us about? Is the // actually part of the file, or did it just happen to "slip in" while you weren't looking?

    Maybe the following parsing code will get you started:

    use strict; use warnings; my @records; my $currTail; my $currField; while (defined(my $line = <DATA>) or defined $currField) { my $field; my $tail; ($field, $tail) = $line =~ /^(.{10}) (.*)/ if defined $line; next if !defined $tail && !defined $currField; $field =~ tr/ //d if defined $field; $currField //= $field; if (! defined $field or (length $field && $currField ne $field)) { push @records, {} if $currField eq 'LOCUS'; $records[-1]{$currField} = $currTail; $currField = undef; $currTail = undef; last if !defined $tail; } $currField = $field if length $field; push @$currTail, $tail if defined $tail; } for my $record (@records) { print "$_:\n", map{" $_\n"} @{$record->{$_}} for sort keys %$rec +ord; } __DATA__ LOCUS NM_001098209 3415 bp mRNA linear PRI 27 +-APR-2014 DEFINITION Homo sapiens catenin (cadherin-associated protein), beta 1 +, 88kDa (CTNNB1), transcript variant 2, mRNA. ACCESSION NM_001098209 XM_001133660 XM_001133664 XM_001133673 XM_001 +133675 VERSION NM_001098209.1 GI:148233337 KEYWORDS RefSeq. SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Eutele +ostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhin +i; Catarrhini; Hominidae; Homo. CDS 269..2614 /gene="CTNNB1" /gene_synonym="armadillo; CTNNB; MRD19" /codon_start=1 /product="catenin beta-1" /protein_id="NP_001091679.1" /db_xref="GI:148233338" /db_xref="CCDS:CCDS2694.1" /db_xref="GeneID:1499" /db_xref="HGNC:HGNC:2514" /db_xref="MIM:116806" /translation="MATQADLMELDMAMEPDRKAAVSHWQQQSYLDSGI +HSGATTTAP SLSGKGNPEEEDVDTSQVLYEWEQGFSQSFTQEQVADIDGQYAMTRAQR +VRAAMFPET LDEGMQIPSTQFDAAHPTNVQRLAEPSQMLKHAVVNLINYQDDAELATR +AIPELTKLL //

    Prints:

    ACCESSION: NM_001098209 XM_001133660 XM_001133664 XM_001133673 XM_001133675 CDS: 269..2614 /gene="CTNNB1" /gene_synonym="armadillo; CTNNB; MRD19" /codon_start=1 /product="catenin beta-1" /protein_id="NP_001091679.1" /db_xref="GI:148233338" /db_xref="CCDS:CCDS2694.1" /db_xref="GeneID:1499" /db_xref="HGNC:HGNC:2514" /db_xref="MIM:116806" /translation="MATQADLMELDMAMEPDRKAAVSHWQQQSYLDSGIHSGATTTAP SLSGKGNPEEEDVDTSQVLYEWEQGFSQSFTQEQVADIDGQYAMTRAQRVRAAMFPET LDEGMQIPSTQFDAAHPTNVQRLAEPSQMLKHAVVNLINYQDDAELATRAIPELTKLL DEFINITION: Homo sapiens catenin (cadherin-associated protein), beta 1, 88kDa (CTNNB1), transcript variant 2, mRNA. KEYWORDS: RefSeq. LOCUS: NM_001098209 3415 bp mRNA linear PRI 27-APR-2014 ORGANISM: Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Homo. SOURCE: Homo sapiens (human) VERSION: NM_001098209.1 GI:148233337
    Perl is the programming world's equivalent of English

      Dear GrandFather,

      Thank you for your reply.

      (a) The datafile was truncated as the file was very long and only one record is shown. I am parsing one file only although you are correct that multifiles are separated by // delimiter.

      (b) Other fields which I parsed using the loop (and were direct) were, LOCUS, ORGANISM etc.

      Many thanks for your reply, I will study the code and suggestion and will follow-up with the discussion.

      Regards

        Knowing the record separator we can do a little "better":

        use strict; use warnings; my @records = {}; $/ = "\n//"; while (defined(my $rec = <DATA>)) { my %fields = $rec =~ /^(?:(?! {10}) *(\S{1,10}))? (.*?(?=\n(?! {1 +0})|\Z))/gms; $fields{$_} = [map {s/^\s*//; $_} split "\n", $fields{$_}] for key +s %fields; push @records, \%fields; } for my $record (@records) { print "$_:\n", map{" $_\n"} @{$record->{$_}} for sort keys %$rec +ord; print "\n\n"; } __DATA__ LOCUS NM_001098210 DEFINITION Homo sapiens catenin ACCESSION NM_001098210 VERSION NM_001098210.1 KEYWORDS RefSeq. SOURCE Homo sapiens (human) ORGANISM Homo sapiens CDS 269..2614 /gene="CTNNB2" // LOCUS NM_001098209 3415 bp mRNA linear PRI 27 +-APR-2014 DEFINITION Homo sapiens catenin (cadherin-associated protein), beta 1 +, 88kDa (CTNNB1), transcript variant 2, mRNA. ACCESSION NM_001098209 XM_001133660 XM_001133664 XM_001133673 XM_001 +133675 VERSION NM_001098209.1 GI:148233337 KEYWORDS RefSeq. SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Eutele +ostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhin +i; Catarrhini; Hominidae; Homo. CDS 269..2614 /gene="CTNNB1" /gene_synonym="armadillo; CTNNB; MRD19" /codon_start=1 /product="catenin beta-1" /protein_id="NP_001091679.1" /db_xref="GI:148233338" /db_xref="CCDS:CCDS2694.1" /db_xref="GeneID:1499" /db_xref="HGNC:HGNC:2514" /db_xref="MIM:116806" /translation="MATQADLMELDMAMEPDRKAAVSHWQQQSYLDSGI +HSGATTTAP SLSGKGNPEEEDVDTSQVLYEWEQGFSQSFTQEQVADIDGQYAMTRAQR +VRAAMFPET LDEGMQIPSTQFDAAHPTNVQRLAEPSQMLKHAVVNLINYQDDAELATR +AIPELTKLL //

        Prints:

        ACCESSION: NM_001098210 CDS: 269..2614 /gene="CTNNB2" DEFINITION: Homo sapiens catenin KEYWORDS: RefSeq. LOCUS: NM_001098210 ORGANISM: Homo sapiens SOURCE: Homo sapiens (human) VERSION: NM_001098210.1 ACCESSION: NM_001098209 XM_001133660 XM_001133664 XM_001133673 XM_001133675 CDS: 269..2614 /gene="CTNNB1" /gene_synonym="armadillo; CTNNB; MRD19" /codon_start=1 /product="catenin beta-1" /protein_id="NP_001091679.1" /db_xref="GI:148233338" /db_xref="CCDS:CCDS2694.1" /db_xref="GeneID:1499" /db_xref="HGNC:HGNC:2514" /db_xref="MIM:116806" /translation="MATQADLMELDMAMEPDRKAAVSHWQQQSYLDSGIHSGATTTAP SLSGKGNPEEEDVDTSQVLYEWEQGFSQSFTQEQVADIDGQYAMTRAQRVRAAMFPET LDEGMQIPSTQFDAAHPTNVQRLAEPSQMLKHAVVNLINYQDDAELATRAIPELTKLL DEFINITION: Homo sapiens catenin (cadherin-associated protein), beta 1, 88kDa (CTNNB1), transcript variant 2, mRNA. KEYWORDS: RefSeq. LOCUS: NM_001098209 3415 bp mRNA linear PRI 27-APR-2014 ORGANISM: Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Homo. SOURCE: Homo sapiens (human) VERSION: NM_001098209.1 GI:148233337
        Perl is the programming world's equivalent of English
Re: multiline in while loop and regular expression
by ww (Archbishop) on Nov 24, 2014 at 23:14 UTC
    I see no strict nor warnings. I do see a (single) double quote in Ln 1, about which perl -c yourscript.pl would have complained and which might have prompted strict or warnings to toss out some admonitions. I assume the missing quote before the file name is a typo introduced while posting.

    If you're going to post code (and you should!) cut'n'paste is your best bet.



    check Ln42!

Re: multiline in while loop and regular expression
by Anonymous Monk on Nov 24, 2014 at 22:32 UTC
    a) b) Do it like C programmers do... use a preprocessor :) Join all these lines into one big happy line. Save output to a new file and work with that.

    Like that (very briefly tested, seems to work):

    use strict; use warnings; use feature 'say'; my $headers = qr{ \A \s* (?: LOCUS | DEFINITION | ACCESSION | VERSION | KEYWORDS | SOURCE | + ORGANISM | CDS ) }x; my $skip = qr{ \A \s* (?: Data\s+file: | \/\/ | \z ) }x; my @lines; while (<>) { next if /$skip/; s/\R//; if (/$headers/ and @lines) { say @lines; @lines = (); } push @lines, $_; } say @lines if @lines;

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1108286]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (5)
As of 2024-04-24 06:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found