Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Genbank file parsing

by Anonymous Monk
on Jan 11, 2005 at 12:08 UTC ( #421253=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks, My question is related to file parsing within while loops. I have a large file containing many individual files all in the same format.

I want to extract certain features from each file which mainly is easy. My problem is when I need to extract everything between two markers in the file. I am struggling to get all the nice info between ORIGIN and // in each file.

This is how the file looks: How can I extract all the info between ORIGIN and // for each record??

# extract of file: #=============== FEATURES Location/Qualifiers /note="blah blah" COUNT 200 ORIGIN 1 lots of nice info 61 lots of nice info 121 lots of nice info //
This is my code so far:
open (FILE, $ARGV[0]) or die "unable to open FILE\n"; + my ($note, $info, $features,$count); my ($c,$d); my @seq; my $in_seq = 0; my $seq = ''; + while (<FILE>) { my $line = $_; if ($line =~ /(FEATURES)\s+(\w+)/) { $features = $2; } if ($line =~ /(COUNT)\s+(\d+)/) { $count = $2; } if ($line =~ /^ORIGIN/) { # print "$line\n"; push @seq, "$line\n"; } until ($line =~ /\/\/) {}; } + print "@seq\n";

Retitled by davido.

Comment on Genbank file parsing
Select or Download Code
Re: Genbank file parsing
by insaniac (Friar) on Jan 11, 2005 at 12:47 UTC
    maybe something like:
    my $found_origin = 0; while (my $line = <FILE>) { if ($line =~ /(FEATURES)\s+(\w+)/) { $found_origin = 0; $features = $2; } elsif ($line =~ /(COUNT)\s+(\d+)/) { $count = $2; } elsif ($line =~ /^ORIGIN/) { # print "$line\n"; $found_origin=1; } push @seq, "$line\n" if $found_origin and not $line =~ m!//!; }

    it's a very simplistic solution.. there are probably better ones...

    --
    to ask a question is a moment of shame
    to remain ignorant is a lifelong shame
Re: Genbank file parsing
by gube (Parson) on Jan 11, 2005 at 12:51 UTC

    if u want to extract Origin to // in the text file globally

    you try this

    undef $/; open (FILE, $ARGV[0]) or die "unable to open FILE\n"; my $input=<FILE>; close(FILE); my @final=(); while ($input=~m#ORIGIN(.*?)//#gsi) { push(@final,$1); } print @final;

    input file "text.txt" contains

    **********************************************

    FEATURES Location/Qualifiers

    /note="blah blah"

    COUNT 200

    ORIGIN

    1 lots of nice info

    61 lots of nice info

    121 lots of nice info

    //

    ORIGIN

    1 lots of nice info

    61 lots of nice info

    121 lots of nice info

    //

    ORIGIN

    1 lots of nice info

    61 lots of nice info

    121 lots of nice info

    //

    ORIGIN

    1 lots of nice info

    61 lots of nice info

    121 lots of nice info

    //

    ***********************************

    output file is look

    **********************

    1 lots of nice info

    61 lots of nice info

    121 lots of nice info

    1 lots of nice info

    61 lots of nice info

    121 lots of nice info

    1 lots of nice info

    61 lots of nice info

    121 lots of nice info

    1 lots of nice info

    61 lots of nice info

    121 lots of nice info

    Regards,

    Senthi Kumar.k

      gube,

      A useful feature for code snippets and experiments is the special __DATA__ section that can be added at the end of your program file. You can read from the DATA file handle with no need to create and open an input file:

      #!/usr/bin/perl while (<DATA>) { print; } __DATA__ foo goo hoo

      __DataFoo__

Re: Genbank file parsing
by EdwardG (Vicar) on Jan 11, 2005 at 13:04 UTC

    You might find it easier if your consider the file as a long string, one that happens to contain embedded newline characters.

    my $data = do {local $/; <DATA>}; my @items = map { { 'name' => $1, 'niceinfo' => $2 } } ($data =~ /^FEATURES(.+?)^ORIGIN(.+?)^\/\//msg);

     

      You might find it easier if your consider the file as a long string, one that happens to contain embedded newline characters.
      IMHO this qualifies as a particularly bad answer since he specifically pointed out that his file is large and it is always recommended not to slurp large files all at once if possible. Now I don't see anything here that suggests this to be necessary...

        It depends just how large the file is, how much memory is available, and other various trade-offs. But you are probably right in general, although I would quibble about "particularly bad" :-/

         

Re: Genbank file parsing
by Anonymous Monk on Jan 11, 2005 at 13:43 UTC
    Untested:
    while (<FILE>) { $features = $1, next if /FEATURES\s+(\w+)/; $count = $1, next if /COUNT\s+(\w+)/; push @seq, $1 if /^ORIGIN/ .. m!//!; }
Re: Genbank file parsing
by Hena (Friar) on Jan 11, 2005 at 13:57 UTC
    While not exactly an answer to your question, but a help for parsing sequence formats. It seems that you are parsing EMBL files. Easier way to get sequences from those might be emboss seqret or perhaps bioperl.
Re: Genbank file parsing
by blazar (Canon) on Jan 11, 2005 at 14:00 UTC
    Dear monks, My question is related to file parsing within while loops. I have a large file containing many individual files all in the same format.
    I don't think so. I suppose you have a large file containing info related to many individual files.
    This is how the file looks: How can I extract all the info between ORIGIN and // for each record??
    # extract of file:  
    #===============
    
    FEATURES             Location/Qualifiers    
                         /note="blah blah"   
    COUNT                200
    ORIGIN
            1 lots of nice info
           61 lots of nice info
          121 lots of nice info
       
    //
    
    If you can rely on this format, here's how I'd do it:
    #!/usr/bin/perl -ln use strict; use warnings; if ($_ eq 'ORIGIN') { local $/='//'; print <>; }
    or
    #!/usr/bin/perl -ln use strict; use warnings; print if $_ eq 'ORIGIN' .. $_ eq '//' __END__
    Of course these are intended to be as minimal examples: adapt the techniques shown here to your needs.
    This is my code so far:
    open (FILE, $ARGV[0]) or die "unable to open FILE\n";
    Why not using <> in the first place? Also, you'd better:
    • use lexical FHs,
    • use the three args form of open(),
    • put relevant info in the error message (i.e. at least include $!).
    Note: I skipped the rest
Re: Genbank file parsing
by perlsen (Chaplain) on Jan 11, 2005 at 14:08 UTC

    Hi, Just try this simple

    undef $/; open (IN, "$ARGV[0]"); my $string=<IN>; close(IN); (@arr)=$string =~ m#ORIGIN(.*?)//#gsi; print @arr;
Re: file parsing - use Parse::Recdescent
by tphyahoo (Vicar) on Jan 11, 2005 at 16:55 UTC
    This might be overkill, but if regexes are feeling unwieldy or hard to maintain for your task, you might want to try it with a grammar and Parse::RecDescent.

    I haven't used P::RD myself yet, but am learning it because it seems like it would come in handy in a variety of situations where regexes won't quite get the job done, or get the job done kludgily. Also, perl 6 "rules" (the new word for the concept formerly known as regex) are shaping up to be sort of an amalgamation of regexes and formal grammar, with the formal grammar aspect closely related to the way grammar parsing works in P::RD. (Damian Conway, who did P::RD is also in charge of Perl 6 rules.)

    thomas.

Re: Genbank file parsing
by stajich (Chaplain) on Jan 11, 2005 at 16:56 UTC
    You can also try not to reinvent the wheel. Bio::SeqIO can parse genbank files.
    use Bio::SeqIO; use strict; my $in = Bio::SeqIO->new(-format =>'genbank',-file => $file); # print the sequence from the genbank file while( my $seq = $in->next_seq ) { print $seq->seq(), "\n"; }
    Also see Ian Korf's lightweight GenBank parser: GBlite.pm
Re: Genbank file parsing
by TedPride (Priest) on Jan 12, 2005 at 00:54 UTC
    There's no need to use regex for most of that, since you aren't searching case insensitive anyway.
    use strict; use warnings; my ($features, $count, @seq); while (<DATA>) { if (index($_, 'FEATURES') != -1) { ($features) = m/\s(\S+)/; } elsif (index($_, 'COUNT') != -1) { ($count) = m/\s(\d+)/; } elsif (index($_, 'ORIGIN') != -1) { push @seq, $_ until index($_ = <DATA>, '//') != -1; } } print "$features\n$count\n",@seq; __DATA__ # extract of file: #=============== FEATURES Location/Qualifiers /note="blah blah" COUNT 200 ORIGIN 1 lots of nice info 61 lots of nice info 121 lots of nice info //
Re: Genbank file parsing
by ercparker (Hermit) on Jan 12, 2005 at 07:20 UTC
    #!/usr/bin/perl -w use strict; while (<DATA>) { if (/ORIGIN/ .. /\/\//) { print; } } __DATA__ # extract of file: #=============== FEATURES Location/Qualifiers /note="blah blah" COUNT 200 ORIGIN 1 lots of nice info 61 lots of nice info 121 lots of nice info //
Re: Genbank file parsing
by Anonymous Monk on Jan 24, 2008 at 15:18 UTC
    I've written a Parse::RecDescent-based GenBank parser that you might find useful/extensible: http://search.cpan.org/~kclark/Bio-GenBankParser-0.01/lib/Bio/GenBankParser.pm Humbly, ky

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://421253]
Approved by Mutant
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (14)
As of 2014-12-18 22:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (67 votes), past polls