Re: Genbank file parsing
by insaniac (Friar) on Jan 11, 2005 at 12:47 UTC
|
my $found_origin = 0;
while (my $line = <FILE>) {
if ($line =~ /(FEATURES)\s+(\w+)/) {
$found_origin = 0;
$features = $2;
}
elsif ($line =~ /(COUNT)\s+(\d+)/) {
$count = $2;
}
elsif ($line =~ /^ORIGIN/) {
# print "$line\n";
$found_origin=1;
}
push @seq, "$line\n" if $found_origin and not $line =~ m!//!;
}
it's a very simplistic solution.. there are probably better ones...
--
to ask a question is a moment of shame
to remain ignorant is a lifelong shame
| [reply] [Watch: Dir/Any] [d/l] |
Re: Genbank file parsing
by stajich (Chaplain) on Jan 11, 2005 at 16:56 UTC
|
You can also try not to reinvent the wheel. Bio::SeqIO can parse genbank files.
use Bio::SeqIO;
use strict;
my $in = Bio::SeqIO->new(-format =>'genbank',-file => $file);
# print the sequence from the genbank file
while( my $seq = $in->next_seq ) {
print $seq->seq(), "\n";
}
Also see Ian Korf's lightweight GenBank parser: GBlite.pm | [reply] [Watch: Dir/Any] [d/l] |
Re: Genbank file parsing
by gube (Parson) on Jan 11, 2005 at 12:51 UTC
|
undef $/;
open (FILE, $ARGV[0]) or die "unable to open FILE\n";
my $input=<FILE>;
close(FILE);
my @final=();
while ($input=~m#ORIGIN(.*?)//#gsi)
{
push(@final,$1);
}
print @final;
input file "text.txt" contains
**********************************************
FEATURES Location/Qualifiers
/note="blah blah"
COUNT 200
ORIGIN
1 lots of nice info
61 lots of nice info
121 lots of nice info
//
ORIGIN
1 lots of nice info
61 lots of nice info
121 lots of nice info
//
ORIGIN
1 lots of nice info
61 lots of nice info
121 lots of nice info
//
ORIGIN
1 lots of nice info
61 lots of nice info
121 lots of nice info
//
***********************************
output file is look
**********************
1 lots of nice info
61 lots of nice info
121 lots of nice info
1 lots of nice info
61 lots of nice info
121 lots of nice info
1 lots of nice info
61 lots of nice info
121 lots of nice info
1 lots of nice info
61 lots of nice info
121 lots of nice info
Regards,
Senthi Kumar.k
| [reply] [Watch: Dir/Any] [d/l] |
|
#!/usr/bin/perl
while (<DATA>) {
print;
}
__DATA__
foo
goo
hoo
__DataFoo__ | [reply] [Watch: Dir/Any] [d/l] |
Re: Genbank file parsing
by Hena (Friar) on Jan 11, 2005 at 13:57 UTC
|
While not exactly an answer to your question, but a help for parsing sequence formats. It seems that you are parsing EMBL files. Easier way to get sequences from those might be emboss seqret or perhaps bioperl. | [reply] [Watch: Dir/Any] |
Re: Genbank file parsing
by blazar (Canon) on Jan 11, 2005 at 14:00 UTC
|
#!/usr/bin/perl -ln
use strict;
use warnings;
if ($_ eq 'ORIGIN') {
local $/='//';
print <>;
}
or
#!/usr/bin/perl -ln
use strict;
use warnings;
print if $_ eq 'ORIGIN' .. $_ eq '//'
__END__
Of course these are intended to be as minimal examples: adapt the techniques shown here to your needs.
This is my code so far:
open (FILE, $ARGV[0]) or die "unable to open FILE\n";
Why not using <> in the first place? Also, you'd better:
- use lexical FHs,
- use the three args form of open(),
- put relevant info in the error message (i.e. at least include $!).
Note: I skipped the rest | [reply] [Watch: Dir/Any] [d/l] [select] |
Re: file parsing - use Parse::Recdescent
by tphyahoo (Vicar) on Jan 11, 2005 at 16:55 UTC
|
This might be overkill, but if regexes are feeling unwieldy or hard to maintain for your task, you might want to try it with a grammar and Parse::RecDescent.
I haven't used P::RD myself yet, but am learning it because it seems like it would come in handy in a variety of situations where regexes won't quite get the job done, or get the job done kludgily. Also, perl 6 "rules" (the new word for the concept formerly known as regex) are shaping up to be sort of an amalgamation of regexes and formal grammar, with the formal grammar aspect closely related to the way grammar parsing works in P::RD. (Damian Conway, who did P::RD is also in charge of Perl 6 rules.)
thomas.
| [reply] [Watch: Dir/Any] |
Re: Genbank file parsing
by ercparker (Hermit) on Jan 12, 2005 at 07:20 UTC
|
#!/usr/bin/perl -w
use strict;
while (<DATA>) {
if (/ORIGIN/ .. /\/\//) {
print;
}
}
__DATA__
# extract of file:
#===============
FEATURES Location/Qualifiers
/note="blah blah"
COUNT 200
ORIGIN
1 lots of nice info
61 lots of nice info
121 lots of nice info
//
| [reply] [Watch: Dir/Any] [d/l] |
Re: Genbank file parsing
by perlsen (Chaplain) on Jan 11, 2005 at 14:08 UTC
|
undef $/;
open (IN, "$ARGV[0]");
my $string=<IN>;
close(IN);
(@arr)=$string =~ m#ORIGIN(.*?)//#gsi;
print @arr;
| [reply] [Watch: Dir/Any] [d/l] |
Re: Genbank file parsing
by EdwardG (Vicar) on Jan 11, 2005 at 13:04 UTC
|
You might find it easier if your consider the file as a long string, one that happens to contain embedded newline characters.
my $data = do {local $/; <DATA>};
my @items = map { { 'name' => $1, 'niceinfo' => $2 } }
($data =~ /^FEATURES(.+?)^ORIGIN(.+?)^\/\//msg);
| [reply] [Watch: Dir/Any] [d/l] |
|
You might find it easier if your consider the file as a long string, one that happens to contain embedded newline characters.
IMHO this qualifies as a particularly bad answer since he specifically pointed out that his file is large and it is always recommended not to slurp large files all at once if possible. Now I don't see anything here that suggests this to be necessary...
| [reply] [Watch: Dir/Any] |
|
It depends just how large the file is, how much memory is available, and other various trade-offs. But you are probably right in general, although I would quibble about "particularly bad" :-/
| [reply] [Watch: Dir/Any] |
Re: Genbank file parsing
by TedPride (Priest) on Jan 12, 2005 at 00:54 UTC
|
There's no need to use regex for most of that, since you aren't searching case insensitive anyway.
use strict;
use warnings;
my ($features, $count, @seq);
while (<DATA>) {
if (index($_, 'FEATURES') != -1) {
($features) = m/\s(\S+)/;
}
elsif (index($_, 'COUNT') != -1) {
($count) = m/\s(\d+)/;
}
elsif (index($_, 'ORIGIN') != -1) {
push @seq, $_ until index($_ = <DATA>, '//') != -1;
}
}
print "$features\n$count\n",@seq;
__DATA__
# extract of file:
#===============
FEATURES Location/Qualifiers
/note="blah blah"
COUNT 200
ORIGIN
1 lots of nice info
61 lots of nice info
121 lots of nice info
//
| [reply] [Watch: Dir/Any] [d/l] |
Re: Genbank file parsing
by Anonymous Monk on Jan 11, 2005 at 13:43 UTC
|
while (<FILE>) {
$features = $1, next if /FEATURES\s+(\w+)/;
$count = $1, next if /COUNT\s+(\w+)/;
push @seq, $1 if /^ORIGIN/ .. m!//!;
}
| [reply] [Watch: Dir/Any] [d/l] |
Re: Genbank file parsing
by Anonymous Monk on Jan 24, 2008 at 15:18 UTC
|
I've written a Parse::RecDescent-based GenBank parser that you might find useful/extensible:
http://search.cpan.org/~kclark/Bio-GenBankParser-0.01/lib/Bio/GenBankParser.pm
Humbly,
ky | [reply] [Watch: Dir/Any] |