heidi has asked for the wisdom of the Perl Monks concerning the following question:
hi all, i need a small clarification in the following program.
i have a datafile like this.
ENTRY CCHU #type complete
TITLE cytochrome c [validated] - human
Homo sapiens
ORGANISM #formal_name Homo sapiens #common_name man
ACCESSIONS A31764; A05676; I55192; A00001
MGDVEKGKKIFIMKCSQCHTVEMGDVEKGGKHKTGPNLHGMIYARAJLFGRKTSEKGQAPGYSYTAANKN
+KGIIWGEDTLMEYLENPKKYIP
ENTRY CCCZ #type complete
TITLE cytochrome c - chimpanzee (tentative sequence)
ORGANISM #formal_name Pan troglodytes #common_name chimpanzee
ACCESSIONS A00002
GDVEKGKKIFIMKCSQCHTSEKVEKGSSSKHKSSSTGPNLHGLMIYARAJFGRKTGSEKQAPGYSYTAAN
+KNKGIIWGED
ENTRY CCMQR #type complete
TITLE cytochrome c - rhesus macaque (tentative sequence)
Macaca mulatta
ORGANISM #formal_name Macaca mulatta #common_name rhesus macaq
+ue
ACCESSIONS A00003
GDVEKGKKIFIMKCSQSEKCHTVEKGGSSSSKHKTGPNLHGSSEKEMIYARAJKSEKLFGAAAAAAAARK
+TGQAPGYSYTAANKSSSSNKGITWGEDTLMEYLENPKKYIPGTKMIFVGIKKKEE
ENTRY CCMKP #type complete
TITLE cytochrome c - spider monkey
ORGANISM #formal_name Ateles sp. #common_name spider monkey
ACCESSIONS A00004
GDVFKGKRIFIMKCSQCHTVESSSSKGGKHKTGPNLHGLMIYARAJSEKFGSSSSSSSSSSR
i have written a program to save each and every line in a seperate array. this is the program
open (PIR,'/home/guest/sampir.txt');
my @arr = ();
while (<PIR>)
{
chomp;
if( /^ENTRY/ ) { $entry = $_ }
elsif ( /^(TITLE)\s+(\S.*)/ ) { $title = "$1\n\t $2" }
elsif ( /^(ORGANISM)\s+(\S.*)/ ) { $org = "$1\n\t $2" }
elsif ( /^ACCESSIONS/ ) { $acc = $_ }
else {
push @se, $_;
}
}
but the line which is under the TITLE heading is not giving the 2nd line of its data. instead it gives only the first line.
eg; when i print the title of the first entry it prints only
"cytochrome c validated - human "and its not printing the second line "Homo sapiens"...
How do i print the second line too in the same first line?
plz help out.
thanks.
Re: doubt in storing a data of 2 lines in an array.
by davorg (Chancellor) on Oct 30, 2006 at 13:42 UTC
|
You are reading the file a line at a time. So when you process the TITLE line, the data in $_ only contains the line that begins with TITLE. So that is all that ends up in $title. The rest of the TITLE data is processed on the next iteration of the loop (and, probably ends up as the first element in @se).
There are a couple of ways to solve this:
- Write a more complex parser. Keep a note of the _previous_ line's tag and if the current line starts with whitespace then append this data to the end of the previous tag.
- Read all the whole file into memory and parse it using more complex regular expressions.
--
< http://dave.org.uk>
"The first rule of Perl club is you do not talk about
Perl club." -- Chip Salzenberg
| [reply] |
Re: doubt in storing a data of 2 lines in an array.
by johngg (Canon) on Oct 30, 2006 at 14:30 UTC
|
davorg's reply suggested one way to approach the problem would be to read the whole file into memory then parse with regular expressions. The following script shows one possible way of doing this using two stages, the first to break into records and the second to break each record into fields. Here it is
use strict;
use warnings;
my $rxRecord = qr
{(?xs)
(ENTRY.*?\n)
(?=ENTRY|\z)
};
my $rxFieldHdrs = qr{(?:ENTRY|TITLE|ORGANISM|ACCESSIONS)};
my $rxField = qr
{(?xs)
($rxFieldHdrs.*?\n)
(?=$rxFieldHdrs|\z)
};
my $fileText;
{
local $/;
$fileText = <DATA>;
}
my @records = $fileText =~ m{$rxRecord}g;
foreach my $record (@records)
{
print qq{$record}, q{+} x 50, qq{\n};
my @fields = $record =~ m{$rxField}g;
foreach my $field (@fields)
{
print qq{$field}, q{-} x 50, qq{\n};
}
print q{*} x 50, qq{\n};
}
__END__
ENTRY CCHU #type complete
TITLE cytochrome c [validated] - human
Homo sapiens
ORGANISM #formal_name Homo sapiens #common_name man
ACCESSIONS A31764; A05676; I55192; A00001
MGDVEKGKKIFIMKCSQCHTVEMGDVEKGGKHKTGPNLHGMIYARAJLFGRKTSEKGQAPGYSYTAANKN
+KGIIWGEDTLMEYLENPKKYIP
ENTRY CCCZ #type complete
TITLE cytochrome c - chimpanzee (tentative sequence)
ORGANISM #formal_name Pan troglodytes #common_name chimpanzee
ACCESSIONS A00002
GDVEKGKKIFIMKCSQCHTSEKVEKGSSSKHKSSSTGPNLHGLMIYARAJFGRKTGSEKQAPGYSYTAAN
+KNKGIIWGED
ENTRY CCMQR #type complete
TITLE cytochrome c - rhesus macaque (tentative sequence)
Macaca mulatta
ORGANISM #formal_name Macaca mulatta #common_name rhesus macaq
+ue
ACCESSIONS A00003
GDVEKGKKIFIMKCSQSEKCHTVEKGGSSSSKHKTGPNLHGSSEKEMIYARAJKSEKLFGAAAAAAAARK
+TGQAPGYSYTAANKSSSSNKGITWGEDTLMEYLENPKKYIPGTKMIFVGIKKKEE
ENTRY CCMKP #type complete
TITLE cytochrome c - spider monkey
ORGANISM #formal_name Ateles sp. #common_name spider monkey
ACCESSIONS A00004
GDVFKGKRIFIMKCSQCHTVESSSSKGGKHKTGPNLHGLMIYARAJSEKFGSSSSSSSSSSR
and here is the output showing for each record the whole record then each individual field. As you can see, your two-line title is preserved.
I hope this is of use
Cheers, JohnGG | [reply] [d/l] [select] |
|
hi john, thank ya for the reply. the program works out very well, and the coding was really smart, i just need to clarify one last doubt of mine, ie., i am not able to print the TITLE's content in the same line, its printing in 2 lines watever i do. plz reply.
thank u once again.
| [reply] |
|
You could either do a global substitution something like $field =~ s{\n}{ }g to replace any newline with a space or you could achieve the same thing with split and join, something like $field = join q{ }, split m{\n}, $field;. In each case you are going to have to handle a big gap in your line because of the indentation of the second line of the title. However, this post should give you enough clues about s{this}{the other} to solve that for yourself. Big hint, \s+ means one or more white-space characters.Best of luck, JohnGG
| [reply] [d/l] [select] |
|
Re: doubt in storing a data of 2 lines in an array.
by Hofmator (Curate) on Oct 30, 2006 at 13:57 UTC
|
| [reply] [d/l] |
Re: doubt in storing a data of 2 lines in an array.
by Fletch (Bishop) on Oct 30, 2006 at 14:08 UTC
|
Not to mention that if this is a common format BioPerl may already have an interface to read it off the shelf.
| [reply] |
|
Good point! This page contains the available formats ...
-- Hofmator
Code written by Hofmator and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.
| [reply] |
Re: doubt in storing a data of 2 lines in an array.
by shmem (Chancellor) on Oct 30, 2006 at 15:10 UTC
|
i have written a program to save each and every line in a seperate array. this is the program
which does not meet it's purpose, since you are saving all lines of
your data file which don't begin with either ENTRY, TITLE, ORGANISM or ACCESSIONS into a single array which you name @se.
Let's look at your data file. It seems to be composed of multi-line records, in which each field begins on a separate line. Each field has an identifier up front (except the last record field which is just a sequence of chars with no blank in it), and some fields appear to be multi-line as well.
Since there is no record separator, you can only tell that all fields of a record are read when all field contents are read. Since your records appear to be ordered, I assume that is the case when that single-word line appears. All fields are stored in an anonymous array, which is pushed onto an array when done reading.
After storing each record, a new anonymous array is initialized for the next record:
my $file = '/home/guest/sampir.txt');
open (PIR, '<', $file) or die "Can't read '$file': $!\n";
my @arr = ();
my $se = []; # anonymous record array
while(<PIR>)
{
chomp;
if (/^(\w+)\s+/) # new field identifier, followed by blanks
{
push @$se, $_;
}
elsif (s/^\s+/ /) # if we can strip leading blanks,
# it's a continuation line
{
$se->[-1] .= $_; # append to last field of this record
}
elsif(/^\w+$/) # must be the last field of the record
{
push @$se, $_; # save the last field
push @arr, $se; # save the record array reference
$se = []; # and make a new array reference for the next
+ record
}
else
{
die "Unknown line type at line $. of '$file'\n";
}
}
Now you have all records in an array of arrays. See perldsc.
--shmem
_($_=" "x(1<<5)."?\n".q·/)Oo. G°\ /
/\_¯/(q /
---------------------------- \__(m.====·.(_("always off the crowd"))."·
");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
| [reply] [d/l] [select] |
|
Thank you very much...
i learnt how to do it, i can manage such problems myself later. thanks again
| [reply] |
|
|