doubt in storing a data of 2 lines in an array.

heidi has asked for the wisdom of the Perl Monks concerning the following question:

hi all, i need a small clarification in the following program. i have a datafile like this.

ENTRY            CCHU       #type complete
TITLE            cytochrome c [validated] - human
         Homo sapiens
ORGANISM         #formal_name Homo sapiens #common_name man
ACCESSIONS       A31764; A05676; I55192; A00001
MGDVEKGKKIFIMKCSQCHTVEMGDVEKGGKHKTGPNLHGMIYARAJLFGRKTSEKGQAPGYSYTAANKN
+KGIIWGEDTLMEYLENPKKYIP
ENTRY            CCCZ       #type complete
TITLE            cytochrome c - chimpanzee (tentative sequence)
ORGANISM         #formal_name Pan troglodytes #common_name chimpanzee
ACCESSIONS       A00002
GDVEKGKKIFIMKCSQCHTSEKVEKGSSSKHKSSSTGPNLHGLMIYARAJFGRKTGSEKQAPGYSYTAAN
+KNKGIIWGED
ENTRY            CCMQR      #type complete
TITLE            cytochrome c - rhesus macaque (tentative sequence)
         Macaca mulatta 
ORGANISM         #formal_name Macaca mulatta #common_name rhesus macaq
+ue
ACCESSIONS       A00003
GDVEKGKKIFIMKCSQSEKCHTVEKGGSSSSKHKTGPNLHGSSEKEMIYARAJKSEKLFGAAAAAAAARK
+TGQAPGYSYTAANKSSSSNKGITWGEDTLMEYLENPKKYIPGTKMIFVGIKKKEE
ENTRY            CCMKP      #type complete
TITLE            cytochrome c - spider monkey
ORGANISM         #formal_name Ateles sp. #common_name spider monkey
ACCESSIONS       A00004
GDVFKGKRIFIMKCSQCHTVESSSSKGGKHKTGPNLHGLMIYARAJSEKFGSSSSSSSSSSR
[download]

i have written a program to save each and every line in a seperate array. this is the program

open (PIR,'/home/guest/sampir.txt');
my @arr = ();
while (<PIR>)
{
    chomp;
    if( /^ENTRY/ ) { $entry = $_ }
    elsif ( /^(TITLE)\s+(\S.*)/ ) { $title = "$1\n\t $2" }
    elsif ( /^(ORGANISM)\s+(\S.*)/ ) { $org = "$1\n\t $2" }
    elsif ( /^ACCESSIONS/ ) { $acc = $_ }
    else {
        push @se, $_;
    }
}
[download]

but the line which is under the TITLE heading is not giving the 2nd line of its data. instead it gives only the first line. eg; when i print the title of the first entry it prints only "cytochrome c validated - human "and its not printing the second line "Homo sapiens"... How do i print the second line too in the same first line? plz help out. thanks.

Comment on doubt in storing a data of 2 lines in an array. Select or Download Code

Replies are listed 'Best First'.
Re: doubt in storing a data of 2 lines in an array. by davorg (Chancellor) on Oct 30, 2006 at 13:42 UTC
You are reading the file a line at a time. So when you process the TITLE line, the data in $_ only contains the line that begins with TITLE. So that is all that ends up in $title. The rest of the TITLE data is processed on the next iteration of the loop (and, probably ends up as the first element in @se). There are a couple of ways to solve this: Write a more complex parser. Keep a note of the _previous_ line's tag and if the current line starts with whitespace then append this data to the end of the previous tag. Read all the whole file into memory and parse it using more complex regular expressions. -- <http://dave.org.uk> "The first rule of Perl club is you do not talk about Perl club." -- Chip Salzenberg	[reply]
Re: doubt in storing a data of 2 lines in an array. by johngg (Canon) on Oct 30, 2006 at 14:30 UTC
davorg's reply suggested one way to approach the problem would be to read the whole file into memory then parse with regular expressions. The following script shows one possible way of doing this using two stages, the first to break into records and the second to break each record into fields. Here it is use strict; use warnings; my $rxRecord = qr {(?xs) (ENTRY.?\n) (?=ENTRY\|\z) }; my $rxFieldHdrs = qr{(?:ENTRY\|TITLE\|ORGANISM\|ACCESSIONS)}; my $rxField = qr {(?xs) ($rxFieldHdrs.?\n) (?=$rxFieldHdrs\|\z) }; my $fileText; { local $/; $fileText = <DATA>; } my @records = $fileText =~ m{$rxRecord}g; foreach my $record (@records) { print qq{$record}, q{+} x 50, qq{\n}; my @fields = $record =~ m{$rxField}g; foreach my $field (@fields) { print qq{$field}, q{-} x 50, qq{\n}; } print q{*} x 50, qq{\n}; } __END__ ENTRY CCHU #type complete TITLE cytochrome c [validated] - human Homo sapiens ORGANISM #formal_name Homo sapiens #common_name man ACCESSIONS A31764; A05676; I55192; A00001 MGDVEKGKKIFIMKCSQCHTVEMGDVEKGGKHKTGPNLHGMIYARAJLFGRKTSEKGQAPGYSYTAANKN +KGIIWGEDTLMEYLENPKKYIP ENTRY CCCZ #type complete TITLE cytochrome c - chimpanzee (tentative sequence) ORGANISM #formal_name Pan troglodytes #common_name chimpanzee ACCESSIONS A00002 GDVEKGKKIFIMKCSQCHTSEKVEKGSSSKHKSSSTGPNLHGLMIYARAJFGRKTGSEKQAPGYSYTAAN +KNKGIIWGED ENTRY CCMQR #type complete TITLE cytochrome c - rhesus macaque (tentative sequence) Macaca mulatta ORGANISM #formal_name Macaca mulatta #common_name rhesus macaq +ue ACCESSIONS A00003 GDVEKGKKIFIMKCSQSEKCHTVEKGGSSSSKHKTGPNLHGSSEKEMIYARAJKSEKLFGAAAAAAAARK +TGQAPGYSYTAANKSSSSNKGITWGEDTLMEYLENPKKYIPGTKMIFVGIKKKEE ENTRY CCMKP #type complete TITLE cytochrome c - spider monkey ORGANISM #formal_name Ateles sp. #common_name spider monkey ACCESSIONS A00004 GDVFKGKRIFIMKCSQCHTVESSSSKGGKHKTGPNLHGLMIYARAJSEKFGSSSSSSSSSSR [download] and here is the output showing for each record the whole record then each individual field. As you can see, your two-line title is preserved. Read more... (4 kB) I hope this is of use Cheers, JohnGG	[reply] [d/l] [select]
Re^2: doubt in storing a data of 2 lines in an array. by Anonymous Monk on Oct 30, 2006 at 14:53 UTC
hi john, thank ya for the reply. the program works out very well, and the coding was really smart, i just need to clarify one last doubt of mine, ie., i am not able to print the TITLE's content in the same line, its printing in 2 lines watever i do. plz reply. thank u once again.	[reply]
Re^3: doubt in storing a data of 2 lines in an array. by johngg (Canon) on Oct 30, 2006 at 15:09 UTC
You could either do a global substitution something like `$field =~ s{\n}{ }g` to replace any newline with a space or you could achieve the same thing with `split` and `join`, something like `$field = join q{ }, split m{\n}, $field;`. In each case you are going to have to handle a big gap in your line because of the indentation of the second line of the title. However, this post should give you enough clues about `s{this}{the other}` to solve that for yourself. Big hint, `\s+` means one or more white-space characters. Best of luck, JohnGG	[reply] [d/l] [select]
Re^4: doubt in storing a data of 2 lines in an array. by Anonymous Monk on Oct 31, 2006 at 10:22 UTC
Re: doubt in storing a data of 2 lines in an array. by Hofmator (Curate) on Oct 30, 2006 at 13:57 UTC
heidi, as in your last posts, your program is a bit confusing, your description and program do not really fit together, ... A couple of pointers and maybe then you can clarify what you want to achieve Your program declares a `@arr` variable but never uses it. Please try to post the smallest code possible that exhibits your problem. You are saying that you are saving each line of the file in a separate array. That is not the case. You are overwriting scalar variables (like $entry or $title) each time you encounter a matching line. So e.g. at the end of your loop, you have the last ORGANISM line in the scalar variable $org. You are reading through your file line by line. How do you expect the 2nd line of a title to end up in the $title variable? None of your regular expressions match on this 2nd line so the else brach is executed and this line is pushed onto the array @se. -- Hofmator Code written by Hofmator and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.	[reply] [d/l]
Re: doubt in storing a data of 2 lines in an array. by Fletch (Bishop) on Oct 30, 2006 at 14:08 UTC
Not to mention that if this is a common format BioPerl may already have an interface to read it off the shelf.	[reply]
Re^2: doubt in storing a data of 2 lines in an array. by Hofmator (Curate) on Oct 30, 2006 at 14:24 UTC
Good point! This page contains the available formats ... -- Hofmator Code written by Hofmator and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.	[reply]
Re: doubt in storing a data of 2 lines in an array. by shmem (Chancellor) on Oct 30, 2006 at 15:10 UTC
i have written a program to save each and every line in a seperate array. this is the program which does not meet it's purpose, since you are saving all lines of your data file which don't begin with either `ENTRY`, `TITLE`, `ORGANISM` or `ACCESSIONS` into a single array which you name `@se`. Let's look at your data file. It seems to be composed of multi-line records, in which each field begins on a separate line. Each field has an identifier up front (except the last ~~record~~ field which is just a sequence of chars with no blank in it), and some fields appear to be multi-line as well. Since there is no record separator, you can only tell that all fields of a record are read when all field contents are read. Since your records appear to be ordered, I assume that is the case when that single-word line appears. All fields are stored in an anonymous array, which is pushed onto an array when done reading. After storing each record, a new anonymous array is initialized for the next record: my $file = '/home/guest/sampir.txt'); open (PIR, '<', $file) or die "Can't read '$file': $!\n"; my @arr = (); my $se = []; # anonymous record array while(<PIR>) { chomp; if (/^(\w+)\s+/) # new field identifier, followed by blanks { push @$se, $_; } elsif (s/^\s+/ /) # if we can strip leading blanks, # it's a continuation line { $se->[-1] .= $_; # append to last field of this record } elsif(/^\w+$/) # must be the last field of the record { push @$se, $_; # save the last field push @arr, $se; # save the record array reference $se = []; # and make a new array reference for the next + record } else { die "Unknown line type at line $. of '$file'\n"; } } [download] Now you have all records in an array of arrays. See perldsc. Read more... (3 kB) --shmem _($_=" "x(1<<5)."?\n".q·/)Oo. G°\ / /\_¯/(q / ---------------------------- \__(m.====·.(_("always off the crowd"))."· ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}	[reply] [d/l] [select]
Re^2: doubt in storing a data of 2 lines in an array. by Anonymous Monk on Oct 31, 2006 at 10:20 UTC
Thank you very much... i learnt how to do it, i can manage such problems myself later. thanks again	[reply]


Your skill will accomplish what the force of many cannot
	PerlMonks