Finding first block of contiguous elements in an array

FamousLongAgo has asked for the wisdom of the Perl Monks concerning the following question:

Hello, fellow monks!

I have been writing a parser for some Protein Data Bank files, for a bioinformatics project. I have no problem extracting the sequences I need, but I am stumped by the titles. Here's the problem:

The files start out in this format:

HEADER    METAL BINDING PROTEIN                   31-AUG-98   1BSW    
+          
TITLE     ACUTOLYSIN A FROM SNAKE VENOM OF AGKISTRODON ACUTUS AT PH   
+          
TITLE    2 7.5                                                        
+          
COMPND    MOL_ID: 1;                                                  
+          
COMPND   2 MOLECULE: ACUTOLYSIN A;                 
...
[download]

The lines beginning with TITLE are the ones I'm interested in grabbing. There's a little caveat in that after the first line, the line number gets prepended to the title fragment. So in this example, the actual title is "Acutolysin A from snake venom of agkistrodon acutus at pH 7.5".

So far so dull. But later in the file, sometimes much later, there may be lines that also begin with TITLE. We want to ignore those.
Assuming the following constraints:

We treat the file as an array ( no slurping into a scalar )
There is no way to distinguish the later TITLE elements by pattern matching.

Can anyone think of an elegant way to grab the first block of 1+ contiguous TITLE lines, and stop?

I know how to do this with regular expressions on a scalar, and how to do it in a very unelegant way by setting flags in a loop, but I suspect there is greater wisdom out there and can't wait to learn.

Special bonus to anyone who can tell me what an agkistrodon acutus is, and how deadly is its bite.

Comment on Finding first block of contiguous elements in an array Select or Download Code

Replies are listed 'Best First'.
Re: Finding first block of contiguous elements in an array by dws (Chancellor) on Dec 21, 2002 at 06:10 UTC
Special bonus to anyone who can tell me what an agkistrodon acutus is, and how deadly is its bite. That would be the Formosan Hundred-pace (or conehead) snake, whose bite causes skin and muscles to darken and deteriorate, accompanied by blistering and blood-tinted discharge and a slight burning sensation. Assuming you've opened the file using the filehandle FILE, the following should work: `while ( <FILE> ) { last if /^TITLE (.)$/; } $title = $1; while ( <FILE> ) { last if not /^TITLE (\d) (.*)/; $title .= ' ' . $2; }` [download]	[reply] [d/l]
Re: Finding first block of contiguous elements in an array by hossman (Prior) on Dec 21, 2002 at 06:05 UTC
It's not clear if the only thing you want is the first occurence of TITLE, but assuming it is, the simplest thing to do is to just stop processing your input stream once you are done with first set of title lines. And if you want to avoid "flag" vaiables, you can allways use a nested loop over the input handle. something like this (psuedo-perl) perhaps... `while (<STDIN>) { next unless /^TITLE\s+(.)$/; my $title = $1; while (<STDIN>) { last unless /^TITLE\s+\d+\s+(.)$/; $title .= $1 } print "Here is the title you wanted: $title"; last; }` [download]	[reply] [d/l]
Re: Finding first block of contiguous elements in an array by tachyon (Chancellor) on Dec 21, 2002 at 06:10 UTC
If your files contain multiple headers just flag it and get on with it.... If not just exit after the initial run of TITLE tokens runs out my $header = 0; my $title = 0; my $string = ''; while (<DATA>) { $header = 1 if /^HEADER/; $title = 1 if /^TITLE/ and $header; if ( $header and $title ) { if ( /^TITLE\s+(.*)/ ) { $string .= $1; } else { $header = $title = 0; $string =~ s/\s+/ /g; print "$string\n"; $string = ''; } } } __DATA__ HEADER METAL BINDING PROTEIN 31-AUG-98 1BSW + TITLE ACUTOLYSIN A FROM SNAKE VENOM OF AGKISTRODON ACUTUS AT PH + TITLE 2 7.5 + COMPND MOL_ID: 1; + COMPND 2 MOLECULE: ACUTOLYSIN A; TITLE NO 2 7.5 TITLE NO 2 7.5 HEADER METAL BINDING PROTEIN 31-AUG-98 1BSW + TITLE ACUTOLYSIN A FROM SNAKE VENOM OF AGKISTRODON ACUTUS AT PH + TITLE 1 2 3 4 5 + COMPND MOL_ID: 1; + COMPND 2 MOLECULE: ACUTOLYSIN A; TITLE NO 2 7.5 TITLE NO 2 7.5 [download] cheers tachyon s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print	[reply] [d/l]
Re: Finding first block of contiguous elements in an array by Aristotle (Chancellor) on Dec 21, 2002 at 12:21 UTC
To propose a really fitting solution, we need to see some code. How do you extract the title? Do you read it in a separate run through the file? Do you have a loop that does one of several things depending on what the current line starts with? What else is your code doing? The solution will differ depending on your existing implementation. I am guessing that: the information is all located in a single file, you're only doing one iteration over it, and all pieces of information follow the format you already showed (ie if broken across multiple lines, the following lines start with the same tag followed by a line number). In that case, the way I'd handle this is to read the lines batchwise, reconstruct them into a single line, then hand it off to the appropriate handler. my %handler = ( HEADER => sub { ... }, TITLE => sub { ... }, COMPND => sub { ... }, ); my ($tag, $text) = ("")x2; while(<>) { chomp; my ($curr_tag, $curr_text) = split /\s+/, $_, 2; if($curr_tag ne $prev_tag) { $handler{$tag}->($tag, $text) if exists $handler{$tag}; # complain_about_unknown() if not exists $handler{$tag}; ? ($tag, $text) = ($curr_tag, ""); } else { my $curr_linenr; ($curr_linenr, $curr_text) = split /\s+/, $curr_text, 2; # perform validation on line nr here? } $text .= " " . $curr_text; } [download] So now we have a parser that lets us write handlers for the tags that don't individually need to worry about multiple line text. And then the distinction is painless: `my %record; my %handler = ( # ... TITLE => sub { $record{TITLE} = $_[1] unless exists $record{TITLE} + }, # ... );` [download] Or if there are multiple records per file: `my $curr_rec = 0; my @record; my %handler = ( # ... HEADER => sub { ++$curr_rec }, TITLE => sub { $record[$curr_rec]->{TITLE} = $_[1] unless exists $record[$curr_rec]->{TITLE} }, # ... );` [download] You get the idea. Makeshifts last the longest.	[reply] [d/l] [select]
Re: Finding first block of contiguous elements in an array by Arien (Pilgrim) on Dec 21, 2002 at 16:29 UTC
Can anyone think of an elegant way to grab the first block of 1+ contiguous TITLE lines, and stop? `my $title; for (@data) { if (/^TITLE\s+\d(.)/) { $title .= $1; } else { last if defined $title; } }` [download] The elegance would be in not adding a flag but rather reusing `$title` for that purpose (since it already contains the state the flag would save). — Arien	[reply] [d/l]
Re: Re: Finding first block of contiguous elements in an array by BrowserUk (Patriarch) on Dec 21, 2002 at 16:36 UTC
~~'cept that that will bottle out too early unless the first line is a title line (which it isn't)?~~ Update: Talking That which begins with B and ends in ollocks again. Arien++ Examine what is said, not who speaks.	[reply]


Perl-Sensitive Sunglasses
	PerlMonks