Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Finding first block of contiguous elements in an array

by FamousLongAgo (Friar)
on Dec 21, 2002 at 05:16 UTC ( [id://221570]=perlquestion: print w/replies, xml ) Need Help??

FamousLongAgo has asked for the wisdom of the Perl Monks concerning the following question:

Hello, fellow monks!

I have been writing a parser for some Protein Data Bank files, for a bioinformatics project. I have no problem extracting the sequences I need, but I am stumped by the titles. Here's the problem:

The files start out in this format:
HEADER METAL BINDING PROTEIN 31-AUG-98 1BSW + TITLE ACUTOLYSIN A FROM SNAKE VENOM OF AGKISTRODON ACUTUS AT PH + TITLE 2 7.5 + COMPND MOL_ID: 1; + COMPND 2 MOLECULE: ACUTOLYSIN A; ...
The lines beginning with TITLE are the ones I'm interested in grabbing. There's a little caveat in that after the first line, the line number gets prepended to the title fragment. So in this example, the actual title is "Acutolysin A from snake venom of agkistrodon acutus at pH 7.5".

So far so dull. But later in the file, sometimes much later, there may be lines that also begin with TITLE. We want to ignore those.
Assuming the following constraints:
  1. We treat the file as an array ( no slurping into a scalar )
  2. There is no way to distinguish the later TITLE elements by pattern matching.
Can anyone think of an elegant way to grab the first block of 1+ contiguous TITLE lines, and stop?

I know how to do this with regular expressions on a scalar, and how to do it in a very unelegant way by setting flags in a loop, but I suspect there is greater wisdom out there and can't wait to learn.

Special bonus to anyone who can tell me what an agkistrodon acutus is, and how deadly is its bite.

Replies are listed 'Best First'.
Re: Finding first block of contiguous elements in an array
by dws (Chancellor) on Dec 21, 2002 at 06:10 UTC
    Special bonus to anyone who can tell me what an agkistrodon acutus is, and how deadly is its bite.

    That would be the Formosan Hundred-pace (or conehead) snake, whose bite causes skin and muscles to darken and deteriorate, accompanied by blistering and blood-tinted discharge and a slight burning sensation.

    Assuming you've opened the file using the filehandle FILE, the following should work:

    while ( <FILE> ) { last if /^TITLE (.*)$/; } $title = $1; while ( <FILE> ) { last if not /^TITLE (\d*) (.*)/; $title .= ' ' . $2; }
Re: Finding first block of contiguous elements in an array
by hossman (Prior) on Dec 21, 2002 at 06:05 UTC

    It's not clear if the only thing you want is the first occurence of TITLE, but assuming it is, the simplest thing to do is to just stop processing your input stream once you are done with first set of title lines. And if you want to avoid "flag" vaiables, you can allways use a nested loop over the input handle.

    something like this (psuedo-perl) perhaps...

    while (<STDIN>) { next unless /^TITLE\s+(.*)$/; my $title = $1; while (<STDIN>) { last unless /^TITLE\s+\d+\s+(.*)$/; $title .= $1 } print "Here is the title you wanted: $title"; last; }

Re: Finding first block of contiguous elements in an array
by tachyon (Chancellor) on Dec 21, 2002 at 06:10 UTC

    If your files contain multiple headers just flag it and get on with it.... If not just exit after the initial run of TITLE tokens runs out

    my $header = 0; my $title = 0; my $string = ''; while (<DATA>) { $header = 1 if /^HEADER/; $title = 1 if /^TITLE/ and $header; if ( $header and $title ) { if ( /^TITLE\s+(.*)/ ) { $string .= $1; } else { $header = $title = 0; $string =~ s/\s+/ /g; print "$string\n"; $string = ''; } } } __DATA__ HEADER METAL BINDING PROTEIN 31-AUG-98 1BSW + TITLE ACUTOLYSIN A FROM SNAKE VENOM OF AGKISTRODON ACUTUS AT PH + TITLE 2 7.5 + COMPND MOL_ID: 1; + COMPND 2 MOLECULE: ACUTOLYSIN A; TITLE NO 2 7.5 TITLE NO 2 7.5 HEADER METAL BINDING PROTEIN 31-AUG-98 1BSW + TITLE ACUTOLYSIN A FROM SNAKE VENOM OF AGKISTRODON ACUTUS AT PH + TITLE 1 2 3 4 5 + COMPND MOL_ID: 1; + COMPND 2 MOLECULE: ACUTOLYSIN A; TITLE NO 2 7.5 TITLE NO 2 7.5

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Re: Finding first block of contiguous elements in an array
by Aristotle (Chancellor) on Dec 21, 2002 at 12:21 UTC

    To propose a really fitting solution, we need to see some code.

    How do you extract the title? Do you read it in a separate run through the file? Do you have a loop that does one of several things depending on what the current line starts with? What else is your code doing? The solution will differ depending on your existing implementation.

    I am guessing that: the information is all located in a single file, you're only doing one iteration over it, and all pieces of information follow the format you already showed (ie if broken across multiple lines, the following lines start with the same tag followed by a line number).

    In that case, the way I'd handle this is to read the lines batchwise, reconstruct them into a single line, then hand it off to the appropriate handler.

    my %handler = ( HEADER => sub { ... }, TITLE => sub { ... }, COMPND => sub { ... }, ); my ($tag, $text) = ("")x2; while(<>) { chomp; my ($curr_tag, $curr_text) = split /\s+/, $_, 2; if($curr_tag ne $prev_tag) { $handler{$tag}->($tag, $text) if exists $handler{$tag}; # complain_about_unknown() if not exists $handler{$tag}; ? ($tag, $text) = ($curr_tag, ""); } else { my $curr_linenr; ($curr_linenr, $curr_text) = split /\s+/, $curr_text, 2; # perform validation on line nr here? } $text .= " " . $curr_text; }
    So now we have a parser that lets us write handlers for the tags that don't individually need to worry about multiple line text. And then the distinction is painless:
    my %record; my %handler = ( # ... TITLE => sub { $record{TITLE} = $_[1] unless exists $record{TITLE} + }, # ... );
    Or if there are multiple records per file:
    my $curr_rec = 0; my @record; my %handler = ( # ... HEADER => sub { ++$curr_rec }, TITLE => sub { $record[$curr_rec]->{TITLE} = $_[1] unless exists $record[$curr_rec]->{TITLE} }, # ... );
    You get the idea.

    Makeshifts last the longest.

Re: Finding first block of contiguous elements in an array
by Arien (Pilgrim) on Dec 21, 2002 at 16:29 UTC
    Can anyone think of an elegant way to grab the first block of 1+ contiguous TITLE lines, and stop?
    my $title; for (@data) { if (/^TITLE\s+\d*(.*)/) { $title .= $1; } else { last if defined $title; } }

    The elegance would be in not adding a flag but rather reusing $title for that purpose (since it already contains the state the flag would save).

    — Arien

      'cept that that will bottle out too early unless the first line is a title line (which it isn't)?

      Update: Talking That which begins with B and ends in ollocks again. Arien++


      Examine what is said, not who speaks.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://221570]
Approved by hossman
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (6)
As of 2024-03-19 03:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found