Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask

Iterating through file to find specific subsets of lines

by thegirlm0nkey (Initiate)
on Dec 04, 2013 at 15:04 UTC ( #1065602=perlquestion: print w/replies, xml ) Need Help??
thegirlm0nkey has asked for the wisdom of the Perl Monks concerning the following question:

Hi - getting extremely stuck and would love some insight! I have a file which contains sequential numbers (actually genomic co-ordinates, I'm an amateur bioinformatician!) and an associated score. I need to extract regions where the score dips below a certain level. The file looks something like this:
1 50 2 50 3 1 4 10 5 49 6 8 7 50 8 5 9 5 10 40
So in this example - the first number on each line is the co-ordinate, and the second is the score. I need all the regions scoring less than 50, so for the small example above, I would get something like:
3 6 8 10
Hope that makes sense - I'm basically looking for the first and last positions where the score is less than 50. So far I have slurped the file into an array like this:
foreach my $line (@lines) { chomp $line; my @columns = split(/\t/, $line); my $score = $columns[1]; if ($score < 50) { #something here... } }
But I'm stuck with the 'something here' - I need to keep track of the first time a score of less than 50 is seen, and the last time it is seen before it goes above 50, and capture the two corresponding $columns[0] numbers. Really hope I've explained this properly! TIA.

Replies are listed 'Best First'.
Re: Iterating through file to find specific subsets of lines
by toolic (Bishop) on Dec 04, 2013 at 15:19 UTC
    Keep track of the positions in an array:
    use warnings; use strict; my @lines = <DATA>; my @pos; foreach my $line (@lines) { chomp $line; my @columns = split( /\s+/, $line ); my $score = $columns[1]; if ($score < 50) { push @pos, $columns[0]; } else { print "@pos[0, -1]\n" if @pos; @pos = (); } } print "@pos[0, -1]\n" if @pos; __DATA__ 1 50 2 50 3 1 4 10 5 49 6 8 7 50 8 5 9 5 10 40

    Note: I changed \t to \s+ just to create a self-contained example.

      Thank you! Exactly what I needed!
Re: Iterating through file to find specific subsets of lines
by jethro (Monsignor) on Dec 04, 2013 at 15:23 UTC
    my $dipped=0; my $column; foreach ... ... if ($score < 50) { $dipped= $columns[0] if (not $dipped); } else { print "$dipped $column\n" if ($dipped); $dipped=0; } $column= $column[0]; } print "$dipped $column\n" if ($dipped);

    Untested. $dipped is the variable that stores the first score that dips under 50 in a "dip region" and it also signifies that you are in such a region by being not 0. The construct with $column is necessary to get the number on the last line out of the foreach loop if a dipped region lasts until the end, could be avoided by declaring @columns outside the loop.

    UPDATE: Removed the off-by-one error found by toolic, using $column to store $column[0] of the previous loop step

      Using your scalar would be more efficient than my array solution... if you could get rid of your off-by-1 error (tested).
Re: Iterating through file to find specific subsets of lines
by choroba (Bishop) on Dec 04, 2013 at 15:20 UTC
    What ouput do you expect for the following input?
    1 1 2 49 3 1

    Sorry, did not understand the question.

    لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1065602]
Approved by marto
[erix]: I think they use Cormorants for that in Asia, by putting a ring around their necks to prevent them from swallowing. Plenty of cormorants here too (=Kormoran)
[erix]: (I've seen gruops of up to a thousand cormorants overhead)

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (6)
As of 2017-10-20 10:55 GMT
Find Nodes?
    Voting Booth?
    My fridge is mostly full of:

    Results (261 votes). Check out past polls.