Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re: Looking for series in consecutive lines of a file

by toolic (Bishop)
on Feb 17, 2015 at 02:00 UTC ( [id://1116938]=note: print w/replies, xml ) Need Help??


in reply to Looking for series in consecutive lines of a file

This produces the desired output:
use warnings; use strict; my %data; while (<DATA>) { my @cols = split; push @{ $data{$cols[0]} }, $cols[1] if $cols[3] > 4; } for my $k (sort keys %data) { my $i = 1; my @ns = @{ $data{$k} }; my $n0 = shift @ns; my @ns2 = $n0; for my $n (@ns) { if ($n == $n0+1) { push @ns2, $n; $i++; } else { print "$k,$ns2[0],$ns2[-1],$i\n"; @ns2 = $n; $i = 1; } $n0 = $n; } print "$k,$ns2[0],$ns2[-1],$i\n"; } __DATA__ C10000035 12 C 4 ....^>. HHFCC C10000035 13 C 6 .....^>. HHFFCC C10000035 14 C 6 ...... JHFFCC C10000035 15 C 6 ...... IHFFFC C10000035 16 A 4 .GG...^>G JGHFFFC C10000035 17 C 7 ....... JGHFFFC C10000035 18 C 8 .......^]. JIHHFFC@ C10000035 19 A 8 ........ IJHHFFFC C10000035 20 C 9 ..T...T.^]. JIHGHFF@C C10000035 21 G 10 A........^]. AJJHHHFDCC C10000040 30 C 5 ....^>. HHFCC C10000040 31 C 6 .....^>. HHFFCC C10000040 32 C 6 ...... JHFFCC C10000040 33 C 6 ...... IHFFFC C10000040 34 C 4 ...... IHFFFC C10000040 35 C 4 ...... IHFFFC C10000040 36 C 4 ...... IHFFFC C10000040 37 C 6 ...... IHFFFC C10000040 38 C 6 ...... IHFFFC

See also:

Replies are listed 'Best First'.
Re^2: Looking for series in consecutive lines of a file
by mbp (Novice) on Feb 17, 2015 at 05:27 UTC

    Hi toolic, thanks very much for that, I have it working now on my end also so that it accepts the data from an input file.

    As I understand it, it looks like your script reads in all of the data first (into the hash 'data'), prior to running the analysis. Is this correct? If so, is there a way rather to do the analysis piecemeal, in other words line by line? The reason I ask is that the files I have to deal with are quite large (up to 80G or so), so it may not be efficient or possible to store them in memory as opposed to parsing them line by line.

    I apologize, I should have made this clear in my original post - but that is why I had my original strategy of reading line by line and storing the first 'qualifying' line of a set and comparing it to the subsequent lines, and then starting the process over for each new set of consecutive lines.

    Thanks again, and thanks in advance if you are able to give advice on a line-by-line version. Cheers. MBP

      Hello mbp,

      Since the input lines are known to be sorted, it is feasible to reduce memory requirements by reading the input file line-by-line. Here is one approach:

      #! perl use strict; use warnings; use constant MIN_DEPTH => 5; my ($chromosome, $position, undef, $coverage_depth) = split /\s+/, <D +ATA>; my %series = ( name => $chromosome, start => $position, end => $position, depth => $coverage_depth, ); while (<DATA>) { ($chromosome, $position, undef, $coverage_depth) = split /\s+/; if ($series{name} eq $chromosome && $series{end} == $position - 1 && $series{depth} >= MIN_DEPTH && $coverage_depth >= MIN_DEPTH) { $series{end} = $position; } else { display_series(); %series = ( name => $chromosome, start => $position, end => $position, depth => $coverage_depth, ); } } display_series(); sub display_series { if ($series{depth} >= MIN_DEPTH) { print join(',', $series{name}, $series{start}, $series{end}, $series{end} - $series{start} + 1), "\n"; } } __DATA__ C10000035 12 C 4 ....^>. HHFCC C10000035 13 C 6 .....^>. HHFFCC C10000035 14 C 6 ...... JHFFCC C10000035 15 C 6 ...... IHFFFC C10000035 16 A 4 .GG...^>G JGHFFFC C10000035 17 C 7 ....... JGHFFFC C10000035 18 C 8 .......^]. JIHHFFC@ C10000035 19 A 8 ........ IJHHFFFC C10000035 20 C 9 ..T...T.^]. JIHGHFF@C C10000035 21 G 10 A........^]. AJJHHHFDCC C10000040 30 C 5 ....^>. HHFCC C10000040 31 C 6 .....^>. HHFFCC C10000040 32 C 6 ...... JHFFCC C10000040 33 C 6 ...... IHFFFC C10000040 34 C 4 ...... IHFFFC C10000040 35 C 4 ...... IHFFFC C10000040 36 C 4 ...... IHFFFC C10000040 37 C 6 ...... IHFFFC C10000040 38 C 6 ...... IHFFFC

      Output:

      16:48 >perl 1157_SoPW.pl C10000035,13,15,3 C10000035,17,21,5 C10000040,30,33,4 C10000040,37,38,2 16:50 >

      Hope that helps,

      Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

        I see you're trying not to overwhelm the new guy, but introducing "use constant" and not introducing subroutine arguments, references? Eeew :)

        ... display_series( \%series ); ... display_series( \%series ); ... sub display_series { my( $series ) = @_; if ($series->{depth} >= MIN_DEPTH) { print join(',', $series->{name}, $series->{start}, $series->{e +nd}, $series->{end} - $series->{start} + 1), "\n"; } }

        Hi Anathasius,

        Brilliant, that works a treat! Thank you very much for your time and help, I really appreciate it.

        Best,

        mbp

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1116938]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (3)
As of 2024-04-19 17:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found