http://www.perlmonks.org?node_id=965581

rnaeye has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have many different datasets. Each dataset is separated from each other with a blank line. I would like to take each dataset, and take the sum of aligned reads (column-5) on each chromosome (column-1). Then find the percentage of mitochondrial chromosome reads (column-1 labeled as “chrM”) within total aligned reads. This part I can do. I pasted my code. It works for a single dataset. However, I do not know how I can loop over multiple datasets and print the same information for multiple datasets independently. Can anyone help? Thank you for your help.

#!/usr/bin/perl use warnings; use strict; use 5.010; my $counter = 0; my $total_aligned_reads = 0; my $mtDNA; while (<DATA>) { my $line = $_; next if /Data_Set_\d+: /; next if /^NoCoordinateCount/; next if /^$/; my($chr, $lenLabel, $lengNumbers, $AlignedLabel, $alignedNumbers) += split /\s/, $line; $total_aligned_reads += $alignedNumbers; $mtDNA = $alignedNumbers if $chr eq "chrM"; } say "total aligned reads = $total_aligned_reads"; say "mtDNA = $mtDNA"; say "mtDNA percentage = ", $mtDNA/$total_aligned_reads*100; __DATA__ Data_Set_116: BAM Index Statistics_on_data 115.html chr10 length= 135534747 Aligned= 435 Unaligned= 0 chr11 length= 135006516 Aligned= 553 Unaligned= 0 chr12 length= 133851895 Aligned= 482 Unaligned= 0 chr13 length= 115169878 Aligned= 367 Unaligned= 0 chr14 length= 107349540 Aligned= 341 Unaligned= 0 chr15 length= 102531392 Aligned= 243 Unaligned= 0 chr16 length= 90354753 Aligned= 258 Unaligned= 0 chr17 length= 81195210 Aligned= 210 Unaligned= 0 chr18 length= 78077248 Aligned= 326 Unaligned= 0 chr19 length= 59128983 Aligned= 115 Unaligned= 0 chr1 length= 249250621 Aligned= 1012 Unaligned= 0 chr20 length= 63025520 Aligned= 194 Unaligned= 0 chr21 length= 48129895 Aligned= 148 Unaligned= 0 chr22 length= 51304566 Aligned= 100 Unaligned= 0 chr2 length= 243199373 Aligned= 897 Unaligned= 0 chr3 length= 198022430 Aligned= 763 Unaligned= 0 chr4 length= 191154276 Aligned= 841 Unaligned= 0 chr5 length= 180915260 Aligned= 755 Unaligned= 0 chr6 length= 171115067 Aligned= 730 Unaligned= 0 chr7 length= 159138663 Aligned= 646 Unaligned= 0 chr8 length= 146364022 Aligned= 642 Unaligned= 0 chr9 length= 141213431 Aligned= 466 Unaligned= 0 chrM length= 16571 Aligned= 2650 Unaligned= 0 chrX length= 155270560 Aligned= 1068 Unaligned= 0 chrY length= 59373566 Aligned= 11 Unaligned= 0 NoCoordinateCount= 0 Data_Set_108: BAM Index Statistics_on_data 107.html chr10 length= 135534747 Aligned= 45 Unaligned= 0 chr11 length= 135006516 Aligned= 49 Unaligned= 0 chr12 length= 133851895 Aligned= 31 Unaligned= 0 chr13 length= 115169878 Aligned= 47 Unaligned= 0 chr14 length= 107349540 Aligned= 24 Unaligned= 0 chr15 length= 102531392 Aligned= 26 Unaligned= 0 chr16 length= 90354753 Aligned= 22 Unaligned= 0 chr17 length= 81195210 Aligned= 23 Unaligned= 0 chr18 length= 78077248 Aligned= 20 Unaligned= 0 chr19 length= 59128983 Aligned= 9 Unaligned= 0 chr1 length= 249250621 Aligned= 89 Unaligned= 0 chr20 length= 63025520 Aligned= 19 Unaligned= 0 chr21 length= 48129895 Aligned= 5 Unaligned= 0 chr22 length= 51304566 Aligned= 13 Unaligned= 0 chr2 length= 243199373 Aligned= 81 Unaligned= 0 chr3 length= 198022430 Aligned= 53 Unaligned= 0 chr4 length= 191154276 Aligned= 55 Unaligned= 0 chr5 length= 180915260 Aligned= 56 Unaligned= 0 chr6 length= 171115067 Aligned= 55 Unaligned= 0 chr7 length= 159138663 Aligned= 44 Unaligned= 0 chr8 length= 146364022 Aligned= 52 Unaligned= 0 chr9 length= 141213431 Aligned= 32 Unaligned= 0 chrM length= 16571 Aligned= 1 Unaligned= 0 chrX length= 155270560 Aligned= 52 Unaligned= 0 chrY length= 59373566 Aligned= 3 Unaligned= 0 NoCoordinateCount= 0 Data_Set_100: BAM Index Statistics_on_data 99.html chr10 length= 135534747 Aligned= 25340 Unaligned= 0 chr11 length= 135006516 Aligned= 24577 Unaligned= 0 chr12 length= 133851895 Aligned= 24335 Unaligned= 0 chr13 length= 115169878 Aligned= 17653 Unaligned= 0 chr14 length= 107349540 Aligned= 16826 Unaligned= 0 chr15 length= 102531392 Aligned= 15506 Unaligned= 0 chr16 length= 90354753 Aligned= 17098 Unaligned= 0 chr17 length= 81195210 Aligned= 14604 Unaligned= 0 chr18 length= 78077248 Aligned= 14139 Unaligned= 0 chr19 length= 59128983 Aligned= 10155 Unaligned= 0 chr1 length= 249250621 Aligned= 43427 Unaligned= 0 chr20 length= 63025520 Aligned= 11568 Unaligned= 0 chr21 length= 48129895 Aligned= 6897 Unaligned= 0 chr22 length= 51304566 Aligned= 6766 Unaligned= 0 chr2 length= 243199373 Aligned= 45536 Unaligned= 0 chr3 length= 198022430 Aligned= 36213 Unaligned= 0 chr4 length= 191154276 Aligned= 34693 Unaligned= 0 chr5 length= 180915260 Aligned= 33941 Unaligned= 0 chr6 length= 171115067 Aligned= 31529 Unaligned= 0 chr7 length= 159138663 Aligned= 29473 Unaligned= 0 chr8 length= 146364022 Aligned= 27419 Unaligned= 0 chr9 length= 141213431 Aligned= 22254 Unaligned= 0 chrM length= 16571 Aligned= 169 Unaligned= 0 chrX length= 155270560 Aligned= 28121 Unaligned= 0 chrY length= 59373566 Aligned= 534 Unaligned= 0 NoCoordinateCount= 0

Replies are listed 'Best First'.
Re: How to loop over multiple datasets (blocks of text)?
by jwkrahn (Abbot) on Apr 18, 2012 at 03:36 UTC

    This looks like it will do what you want:

    #!/usr/bin/perl use warnings; use strict; use 5.010; local $/ = ''; # use paragraph mode while ( <DATA> ) { my $total_aligned_reads; $total_aligned_reads += $1 while /Aligned= (\d+)/g; my ( $mtDNA ) = /^chrM.+Aligned= (\d+)/m; say "total aligned reads = $total_aligned_reads"; say "mtDNA = $mtDNA"; say "mtDNA percentage = ", $mtDNA / $total_aligned_reads * 100; } __DATA__ Data_Set_116: BAM Index Statistics_on_data 115.html chr10 length= 135534747 Aligned= 435 Unaligned= 0 chr11 length= 135006516 Aligned= 553 Unaligned= 0 chr12 length= 133851895 Aligned= 482 Unaligned= 0 chr13 length= 115169878 Aligned= 367 Unaligned= 0 chr14 length= 107349540 Aligned= 341 Unaligned= 0 chr15 length= 102531392 Aligned= 243 Unaligned= 0 chr16 length= 90354753 Aligned= 258 Unaligned= 0 chr17 length= 81195210 Aligned= 210 Unaligned= 0 chr18 length= 78077248 Aligned= 326 Unaligned= 0 chr19 length= 59128983 Aligned= 115 Unaligned= 0 chr1 length= 249250621 Aligned= 1012 Unaligned= 0 chr20 length= 63025520 Aligned= 194 Unaligned= 0 chr21 length= 48129895 Aligned= 148 Unaligned= 0 chr22 length= 51304566 Aligned= 100 Unaligned= 0 chr2 length= 243199373 Aligned= 897 Unaligned= 0 chr3 length= 198022430 Aligned= 763 Unaligned= 0 chr4 length= 191154276 Aligned= 841 Unaligned= 0 chr5 length= 180915260 Aligned= 755 Unaligned= 0 chr6 length= 171115067 Aligned= 730 Unaligned= 0 chr7 length= 159138663 Aligned= 646 Unaligned= 0 chr8 length= 146364022 Aligned= 642 Unaligned= 0 chr9 length= 141213431 Aligned= 466 Unaligned= 0 chrM length= 16571 Aligned= 2650 Unaligned= 0 chrX length= 155270560 Aligned= 1068 Unaligned= 0 chrY length= 59373566 Aligned= 11 Unaligned= 0 NoCoordinateCount= 0 Data_Set_108: BAM Index Statistics_on_data 107.html chr10 length= 135534747 Aligned= 45 Unaligned= 0 chr11 length= 135006516 Aligned= 49 Unaligned= 0 chr12 length= 133851895 Aligned= 31 Unaligned= 0 chr13 length= 115169878 Aligned= 47 Unaligned= 0 chr14 length= 107349540 Aligned= 24 Unaligned= 0 chr15 length= 102531392 Aligned= 26 Unaligned= 0 chr16 length= 90354753 Aligned= 22 Unaligned= 0 chr17 length= 81195210 Aligned= 23 Unaligned= 0 chr18 length= 78077248 Aligned= 20 Unaligned= 0 chr19 length= 59128983 Aligned= 9 Unaligned= 0 chr1 length= 249250621 Aligned= 89 Unaligned= 0 chr20 length= 63025520 Aligned= 19 Unaligned= 0 chr21 length= 48129895 Aligned= 5 Unaligned= 0 chr22 length= 51304566 Aligned= 13 Unaligned= 0 chr2 length= 243199373 Aligned= 81 Unaligned= 0 chr3 length= 198022430 Aligned= 53 Unaligned= 0 chr4 length= 191154276 Aligned= 55 Unaligned= 0 chr5 length= 180915260 Aligned= 56 Unaligned= 0 chr6 length= 171115067 Aligned= 55 Unaligned= 0 chr7 length= 159138663 Aligned= 44 Unaligned= 0 chr8 length= 146364022 Aligned= 52 Unaligned= 0 chr9 length= 141213431 Aligned= 32 Unaligned= 0 chrM length= 16571 Aligned= 1 Unaligned= 0 chrX length= 155270560 Aligned= 52 Unaligned= 0 chrY length= 59373566 Aligned= 3 Unaligned= 0 NoCoordinateCount= 0 Data_Set_100: BAM Index Statistics_on_data 99.html chr10 length= 135534747 Aligned= 25340 Unaligned= 0 chr11 length= 135006516 Aligned= 24577 Unaligned= 0 chr12 length= 133851895 Aligned= 24335 Unaligned= 0 chr13 length= 115169878 Aligned= 17653 Unaligned= 0 chr14 length= 107349540 Aligned= 16826 Unaligned= 0 chr15 length= 102531392 Aligned= 15506 Unaligned= 0 chr16 length= 90354753 Aligned= 17098 Unaligned= 0 chr17 length= 81195210 Aligned= 14604 Unaligned= 0 chr18 length= 78077248 Aligned= 14139 Unaligned= 0 chr19 length= 59128983 Aligned= 10155 Unaligned= 0 chr1 length= 249250621 Aligned= 43427 Unaligned= 0 chr20 length= 63025520 Aligned= 11568 Unaligned= 0 chr21 length= 48129895 Aligned= 6897 Unaligned= 0 chr22 length= 51304566 Aligned= 6766 Unaligned= 0 chr2 length= 243199373 Aligned= 45536 Unaligned= 0 chr3 length= 198022430 Aligned= 36213 Unaligned= 0 chr4 length= 191154276 Aligned= 34693 Unaligned= 0 chr5 length= 180915260 Aligned= 33941 Unaligned= 0 chr6 length= 171115067 Aligned= 31529 Unaligned= 0 chr7 length= 159138663 Aligned= 29473 Unaligned= 0 chr8 length= 146364022 Aligned= 27419 Unaligned= 0 chr9 length= 141213431 Aligned= 22254 Unaligned= 0 chrM length= 16571 Aligned= 169 Unaligned= 0 chrX length= 155270560 Aligned= 28121 Unaligned= 0 chrY length= 59373566 Aligned= 534 Unaligned= 0 NoCoordinateCount= 0

      Well done, jwkrahn.

Re: How to loop over multiple datasets (blocks of text)?
by BrowserUk (Patriarch) on Apr 18, 2012 at 01:55 UTC

    The simplest change is to wrap what you have in another while loop that reads the datafile in paragraph mode, and then open each multi-line dataset as an in-memory file.

    That way your existing code can operate directly on that ramfile without change:

    #!/usr/bin/perl use warnings; use strict; use 5.010; $/ = ''; ## paragraph mode while( my $dataset = <DATA> ) { ## open each dataset as a ramfile open DATASET, '<', \$dataset or die $!; local $/ = "\n"; ## Return readline to by-line mode my $counter = 0; my $total_aligned_reads = 0; my $mtDNA; while (<DATASET>) { ## my $line = $_; #### Unnecessary; see next comment next if /Data_Set_\d+: /; next if /^NoCoordinateCount/; next if /^$/; my($chr, $lenLabel, $lengNumbers, $AlignedLabel, $alignedNumbe +rs) = split /\s/; ####, $line; split operates on $_ by default $total_aligned_reads += $alignedNumbers; $mtDNA = $alignedNumbers if $chr eq "chrM"; } say "total aligned reads = $total_aligned_reads"; say "mtDNA = $mtDNA"; say "mtDNA percentage = ", $mtDNA/$total_aligned_reads*100; } __DATA__ ...sample data as above ...

    Produces:

    C:\test>junk total aligned reads = 14253 mtDNA = 2650 mtDNA percentage = 18.5925770013331 total aligned reads = 906 mtDNA = 1 mtDNA percentage = 0.11037527593819 total aligned reads = 538773 mtDNA = 169 mtDNA percentage = 0.0313675703867863

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

Re: How to loop over multiple datasets (blocks of text)?
by Kenosis (Priest) on Apr 18, 2012 at 03:18 UTC

    Here's a solution that combines/modifies both BrowserUk's and bobdabuilda's suggestions:

    #!/usr/bin/perl use warnings; use strict; use 5.010; { # Within this block because of the next line local $/ = undef; for(<DATA> =~ /(Data_Set.*?Count= 0)/gs) { my $counter = 0; my $total_aligned_reads = 0; my $mtDNA; for(split /\n/) { ## my $line = $_; #### Unnecessary; see next comment next if /(Data_Set_\d+: |^NoCoordinateCount|^$)/; ## next if /^NoCoordinateCount/; ## next if /^$/; my($chr, $lenLabel, $lengNumbers, $AlignedLabel, $alignedN +umbers) = split /\s/; ####, $line; split operates on $_ by default $total_aligned_reads += $alignedNumbers; $mtDNA = $alignedNumbers if $chr eq "chrM"; } say "total aligned reads = $total_aligned_reads"; say "mtDNA = $mtDNA"; say "mtDNA percentage = ", $mtDNA/$total_aligned_reads*100; } } __DATA__

    Hope this helps!

Re: How to loop over multiple datasets (blocks of text)?
by bobdabuilda (Beadle) on Apr 18, 2012 at 01:59 UTC

    I'm sure some much more knowledgeable PERLy folks will follow this suggestion with some much more technical ideas, etc. but in the mean-time, the one thing I notice about your data that you're showing, is that they all have a consistent first and last line of each of the data sets.

    Could you not utilise that consistency to perform a loop to process each dataset until you either come to the end of the set, or the end of DATA?

    In other words, change these two lines:

    next if /Data_Set_\d+: /; next if /^NoCoordinateCount/;

    into a couple of loops looking for the start and end of your data set, so you can process each set individually.

Re: How to loop over multiple datasets (blocks of text)?
by micurley (Initiate) on Apr 20, 2012 at 04:30 UTC

    In the event you have alot of data and don't want to suck the entire file in at once you can loop over each paragraph
    Shown with embeded data here to simplify

    #!/usr/bin/perl use warnings; use strict; use 5.010; sub ProcessChunk( $ ); $/ = "\n\n"; while (<DATA>) { chomp; ProcessChunk( $_ ); } sub ProcessChunk( $ ){ my @data = split /\n/, shift; my $counter = 0; my $total_aligned_reads = 0; my $mtDNA = 0; foreach my $line ( @data ) { next if $line =~ /^Data_Set/; next if $line =~ /^NoCoordinateCount/; my ($chr, $lenLabel, $lengNumbers, $AlignedLabel, $alignedNumb +ers) = split /\s/, $line; $total_aligned_reads += $alignedNumbers; $mtDNA = $alignedNumbers if $chr eq "chrM"; } print "total aligned reads = $total_aligned_reads\n"; print "mtDNA = $mtDNA\n"; print "mtDNA percentage = "; if( $total_aligned_reads && $mtDNA ) { print $mtDNA/$total_aligned_reads*100; } else { print 'undefined'; } print "\n"; return; } __DATA__ Data_Set_116: BAM Index Statistics_on_data 115.html chr10 length= 135534747 Aligned= 435 Unaligned= 0 chr11 length= 135006516 Aligned= 553 Unaligned= 0 chr12 length= 133851895 Aligned= 482 Unaligned= 0 chr13 length= 115169878 Aligned= 367 Unaligned= 0 chr14 length= 107349540 Aligned= 341 Unaligned= 0 chr15 length= 102531392 Aligned= 243 Unaligned= 0 chr16 length= 90354753 Aligned= 258 Unaligned= 0 chr17 length= 81195210 Aligned= 210 Unaligned= 0 chr18 length= 78077248 Aligned= 326 Unaligned= 0 chr19 length= 59128983 Aligned= 115 Unaligned= 0 chr1 length= 249250621 Aligned= 1012 Unaligned= 0 chr20 length= 63025520 Aligned= 194 Unaligned= 0 chr21 length= 48129895 Aligned= 148 Unaligned= 0 chr22 length= 51304566 Aligned= 100 Unaligned= 0 chr2 length= 243199373 Aligned= 897 Unaligned= 0 chr3 length= 198022430 Aligned= 763 Unaligned= 0 chr4 length= 191154276 Aligned= 841 Unaligned= 0 chr5 length= 180915260 Aligned= 755 Unaligned= 0 chr6 length= 171115067 Aligned= 730 Unaligned= 0 chr7 length= 159138663 Aligned= 646 Unaligned= 0 chr8 length= 146364022 Aligned= 642 Unaligned= 0 chr9 length= 141213431 Aligned= 466 Unaligned= 0 chrM length= 16571 Aligned= 2650 Unaligned= 0 chrX length= 155270560 Aligned= 1068 Unaligned= 0 chrY length= 59373566 Aligned= 11 Unaligned= 0 NoCoordinateCount= 0 Data_Set_108: BAM Index Statistics_on_data 107.html chr10 length= 135534747 Aligned= 45 Unaligned= 0 chr11 length= 135006516 Aligned= 49 Unaligned= 0 chr12 length= 133851895 Aligned= 31 Unaligned= 0 chr13 length= 115169878 Aligned= 47 Unaligned= 0 chr14 length= 107349540 Aligned= 24 Unaligned= 0 chr15 length= 102531392 Aligned= 26 Unaligned= 0 chr16 length= 90354753 Aligned= 22 Unaligned= 0 chr17 length= 81195210 Aligned= 23 Unaligned= 0 chr18 length= 78077248 Aligned= 20 Unaligned= 0 chr19 length= 59128983 Aligned= 9 Unaligned= 0 chr1 length= 249250621 Aligned= 89 Unaligned= 0 chr20 length= 63025520 Aligned= 19 Unaligned= 0 chr21 length= 48129895 Aligned= 5 Unaligned= 0 chr22 length= 51304566 Aligned= 13 Unaligned= 0 chr2 length= 243199373 Aligned= 81 Unaligned= 0 chr3 length= 198022430 Aligned= 53 Unaligned= 0 chr4 length= 191154276 Aligned= 55 Unaligned= 0 chr5 length= 180915260 Aligned= 56 Unaligned= 0 chr6 length= 171115067 Aligned= 55 Unaligned= 0 chr7 length= 159138663 Aligned= 44 Unaligned= 0 chr8 length= 146364022 Aligned= 52 Unaligned= 0 chr9 length= 141213431 Aligned= 32 Unaligned= 0 chrM length= 16571 Aligned= 1 Unaligned= 0 chrX length= 155270560 Aligned= 52 Unaligned= 0 chrY length= 59373566 Aligned= 3 Unaligned= 0 NoCoordinateCount= 0 Data_Set_100: BAM Index Statistics_on_data 99.html chr10 length= 135534747 Aligned= 25340 Unaligned= 0 chr11 length= 135006516 Aligned= 24577 Unaligned= 0 chr12 length= 133851895 Aligned= 24335 Unaligned= 0 chr13 length= 115169878 Aligned= 17653 Unaligned= 0 chr14 length= 107349540 Aligned= 16826 Unaligned= 0 chr15 length= 102531392 Aligned= 15506 Unaligned= 0 chr16 length= 90354753 Aligned= 17098 Unaligned= 0 chr17 length= 81195210 Aligned= 14604 Unaligned= 0 chr18 length= 78077248 Aligned= 14139 Unaligned= 0 chr19 length= 59128983 Aligned= 10155 Unaligned= 0 chr1 length= 249250621 Aligned= 43427 Unaligned= 0 chr20 length= 63025520 Aligned= 11568 Unaligned= 0 chr21 length= 48129895 Aligned= 6897 Unaligned= 0 chr22 length= 51304566 Aligned= 6766 Unaligned= 0 chr2 length= 243199373 Aligned= 45536 Unaligned= 0 chr3 length= 198022430 Aligned= 36213 Unaligned= 0 chr4 length= 191154276 Aligned= 34693 Unaligned= 0 chr5 length= 180915260 Aligned= 33941 Unaligned= 0 chr6 length= 171115067 Aligned= 31529 Unaligned= 0 chr7 length= 159138663 Aligned= 29473 Unaligned= 0 chr8 length= 146364022 Aligned= 27419 Unaligned= 0 chr9 length= 141213431 Aligned= 22254 Unaligned= 0 chrM length= 16571 Aligned= 169 Unaligned= 0 chrX length= 155270560 Aligned= 28121 Unaligned= 0 chrY length= 59373566 Aligned= 534 Unaligned= 0 NoCoordinateCount= 0