Strange output problem

iangibson has asked for the wisdom of the Perl Monks concerning the following question:

I have a problem with formatting a genetic data file that is to have the columns and rows transposed. Hopefully a wise monk can see what's wrong.

The problem is that the final row of the final output file gets truncated just before the end i.e. it ends up about 97 columns shorter than all the others. All the other rows above the final row appear to contain the correct data.

I have tried the code on several different input files, which all produce the same results. The intermediate output file appears to be okay (all rows have the same number of columns), so presumably there's a problem with the last part of the code rather than the transposition part, but I can't see what it could be.

Suggestions welcome, thanks

#!/usr/bin/perl
use strict;
use warnings;
use v5.14;

# Transpose columns to rows, in preparation for using BioPerl:

die "need two arguments (i.e. chr cont) at invocation" unless @ARGV ==
+ 2;
chomp( my $chr_num = shift );
chomp( my $cont    = shift );

open my $in_file, '<', "chr${chr_num}_exome_snps_only.vcf_$cont"
  or die "Cannot open input file: $!";

open my $out_file, '>', "chr${chr_num}_exome_snps_only_transposed_$con
+t"
  or die "Cannot open output file: $!";

my @rows       = ();
my @transposed = ();

while (<$in_file>) {
    chomp;
    push @rows, [split];
}

foreach my $row (@rows) {
    foreach my $column ( 0 .. $#{$row} ) {
        push( @{ $transposed[$column] }, $row->[$column] );
    }
}

foreach my $new_row (@transposed) {
    foreach my $new_col ( @{$new_row} ) {
        print $out_file $new_col, "\t";
    }
    print $out_file "\n";
}

# Remove unnecessary rows, clean up the SNPs and put into CSV format:

open my $new_in_file, '<', "chr${chr_num}_exome_snps_only_transposed_$
+cont"
  or die "Cannot open new input file: $!";
open my $new_out_file, '>', "chr${chr_num}_exome_snps_processed_$cont"
  or die "Cannot open new output file: $!";

while (<$new_in_file>) {
    chomp;
    next          if $. == 1;   # remove header line
    s/POS/SAMPLE/ if $. == 2;
    s/\s+/,/g     if $. == 2;   # replace whitespace with a comma on I
+D row
    next          if $. >= 3 and $. <= 9;   # remove unwanted lines
    s/\s+(\d)\|(\d)\S+/,$1 $2/g;    # clean up the SNPs & put into CSV
+ format
    print $new_out_file "$_\n";
}



=cut

sample $in_file (note: hundreds more columns i.e. individuals beginnin
+g with HG, and hundreds of thousands more rows):

#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT 
+   HG00553    HG00554    HG00637
22    16287226    .    C    T    100    PASS    SNP;BAVGPOST=0.986;BRS
+Q=0.635;LDAF=0.0167;AVGPOST=0.9859;RSQ=0.6343;ERATE=0.0026;THETA=0.00
+22;AC=23;AN=2184    GT:DS:GL:BD    0|0:0.000:-0.48,-0.48,-0.48:0.021 
+   0|0:0.000:-0.00,-2.69,-5.00:0.0001    0|0:0.000:-0.00,-2.53,-5.00:
+0.0001
22    16287365    .    C    T    100    PASS    SNP;BAVGPOST=0.998;BRS
+Q=0.540;LDAF=0.0030;AVGPOST=0.9959;RSQ=0.4172;ERATE=0.0009;THETA=0.02
+01;AC=2;AN=2184    GT:DS:GL:BD    0|0:0.000:-0.48,-0.48,-0.48:0.0024 
+   0|0:0.000:-0.00,-4.40,-5.00:0    0|0:0.000:-0.02,-1.38,-5.00:0.000
+2
22    16287649    .    G    A    100    PASS    SNP;BAVGPOST=0.949;BRS
+Q=0.519;LDAF=0.0645;AVGPOST=0.9342;RSQ=0.5421;ERATE=0.0115;THETA=0.00
+95;AC=81;AN=2184    GT:DS:GL:BD    0|0:0.050:-0.17,-0.50,-2.67:0.072 
+   0|0:0.000:-0.00,-2.34,-5.00:0.0002    0|0:0.000:-0.01,-1.86,-5.00:
+0.0024
22    16287784    .    C    T    100    PASS    SNP;BAVGPOST=0.998;BRS
+Q=0.844;LDAF=0.0089;AVGPOST=0.9941;RSQ=0.7151;ERATE=0.0030;THETA=0.00
+43;AC=13;AN=2184    GT:DS:GL:BD    0|0:0.000:-0.48,-0.48,-0.48:0.0104
+    0|0:0.000:-0.00,-2.79,-5.00:0.0001    0|0:0.000:0.00,-5.00,-5.00:
+0
22    16287851    .    G    A    100    PASS    SNP;BAVGPOST=0.939;BRS
+Q=0.764;LDAF=0.1473;AVGPOST=0.9183;RSQ=0.7385;ERATE=0.0015;THETA=0.03
+09;AC=280;AN=2184    GT:DS:GL:BD    0|0:0.100:-0.08,-0.78,-5.00:0.040
+6    0|0:0.000:-0.00,-2.76,-5.00:0    0|0:0.000:-0.00,-3.49,-5.00:0
22    16287912    .    G    A    100    PASS    SNP;BAVGPOST=0.998;BRS
+Q=0.800;LDAF=0.0057;AVGPOST=0.9983;RSQ=0.8580;ERATE=0.0006;THETA=0.02
+10;AC=11;AN=2184    GT:DS:GL:BD    0|0:0.000:-0.12,-0.63,-4.70:0.0028
+    0|0:0.000:-0.01,-1.91,-5.00:0.0003    0|0:0.000:-0.00,-2.30,-5.00
+:0


sample $new_out_file (note: hundreds of thousands more columns, hundre
+ds more rows) 

SAMPLE,16287215,16287226,16287365,16287649,16287784
HG00553,0 0,0 0,0 0,0 0,0 0
HG00554,0 0,0 0,0 0,0 0,0 0
HG00637,0 0,0 0,0 0,0 0,0 0
HG00638,0 0,0 0,0 0,0 0,0 0
HG00640,0 0,0 0,0 0,0 0,0 0
HG00641,0 0,0 0,0 0,0 0,0 0
HG00731,0 0,0 0,0 0,0 0,0 0
[download]

Comment on Strange output problem Download Code

Replies are listed 'Best First'.
Re: Strange output problem by Lotus1 (Vicar) on Apr 12, 2012 at 17:59 UTC
I noticed that you re-open the same file from $out_file as $new_in_file without closing $out_file first. Seems like buffering could be an issue here. I tried your program with a smaller number of rows and columns and found that with the close statement before the reopen I got output in the final processed file and without it I got a blank file. The intermediary file looked good in both cases. Maybe the final data to the intermediary gets written when the program ends and that first file handle is closed. I don't want to troubleshoot much here until you have a look at using close. Update: I thought of an easy way to test my idea about buffering. I made a copy of the intermediary file then closed the intermediary file then made another copy. The first copy was empty the second had everything in it. # Remove unnecessary rows, clean up the SNPs and put into CSV format: open my $new_in_file, '<', "chr${chr_num}_exome_snps_only_transposed_$ +cont" or die "Cannot open new input file: $!"; ################################## # I added the following lines # open my $temp, '>', 'transposed_copy' or die "Cannot open transposed_copy: $!"; while(<$new_in_file>) { print $temp $_; } close $temp; close $out_file; open $temp, '>', 'transposed_copy2' or die "Cannot open transposed_copy2: $!"; while(<$new_in_file>) { print $temp $_; } close $temp; [download]	[reply] [d/l]
Re^2: Strange output problem by iangibson (Scribe) on Apr 12, 2012 at 18:46 UTC
Thanks! It looks like that was the problem. I simply added a `close $out_file;` between the two parts of the program, and the output is now correct. I'd never have thought of that .. better get into the habit of closing filehandles when I'm done with them! Thanks again.	[reply] [d/l]
Re: Strange output problem by Riales (Hermit) on Apr 12, 2012 at 17:24 UTC
Nothing jumps out at me, but I am inclined to agree with your guess about the problem being in the last part of the code. Have you tried adding print statements before the chomp, before the substitution, and after the substitution? `while (<$new_in_file>) { print "BEFORE CHOMP: $_\n"; chomp; next if $. == 1; # remove header line s/POS/SAMPLE/ if $. == 2; s/\s+/,/g if $. == 2; # replace whitespace with a comma on I +D row next if $. >= 3 and $. <= 9; # remove unwanted lines print "BEFORE SUBSTITUTION: $_\n"; s/\s+(\d)\\|(\d)\S+/,$1 $2/g; # clean up the SNPs & put into CSV + format print "AFTER SUBSTITUTION: $_\n"; print $new_out_file "$_\n"; }` [download]	[reply] [d/l]
Re: Strange output problem by RichardK (Parson) on Apr 12, 2012 at 17:30 UTC
Why are you removing header lines from the transposed output file when you didn't write a header to that file? AFAICT you only write the transposed data!	[reply]
Re^2: Strange output problem by Lotus1 (Vicar) on Apr 12, 2012 at 18:24 UTC
The OP doesn't show the sample transposed only file but when I ran it the first column was the headers.	[reply]

Back to Seekers of Perl Wisdom