http://www.perlmonks.org?node_id=964788

iangibson has asked for the wisdom of the Perl Monks concerning the following question:

I have a problem with formatting a genetic data file that is to have the columns and rows transposed. Hopefully a wise monk can see what's wrong.

The problem is that the final row of the final output file gets truncated just before the end i.e. it ends up about 97 columns shorter than all the others. All the other rows above the final row appear to contain the correct data.

I have tried the code on several different input files, which all produce the same results. The intermediate output file appears to be okay (all rows have the same number of columns), so presumably there's a problem with the last part of the code rather than the transposition part, but I can't see what it could be.

Suggestions welcome, thanks

#!/usr/bin/perl use strict; use warnings; use v5.14; # Transpose columns to rows, in preparation for using BioPerl: die "need two arguments (i.e. chr cont) at invocation" unless @ARGV == + 2; chomp( my $chr_num = shift ); chomp( my $cont = shift ); open my $in_file, '<', "chr${chr_num}_exome_snps_only.vcf_$cont" or die "Cannot open input file: $!"; open my $out_file, '>', "chr${chr_num}_exome_snps_only_transposed_$con +t" or die "Cannot open output file: $!"; my @rows = (); my @transposed = (); while (<$in_file>) { chomp; push @rows, [split]; } foreach my $row (@rows) { foreach my $column ( 0 .. $#{$row} ) { push( @{ $transposed[$column] }, $row->[$column] ); } } foreach my $new_row (@transposed) { foreach my $new_col ( @{$new_row} ) { print $out_file $new_col, "\t"; } print $out_file "\n"; } # Remove unnecessary rows, clean up the SNPs and put into CSV format: open my $new_in_file, '<', "chr${chr_num}_exome_snps_only_transposed_$ +cont" or die "Cannot open new input file: $!"; open my $new_out_file, '>', "chr${chr_num}_exome_snps_processed_$cont" or die "Cannot open new output file: $!"; while (<$new_in_file>) { chomp; next if $. == 1; # remove header line s/POS/SAMPLE/ if $. == 2; s/\s+/,/g if $. == 2; # replace whitespace with a comma on I +D row next if $. >= 3 and $. <= 9; # remove unwanted lines s/\s+(\d)\|(\d)\S+/,$1 $2/g; # clean up the SNPs & put into CSV + format print $new_out_file "$_\n"; } =cut sample $in_file (note: hundreds more columns i.e. individuals beginnin +g with HG, and hundreds of thousands more rows): #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT + HG00553 HG00554 HG00637 22 16287226 . C T 100 PASS SNP;BAVGPOST=0.986;BRS +Q=0.635;LDAF=0.0167;AVGPOST=0.9859;RSQ=0.6343;ERATE=0.0026;THETA=0.00 +22;AC=23;AN=2184 GT:DS:GL:BD 0|0:0.000:-0.48,-0.48,-0.48:0.021 + 0|0:0.000:-0.00,-2.69,-5.00:0.0001 0|0:0.000:-0.00,-2.53,-5.00: +0.0001 22 16287365 . C T 100 PASS SNP;BAVGPOST=0.998;BRS +Q=0.540;LDAF=0.0030;AVGPOST=0.9959;RSQ=0.4172;ERATE=0.0009;THETA=0.02 +01;AC=2;AN=2184 GT:DS:GL:BD 0|0:0.000:-0.48,-0.48,-0.48:0.0024 + 0|0:0.000:-0.00,-4.40,-5.00:0 0|0:0.000:-0.02,-1.38,-5.00:0.000 +2 22 16287649 . G A 100 PASS SNP;BAVGPOST=0.949;BRS +Q=0.519;LDAF=0.0645;AVGPOST=0.9342;RSQ=0.5421;ERATE=0.0115;THETA=0.00 +95;AC=81;AN=2184 GT:DS:GL:BD 0|0:0.050:-0.17,-0.50,-2.67:0.072 + 0|0:0.000:-0.00,-2.34,-5.00:0.0002 0|0:0.000:-0.01,-1.86,-5.00: +0.0024 22 16287784 . C T 100 PASS SNP;BAVGPOST=0.998;BRS +Q=0.844;LDAF=0.0089;AVGPOST=0.9941;RSQ=0.7151;ERATE=0.0030;THETA=0.00 +43;AC=13;AN=2184 GT:DS:GL:BD 0|0:0.000:-0.48,-0.48,-0.48:0.0104 + 0|0:0.000:-0.00,-2.79,-5.00:0.0001 0|0:0.000:0.00,-5.00,-5.00: +0 22 16287851 . G A 100 PASS SNP;BAVGPOST=0.939;BRS +Q=0.764;LDAF=0.1473;AVGPOST=0.9183;RSQ=0.7385;ERATE=0.0015;THETA=0.03 +09;AC=280;AN=2184 GT:DS:GL:BD 0|0:0.100:-0.08,-0.78,-5.00:0.040 +6 0|0:0.000:-0.00,-2.76,-5.00:0 0|0:0.000:-0.00,-3.49,-5.00:0 22 16287912 . G A 100 PASS SNP;BAVGPOST=0.998;BRS +Q=0.800;LDAF=0.0057;AVGPOST=0.9983;RSQ=0.8580;ERATE=0.0006;THETA=0.02 +10;AC=11;AN=2184 GT:DS:GL:BD 0|0:0.000:-0.12,-0.63,-4.70:0.0028 + 0|0:0.000:-0.01,-1.91,-5.00:0.0003 0|0:0.000:-0.00,-2.30,-5.00 +:0 sample $new_out_file (note: hundreds of thousands more columns, hundre +ds more rows) SAMPLE,16287215,16287226,16287365,16287649,16287784 HG00553,0 0,0 0,0 0,0 0,0 0 HG00554,0 0,0 0,0 0,0 0,0 0 HG00637,0 0,0 0,0 0,0 0,0 0 HG00638,0 0,0 0,0 0,0 0,0 0 HG00640,0 0,0 0,0 0,0 0,0 0 HG00641,0 0,0 0,0 0,0 0,0 0 HG00731,0 0,0 0,0 0,0 0,0 0

Replies are listed 'Best First'.
Re: Strange output problem
by Lotus1 (Vicar) on Apr 12, 2012 at 17:59 UTC

    I noticed that you re-open the same file from $out_file as $new_in_file without closing $out_file first. Seems like buffering could be an issue here.

    I tried your program with a smaller number of rows and columns and found that with the close statement before the reopen I got output in the final processed file and without it I got a blank file. The intermediary file looked good in both cases. Maybe the final data to the intermediary gets written when the program ends and that first file handle is closed. I don't want to troubleshoot much here until you have a look at using close.

    Update: I thought of an easy way to test my idea about buffering. I made a copy of the intermediary file then closed the intermediary file then made another copy. The first copy was empty the second had everything in it.

    # Remove unnecessary rows, clean up the SNPs and put into CSV format: open my $new_in_file, '<', "chr${chr_num}_exome_snps_only_transposed_$ +cont" or die "Cannot open new input file: $!"; ################################## # I added the following lines # open my $temp, '>', 'transposed_copy' or die "Cannot open transposed_copy: $!"; while(<$new_in_file>) { print $temp $_; } close $temp; close $out_file; open $temp, '>', 'transposed_copy2' or die "Cannot open transposed_copy2: $!"; while(<$new_in_file>) { print $temp $_; } close $temp;

      Thanks! It looks like that was the problem. I simply added a

       close $out_file;

      between the two parts of the program, and the output is now correct. I'd never have thought of that .. better get into the habit of closing filehandles when I'm done with them!

      Thanks again.

Re: Strange output problem
by Riales (Hermit) on Apr 12, 2012 at 17:24 UTC

    Nothing jumps out at me, but I am inclined to agree with your guess about the problem being in the last part of the code. Have you tried adding print statements before the chomp, before the substitution, and after the substitution?

    while (<$new_in_file>) { print "BEFORE CHOMP: $_\n"; chomp; next if $. == 1; # remove header line s/POS/SAMPLE/ if $. == 2; s/\s+/,/g if $. == 2; # replace whitespace with a comma on I +D row next if $. >= 3 and $. <= 9; # remove unwanted lines print "BEFORE SUBSTITUTION: $_\n"; s/\s+(\d)\|(\d)\S+/,$1 $2/g; # clean up the SNPs & put into CSV + format print "AFTER SUBSTITUTION: $_\n"; print $new_out_file "$_\n"; }
Re: Strange output problem
by RichardK (Parson) on Apr 12, 2012 at 17:30 UTC

    Why are you removing header lines from the transposed output file when you didn't write a header to that file? AFAICT you only write the transposed data!

      The OP doesn't show the sample transposed only file but when I ran it the first column was the headers.