Beefy Boxes and Bandwidth Generously Provided by pair Networks DiBona
Do you know where your variables are?
 
PerlMonks  

Strange output problem

by iangibson (Scribe)
on Apr 12, 2012 at 16:50 UTC ( #964788=perlquestion: print w/ replies, xml ) Need Help??
iangibson has asked for the wisdom of the Perl Monks concerning the following question:

I have a problem with formatting a genetic data file that is to have the columns and rows transposed. Hopefully a wise monk can see what's wrong.

The problem is that the final row of the final output file gets truncated just before the end i.e. it ends up about 97 columns shorter than all the others. All the other rows above the final row appear to contain the correct data.

I have tried the code on several different input files, which all produce the same results. The intermediate output file appears to be okay (all rows have the same number of columns), so presumably there's a problem with the last part of the code rather than the transposition part, but I can't see what it could be.

Suggestions welcome, thanks

#!/usr/bin/perl use strict; use warnings; use v5.14; # Transpose columns to rows, in preparation for using BioPerl: die "need two arguments (i.e. chr cont) at invocation" unless @ARGV == + 2; chomp( my $chr_num = shift ); chomp( my $cont = shift ); open my $in_file, '<', "chr${chr_num}_exome_snps_only.vcf_$cont" or die "Cannot open input file: $!"; open my $out_file, '>', "chr${chr_num}_exome_snps_only_transposed_$con +t" or die "Cannot open output file: $!"; my @rows = (); my @transposed = (); while (<$in_file>) { chomp; push @rows, [split]; } foreach my $row (@rows) { foreach my $column ( 0 .. $#{$row} ) { push( @{ $transposed[$column] }, $row->[$column] ); } } foreach my $new_row (@transposed) { foreach my $new_col ( @{$new_row} ) { print $out_file $new_col, "\t"; } print $out_file "\n"; } # Remove unnecessary rows, clean up the SNPs and put into CSV format: open my $new_in_file, '<', "chr${chr_num}_exome_snps_only_transposed_$ +cont" or die "Cannot open new input file: $!"; open my $new_out_file, '>', "chr${chr_num}_exome_snps_processed_$cont" or die "Cannot open new output file: $!"; while (<$new_in_file>) { chomp; next if $. == 1; # remove header line s/POS/SAMPLE/ if $. == 2; s/\s+/,/g if $. == 2; # replace whitespace with a comma on I +D row next if $. >= 3 and $. <= 9; # remove unwanted lines s/\s+(\d)\|(\d)\S+/,$1 $2/g; # clean up the SNPs & put into CSV + format print $new_out_file "$_\n"; } =cut sample $in_file (note: hundreds more columns i.e. individuals beginnin +g with HG, and hundreds of thousands more rows): #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT + HG00553 HG00554 HG00637 22 16287226 . C T 100 PASS SNP;BAVGPOST=0.986;BRS +Q=0.635;LDAF=0.0167;AVGPOST=0.9859;RSQ=0.6343;ERATE=0.0026;THETA=0.00 +22;AC=23;AN=2184 GT:DS:GL:BD 0|0:0.000:-0.48,-0.48,-0.48:0.021 + 0|0:0.000:-0.00,-2.69,-5.00:0.0001 0|0:0.000:-0.00,-2.53,-5.00: +0.0001 22 16287365 . C T 100 PASS SNP;BAVGPOST=0.998;BRS +Q=0.540;LDAF=0.0030;AVGPOST=0.9959;RSQ=0.4172;ERATE=0.0009;THETA=0.02 +01;AC=2;AN=2184 GT:DS:GL:BD 0|0:0.000:-0.48,-0.48,-0.48:0.0024 + 0|0:0.000:-0.00,-4.40,-5.00:0 0|0:0.000:-0.02,-1.38,-5.00:0.000 +2 22 16287649 . G A 100 PASS SNP;BAVGPOST=0.949;BRS +Q=0.519;LDAF=0.0645;AVGPOST=0.9342;RSQ=0.5421;ERATE=0.0115;THETA=0.00 +95;AC=81;AN=2184 GT:DS:GL:BD 0|0:0.050:-0.17,-0.50,-2.67:0.072 + 0|0:0.000:-0.00,-2.34,-5.00:0.0002 0|0:0.000:-0.01,-1.86,-5.00: +0.0024 22 16287784 . C T 100 PASS SNP;BAVGPOST=0.998;BRS +Q=0.844;LDAF=0.0089;AVGPOST=0.9941;RSQ=0.7151;ERATE=0.0030;THETA=0.00 +43;AC=13;AN=2184 GT:DS:GL:BD 0|0:0.000:-0.48,-0.48,-0.48:0.0104 + 0|0:0.000:-0.00,-2.79,-5.00:0.0001 0|0:0.000:0.00,-5.00,-5.00: +0 22 16287851 . G A 100 PASS SNP;BAVGPOST=0.939;BRS +Q=0.764;LDAF=0.1473;AVGPOST=0.9183;RSQ=0.7385;ERATE=0.0015;THETA=0.03 +09;AC=280;AN=2184 GT:DS:GL:BD 0|0:0.100:-0.08,-0.78,-5.00:0.040 +6 0|0:0.000:-0.00,-2.76,-5.00:0 0|0:0.000:-0.00,-3.49,-5.00:0 22 16287912 . G A 100 PASS SNP;BAVGPOST=0.998;BRS +Q=0.800;LDAF=0.0057;AVGPOST=0.9983;RSQ=0.8580;ERATE=0.0006;THETA=0.02 +10;AC=11;AN=2184 GT:DS:GL:BD 0|0:0.000:-0.12,-0.63,-4.70:0.0028 + 0|0:0.000:-0.01,-1.91,-5.00:0.0003 0|0:0.000:-0.00,-2.30,-5.00 +:0 sample $new_out_file (note: hundreds of thousands more columns, hundre +ds more rows) SAMPLE,16287215,16287226,16287365,16287649,16287784 HG00553,0 0,0 0,0 0,0 0,0 0 HG00554,0 0,0 0,0 0,0 0,0 0 HG00637,0 0,0 0,0 0,0 0,0 0 HG00638,0 0,0 0,0 0,0 0,0 0 HG00640,0 0,0 0,0 0,0 0,0 0 HG00641,0 0,0 0,0 0,0 0,0 0 HG00731,0 0,0 0,0 0,0 0,0 0

Comment on Strange output problem
Download Code
Re: Strange output problem
by Riales (Hermit) on Apr 12, 2012 at 17:24 UTC

    Nothing jumps out at me, but I am inclined to agree with your guess about the problem being in the last part of the code. Have you tried adding print statements before the chomp, before the substitution, and after the substitution?

    while (<$new_in_file>) { print "BEFORE CHOMP: $_\n"; chomp; next if $. == 1; # remove header line s/POS/SAMPLE/ if $. == 2; s/\s+/,/g if $. == 2; # replace whitespace with a comma on I +D row next if $. >= 3 and $. <= 9; # remove unwanted lines print "BEFORE SUBSTITUTION: $_\n"; s/\s+(\d)\|(\d)\S+/,$1 $2/g; # clean up the SNPs & put into CSV + format print "AFTER SUBSTITUTION: $_\n"; print $new_out_file "$_\n"; }
Re: Strange output problem
by RichardK (Priest) on Apr 12, 2012 at 17:30 UTC

    Why are you removing header lines from the transposed output file when you didn't write a header to that file? AFAICT you only write the transposed data!

      The OP doesn't show the sample transposed only file but when I ran it the first column was the headers.

Re: Strange output problem
by Lotus1 (Hermit) on Apr 12, 2012 at 17:59 UTC

    I noticed that you re-open the same file from $out_file as $new_in_file without closing $out_file first. Seems like buffering could be an issue here.

    I tried your program with a smaller number of rows and columns and found that with the close statement before the reopen I got output in the final processed file and without it I got a blank file. The intermediary file looked good in both cases. Maybe the final data to the intermediary gets written when the program ends and that first file handle is closed. I don't want to troubleshoot much here until you have a look at using close.

    Update: I thought of an easy way to test my idea about buffering. I made a copy of the intermediary file then closed the intermediary file then made another copy. The first copy was empty the second had everything in it.

    # Remove unnecessary rows, clean up the SNPs and put into CSV format: open my $new_in_file, '<', "chr${chr_num}_exome_snps_only_transposed_$ +cont" or die "Cannot open new input file: $!"; ################################## # I added the following lines # open my $temp, '>', 'transposed_copy' or die "Cannot open transposed_copy: $!"; while(<$new_in_file>) { print $temp $_; } close $temp; close $out_file; open $temp, '>', 'transposed_copy2' or die "Cannot open transposed_copy2: $!"; while(<$new_in_file>) { print $temp $_; } close $temp;

      Thanks! It looks like that was the problem. I simply added a

       close $out_file;

      between the two parts of the program, and the output is now correct. I'd never have thought of that .. better get into the habit of closing filehandles when I'm done with them!

      Thanks again.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://964788]
Approved by Eliya
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (5)
As of 2014-04-19 16:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (483 votes), past polls