http://www.perlmonks.org?node_id=1006291

iangibson has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I have the following code, whose purpose is to grab the data from the first two columns (chromosome and coordinate) of a series of files, then output to a new file these pairs ordered by chromosome and coordinate, having removed duplicates of any pairs occurring in more than one input file.

The code I've written so far outputs only one line per chromosome, with all the coordinates in the correct order.

What I'd like to have is each coordinate for a given chromosome to appear on its own next to the chromosome, e.g. where at present I have:

11   608502 1016988 1017495 1018088 3661585

I'd like to instead have:

11 608502 11 1016988 11 1017495 11 1018088 11 3661585

The other change I'd like to make is to output the chromosomes in numerical order, rather than in hash order as at present. Any help you can offer me would be much appreciated.

#!/usr/bin/perl use warnings; use strict; use v5.10; # Create program to read in a series of VCF files, outputting a two-c +olumn # file consisting of chromosome and coordinate for each site seen in +one or # more files ('master-site-list'). die "Give the names of each VCF" if @ARGV < 1; my $number_of_files = @ARGV; my $infile; my $current_file_number = 0; my %hash_of_chroms; open my $outfile, '>', "master-site-list" or die "Cannot open output f +ile: $!"; until ( $current_file_number >= $number_of_files ) { open $infile, '<', $ARGV[$current_file_number] or die "Cannot open VCF file: $!"; while (<$infile>) { chomp; next if /^#/; my ( $chr, $coord ) = split(/\s+/); push @{ $hash_of_chroms{$chr} }, $coord unless defined $hash_of_chroms{$chr}[$coord]; # don't add +duplicates } $current_file_number++; close $infile; } my %sorted_hash_of_chroms; foreach my $chrom ( keys %hash_of_chroms ) { push @{ $sorted_hash_of_chroms{$chrom} }, sort { $hash_of_chroms{$a}->[0] <=> $hash_of_chroms{$b}->[0] } keys %hash_of_chroms; } foreach my $chrom ( keys %hash_of_chroms ) { say $outfile "$chrom\t@{ $hash_of_chroms{$chrom} }"; } close $outfile;

Examples of input file:

##fileformat=VCFv4.1 ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=RR,Number=1,Type=Integer,Description="Reference Read Dept +h"> ##FORMAT=<ID=VR,Number=.,Type=Integer,Description="Variant Read Depth( +s)"> ##FORMAT=<ID=PL,Number=.,Type=Integer,Description="Normalized, Phred-s +caled likelihoods for genotypes as defined in the VCF specification"> ##INFO=<ID=SF,Number=.,Type=String,Description="Source file(s)"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT + NA19238 1 874816 . C CT 10 FINAL_AMBG SF=bcm_illum,bi_ill +um;validation_ratio=52/1019;sets=bcm_illum,bi_illum;samples=NA19238 + GT 1/0 1 1647893 . C CTTTCTT 30 FINAL_NOT_CONFIRMED SF=b +cm_illum,bcm_solid;validation_ratio=0/3380;sets=bcm_illum,bcm_solid;s +amples=NA19238 GT 0/0 1 7889972 rs57875989 GAGAATCCATCCCATCCTACTGCCAGCGCTCTGTCCACAG +GATCGCCTCCCATGA G 33190 DESIGN_FAIL SF=bi_illum GT +./. 1 14106394 . A ACTC 100 FINAL_CONFIRMED SF=bcm_il +lum,bcm_solid,bc_illum;validation_ratio=864/1754;sets=bcm_illum,bcm_s +olid,bc_illum;samples=NA19238 GT 1/0

Replies are listed 'Best First'.
Re: Printing Hash of Arrays
by RichardK (Parson) on Nov 29, 2012 at 16:50 UTC

    BTW this bit of code doesn't do what you think it does

    push @{ $hash_of_chroms{$chr} }, $coord unless defined $hash_of_chroms{$chr}[$coord]; # don't add +duplicates

    push adds the item to the end of the list and you check some list element off in the far distance.

    i.e. if the array is empty and $coord = 50, then your code does this:-

    $array[0] = $coord unless $array[50];
Re: Printing Hash of Arrays
by Anonymous Monk on Nov 29, 2012 at 16:26 UTC
    So, you would like to make these changes? Go ahead :)
Re: Printing Hash of Arrays
by Anonymous Monk on Nov 29, 2012 at 16:36 UTC
    Can you identify where you need to make the changes?