Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Adding missing values into a hash

by Biopolete (Initiate)
on Jun 18, 2014 at 19:05 UTC ( #1090340=perlquestion: print w/ replies, xml ) Need Help??
Biopolete has asked for the wisdom of the Perl Monks concerning the following question:

Hi

I' m trying to convert a column to a csv file.

My data looks like:

##INFO=<ID=AA,

##INFO=<ID=AB,

##INFO=<ID=AC,

Num Data

1 AA=1;AB=2;AC=3

2 AA=2;AB=2

3 AA=5;AB=1;AC=1

And I want a csv like this:

AC AB AC

1 2 3

2 2 NA

5 1 1

First of all I do a hash obtaining the keys from the metadata (##)

open(I1,$ARGV[0]); my %info; while (my $line = <I1>) { if ($line =~ /##INFO=<ID=/) { my ($first,$second) = (split(/\,/, $line)); my ($firstsecond,$secondsecond) = (split(/ID=/, $first)); $info{$secondsecond}=(); } }

Now I have my hash “info” with my keys (AA, AB, AC)

Then I want to introduce the values.

I start with:

while (my $line = <I1>) { if ($line !~ /#/) { my ($numbers,$data) = (split(/\t/, $line)); foreach my $dat ($data){ my ($string, $int) = (split(/\;/, $dat));

That is because I want to eliminate “\t” and “;”

But I don't know how to introduce the missing values (NA)

I want something like this:

AA => 1,2,5

AB => 2,2,1

AC => 3,NA,1

Anyone knows how to introduce the “NA” string in its correct position?? (my real file is much bigger with a lot of “NA”)

Thank you very much.

Comment on Adding missing values into a hash
Select or Download Code
Re: Adding missing values into a hash
by poj (Priest) on Jun 18, 2014 at 19:57 UTC
    Try
    #!perl use strict; use Text::CSV; my %info=(); my $line_count=0; while (my $line = <DATA>){ chomp($line); if ($line =~ /##INFO=<ID=([^,]+)/){ $info{$1}=[]; } else { my (undef,%hash) = split /[\t;=]/,$line; for (keys %info){ push @{$info{$_}},$hash{$_} || 'NA'; } ++$line_count; } } my $csv = Text::CSV->new ( {binary=>1, eol=>"\012"} ) or die "Cannot use CSV: ".Text::CSV->error_diag(); open my $fh,'>','output.csv' or die "Could not open output.csv $!"; my @col_head = sort keys %info; $csv->print($fh, \@col_head); for my $i (1..$line_count){ my @row = map { $info{$_}[$i-1] } @col_head; $csv->print($fh, \@row); } __DATA__ ##INFO=<ID=AA, ##INFO=<ID=AB, ##INFO=<ID=AC, 1 AA=1;AB=2;AC=3 2 AA=2;AB=2 3 AA=5;AB=1;AC=1
    poj

      Thank you very much for your answer :)

      Your answer seems quite interesting but I don't know why but I obtain too many "NA".

      With de first part of the script I obtain for example

      AA=> NA, NA,1,2,5,NA,NA,NA,NA,NA

      instead of

      AA=> 1,2,5

      Perhaps is related with

      my (undef,%hash) = split /[\t;=]/,$line;

      because you are spliting 3 times, I don't know.

      The final csv is

      AA => NA,NA,NA

      AB => NA,NA,NA

      AC => NA,NA,NA

      Perhaps is because the problem with de "NA".

        Do you have other lines in the file apart from those like
        ##INFO=<ID=AA, and 1 AA=1;AB=2;AC=3 ? Blank lines for example.

        poj
Re: Adding missing values into a hash
by McA (Curate) on Jun 18, 2014 at 19:58 UTC

    Hi,

    in your inner loop you have to replace the last line

    my ($string, $int) = (split(/\;/, $dat));

    by the following

    my @elements = split /;/, $data; my %rowvalues; foreach my $element (@elements) { my ($key, $value) = split /=/, $element; $rowvalues{$key} = $value; } foreach my $key (keys %info) { if(exists $rowvalue{$key}) { push @{$info{$key}}, $rowvalue{$key}; } else { push @{$info{$key}}, 'NA'; } }

    I hope that is it. I haven't tested. Please put code tags around your sample data so we can see the structure better.

    Regards
    McA

      Thank you very much for your answer :)

      I was trying something similar, but the problem is that at the end I obtain a hash of hashes, I don't know why. And it's imposible working with them.

        Hi

        I was wondering about your answer and therefore made this selfcontained snippet which should show the relevant elements.

        #!/bin/env perl use strict; use warnings; use 5.010; my %info; while (my $line = <DATA>) { chomp $line; if ($line =~ /##INFO=<ID=/) { my ($first, $second) = split /,/, $line; my ($firstsecond, $secondsecond) = split /ID=/, $first; $info{$secondsecond}=(); } elsif ($line !~ /#/) { my ($numbers, $data) = split /\s+/, $line; foreach my $dat ($data){ my @elements = split /;/, $data; my %rowvalues; foreach my $element (@elements) { my ($key, $value) = split /=/, $element; $rowvalues{$key} = $value; } foreach my $key (keys %info) { if(exists $rowvalues{$key}) { push @{$info{$key}}, $rowvalues{$key}; } else { push @{$info{$key}}, 'NA'; } } } } else { next; } } foreach my $header (sort keys %info) { say $header, ' => ', join(',', @{$info{$header}}); } __DATA__ # First the headers ##INFO=<ID=AA, ##INFO=<ID=AB, ##INFO=<ID=AC, # then the data 1 AA=1;AB=2;AC=3 2 AA=2;AB=2 3 AA=5;AB=1;AC=1

        I hope this will clarify what was said before. I change one split from '\t' to '\s+' because of pasting this code herein would probably destroy tghe tab character.

        Regards
        McA

Re: Adding missing values into a hash
by Laurent_R (Parson) on Jun 18, 2014 at 20:07 UTC
    First, your output is not exactly a CSV (comma separated value) format.

    Second You would probably better off storing your metadata keys in an array rather than a hash, because an array preserves the order of the data (and not a hash).

    Third, a regex might be simpler than a split if you just want to remove the trailing comma:

    while (my $line = <I1>) { chomp $line; if ($line =~ /##INFO=<ID=/) { $line =~ s/,$//; # ... }
    Fourth, I do not see any \t in tour input.

    Fifth, reading the file twice does not seem to be a very good idea. Can't you decide, based on the content, that you have finished reading the metadata and started to read the data?

      Thank you very much for your answer :)

      You are right about the csv format, I was thinking in a excel.

      I thing that storing my metadata keys in an array rather than a hash it would be better, the problem is that I don't know how to asociate the metadata in the array with the values without doing it with a hash.

      Your third and fifth points look like very interesting, but I am "noob" and i don't know how to do it :(

Re: Adding missing values into a hash
by sundialsvc4 (Abbot) on Jun 19, 2014 at 13:40 UTC

    Also, in general, here’s a way to introduce missing keys into a hash:   (extemporaneous coding, might not compile)

    foreach my $key (qr/AA BB CC/) { $myrecord->{$key} = 'NA' unless exists($myrecord->{$key}); }

    You would do this after the loop that built the hash from the inputs.   Two things to note here:

    1. The unless suffix might be new to you, but it’s very handy in situations exactly like this . . .
    2. The use of the function exists() is very important, because you really do want to know whether the key exists in the hash, regardless of whether its value (if present) would evaluate to True or to False.  

    This logic will loop through the list of keys (AA, BB, CC), and insert the value 'NA' for each of them if they’re not there.

    Obviously, this is not the only way to do it.   For instance, you might not want to corrupt the value of the hash by sticking NA strings into it, so you might instead copy the values to another hash (initialized to empty-hash before the start of the loop).   The code would use an if statement instead of unless, but it would still use exists() as shown.   Such logic could be used, with a known-complete list of all keys that could occur, to copy only those keys into an output hash, inserting NA into that hash for any missing values.   If you for any reason do not want to trust that the source-hash in question contains only keys that you are interested in, that would be one way to accomodate that.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1090340]
Approved by Paladin
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (5)
As of 2014-10-24 07:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (130 votes), past polls