Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Multidimensional hash help!

by ila14 (Initiate)
on Mar 04, 2014 at 11:38 UTC ( #1076856=perlquestion: print w/replies, xml ) Need Help??
ila14 has asked for the wisdom of the Perl Monks concerning the following question:

Hello all,

I am sure you would have heard this all before but I shall say it again. I am a new user of perl and would like help defining my hash. I have an input text file containing 11 columns of information that I have initialized. A snippet of the file and my code are shown below.

File: ID Symbol Taxon Taxon Name Evidence GO ID GO Name + Aspect Reference With Source H1SXX9 Symbol1 12345 Homo Sapiens IEA GO:0015031 pro +tein transport Process GO_REF:0000002 InterPro:IPR027282 +InterPro H1SXZ5 Symbol2 12345 Homo Sapiens IEA GO:0003824 cat +alytic activity Function GO_REF:0000002 InterPro:IPR003607 + InterPro H1SXZ5 Symbol2 12345 Homo Sapiens IEA GO:0008152 met +abolic process Process GO_REF:0000002 InterPro:IPR002912 +InterPro H1SXZ5 Symbol2 12345 Homo Sapiens IEA GO:0008728 GTP + diphosphokinase activity Function GO_REF:0000003 EC:2.7.6.5 + UniProt H1SXZ5 Symbol2 12345 Homo Sapiens IEA GO:0015969 gua +nosine tetraphosphate metabolic process Process GO_REF:0000002 + InterPro:IPR004811|InterPro:IPR007685 InterPro H1SXZ5 Symbol2 12345 Homo Sapiens IEA GO:0016301 kin +ase activity Function GO_REF:0000038 UniProtKB-KW:KW-0418 + UniProt H1SXZ5 Symbol2 12345 Homo Sapiens IEA GO:0016310 pho +sphorylation Process GO_REF:0000038 UniProtKB-KW:KW-0418 +UniProt H1SXZ5 Symbol2 12345 Homo Sapiens IEA GO:0016597 ami +no acid binding Function GO_REF:0000002 InterPro:IPR002912 + InterPro H1SXZ5 Symbol2 12345 Homo Sapiens IEA GO:0016740 tra +nsferase activity Function GO_REF:0000038 UniProtKB-KW:KW-08 +08 UniProt H1SXZ8 Symbol3 12345 Homo Sapiens IEA GO:0006812 cat +ion transport Process GO_REF:0000002 InterPro:IPR002524 I +nterPro H1SXZ8 Symbol3 12345 Homo Sapiens IEA GO:0008324 cat +ion transmembrane transporter activity Function GO_REF:0000002 + InterPro:IPR002524 InterPro H1SXZ8 Symbol3 12345 Homo Sapiens IEA GO:0030001 met +al ion transport Process GO_REF:0000002 InterPro:IPR006121 + InterPro H1SXZ8 Symbol3 12345 Homo Sapiens IEA GO:0046872 met +al ion binding Function GO_REF:0000002 InterPro:IPR006121 + InterPro H1SXZ8 Symbol3 12345 Homo Sapiens IEA GO:0055085 tra +nsmembrane transport Process GO_REF:0000002 InterPro:IPR0025 +24 InterPro H1SY02 Symbol4 12345 Homo Sapiens IEA GO:0006810 tra +nsport Process GO_REF:0000002 InterPro:IPR002898 InterPro H1SY02 Symbol4 12345 Homo Sapiens IEA GO:0006810 tra +nsport Process GO_REF:0000038 UniProtKB-KW:KW-0813 UniPro +t H1SY02 Symbol4 12345 Homo Sapiens IEA GO:0008565 pro +tein transporter activity Function GO_REF:0000002 InterPro:I +PR002898 InterPro H1SY02 Symbol4 12345 Homo Sapiens IEA GO:0015031 pro +tein transport Process GO_REF:0000038 UniProtKB-KW:KW-0653 + UniProt H1SY06 Symbol5 12345 Homo Sapiens IEA GO:0004129 cyt +ochrome-c oxidase activity Function GO_REF:0000002 InterPro: +IPR000883|InterPro:IPR004677|InterPro:IPR023615|InterPro:IPR023616 + InterPro H1SY06 Symbol5 12345 Homo Sapiens IEA GO:0004129 cyt +ochrome-c oxidase activity Function GO_REF:0000003 EC:1.9.3. +1 UniProt H1SY06 Symbol5 12345 Homo Sapiens IEA GO:0005506 iro +n ion binding Function GO_REF:0000002 InterPro:IPR000883 +InterPro
Code: open(IN,$annotationfile) or die "Can't open $annotationfile\n"; while(<IN>){ chomp; @data = split(/\t/,$_); $Column1 = @data[0]; $Column2 = @data[1]; $Column3 = @data[2]; $Column4 = @data[3]; $Column5 = @data[4]; $Column6 = @data[5]; $Column7 = @data[6]; $Column8 = @data[7]; $Column9 = @data[8]; $Column10 = @data[9]; $Column11 = @data[10]; print "$Column1\t$Column2....\t$Column11\n"; foreach $_ (1..$#data){ $GOHash{"$Symbols"}{"$GO_Names"} = "$IDs"; foreach $Symbols (@data) { foreach my $name (sort {$a <=> $b} (keys %GOHash) ) { foreach my $annotation (keys %{ $GOHash{$name} }) { print "$name, $annotation: $GOHash{$name}{$annotation}\n"; close(IN); } } } } }

I know that my columns are initialized because I see the correct information when I print. I am experiencing problems creating a multidimensional hash. Sorry that the syntax is not correct, I did it for simplicities sake,however this is what I would like to hash: % hash1 = Column1 => Column 2. %hash2 = %hash1 => Column3. %hash3 = %hash2 => Column 4.

I would be thankful for any help/advice. Kind regards, Ila14 </>

Replies are listed 'Best First'.
Re: Multidimensional hash help!
by Tux (Abbot) on Mar 04, 2014 at 12:22 UTC
    • Why do you read into @data and the assign to new fields for every line??
    • Why do you use slices in reading @data (@data[1] better written as $data[1], which would have been prompted to you when you would have used strict and warnings)
    • Why do you quote scalars? (no need to put "'s around $Symbols)

    Read ahead and see if that makes some sense:

    use 5.16.2; use warnings; my $annotationfile = "file.tsv"; open my $fh, "<", $annotationfile or die "$annotationfile: $!\n"; # First read the header my @hdr = split m/\t/ => scalar <$fh>; my %GOHash; # Now read every line while (<$fh>) { chomp; # read as a hash my %hash; @hash{@hdr} = split m/\t/ => $_, 11; $GOHash{$hash{Symbol}}{$hash{"Taxon Name"}} = $hash{ID}; }

    FWIW you can read the whole file in one single statement into an array of hashes using recent Text::CSV_XS:

    use 5.16.2; use warnings; use Text::CSV_XS qw( csv ); my $AoH = csv (in => "file.tsv", sep_char => "\t", headers => "auto"); foreach my $row (@$AoH) { print $row->{"GO ID"}, $row->{ID}; }

    Enjoy, Have FUN! H.Merijn
    scalar
Re: Multidimensional hash help!
by kcott (Chancellor) on Mar 04, 2014 at 15:50 UTC

    G'day ila14,

    Welcome to the monastery.

    My biggest problem with this is determining the structure of the multidimensional hash you're trying to create.

    Your description, "I would like to hash: % hash1 = Column1 => Column 2. %hash2 = %hash1 => Column3. %hash3 = %hash2 => Column 4.", conveys no real meaning. Your code is not helpful either: given you've stated "the syntax is not correct", this isn't too surprising.

    From the code you've posted, I suspect you'd benefit from reading "perlintro -- a brief introduction and overview of Perl".

    For information on data structures, I suggest you read "perldsc - Perl Data Structures Cookbook"; paying particular attention to the "HASHES OF HASHES" section.

    The actual code you need may be as simple as this:

    #!/usr/bin/env perl use strict; use warnings; use autodie; use constant { ID => 0, SYMBOL => 1, GO_ID => 5, GO_NAME => 6, }; my $file = './pm_1076856.tsv'; my %go_hash; open my $fh, '<', $file; while (<$fh>) { next if $. == 1; my @cols = split /\t/; $go_hash{$cols[SYMBOL]}{$cols[ID]}{$cols[GO_ID]} = $cols[GO_NAME]; } use Data::Dump; dd \%go_hash;

    The file pm_1076856.tsv contains the input data you posted. Here's the output after running my example script:

    { Symbol1 => { H1SXX9 => { "GO:0015031" => "protein transport" } }, Symbol2 => { H1SXZ5 => { "GO:0003824" => "catalytic activity", "GO:0008152" => "metabolic process", "GO:0008728" => "GTP diphosphokinase activity", "GO:0015969" => "guanosine tetraphosphate metabolic p +rocess", "GO:0016301" => "kinase activity", "GO:0016310" => "phosphorylation", "GO:0016597" => "amino acid binding", "GO:0016740" => "transferase activity", }, }, Symbol3 => { H1SXZ8 => { "GO:0006812" => "cation transport", "GO:0008324" => "cation transmembrane transporter act +ivity", "GO:0030001" => "metal ion transport", "GO:0046872" => "metal ion binding", "GO:0055085" => "transmembrane transport", }, }, Symbol4 => { H1SY02 => { "GO:0006810" => "transport", "GO:0008565" => "protein transporter activity", "GO:0015031" => "protein transport", }, }, Symbol5 => { H1SY06 => { "GO:0004129" => "cytochrome-c oxidase activity", "GO:0005506" => "iron ion binding", }, }, }

    If that's close to what you want, try changing the hash depth and @cols indices to get whatever you require.

    If that's completely different from what you're after, and you still can't work out what code you need, reduce your example data to a more manageable size for demonstration purposes (maybe half a dozen records) and post the actual data structure you require (something along the lines of my posted output would be preferable).

    Also take a look at the guidelines in "How do I post a question effectively?" for hints and tips on what you can do to help us to help you.

    -- Ken

      I do have a further question. I tried to manually traverse my hash so that I can print the keys separated by newline and tab however because the hash is more than 2 dimensions I am experiencing trouble. (http://perlmaven.com/multi-dimensional-hashes) Here is the piece of code that I wrote.

      #foreach my $symb (sort keys %go_hash) { #foreach my $UniID (keys %{ $go_hash{$symb} }) { #foreach my $TaX (keys %{ $go_hash{$symb}{$UniID} }) { #print "$symb, $UniID, $go_hash{$symb}{$UniID}{$TaX}\n"; #} #} #}

      and here is a sample output:

      Hey, HSWZH7, HASH(0x7fdeb30dc310) how, HSX0L1, HASH(0x7fdeb3169768) are, HSX1I1, HASH(0x7fdeb31784b0) you, HSX4J3, HASH(0x7fdeb31784b0)
      The "prettiest" I have made my output using data::dumper so far is to set the indent to 1 and pair to "\t" as shown below. $Data::Dumper::Pair = " \t "; $Data::Dumper::Indent = 1; Thanks again.

        Thanks for fixing the formatting; however, instead of creating a new post you can just edit the original (see "How do I change/delete my post?"). Don't worry about the original (Re^2: Multidimensional hash help!): I've requested that it be reaped.

        When you post, please be specific about what you're doing. In this case, I've guessed $symb refers to the Symbol column. $UniID and $TaX are unclear (there's two columns with ID and two with Taxon): UniID and TaX may be standard abbreviations where you work (or generally in your industry) but I don't work in your industry nor do most of the people here who could help you.

        The number of levels of the hash shouldn't be an issue: I'm guessing another nested for loop would've accessed all the data. In the script below, I've added another level (to what I had in my previous script) and shown how to print the fields. [For future reference, you'll find logically indenting your code makes it a lot easier to read and maintain (compare your code with mine).]

        Data::Dump (which I used in my previous script) is a CPAN module which you may need to install. Data::Dumper is a built-in module. I've shown usage examples of both for comparison — you'll need click on "Reveal this spoiler" to see the output.

        #!/usr/bin/env perl use strict; use warnings; use autodie; use constant { ID => 0, SYMBOL => 1, TAXON_NAME => 3, GO_ID => 5, GO_NAME => 6, }; my $file = './pm_1076856.tsv'; my %go_hash; open my $fh, '<', $file; while (<$fh>) { next if $. == 1; my @cols = split /\t/; $go_hash{$cols[SYMBOL]}{$cols[ID]}{$cols[TAXON_NAME]}{$cols[GO_ID] +} = $cols[GO_NAME]; } close $fh; for my $symbol (sort keys %go_hash) { for my $id (sort keys %{$go_hash{$symbol}}) { for my $taxon_name (sort keys %{$go_hash{$symbol}{$id}}) { for my $go_id (sort keys %{$go_hash{$symbol}{$id}{$taxon_n +ame}}) { print join("\t" => $symbol, $id, $taxon_name, $go_id, $go_hash{$symbol}{$id}{$taxon_name +}{$go_id} ), "\n"; } } } } { print "\nData::Dumper Output:\n"; use Data::Dumper; local $Data::Dumper::Indent = 1; print Dumper \%go_hash; } print "\nData::Dump Output:\n"; use Data::Dump; dd \%go_hash;

        Output:

        Symbol1 H1SXX9 Homo Sapiens GO:0015031 protein transport Symbol2 H1SXZ5 Homo Sapiens GO:0003824 catalytic activity Symbol2 H1SXZ5 Homo Sapiens GO:0008152 metabolic process Symbol2 H1SXZ5 Homo Sapiens GO:0008728 GTP diphosphokinase + activity Symbol2 H1SXZ5 Homo Sapiens GO:0015969 guanosine tetraphos +phate metabolic process Symbol2 H1SXZ5 Homo Sapiens GO:0016301 kinase activity Symbol2 H1SXZ5 Homo Sapiens GO:0016310 phosphorylation Symbol2 H1SXZ5 Homo Sapiens GO:0016597 amino acid binding Symbol2 H1SXZ5 Homo Sapiens GO:0016740 transferase activit +y Symbol3 H1SXZ8 Homo Sapiens GO:0006812 cation transport Symbol3 H1SXZ8 Homo Sapiens GO:0008324 cation transmembran +e transporter activity Symbol3 H1SXZ8 Homo Sapiens GO:0030001 metal ion transport Symbol3 H1SXZ8 Homo Sapiens GO:0046872 metal ion binding Symbol3 H1SXZ8 Homo Sapiens GO:0055085 transmembrane trans +port Symbol4 H1SY02 Homo Sapiens GO:0006810 transport Symbol4 H1SY02 Homo Sapiens GO:0008565 protein transporter + activity Symbol4 H1SY02 Homo Sapiens GO:0015031 protein transport Symbol5 H1SY06 Homo Sapiens GO:0004129 cytochrome-c oxidas +e activity Symbol5 H1SY06 Homo Sapiens GO:0005506 iron ion binding

        -- Ken

      Hello Ken, Thank you for your response. I am a biologist and have only started using perl over the past month so am feeling a little lost. Your code does help a lot and is similar to what I require. I would need to add an additional to have it completely and shall read the references you posted. Thank you. ila
        "I am a biologist and have only started using perl over the past month so am feeling a little lost."

        I recommend you bookmark "perl - The Perl 5 language interpreter".

        From this page, you'll find links to the documentation for all the built-in functions, modules and other parts of the language as well as FAQs, tutorials and other resources.

        Rather than attempting to read everything at once (a particularly daunting endeavour), I suggest you familiarise yourself with the various sections and what they provide: this is a much simpler task and will allow you to quickly access information as and when you need it.

        Having said that, you'd probably benefit from reading "perlintro -- a brief introduction and overview of Perl" in its entirety.

        -- Ken

      I do have a further question. I tried to manually traverse my hash so that I can print the keys separated by newline and tab however because the hash is more than 2 dimensions I am experiencing trouble. (http://perlmaven.com/multi-dimensional-hashes) Here is the piece of code that I wrote. #foreach my $symb (sort keys %go_hash) { #foreach my $UniID (keys %{ $go_hash{$symb} }) { #foreach my $TaX (keys %{ $go_hash{$symb}{$UniID} }) { #print "$symb, $UniID, $go_hash{$symb}{$UniID}{$TaX}\n"; #} #} #} and here is a sample output: Hey, HSWZH7, HASH(0x7fdeb30dc310) how, HSX0L1, HASH(0x7fdeb3169768) are, HSX1I1, HASH(0x7fdeb31784b0) you, HSX4J3, HASH(0x7fdeb31784b0) The "prettiest" I have made my output using data::dumper so far is to set the indent to 1 and pair to "\t" as shown below. $Data::Dumper::Pair = " \t "; $Data::Dumper::Indent = 1; Thanks again.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1076856]
Approved by kcott
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (5)
As of 2018-10-23 01:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    When I need money for a bigger acquisition, I usually ...














    Results (125 votes). Check out past polls.

    Notices?