http://www.perlmonks.org?node_id=1048406

ZWcarp has asked for the wisdom of the Perl Monks concerning the following question:

Hello brothers,

I'm trying to figure out the best way to load a probability table into perl. Heres my code which seems to work but there has to be a better and easier way to do this!

:
#!/usr/bin/perl -w use strict; use warnings; use diagnostics; use Data::Dumper; ########################## my %hash; my %hash2; my @a1; my $header_line = <>; my @headers = split(/\t/,$header_line); my $index=0; my %col_header = map { $_ => $index++} @headers[1..$#headers]; # #print Dumper \%col_header; while (<>) { chomp; @a1=split(/\t/,$_); my $fromAA=shift(@a1); foreach my $toAA (keys %col_header){ $hash{$fromAA}{$toAA} = $a1[$col_header{$toAA}]; } } print Dumper \%hash;

And here is an example set of data in tab separated format (c/p from excel sheet)

Amino Acid Switch Probabiities AAA AAC AAG AA +T ACA ACC ACG ACT AGA AGC A +GG AGT ATA ATC ATG ATT AAA 0.40849 0.01506 0.26198 0.01904 0.01527 0.0065 + 0.01149 0.00774 0.09164 0.00886 0.06529 0.01076 0 +.0066 0.00143 0.00546 0.00199 AAC 0.011 0.41485 0.00959 0.25289 0.00916 0.01865 + 0.0096 0.01375 0.00594 0.06004 0.00586 0.04124 0.0 +0198 0.00335 0.00212 0.00227 AAG 0.29591 0.01484 0.46315 0.01686 0.01135 0.00675 + 0.01565 0.00657 0.06807 0.00889 0.09736 0.00994 +0.00415 0.00151 0.00852 0.00177 AAT 0.0123 0.22372 0.00965 0.3545 0.00971 0.01103 + 0.00955 0.01893 0.00611 0.03574 0.00553 0.06048 0. +00225 0.00204 0.00231 0.00365 ACA 0.00913 0.0075 0.006 0.00899 0.25029 0.13459 +0.19326 0.15274 0.00817 0.0142 0.00526 0.01523 0.01 +986 0.00584 0.01536 0.00737 ACC 0.00368 0.01445 0.00338 0.00966 0.12735 0.27128 + 0.15326 0.14524 0.00311 0.02754 0.00306 0.01691 +0.00817 0.01213 0.00809 0.00755 ACG 0.0024 0.00274 0.00289 0.00309 0.06746 0.05654 + 0.0985 0.05631 0.0019 0.00596 0.00256 0.00568 0.0 +0401 0.00227 0.00719 0.00266 ACT 0.00395 0.0096 0.00296 0.01494 0.13029 0.13094 + 0.1376 0.2089 0.00312 0.01788 0.00272 0.02573 0.0 +0854 0.00653 0.00841 0.01267 AGA 0.04383 0.00389 0.02882 0.00452 0.00654 0.00263 + 0.00435 0.00293 0.28665 0.00702 0.17898 0.00863 +0.00367 0.00076 0.00259 0.00093 AGC 0.00598 0.05551 0.00531 0.03735 0.01604 0.03288 + 0.01929 0.02368 0.00991 0.3642 0.01051 0.23871 0 +.00275 0.00449 0.0028 0.00327 AGG 0.0274 0.00337 0.03617 0.00359 0.00369 0.00227 + 0.00515 0.00224 0.15703 0.00653 0.25775 0.00725 0 +.00203 0.00066 0.00346 0.00076 AGT 0.00519 0.02725 0.00425 0.04518 0.0123 0.01443 + 0.01315 0.02436 0.00871 0.17062 0.00833 0.28774 0 +.00236 0.0019 0.00217 0.00399 ATA 0.00219 0.0009 0.00122 0.00115 0.01102 0.00479 + 0.00637 0.00556 0.00254 0.00135 0.0016 0.00162 0. +19608 0.0846 0.02113 0.08774 ATC 0.00101 0.00324 0.00095 0.00224 0.00691 0.01517 + 0.0077 0.00905 0.00112 0.0047 0.00111 0.00279 0. +18032 0.36846 0.02111 0.22364 ATG 0.00423 0.00224 0.00584 0.00277 0.01992 0.01108 + 0.02671 0.01278 0.00419 0.00321 0.00639 0.00348 +0.04935 0.02313 0.58867 0.0273 ATT 0.00123 0.00193 0.00097 0.0035 0.00763 0.00826 + 0.00788 0.01538 0.0012 0.003 0.00113 0.00512 0.16 +368 0.19576 0.02181 0.31668

Replies are listed 'Best First'.
Re: Best way to read in an XbyX table into a Hash{Key}{Key2}[value] structure
by rjt (Curate) on Aug 07, 2013 at 19:01 UTC
    (c/p from excel sheet)

    Given that, you may want to consider Spreadsheet::ParseExcel to read the data in directly from your Excel spreadsheet.

    Edit: Thanks to the OP for helping me understand the requirements. I now believe this will be much more in line with what you're after:

    my (undef, @col) = split /\t/, <>; # Column names my %prob_map; while (my ($from, @values) = split /\t/, <>) { @{$prob_map{$from}}{@col} = @values; } say "AAC to AAA = " . $prob_map{AAC}{AAA}; say "AAA to AAC = " . $prob_map{AAA}{AAC}; __END__ AAC to AAA = 0.011 AAA to AAC = 0.01506

    Previous line-based (i.e., row major) suggestion is below.


    Otherwise, parsing the plain text you've provided is fairly straightforward as well:

    my @col = split /\t/, <>; # Column headings my @lines = map { my %l; @l{@col} = split /\t/; \%l } <>;

    However, whether that's actually an improvement on your code or not is debatable. :-)

      Thank you for your help!

      So its not usually from excel sheets, I just listed that in case there was excel markup leftover ..like \r returns or something.

      Question though....I don't understand how I then access the information following your method. For example, if I wanted the probability of a AAC to AAA (.0011) how do I get to that value? You've created an array where each cell has the ref to a hash....which in turn holds the key value pairs of each row with respect to that column... correct? How do I access the info for individual cases?

        Thanks for the follow-up. I believe I now understand your requirements. Since my original solution was almost certainly not what you were after, I added a better one to my original node, which will make it possible to do the from/to lookups you need. Mea culpa; it's been a while since I looked at a probability table. :-)

      my (undef, @col) = split /\s+/, <DATA>; # Column names
      When splitting on \s+ instead of \t, the first header value, 'Amino Acid Switch Probabiities', will be wrongly split into the @col array. You need to split on tabs to prevent this from happening.

        Quite right. My local test copy of the sample data was based on the original OP (which had the tabs squashed to spaces), and I forgot to update my split pattern when posting. Corrected now, thanks.

Re: Best way to read in an XbyX table into a Hash{Key}{Key2}[value] structure
by Cristoforo (Curate) on Aug 07, 2013 at 19:54 UTC
    Maybe not better, but a little shorter.
    #!/usr/bin/perl use strict; use warnings; chomp(my (undef, @headers) = split /\t/, <>); my %hash; while (<>) { chomp; my ($fromAA, @a1) = split /\t/; @{ $hash{$fromAA} }{@headers} = @a1; }