Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask

Obtaining combinations of hash keys and values

by Anonymous Monk
on Apr 28, 2016 at 14:00 UTC ( #1161775=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have a short script which splits a DNA-sequence (imported by BioPerl) by a user-provided pattern and subsequently populates a hash with modified versions of the resulting fragments along with a unique ID. Each fragment produced by split appears in the hash twice - in the original form and in a reverse complement form (F and R respectively).

#!/usr/bin/perl use strict; use warnings; use Bio::SeqIO; my %sequences; my $seqio = Bio::SeqIO->new(-file => $ARGV[0]); my $enz = $ARGV[1]; while(my $seqobj = $seqio->next_seq) { my $id = $seqobj->display_id; my $seq = $seqobj->seq; $sequences{$id} = $seq; } my @fragments; for my $value (values %sequences) { @fragments = split(/$enz/, $value); } my $Lfill = "TT"; my $Rfill = "AA"; my $ID = 0; my %bins; foreach my $RF (@fragments) { $ID++; $RF = $Lfill.$RF.$Rfill; $bins{$ID."F"} = $RF; (my $rev = $RF) =~ tr/ACGT/TGCA/; $bins{$ID."R"} = reverse($rev); }

Example Input:


Fragments Generated:


I now want to obtain unique combinations (of two) whilst excluding a fragment from combining with its own reverse cognate. Using the example above, it would produce:


But would not produce 1F1R or 2F2R. As shown above, both the keys of the involved fragments are combined as well as the values - and stored in a new hash.

I've tried using the CPAN modules Algorithm::Combinatorics and Math::Combinatorics but can't seem to adapt these to fit this task.

Does anybody have any code snippets, examples or suggestions that could help achieve this? If it helps: i'm very new to Perl.

Replies are listed 'Best First'.
Re: Obtaining combinations of hash keys and values
by choroba (Bishop) on Apr 28, 2016 at 16:39 UTC
    If the number of fragments you want to combine isn't dynamic (i.e. it's always 2), just write the nested loops:
    #!/usr/bin/perl use warnings; use strict; my %fragments = ( 1 => { F => 'TTAAGTAGCATCGATTTATAGCATCGACTAGTAA', R => 'TTACTAGTCGATGCTATAAATCGATGCTACTTAA', }, 2 => { F => 'TTAGCTACGATCAGCTACGATCGAGCGACTACGTAGCAA +', R => 'TTGCTACGTAGTCGCTCGATCGTAGCTGATCGTAGCTAA +', }, ); my %combinations; for my $fragment (keys %fragments) { for my $other_fragment (keys %fragments) { next if $fragment eq $other_fragment; for my $cognate (qw( F R )) { for my $other_cognate (qw( F R )) { $combinations{"$fragment$cognate$other_fragment$other_ +cognate"} = $fragments{$fragment}{$cognate} . $fragments{$other_fragment}{$other_cognate}; } } } } use Data::Dumper; print Dumper(\%combinations);

    This generates 8 entries, not four, as it creates 2F1F etc., too. You can modify the "next" condition to avoid it, i.e.

    next if $fragment ge $other_fragment;

    If you need to combine N fragments, see Algorithm::Loops.

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
      Thanks! How scalable would this prove to be if thousands of input fragments are used?
        For thousands of fragments, your result set could get large. Since you are obtaining 4 results for each combination, the total number of results would come to 4 * the total number of combinations.

        • 4 * C(1000,2) == 1,998,000
        • 4 * C(5000,2) == 12,497,500
        • 4 * C(10,000,2) == 199,980,000
        • 4 * C(20,000,2) == 799,960,000

        I guess you need to know how you want to use/analyze the results. Also if you would want to print the results to a file or store in an array or hash.

        Update: When I ran a test here against a sample fasta file, I generated 124 fragments and stored the combinations in a hash. I had a total memory use of 12,848,152 bytes for the 30,504 combinations (about 421 bytes per combination). So I'd guess that if you had 1000 or more fragments, you would probably exceed your memory.

        When I used an array instead of a hash, the memory used was slightly less, 10,888,016 bytes. (roughly 357 bytes per combination).

      I'm guessing I will need to print to an output file instead of storing combinations in a hash for large input amounts otherwise the RAM requirement is off the charts!
        You can't print before you've read the whole file, otherwise you don't know the counts. If this is really a problem (test it and see), you can store just line numbers in the hash instead of the full blocks, and then read the file for the second time and print the numbers (but only print the first block for each corresponding key!).

        ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
Re: Obtaining combinations of hash keys and values
by Anonymous Monk on Apr 28, 2016 at 17:38 UTC

    Why not keep the fragments as pairs. The pairs can then combine to produce four alternatives.


    Or, you could keep only the F fragments, but assemble the four FF FR RF RR variants right as you combine two fragments. For example, write a sub that accepts two keys and returns four pairs as a list. Breaking down the problem into subroutines is a smart idea in any case. Using Algorithm::Combinatorics combinations() is then possible since you have one key per fragment/cognate.

      Thank you. I had no considered passing it off to a subroutine - good excuse to learn how to use subs now!

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1161775]
Approved by Athanasius
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (2)
As of 2019-05-22 03:39 GMT
Find Nodes?
    Voting Booth?
    Do you enjoy 3D movies?

    Results (138 votes). Check out past polls.

    • (Sep 10, 2018 at 22:53 UTC) Welcome new users!