Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Speeding permutation counting

by albert (Monk)
on Jul 18, 2007 at 13:53 UTC ( #627253=perlquestion: print w/replies, xml ) Need Help??

albert has asked for the wisdom of the Perl Monks concerning the following question:

I have a series of strings (all of equal length) which contain only 0s and 1s, such as:
111011001010011110100010100001 111010000010010110000010000001 000101011100001000110101110000 000101111101001001111101111110 111011001010111110100010100001 000100010100000000010001010000
For each unique pairs of strings, I want to count the number of 00, 01, 10, and 11 as you move each character for the pair. (In the example above, for the first two strings, there are 15 of '00', 0 of '01', 5 of '10', and 10 of '11'.)

Since I want to look at all the pairs amongst 1000s of these strings, speed is of the essence. I am currently doing the following, but appreciate any suggestions on alternative strategies which might be faster.

In words:
I put the strings into an array of arrays, where the sub-arrays are composed of the strings split into single characters. Then, I iterate over every pair of elements in the array, taking the pair of sub-arrays, and iterating over those to count the 00, 01, 10, and 11.

In code:

my @strings = qw/111011001010011110100010100001 111010000010010110000010000001 000101011100001000110101110000 000101111101001001111101111110 111011001010111110100010100001 000100010100000000010001010000/; foreach my $string (@strings) { my @items = split //, $string; $string = \@items; } for ( my $i = 0 ; $i < @strings ; $i++ ) { for ( my $j = $i + 1 ; $j < @strings ; $j++ ) { my ( $c00, $c01, $c10, $c11 ) = ( 0, 0, 0, 0 ); for ( my $k = 0 ; $k < @{ $strings[$i] } ; $k++ ) { $c00++ if ${$strings[$i]}[$k] == 0 && ${$strings[$j]}[$k] == 0; $c01++ if ${$strings[$i]}[$k] == 0 && ${$strings[$j]}[$k] == 1; $c10++ if ${$strings[$i]}[$k] == 1 && ${$strings[$j]}[$k] == 0; $c11++ if ${$strings[$i]}[$k] == 1 && ${$strings[$j]}[$k] == 1; } print join( "\t", $i, $j, $c00, $c01, $c10, $c11 ), "\n"; } }
Since I have many 1000s of these strings to analyze in unique pairs, speed is of importance. (For reference, my real world strings are between 120 and 180 characters in length.) Therefore, does any wise monk have a suggestion on ways I might speed this up. Or, can someone reassure me that I can't do much better than this.

Thanks wise monks,
-albert

Replies are listed 'Best First'.
Re: Speeding permutation counting
by blokhead (Monsignor) on Jul 18, 2007 at 14:26 UTC
    I think the previous replies have misunderstood your question. If I understand correctly, you want to take a pair of strings, put them on top of each other, and read *down* the columns:
    string A: 0 1 0 1 string B: 1 1 0 1 | | | | | | | +--> 11 | | +----> 00 | +------> 11 +--------> 01
    .. i.e, not just look at all the pairs of characters in an individual string..

    Splitting each string into an array seems wasteful, as arrays are much more memory-heavy than strings. You can use substr to index into any position in the string to get individual characters.

    Also, instead of 4 different cases of logic involving 4 similarly-named variables $c00 through $c11, you can use a hash to really simplify the code:

    my @data = qw[ 111011001010011110100010100001 111010000010010110000010000001 000101011100001000110101110000 000101111101001001111101111110 111011001010111110100010100001 000100010100000000010001010000 ]; use Data::Dumper; for my $i (0 .. $#data) { for my $j ($i+1 .. $#data) { my %counts; $counts{ substr($data[$i],$_,1) . substr($data[$j],$_,1) }++ for 0 .. length($data[$i]) - 1; print Dumper \%counts; } }
    Another cute solution is to use chop to trim off characters at a time from both strings. It generates these character-pairs in reverse order, but since you only care about the final count, it's ok. Since it modifies the string, you have to do it on a copy.
    for my $i (0 .. $#data) { for my $j ($i+1 .. $#data) { my %counts; my ($str1, $str2) = @data[$i,$j]; $counts{ chop($str1) . chop($str2) }++ while length($str1) and length($str2); print Dumper \%counts; } }

    blokhead

      Yes, this is the correct interpretation of my question, and thanks for your suggestions. I benchmarked your two suggestions relative to my original, and things are definitely improved:
      Rate orig substring blokhead blokchop orig 1199/s -- -45% -62% -69% substring 2162/s 80% -- -31% -44% blokhead 3144/s 162% 45% -- -18% blokchop 3840/s 220% 78% 22% --
      "Substring" is closer to my original (too verbose) style, so some of the gains are also coming from the more succinct, easier to read code. Best so far is the "chop" based suggestion.

      Still wondering if there might be other improvements, or some vastly different way to solve this.

      Thanks for these suggestions, they really have helped.
      -a

        Do you have the benchmark code? I consider benchmark results useless without the accompanying benchmark code.

        It's also stopping us from comparing our own solutions without redoing all your work.

        If you're worried about the space it takes, use <readmore> or <spoiler> tags.

Re: Speeding permutation counting
by BrowserUk (Pope) on Jul 18, 2007 at 15:04 UTC

    Instead of breaking your strings up and then having to iterate the characters, you can use bitwise-string operations to process the characters in parallel and then use tr/// to count the results. This results in a 10x speed up over your original:

    #! perl -slw use strict; use Benchmark::Timer; my $T = new Benchmark::Timer; use Math::Random::MT qw[ rand srand ]; our $S ||= 1; our $B ||= 32; our $N ||= 1000; srand( 1 ); my @strings = map { unpack 'b'. $B, rand( 2**32 ) } 1 .. $N; my $label = "$N strings of $B bits (srand:$S)"; $T->start( $label ); for my $i ( 0 .. $#strings ) { for my $j ( $i+1 .. $#strings ) { print join "\t", $i, $j, ( $strings[ $i ] | $strings[ $j ] ) =~ tr[0][0], ## 0 +0 ( ~$strings[ $i ] & $strings[ $j ] ) =~ tr[\1][\1], ## 0 +1 ( $strings[ $i ] & ~$strings[ $j ] ) =~ tr[\1][\1], ## 1 +0 ( $strings[ $i ] & $strings[ $j ] ) =~ tr[1][1]; ## 1 +1 } } $T->stop( $label ); $T->report; __END__ ## Original foreach my $string (@strings) { my @items = split //, $string; $string = \@items; } for ( my $i = 0 ; $i < @strings ; $i++ ) { for ( my $j = $i + 1 ; $j < @strings ; $j++ ) { my ( $c00, $c01, $c10, $c11 ) = ( 0, 0, 0, 0 ); for ( my $k = 0 ; $k < @{ $strings[$i] } ; $k++ ) { $c00++ if ${$strings[$i]}[$k] == 0 && ${$strings[$j]}[$k] == 0; $c01++ if ${$strings[$i]}[$k] == 0 && ${$strings[$j]}[$k] == 1; $c10++ if ${$strings[$i]}[$k] == 1 && ${$strings[$j]}[$k] == 0; $c11++ if ${$strings[$i]}[$k] == 1 && ${$strings[$j]}[$k] == 1; } print join( "\t", $i, $j, $c00, $c01, $c10, $c11 ), "\n"; } } c:\test>627253 >nul 1 trial of 1000 strings of 32 bits (srand:1) (40.336s total) c:\test>627253 >nul 1 trial of 1000 strings of 32 bits (srand:1) (40.477s total) ## Bitwise + tr/// c:\test>627253 >nul 1 trial of 1000 strings of 32 bits (srand:1) (4.711s total) c:\test>627253 >nul 1 trial of 1000 strings of 32 bits (srand:1) (4.696s total)

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      This is a significant improvement. I knew there would be something to do with bitwise operators, but I was not versed enough to come up with such a solution.

      This is giving the following benchmarks:

      Rate orig blokchop browserUK orig 1178/s -- -69% -94% blokchop 3829/s 225% -- -81% browserUK 20097/s 1606% 425% --
      In my testing, this is improving about 17x on my original, and about 5x on blokhead's chop based solution.

      Thanks for the excellent suggestion,
      -a

      BrowserUk,
      Once I understood the problem, I came up with a solution very similar to yours. When I logged in this morning, I noticed that you had already came up with the idea. Why does work have to get in the way of fun?

      I doubt there would be a speed improvement but I am pretty sure you only have to count '01' since '10' is just a reflection. Additionally, you should only need to calculate ('11' and '00') or ('11' and '10') or ('00' and '10'). The remaining two can be obtained with math.

      Cheers - L~R

        In my problem '01' is different than '10', and I need to count them separately. Still, your point of not needing to count the last is well taken, as I can count 3, and thereby know the 4th. However, by doing this, I'm not seeing any significant performance gains, and it may be a bit slower.
        Rate browser+LR browserUK browser+LR 19055/s -- -2% browserUK 19509/s 2% --
        Code for benchmarking...
Re: Speeding permutation counting (/../g)
by tye (Sage) on Jul 18, 2007 at 14:19 UTC
    $count{$_}++ for $string =~ /../g;

    Update: With the clarifications of the problem statement:

    while( @strings ) { my( $x, $y )= splice @strings, 0, 2; my %count; while( $x =~ /(.)/g ) { my $x1= $1; $y =~ /(.)/g or die "..."; $count{ $x1.$1 }++; # Or replace all of the above with: #$count{ "$1" . ( $y =~ /(.)/g ? "$1" : '' ) }++; } print join " ", %count, $/; }

    with the (simple) cleaning up of the output left as an excercise.

    - tye        

Re: Speeding permutation counting
by dsheroh (Monsignor) on Jul 18, 2007 at 15:22 UTC
    Other monks have already provided excellent optimizations for this particular case. I'd like to add a more modest optimization which applies to your original code and is more generally useful:

    Don't make unnecessary comparisons.

    In the original implementation, you're making eight comparisons on each pair of characters when you only need two.

    if (${$strings[$i]}[$k]) { if (${$strings[$j]}[$k]) { $c11++; } else { $c10++; } } else { if (${$strings[$j]}[$k]) { $c01++; } else { $c00++; } }

      I thougth a hash might be faster, but no.

      Rate hash if hash 9329/s -- -11% if 10473/s 12% -- Rate hash if hash 9436/s -- -11% if 10599/s 12% -- Rate hash if hash 9341/s -- -12% if 10662/s 14% --

      Benchmark code:

      A hash is definitely cleaner, though.

        Cool... Guess I should've benchmarked my suggestion instead of just assuming it would be slower than the hash-based ideas.

        Actually, unless I've severely mis-benchmarked it, it appears that my modest optimization (combined with blokhead's suggestion of using chop) even outdoes BrowserUK's bitwise version and still more so with fewer lines (based on BrowserUK's test code, I'm comparing every pair of lines):

        100 lines/100 iterations: Rate bitwise ifchop bitwise 33.1/s -- -41% ifchop 56.5/s 71% -- 10 lines/10000 iterations: Rate bitwise ifchop bitwise 3226/s -- -57% ifchop 7463/s 131% -- 1000 lines/5 iterations: s/iter bitwise ifchop bitwise 3.06 -- -23% ifchop 2.37 29% --
        (I actually suspect I'm doing something wrong in my test because I don't see why the number of lines would affect the relative performance at all. Update: I was doing something wrong. I forgot to pull the strings out of the array and was just comparing indexes in my version. With my version fixed to actually do the comparisons, the bitwise version is much faster, as I had initially expected:
        Rate ifchop bitwise ifchop 4.22/s -- -87% bitwise 33.3/s 690% --
        )

        Benchmark code:

Re: Speeding permutation counting
by ikegami (Pope) on Jul 18, 2007 at 15:56 UTC
    use strict; use warnings; my @strings = qw/ 111011001010011110100010100001 111010000010010110000010000001 000101011100001000110101110000 000101111101001001111101111110 111011001010111110100010100001 000100010100000000010001010000 /; s/(?<=.)/\0/g for my @strings1 = @strings; # "abc" => "a\0b\0c\0" s/(?=.)/\0/g for my @strings2 = @strings; # "abc" => "\0a\0b\0c" for my $i (0..$#strings) { for my $j ($i+1..$#strings) { my %c; ++$c{$_} for ($strings1[$i] | $strings2[$j]) =~ /../g; print(join("\t", $i, $j, $c{'00'}||0, $c{'01'}||0, $c{'10'}||0, +$c{'11'}||0), "\n"); } }

      Nevermind. It was an interesting idea, but it sucks.

      Rate ikegami albert browseruk ikegami 5.76/s -- -26% -89% albert 7.84/s 36% -- -86% browseruk 54.8/s 851% 599% --

      Both ikegami and albert's version will be faster per string when processing 1000 strings instead of 100 (since they both have a big overhead), but not enough to even approach BrowserUk's version.

      Benchmark code:

Re: Speeding permutation counting
by leocharre (Priest) on Jul 18, 2007 at 14:36 UTC
    Is this what you want, as tye pointed out,
    #!/usr/bin/perl -w use strict; use Smart::Comments '###'; my $report ={}; my @strings = qw( 0101 0001 ); for (@strings){ my $string = $_; $report->{$_}++ for $string=~/../g; } ### $report ### should be ### 01 3 ### 00 1
    Or should the result be 01=3x, 10=1x, 00=2x ?
Re: Speeding permutation counting
by moritz (Cardinal) on Jul 18, 2007 at 14:21 UTC
    I haven't tried it, but I'd suggest the following for each string:

    Use unpack to split your string into pairs, and then unpack again, this time discarding the first character (otherwise you would get only pairs where the first index is even).

    Then use either your counting technique, or a hash.

    And benchmark it!

    Update: This posting is irrelevant since I interpreted the original question wrongly.

Re: Speeding permutation counting
by sgt (Deacon) on Jul 18, 2007 at 14:58 UTC

    Have you tried representing your strings as bit-strings. Maybe vec() and related are not the fastest, but Bit::Vector::Minimal or Bit::Vector could be useful.

    You talk about unique pairs. If you don't have duplicated strings (which can be avoided via implicit normalization in your choice of representation or a cure such as 'sort -u') then any two strings make a unique pair. You have n(n+1)/2 pairs or n(n-1)/2 if you don't want the "diagonal" i.e pairs like (s, s).

    cheers --stephan
Re: Speeding permutation counting
by NiJo (Friar) on Jul 19, 2007 at 21:16 UTC
    Most solutions as of yet are working on a char by char basis.

    Treating them as numbers seems to be more natural. Using binary does not handle the 1+1 carry over. Switching to decimal provides the required carry over space:

    Basically you do this:

    1 * 1010 + 2 * 1100 = ---------- 3210

    I'll leave it to others to count 0s 1s 2s and 3s, put some loops around and to benchmark it. This drastically reduces the required operations. I expect it to be some orders of magnitude faster. Math::BigInt also provides a C speedup.

Re: Speeding permutation counting
by Limbic~Region (Chancellor) on Jul 19, 2007 at 15:49 UTC
    albert,
    I found this a neat an interesting problem but I haven't had a chance to try out my idea. I do have a couple of questions first:
    • Do you need the count breakout per position or just total over all strings?
    • With 010 and 101, should the count of "01" and "10" be 2 per position for a total of 6? In other words, do each strings need to be compared twice taking turns being on top?

    I don't think any answer to the questions would invalidate my idea but it might change the implementation a bit. Since I do not have time to try it myself, I am laying it out here.

    I believe you should be able to mathematically determine the counts without doing the comparisons. I believe all you need to do is count how many 0s and 1s there are for each position. To further speed up the process, I might even consider doing the counting using Inline::C.

    Cheers - L~R

        BrowserUk,
        What I am suggesting is that if you know the number of 1s and 0s in all strings in position 1, you should be able to calculate the number of '11', '00', '10', and '01' at position 1 without comparing any of the strings. This would still have to be done for all strings at all positions - I am just trying to eliminate the comparisons.

        Cheers - L~R

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://627253]
Approved by Corion
Front-paged by fenLisesi
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (7)
As of 2021-06-22 17:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What does the "s" stand for in "perls"? (Whence perls)












    Results (108 votes). Check out past polls.

    Notices?