http://www.perlmonks.org?node_id=955712

jsmagnuson has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I have 2 arrays. One (@perms) has all possible permutations of a set of symbols of a specified length; this is often a very large array. The second array (@observed) is a list of observed ngrams from a string.

What I want to do is create a 3rd array (@permCounts), which has the same length as @perms. I want to initialize this as all zeroes, and then each time I encounter an item from @perms in @observed, I want to increment the corresponding element in @permCounts. Here's how I'm doing it now, but it is very slow when the ngram order is 4 or greater (it begins to take 2+secs per "observed" set, and I have 100s of thousands to evaluate). If anyone can advise me on how to speed this up, I would be most grateful!

use PDL; @observed = ("ab", "ab", "ad", "an", "bd", "bn", "dn"); $ngramOrder = 2; $alph = "{a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z}"; @perms = glob $alph x $ngramOrder; foreach $obs (@observed) { push @hits, grep { $perms[$_] eq $obs } 0..$#perms; } @permCounts=list(zeroes($#perms+1)); foreach $hit (@hits) { $permCounts[$hit]++; }

Replies are listed 'Best First'.
Re: counting instances of one array in another array
by JavaFan (Canon) on Feb 23, 2012 at 11:53 UTC
    Why not use a hash?
    my @observed = ("ab", "ab", "ad", "an", "bd", "bn", "dn"); my %permCounts; $permCounts{$_}++ for @observed;
    The answer why your solution is so slow: the number of permutations is exponential. You'll be spending the majority of your time initializing the @permCounts array, and most of its elements will remain 0.

      Thank you very much for your reply. I get the general idea, but my problem is that I need to end up with an array of size @perms that has counts of each item from @perms that occur in @observed. The reason is that what I really will do is generate that coding for each of many words, generate the array giving the counts of observed patterns from the entire set of possible patterns, and then calculate similarity of those resulting vectors as an index of word similarity.

      Your solution provides a hash with counts for each item in @observed, but then I need to associate those counts with the position of the corresponding element in @perms in @permCounts (or %permCounts!).

      So if:

      @observed = ("ab", "ab", "ad", "an", "bd", "bn", "dn");

      and:

      @perms = ("aa", "ab", "ac", "ad", "ae", "af", "ag", "an", "ba", "bd", "bh", "bn", "dn"); # @perms would never be something like this, # but just to give an example

      Then the result I want is that @permCounts would be:

      (0, 2, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1)

      Thanks very much,

      jim
        Well, good luck then. Be aware that there are 456976 possible combinations with words of length 4 (using the 26 letters of the English alphabet), and 308915776 words of length 6. The latter will take about 5.75Gb to store. If you want ngrams up to 8 characters, look around for a box with 3.7Tb of memory. And if you need to go to ngrams of length 12, you'd be looking at more than 1.5 Eb.
        jsmagnuson Any chance that Math::Fleximal will allow you to treat your range of possible combinations as a number sequence and do math on the key to find the position? I like the hash-key solution and treating the keys as fleximals may solve your position question without building a full array of possibilites.
        my %seen; $seen{$_}++ for @observed; @permCounts = map { $seen{$_} || 0 } @perms;
Re: counting instances of one array in another array
by BrowserUk (Patriarch) on Feb 23, 2012 at 13:03 UTC

    Build a hash that relates the permutation to its position, and use that to decide which count to increment:

    #! perl -slw use strict; use Data::Dump qw[ pp ]; use Algorithm::Combinatorics qw[ variations_with_repetition ]; our $N //= 4; my @perms = map join( '', @$_ ), variations_with_repetition( [ 1 .. $N + ], $N ); my %lookup; $lookup{ $perms[ $_ ] } = $_ for 0 .. $#perms; my @counts = (0) x @perms; for( 1 .. 1e3 ) { my $observed = join '', map 1+int( rand $N ), 1 .. $N; ## random o +bservation ++$counts[ $lookup{ $observed } ]; } printf "%${N}s : %${N}d\n", $perms[ $_ ], $counts[ $_ ] for 0 .. $#perms;

    A run:

    C:\test>junk -N=3 111 : 41 112 : 37 113 : 35 121 : 27 122 : 32 123 : 31 131 : 43 132 : 34 133 : 45 211 : 43 212 : 35 213 : 40 221 : 47 222 : 38 223 : 32 231 : 46 232 : 44 233 : 30 311 : 33 312 : 37 313 : 30 321 : 30 322 : 49 323 : 35 331 : 42 332 : 32 333 : 32

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

Re: counting instances of one array in another array
by MidLifeXis (Monsignor) on Feb 23, 2012 at 13:59 UTC

    It could be beneficial to place Time::HiRes timestamps before the glob, the first foreach, the @permCounts assignment, the second foreach, run the program with different-sized datasets, and see where your program slows down the most.

    But outside of any actual profiling data, I do see a candidate for optimization. How about using your @observed strings as a base-(sizeof $alph) "number", and generate your @hits array from that. You are currently scanning the @perms list scalar(@observed) times, an O(N*m)1 algorithm (where N is the size of the @perms list). If you can convert the scan (the first foreach loop) to a function that can convert the $obs value to an index directly, this becomes an O(1)1, which, given the right conditions, can be faster than scanning the @perms loop every time.

    For example, your @observed values are (in the example) base-26, two-digit numbers. A basic algorithm for conversion would be something like:

    use Test::More q(no_plan); # ngram2number # # Convert an ngram into a number given a # hashref containing the alphabet conversion, # and the ngram to convert. # # An area for improvement would be to cache the # $base ** ( $position++ ) results. # # Untested (ok, now it is), no warranty, blah blah blah # # Update: error in code - added reverse to correct # Update: Multiply current digit, not add; scalar keys %alphabet # Update: Added testing commands # sub ngram2number { my ( $alphabet, $ngram ) = @_; my $results = 0; my $position = 0; my $base = scalar( keys %$alphabet ); for my $c ( reverse split( //, $ngram ) ) { $results += ( $base ** ( $position++ ) ) * $alphabet->{ $c }; } $results; } $hex = { map { ( $_ => hex( $_ ) ) } ( '0'..'9', 'a'..'f' ) }; is( ngram2number( $hex, $_ ), hex( $_ ), "$_ in hex matches" ) for ( '0'..'9', 'a'..'f' ); is( ngram2number( $hex, $_ ), hex( $_ ), "$_ in hex matches" ) for ( glob '{0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,f}'x2 );
    Depending on the size of the alphabet and the size of the ngram, number may need to be a bignum or float.

    One other area (as mentioned above) for optimization could be the assignment of the zeros to the @permCount array, and even the use of the @perms array. Depending on the size of the alphabet and the order of the ngram, it could grow large enough to start swapping (or even exhausting) memory resources, which will be a performance killer. Since the @perms (and thus the @permCounts) array grows exponentially in the form sizeof( alphabet ) ** $ngramOrder, your memory use grows very fast when using your presented method of calculating the answer.

    Footnotes:

    • 1 - I think that I have this correct, but I would hope I would get checked on this. I ignored the effect of iterating across each character of the ngram for the string comparison and the cost of iterating across each character in the ngram2number function as effectively cancelling each other out, within a constant multiplier.

    --MidLifeXis

Re: counting instances of one array in another array
by LanX (Saint) on Feb 23, 2012 at 12:18 UTC
    Sorry but searching thru an array "of all permutations of length n" doesnt make sense. If the tested string is shorter than n it will be a member by definition (modulo side criterias like repetition). If its just about counting use a hash like shown and a logical test.

    Cheers Rolf

      Yep, I know that the patterns in @observed must be members of @perms. I want to get back the list of positions where the items in @observed occur in the ordered list of permutations in @perms. I am certain there is a better way to do this, and I've done a lot of searching on this site and others looking for better solutions, but I haven't found one yet. That's why I'm bringing this to the monks.

      Best wishes,

      jim

        Basically you want to give each pattern a specific id, right?

        So why don't you use a second hash that gives you the position in @perm for each pattern. Since @perm is fixed this hash is generated once and then available for the rest of the scripts runtime

        It's hard to believe that you really need the position in the array after switching to hashes.

        But well, lets continue this road...

        As long as the order isn't random but systematic you should be able to calculate this position.

        So plz search the internet or math books for algorithms generating all permutations by combination of two permutations named sigma and tau and reverse the process to get the index.

        This approach would limit time and space complexity considerably!

        Cheers Rolf

Re: counting instances of one array in another array
by sundialsvc4 (Abbot) on Feb 23, 2012 at 16:46 UTC

    Good grief, people ... you don't even need a hash. You almost don't need a program.

    SELECT PERM, COUNT(*) FROM PERMUTATIONS GROUP BY PERM;

    If you need to know the position of a permutation then you can calculate it. The position of ABC = (1,2,3) = 1 + 2*26^2 + 3*26^3, good enuf.

    If you then need to generate that huge matrix then ... generate it, as an output, using a nested loop, filling in the minuscule fraction of those entries that are not zero.

    You never have to store the entire range of possibilities; you need to store or count what you've actually got.

    In the following post, and within ten minutes flat, BrowserUK will now proceed to demonstrate why this reasoning is totally and utterly wrong and why my brain must be disconnected for saying it ...

      > If you need to know the position of a permutation then you can calculate it. The position of ABC = (1,2,3) = 1 + 2*26^2 + 3*26^3, good enuf.

      well this doesn't efficiently encode permutations, but could be sufficient for the OP.

      (until he understands hashes)

      Cheers Rolf

      Thank you for this approach! This is about 10x faster than the next best approach that was suggested here that I tried implementing (which was this code from an AnonymousMonk):

      my %seen; $seen{$_}++ for @observed; @permCounts = map { $seen{$_} || 0 } @perms;

      This is good enough to get me going, but I will follow the other advice that was offered and learn about fleximal, PDL::Sparse, and hash-fu. I am very grateful for the generous advice from all of you!

      best,

      jim

      ps -- Slight tweak: abc=(0,1,2), and then 0+(1*26^1)+(2*26^2), etc. (i.e., a=0, z=25, and when string position > 0 you multiply the index of the character in that position by 26 raised to the power of the position).

        Re: your ps - see my ngram2number routine above. It does exactly that calculation.

        Update: after re-reading my original message, I note that you also need to translate a list of characters into an alphabet to use the function (just in case it was not understood from the original code). You can do that with:

        sub charlist2alphabet { my %results; for my $index ( 0 .. (@_ - 1) ) { $results{ $_[ $index ] } = $index; } \%results; } my $alphabet = charlist2alphabet( @charlist );
        or
        my $alphabet = { map { ( $charlist[ $_ ] => $_ ) } ( 0 .. @charlist - +1 );

        --MidLifeXis

Re: counting instances of one array in another array
by MidLifeXis (Monsignor) on Feb 23, 2012 at 15:11 UTC

    I also wonder if PDL::Sparse would be helpful in this case.

    --MidLifeXis

      I couldn't find any trace of that module. Possibly helpful would be PDL::Ngrams and/or PDL::CCS (CCS = Collapsed Column Storage).