counting instances of one array in another array

jsmagnuson has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: counting instances of one array in another array by JavaFan (Canon) on Feb 23, 2012 at 11:53 UTC
Why not use a hash? `my @observed = ("ab", "ab", "ad", "an", "bd", "bn", "dn"); my %permCounts; $permCounts{$_}++ for @observed;` [download] The answer why your solution is so slow: the number of permutations is exponential. You'll be spending the majority of your time initializing the @permCounts array, and most of its elements will remain 0.	[reply] [d/l]
Re^2: counting instances of one array in another array by jsmagnuson (Acolyte) on Feb 23, 2012 at 12:24 UTC
Thank you very much for your reply. I get the general idea, but my problem is that I need to end up with an array of size `@perms` that has counts of each item from `@perms` that occur in `@observed`. The reason is that what I really will do is generate that coding for each of many words, generate the array giving the counts of observed patterns from the entire set of possible patterns, and then calculate similarity of those resulting vectors as an index of word similarity. Your solution provides a hash with counts for each item in `@observed`, but then I need to associate those counts with the position of the corresponding element in `@perms` in `@permCounts` (or `%permCounts`!). So if: `@observed = ("ab", "ab", "ad", "an", "bd", "bn", "dn");` [download] and: `@perms = ("aa", "ab", "ac", "ad", "ae", "af", "ag", "an", "ba", "bd", "bh", "bn", "dn"); # @perms would never be something like this, # but just to give an example` [download] Then the result I want is that `@permCounts` would be: `(0, 2, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1)` [download] Thanks very much, jim	[reply] [d/l] [select]
Re^3: counting instances of one array in another array by JavaFan (Canon) on Feb 23, 2012 at 13:22 UTC
Well, good luck then. Be aware that there are 456976 possible combinations with words of length 4 (using the 26 letters of the English alphabet), and 308915776 words of length 6. The latter will take about 5.75Gb to store. If you want ngrams up to 8 characters, look around for a box with 3.7Tb of memory. And if you need to go to ngrams of length 12, you'd be looking at more than 1.5 Eb.	[reply]
Re^3: counting instances of one array in another array by jandrew (Chaplain) on Feb 23, 2012 at 17:07 UTC
jsmagnuson Any chance that Math::Fleximal will allow you to treat your range of possible combinations as a number sequence and do math on the key to find the position? I like the hash-key solution and treating the keys as fleximals may solve your position question without building a full array of possibilites.	[reply]
Re^4: counting instances of one array in another array by jandrew (Chaplain) on Feb 23, 2012 at 17:52 UTC
Re^3: counting instances of one array in another array by Anonymous Monk on Feb 23, 2012 at 14:42 UTC
`my %seen; $seen{$_}++ for @observed; @permCounts = map { $seen{$_} \|\| 0 } @perms;` [download]	[reply] [d/l]
Re^4: counting instances of one array in another array by JavaFan (Canon) on Feb 23, 2012 at 14:50 UTC
Re: counting instances of one array in another array by BrowserUk (Patriarch) on Feb 23, 2012 at 13:03 UTC
Build a hash that relates the permutation to its position, and use that to decide which count to increment: #! perl -slw use strict; use Data::Dump qw[ pp ]; use Algorithm::Combinatorics qw[ variations_with_repetition ]; our $N //= 4; my @perms = map join( '', @$_ ), variations_with_repetition( [ 1 .. $N + ], $N ); my %lookup; $lookup{ $perms[ $_ ] } = $_ for 0 .. $#perms; my @counts = (0) x @perms; for( 1 .. 1e3 ) { my $observed = join '', map 1+int( rand $N ), 1 .. $N; ## random o +bservation ++$counts[ $lookup{ $observed } ]; } printf "%${N}s : %${N}d\n", $perms[ $_ ], $counts[ $_ ] for 0 .. $#perms; [download] A run: `C:\test>junk -N=3 111 : 41 112 : 37 113 : 35 121 : 27 122 : 32 123 : 31 131 : 43 132 : 34 133 : 45 211 : 43 212 : 35 213 : 40 221 : 47 222 : 38 223 : 32 231 : 46 232 : 44 233 : 30 311 : 33 312 : 37 313 : 30 321 : 30 322 : 49 323 : 35 331 : 42 332 : 32 333 : 32` [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply] [d/l] [select]
Re: counting instances of one array in another array by MidLifeXis (Monsignor) on Feb 23, 2012 at 13:59 UTC
It could be beneficial to place Time::HiRes timestamps before the `glob`, the first `foreach`, the @permCounts assignment, the second `foreach`, run the program with different-sized datasets, and see where your program slows down the most. But outside of any actual profiling data, I do see a candidate for optimization. How about using your @observed strings as a base-(sizeof $alph) "number", and generate your @hits array from that. You are currently scanning the @perms list scalar(@observed) times, an O(Nm)¹ algorithm (where N is the size of the @perms list). If you can convert the scan (the first `foreach` loop) to a function that can convert the `$obs` value to an index directly, this becomes an O(1)¹, which, given the right conditions, can be faster than scanning the @perms loop every time. For example, your @observed values are (in the example) base-26, two-digit numbers. A basic algorithm for conversion would be something like: use Test::More q(no_plan); # ngram2number # # Convert an ngram into a number given a # hashref containing the alphabet conversion, # and the ngram to convert. # # An area for improvement would be to cache the # $base * ( $position++ ) results. # # Untested (ok, now it is), no warranty, blah blah blah # # Update: error in code - added reverse to correct # Update: Multiply current digit, not add; scalar keys %alphabet # Update: Added testing commands # sub ngram2number { my ( $alphabet, $ngram ) = @_; my $results = 0; my $position = 0; my $base = scalar( keys %$alphabet ); for my $c ( reverse split( //, $ngram ) ) { $results += ( $base ** ( $position++ ) ) * $alphabet->{ $c }; } $results; } $hex = { map { ( $_ => hex( $_ ) ) } ( '0'..'9', 'a'..'f' ) }; is( ngram2number( $hex, $_ ), hex( $_ ), "$_ in hex matches" ) for ( '0'..'9', 'a'..'f' ); is( ngram2number( $hex, $_ ), hex( $_ ), "$_ in hex matches" ) for ( glob '{0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,f}'x2 ); [download] Depending on the size of the alphabet and the size of the ngram, number may need to be a bignum or float. One other area (as mentioned above) for optimization could be the assignment of the zeros to the @permCount array, and even the use of the @perms array. Depending on the size of the alphabet and the order of the ngram, it could grow large enough to start swapping (or even exhausting) memory resources, which will be a performance killer. Since the @perms (and thus the @permCounts) array grows exponentially in the form `sizeof( alphabet ) ** $ngramOrder`, your memory use grows very fast when using your presented method of calculating the answer. Footnotes: 1 - I think that I have this correct, but I would hope I would get checked on this. I ignored the effect of iterating across each character of the ngram for the string comparison and the cost of iterating across each character in the ngram2number function as effectively cancelling each other out, within a constant multiplier. --MidLifeXis	[reply] [d/l] [select]
Re: counting instances of one array in another array by LanX (Saint) on Feb 23, 2012 at 12:18 UTC
Sorry but searching thru an array "of all permutations of length n" doesnt make sense. If the tested string is shorter than n it will be a member by definition (modulo side criterias like repetition). If its just about counting use a hash like shown and a logical test. Cheers Rolf	[reply]
Re^2: counting instances of one array in another array by jsmagnuson (Acolyte) on Feb 23, 2012 at 12:31 UTC
Yep, I know that the patterns in `@observed` must be members of `@perms`. I want to get back the list of positions where the items in `@observed` occur in the ordered list of permutations in `@perms`. I am certain there is a better way to do this, and I've done a lot of searching on this site and others looking for better solutions, but I haven't found one yet. That's why I'm bringing this to the monks. Best wishes, jim	[reply] [d/l] [select]
Re^3: counting instances of one array in another array by jethro (Monsignor) on Feb 23, 2012 at 14:04 UTC
Basically you want to give each pattern a specific id, right? So why don't you use a second hash that gives you the position in @perm for each pattern. Since @perm is fixed this hash is generated once and then available for the rest of the scripts runtime	[reply]
Re^3: counting instances of one array in another array by LanX (Saint) on Feb 23, 2012 at 13:22 UTC
It's hard to believe that you really need the position in the array after switching to hashes. But well, lets continue this road... As long as the order isn't random but systematic you should be able to calculate this position. So plz search the internet or math books for algorithms generating all permutations by combination of two permutations named sigma and tau and reverse the process to get the index. This approach would limit time and space complexity considerably! Cheers Rolf	[reply]
Re: counting instances of one array in another array by sundialsvc4 (Abbot) on Feb 23, 2012 at 16:46 UTC
Good grief, people ... you don't even need a hash. You almost don't need a program. SELECT PERM, COUNT() FROM PERMUTATIONS GROUP BY PERM; If you need to know the position of a permutation then you can calculate it. The position of ABC = (1,2,3) = 1 + 226^2 + 326^3, good enuf. If you then need to generate that huge matrix then ... generate it, as an output, using a nested loop, filling in the minuscule fraction of those entries that are not zero. You never have to store the entire range of possibilities; you need to store or count what you've actually got. In the following post, and within ten minutes flat, BrowserUK will now proceed to demonstrate why this reasoning is totally and utterly wrong and why my brain must be disconnected for saying it ...*	[reply]
Re^2: counting instances of one array in another array by LanX (Saint) on Feb 23, 2012 at 18:13 UTC
> If you need to know the position of a permutation then you can calculate it. The position of ABC = (1,2,3) = 1 + 226^2 + 326^3, good enuf. well this doesn't efficiently encode permutations, but could be sufficient for the OP. (until he understands hashes) Cheers Rolf	[reply]
Re^2: counting instances of one array in another array by jsmagnuson (Acolyte) on Feb 24, 2012 at 11:07 UTC
Thank you for this approach! This is about 10x faster than the next best approach that was suggested here that I tried implementing (which was this code from an AnonymousMonk): `my %seen; $seen{$_}++ for @observed; @permCounts = map { $seen{$_} \|\| 0 } @perms;` [download] This is good enough to get me going, but I will follow the other advice that was offered and learn about fleximal, PDL::Sparse, and hash-fu. I am very grateful for the generous advice from all of you! best, jim ps -- Slight tweak: abc=(0,1,2), and then 0+(126^1)+(226^2), etc. (i.e., a=0, z=25, and when string position > 0 you multiply the index of the character in that position by 26 raised to the power of the position).	[reply] [d/l]
Re^3: counting instances of one array in another array by MidLifeXis (Monsignor) on Feb 24, 2012 at 13:53 UTC
Re: your ps - see my ngram2number routine above. It does exactly that calculation. Update: after re-reading my original message, I note that you also need to translate a list of characters into an alphabet to use the function (just in case it was not understood from the original code). You can do that with: `sub charlist2alphabet { my %results; for my $index ( 0 .. (@_ - 1) ) { $results{ $_[ $index ] } = $index; } \%results; } my $alphabet = charlist2alphabet( @charlist );` [download] or `my $alphabet = { map { ( $charlist[ $_ ] => $_ ) } ( 0 .. @charlist - +1 );` [download] --MidLifeXis	[reply] [d/l] [select]
Re: counting instances of one array in another array by MidLifeXis (Monsignor) on Feb 23, 2012 at 15:11 UTC
I also wonder if PDL::Sparse would be helpful in this case. --MidLifeXis	[reply]
Re^2: counting instances of one array in another array by etj (Deacon) on May 24, 2022 at 22:24 UTC
I couldn't find any trace of that module. Possibly helpful would be PDL::Ngrams and/or PDL::CCS (CCS = Collapsed Column Storage).	[reply]
Re^3: counting instances of one array in another array by hippo (Bishop) on May 25, 2022 at 09:49 UTC
There's an archive of it on github and the tarball is available on BackPAN. 🦛	[reply]


No such thing as a small change
	PerlMonks