CaptainF has asked for the wisdom of the Perl Monks concerning the following question:

I have many pairs of arrays that I need to compare for a simple set of criteria element by element. There is a premium on speed.

I have arrays that are something like:
@a = qw(a b c b c); @b = qw(a x c d d);
...and simply need to evaluate the first index position where a simple set of criteria are met (e.g., $a[3] ne $b[3] and $a[3] ne 'x' and b[3] ne 'x'). In my example, the correct answer would be element 3.

I've tried a variety of simple solutions of the form:

for(0..$#a){ if($a[$_] eq $b[$_] and $a[$_] ne 'x' ){ etc. etc. }elsif( ){ } }

etc. but that is absolutely clobbering performance (array sizes are on the order of 1000 elements, but I have approximately 1X10^12 pairs of arrays to compare (but not all in memory!).

Anyone have a suggestion?

Replies are listed 'Best First'.
Re: speedy way to compare elements in pairs of arrays?
by jethro (Monsignor) on May 26, 2009 at 09:21 UTC

    I think we have a XY Problem problem here

    To store 10^12 arrays of size 1000 you would need at least 1000 Terabyte. Since you probably don't have that amount of storage I assume you already multiplied your 1 million arrays to arrive at that number. Whatever you do you will never speed up the comparisions enough to make 10^12 comparisions fast. You have to ask instead how to avoid making 10^12 comparisions.

    I also assume the sets of different elements is relatively small and the criteria are not fixed, i.e. it is not always element 3 you are comparing.

    Then a solution could be to use a database or even simple files where you store the id of all arrays that have element m as the n-th element

    So if you use files, you would in a first step preprocess your arrays and create the files a0,a1,a2...a1000,b1...b1000,c1...c1000,... . In file a0 you would store all array-ids where an a is in position 0. So both of your example arrays above would be listed in a0.

    That would be a few thousand files, but a modern filesystem can handle that easily. Now your example search would work differently: You would simply list all array id-pairs in the files a0+b0,a0+c0,a0+d0...,b0+c0,b0+d0,... except for the files a0+x0,b0+x0... If there are none, resume with a1+b1,a1+c1...

    Note that if you have more criteria to further restrict your solution space, those criteria should be able to prune down the file pairs *before* you have to read any of these files

    This solution might not be the best way to represent your data. Provide more information and a better solution might surface

Re: speedy way to compare elements in pairs of arrays?
by CountZero (Bishop) on May 26, 2009 at 09:21 UTC
    A thousand billion pairs of arrays to compare? Even if comparing thousand pairs of arrays only takes 1 second (which is extremely fast given that you must read them from some form of external storage), it would take you more than 30 years to do so.

    Where are you storing all these arrays? Suppose one array is only 1000 bytes long, you will need 2 x 10**15 bytes of storage at least (that is 2 million billion bytes of storage, or 2 thousand terabytes), not taking into account any overhead. Are you sure about your figures?


    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: speedy way to compare elements in pairs of arrays?
by almut (Canon) on May 26, 2009 at 08:50 UTC

    First, are you absolutely sure it's the comparison operations that's taking the time, and not the reading in of your 1e12 pairs of arrays, etc.?

    If so, and if you need real speed, think about using another language like C instead.

Re: speedy way to compare elements in pairs of arrays?
by holli (Abbot) on May 26, 2009 at 08:55 UTC
    It's hard to tell from your "code" but it looks like you have multiple sets (the elsif part) of criteria and are looping over the complete pair of arrays, checking for each criterium. This is of course straightforward. The only room for optimization probably is in "the first index position where a simple set of criteria are met". The idea is to short-circuit the loop as fast as possible. Two possibilities come to (my) mind:
    1. for each pair of arrays, loop over the criteria and then over the data, returning as early as possible. if the data that is to be found is likely to be early in the array this might save time.
    2. put your criteria sets in an array of subroutines. when you loop over the data arrays and one of the subs returns true, remove it from the list. That way it won't get executed uneccessarily. (Like truncating your if/elsif tree). Once the critera array is empty you can leave the loop completely. The subroutine calls may outweigh that though.


    When you're up to your ass in alligators, it's difficult to remember that your original purpose was to drain the swamp.
Re: speedy way to compare elements in pairs of arrays?
by graff (Chancellor) on May 27, 2009 at 02:50 UTC
    As others have said, you've left out some important information. What form is the data in when you start to work with it? When your script does what it needs to do, what is the nature of the output it should be producing?

    Depending on what the task really is, it might be best to do some sort of preconditioning on all the input data, so that the decision criteria can be applied more efficiently -- e.g. instead of two arrays containing sequences of letters, how about a single list of strings, two characters per string (first character comes from one array, second character comes from the other array); skip the line if it contains an "x", do one thing if the two characters are the same, something else otherwise, etc etc.

    But without more information, there's no point speculating about it.

Re: speedy way to compare elements in pairs of arrays?
by CaptainF (Initiate) on May 28, 2009 at 17:40 UTC
    Thanks everyone for chiming in. Many of you were concerned with the number of arrays I was trying to compare, so I am trying an altogether different approach.(sorry if this seems like a somewhat different problem, but it is an all-together different approach to the above mentioned-issue). Another way to deal with my problem is as follows. I need to evaluate if a pair of strings of the same length are identical after excluding all positions that have an 'N' in one or both strings. Both strings will always consist of the character set /ATGCN/. For example, the comparison of $a and $b would meet my criterion: $a = 'ATGNCNC'; $b = 'ATGACNN'; But $c and $d would not: $c = 'ATGNCNC'; $d = 'TTGNNNC'; (because the first character differs at a position that does not contain an 'N' in either string) Again, there is an extreme premium on efficiency here. Any thoughts?