Re^3: correspondence between two arrays

I'd guess the downvotes (though I agree that may be a harsh response to a solution that could be much better but does work) are probably because he's looping through the entire second array for every item in the first array, instead of using a hash.

Writing the above got me curious: Just how bad is looping through an array and comparing as opposed to checking against a hash's keys? Very, very bad, it turns out. I wrote the script below, based on this task, to benchmark the two methods with a variable number of items in the lists. I only used one iteration for Benchmark (and removed the warnings to save space), because the test takes long enough to do once with larger lists, and I didn't want it to take all day. But the differences are large enough to ignore whatever margin of error that more iterations would smooth out. The hash method also has the overhead of splitting the elements of the second array and creating the hash.

With only 1000 items in each list, the array/array method takes about 2.75 seconds -- fast enough to live with, if you're not running it over and over all day. But the hash method is so fast that Benchmark seems unable to display the time elapsed sensibly. Bumping the lists up to 5000 items each, the array/array method goes up to almost 10 seconds, but the hash method is still down at .01 seconds, or almost 1000 times faster. At 10,000 items, array/array is up to 45 secs, and the hash is still at .02, or 1500 times faster. So not only is the array/array method much slower, but it scales much worse, since the number of loops and comparisons increases by the square of the list size. At 100,000 items in each array, the array/array method took well over an hour, and the hash method is still under a second! I knew the hash would be faster, but wow.

$ perl 939870.pl 1000
Building arrays of 1000 with unique 4-char keys... done.
Arrays have 1000 items each. Benchmarking...
                    Rate  arrays plus regex     using a hash
arrays plus regex   2.70/s             --            -100%
using a hash    1000000000000000/s 37000000000000000%  --             
+ 

$ perl 939870.pl 5000
Building arrays of 5000 with unique 4-char keys... done.
Arrays have 5000 items each. Benchmarking...
                    s/iter arrays plus regex      using a hash
arrays plus regex     9.40                --             -100%
using a hash      1.00e-02            93900%                --

$ perl 939870.pl 10000
Building arrays of 10000 with unique 4-char keys... done.
Arrays have 10000 items each. Benchmarking...
                    s/iter arrays plus regex      using a hash
arrays plus regex     46.3                --             -100%
using a hash      3.00e-02           154267%                --

$ perl 939870.pl 100000
Building arrays of 100000 with unique 4-char keys... done.
Arrays have 100000 items each. Benchmarking...
                  s/iter arrays plus regex      using a hash
arrays plus regex   4110                --             -100%
using a hash       0.330          1245345%                --
[download]

$ cat 939870.pl
#!/usr/bin/perl
use Modern::Perl;
use Benchmark qw(:all);

# create two lists, with the formats:
# list1: (four characters A-Z)
#   RDAE
#   JFCS
#
# list2: (7 digit number, space, four chars A-Z)
#   1234567 JIQZ
#   2345678 EFOP

my( @list1, @list2);
my( %keys1, %keys2);
my $array_size = $ARGV[0] || 100_000;

print "Building arrays of $array_size with unique 4-char keys... ";
for (1..$array_size){
    my($s1, $s2) = ('','');
    $s1 .= ('A'..'Z')[rand 26] for (1..4);
    $s2 .= ('A'..'Z')[rand 26] for (1..4);
    next if $keys1{$s1} or $keys2{$s2};    # make keys unique
    push @list1, $s1;                      # save keys in lists
    push @list2, int(rand(8999999)+1_000_000). " $s2";
}

say "done.\nArrays have ". 
    scalar(@list1).
    " items each. Benchmarking...";

cmpthese(1, {
    'arrays plus regex' => \&array_plus_regex,
    'using a hash'      => \&using_hash,
});

sub array_plus_regex {
    my @results;
  FOUND:
    for my $l1 (@list1){
        for my $l2 (@list2){
            if($l2 =~ / $l1/){
                push @results, "$l1 ---> $`";
                next FOUND;
            }
        }
        push @results, "$l1 ---> null";
    }
}

sub using_hash {
    my @results;
    my %match;
    my $count = 0;
    for my $l2 (@list2){
        my($k, $v) = split ' ', $l2;
        $match{$k} = $v;
    }
    for my $l1 (@list1){
        if($match{$l1}){
            push @results, "$l1 ---> $match{$l1}";
        } else {
            push @results, "$l1 ---> null";
        }
    }
}
[download]

Aaron B.
My Woefully Neglected Blog, where I occasionally mention Perl.

Comment on Re^3: correspondence between two arrays Select or Download Code


Perl-Sensitive Sunglasses
	PerlMonks