http://www.perlmonks.org?node_id=956321

jsmagnuson has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I have a set of about 30,000 words, and I am using string kernels as a metric of word similarity. The goal is to see whether different kernels are better at predicting how quickly human subjects are able to process words.

I have calculated the string kernels for each word (with help from this marvelous group). So now I have a file with 30,000 lines. The first field in each line is a word, and this is followed by a 676-element vector representing the kernel representation.

Once I read this in, I need to step through and calculate the similarity of each word to every other word using vector cosine, as well as track the highest similarity value (excluding the word itself), and the set of X-most similar items (there are reasons to believe these are good predictors of human performance).

Here's the problem: when I start running the code below, it is very fast. It takes 38 msecs to process the first word, but by the time it reaches the 100th it is taking 80 msecs, and by the 400th it is taking 200 msecs.

Memory use by perl stays constant, and I cannot figure out what would make the program slow down so much -- I know it's not well-written, as I'm a cognitive neuroscientist and not a very good programmer. But I'm really stumped as to what is causing the slow down. If I take out the code for tracking the top X items, it doesn't slow down nearly as much, but it still slows down (first item = 38msecs, 100th = 44, 400th=85). But I really need to do that tracking...

So if anyone can give me pointers as to what is slowing things down and whether there is a way to avoid it, I would be most grateful.

Thanks!

jim

#!/usr/bin/perl -s use PDL; use Time::HiRes qw ( time ) ; $|=1; $top = 20; while(<>){ chomp; ($wrd, @data) = split; $kernel{$wrd} = pdl(@data); # EXAMPLE LINE # word 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 + 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 } $nrecs = keys %kernel; $startAll = time(); $at = 0; foreach $w1 (sort(keys %kernel)){ $totalsim = $maxsim = 0; @topX = (); $at2 = 0; foreach $w2 (sort(keys %kernel)) { next if($at == $at2); # skip identical item, but not homophones $at2++; $sim = inner(norm($kernel{$w1}),norm($kernel{$w2})); $totalsim+=$sim; if($sim > $maxsim){ $maxsim = $sim; } # keep the top 20 if($#topX < $top){ push @topX, $sim; } else { @topX = sort { $a <=> $b } @topX; if($sim > $topX[0]){ $topX[0] = $sim; } } } $at++; $topXtotal = sum(pdl(@topX)); printf("$at\t$w1\t$totalsim\t$maxsim\t$topXtotal\n"); unless($at % 10){ $now = time(); $elapsed = $now - $startAll; $thisWord = $now - $startWord; $perWord = $elapsed / $at; $hoursRemaining = (($nrecs - $at) * $perWord)/3600; printf(STDERR "#$at\t$w1\t$totalsim\t$maxsim\t$topXtotal\t". "ELAPSED %.3f THISWORD %.3f PERWORD %.3f HOURStoGO %.3f\n", $elapsed, $thisWord, $perWord, $hoursRemaining); } }