Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Mysterious slow down with large data set

by jsmagnuson (Acolyte)
on Feb 26, 2012 at 21:50 UTC ( #956321=perlquestion: print w/ replies, xml ) Need Help??
jsmagnuson has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I have a set of about 30,000 words, and I am using string kernels as a metric of word similarity. The goal is to see whether different kernels are better at predicting how quickly human subjects are able to process words.

I have calculated the string kernels for each word (with help from this marvelous group). So now I have a file with 30,000 lines. The first field in each line is a word, and this is followed by a 676-element vector representing the kernel representation.

Once I read this in, I need to step through and calculate the similarity of each word to every other word using vector cosine, as well as track the highest similarity value (excluding the word itself), and the set of X-most similar items (there are reasons to believe these are good predictors of human performance).

Here's the problem: when I start running the code below, it is very fast. It takes 38 msecs to process the first word, but by the time it reaches the 100th it is taking 80 msecs, and by the 400th it is taking 200 msecs.

Memory use by perl stays constant, and I cannot figure out what would make the program slow down so much -- I know it's not well-written, as I'm a cognitive neuroscientist and not a very good programmer. But I'm really stumped as to what is causing the slow down. If I take out the code for tracking the top X items, it doesn't slow down nearly as much, but it still slows down (first item = 38msecs, 100th = 44, 400th=85). But I really need to do that tracking...

So if anyone can give me pointers as to what is slowing things down and whether there is a way to avoid it, I would be most grateful.

Thanks!

jim

#!/usr/bin/perl -s use PDL; use Time::HiRes qw ( time ) ; $|=1; $top = 20; while(<>){ chomp; ($wrd, @data) = split; $kernel{$wrd} = pdl(@data); # EXAMPLE LINE # word 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 + 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 } $nrecs = keys %kernel; $startAll = time(); $at = 0; foreach $w1 (sort(keys %kernel)){ $totalsim = $maxsim = 0; @topX = (); $at2 = 0; foreach $w2 (sort(keys %kernel)) { next if($at == $at2); # skip identical item, but not homophones $at2++; $sim = inner(norm($kernel{$w1}),norm($kernel{$w2})); $totalsim+=$sim; if($sim > $maxsim){ $maxsim = $sim; } # keep the top 20 if($#topX < $top){ push @topX, $sim; } else { @topX = sort { $a <=> $b } @topX; if($sim > $topX[0]){ $topX[0] = $sim; } } } $at++; $topXtotal = sum(pdl(@topX)); printf("$at\t$w1\t$totalsim\t$maxsim\t$topXtotal\n"); unless($at % 10){ $now = time(); $elapsed = $now - $startAll; $thisWord = $now - $startWord; $perWord = $elapsed / $at; $hoursRemaining = (($nrecs - $at) * $perWord)/3600; printf(STDERR "#$at\t$w1\t$totalsim\t$maxsim\t$topXtotal\t". "ELAPSED %.3f THISWORD %.3f PERWORD %.3f HOURStoGO %.3f\n", $elapsed, $thisWord, $perWord, $hoursRemaining); } }

Comment on Mysterious slow down with large data set
Download Code
Re: Mysterious slow down with large data set
by BrowserUk (Pope) on Feb 26, 2012 at 22:13 UTC

    I can't see what is causing the slowdown, but I can see one obvious thing that would speed it up a lot. You keep re-sorting the keys to %kernel every time when they do not change. Instead of:

    foreach $w1 ( sort( keys %kernel ) ){ $totalsim = $maxsim = 0; @topX = (); $at2 = 0; foreach $w2 ( sort( keys %kernel ) ) { ...

    Using:

    my @sortedKeys = sort( keys %kernel ); foreach $w1 ( @sortedKeys ){ $totalsim = $maxsim = 0; @topX = (); $at2 = 0; foreach $w2 ( @sortedKeys ) {

    may speed things up to the point that the slowdown becomes insignificant.

    Also, using a sort to track top N is probably slower than a simple linear insertion and truncate if necessary.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      Yes, I can't believe I did that. Thank you!

      Regarding the top N, I had previously tried this, which worked, but seemed more complicated. Is this what you had in mind?

      } elsif ($sim > min(pdl(@topList))) { $theMin = grep { $topX[$_] eq min(pdl(@topX)) } 0..$#topX; # replace the smallest $topX[$theMin] = $sim; # add this one push @topX, $sim; }
      Thanks!

        I tried it this way:

        @topX = (-1) x 20; ... $topX[ $_ ] < $sim and splice( @topX, $_, 0, $sim ), pop( @top +X ), last for 0 .. 19;

        A short-ciruited, linear insertion is at worst O(N) rather than O(N logN).

        It speeds things a little, but doesn't address the slowdown which is happening exclusively (and inexplicably) inside PDL.

        Unfortunately, the PDL documentation spends more time telling you about their 'philosophy'; and indexing the indexes to the documentation than is does telling you what these functions actually do; or how to inspect the results of what they did :(


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

Re: Mysterious slow down with large data set
by chromatic (Archbishop) on Feb 26, 2012 at 22:16 UTC
    I cannot figure out what would make the program slow down so much...

    Unnecessary work:

    foreach $w1 (sort(keys %kernel)){ ... foreach $w2 (sort(keys %kernel)) { ... } }

    For every word in the hash, you sort all of the other words in the hash (even though you've already sorted them). Then you compare every word in the hash again, even those you've already compared.

    Instead, sort the hash keys and assign them to an array. Then process the array one word at a time. Grab the first word, then compare it to the remaining words in the array. Then do the same for the second and so on. Note that the list of words to compare gets shorter with every word you process. This is exactly what you want.


    Improve your skills with Modern Perl: the free book.

      Thanks for pointing out the unnecessary work.

      Can I ask you about the idea that I can decrease the size of the list with each word? The problem is that I need to get the total similarity for each word to every other word. So I don't think I can decrease the number of comparisons per step without storing preceding results, but the memory demands are huge, even if I delete items from memory once they are retrieved.

      But if you see another solution, please let me know. Thank you for your help!

        I don't know if the comparison of two words is direction dependent this mean compare word1 with word2 is the same as compare word2 to word1. If yes keep your algoritm if no follow chromatics tip and compare the first element to all elements in the arrays but the first one (the elements himself), then compare the second element to all other but first two because you already compared the first to a second. This what chromatic pointed to.
Re: Mysterious slow down with large data set
by BrowserUk (Pope) on Feb 26, 2012 at 22:58 UTC

    The slowdown is being caused by the PDL calculation. You can prove this by substituting $sim = rand() for that calculation and see that it runs at constant speed, 0.023s per iteration, where the PDL calculation takes an almost identical time for the first iteration and slows to 0.214s by the time you get to the 500th

    My understanding of PDL is limited, but from a cursory inspection, you are performing the same calculation on what looks to be the same size piddles at each iteration, so there is no logic to it getting slower.

    Unless of course, the length or form of the piddles is being changed each time you reuse it. What do innner() and norm() do to the underlying piddles? If they are growing, (possibly doubling?). in length each time you use them, it would explain the progression of the slowdown.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

Re: Mysterious slow down with large data set
by jwkrahn (Monsignor) on Feb 26, 2012 at 23:08 UTC

    Not related to speed, but if you had warnings enabled you would get this message:

    Name "main::startWord" used only once: possible typo at yourprogram line 42.

    Which means that this line:

         42     $thisWord = $now - $startWord;

    Is actually processed as:

         42     $thisWord = $now;

    Also, Perl is not C, so this line:

         38   printf("$at\t$w1\t$totalsim\t$maxsim\t$topXtotal\n");

    Should be:

         38   print "$at\t$w1\t$totalsim\t$maxsim\t$topXtotal\n";

    And these lines:

         45     printf(STDERR "#$at\t$w1\t$totalsim\t$maxsim\t$topXtotal\t".
         46        "ELAPSED %.3f THISWORD %.3f PERWORD %.3f HOURStoGO %.3f\n",
         47        $elapsed, $thisWord, $perWord, $hoursRemaining);

    Should be:

         45     print STDERR "#$at\t$w1\t$totalsim\t$maxsim\t$topXtotal\t";
         46     printf STDERR "ELAPSED %.3f THISWORD %.3f PERWORD %.3f HOURStoGO %.3f\n",
         47        $elapsed, $thisWord, $perWord, $hoursRemaining;
Re: Mysterious slow down with large data set
by BrowserUk (Pope) on Feb 27, 2012 at 01:42 UTC

    You can shed another chunk of time by norm()'ing your vectors once as you load them, rather than repeatedly:

    while(<>){ chomp; my ($wrd, @data) = split; $kernel{$wrd} = norm( pdl(@data) ); ... my $sim = inner( $kernel{ $w1 }, $kernel{ $w2 } );

    But the PDL -- now down to just the inner()s -- still exhibits the slowdown behaviour, despite that it doesn't do so when run in a tight loop on two similar vectors in a standalone test.

    At this point you need some serious PDL knowledge to move forward I think.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

Re: Mysterious slow down with large data set
by BrowserUk (Pope) on Feb 27, 2012 at 02:03 UTC

    Try switching the order of these two statements around and see what happens.

    next if($at == $at2); # skip identical item, but not homophones $at2++;

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

Re: Mysterious slow down with large data set
by sundialsvc4 (Monsignor) on Feb 27, 2012 at 15:29 UTC

    When you start getting into the realm of a “memory footprint” that is large enough to provoke paging activity from the virtual memory manager (as this algorithm certainly would be), you therefore get into the realm where the so-called working-set size of the algorithm becomes important, and so you want so-called locality of reference.   Memory references that are “nearby to” what you have recently “touched” are cheap, but others require a page-fault and that is undoubtedly where your milli-seconds are coming from.   Hence the technical reasoning behind the Wisdoms offered here:   once you have built a structure, keep it, and iterate through it repeatedly.

    There is a somewhat curious artifact of all “admittedly, very convenient” garbage-collection based memory management strategies such as Perl’s:   when you allocate a structure over and over again, the previous copies of that structure are still there, abandoned but not necessarily cleaned-up yet.   Therefore the memory footprint as seen by the operating system can be somewhat bigger than you would intuitively expect it to be.

    I think that it’s pretty much a given that, if you can notice that an algorithm is slowing down, and especially if you notice what looks like a more-than-linear slowdown curve, then “inefficient use of memory” is the root cause.   In the computer world of nano-seconds, disk drives and communications links (and human beings) are about the only source of milli-seconds to be found.

    (Now, in the immediate but inevitable reply to this comment, BrowserUK will now proceed to vilify both it and me.   Maestro, if you please ...)

      From the OP:

      Memory use by perl stays constant,

      When I ran the OPs code, it used a little over 300MB to hold and process a 57,000 word dataset. He mentions a 30,000 word dataset, so maybe 200MB.

      What usage did you see when you ran it?

      Why not just stop guessing? It doesn't help anyone.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

      (Now, in the immediate but inevitable reply to this comment, BrowserUK will now proceed to vilify both it and me. Maestro, if you please ...)
      Why assume people are running out of memory when most systems have 3-4GB these days and servers often have a lot more than that? If you don't want him to call you out then stop shooting blanks in the dark!

      Elda Taluta; Sarks Sark; Ark Arks
      My deviantART gallery

Re: Mysterious slow down with large data set
by BrowserUk (Pope) on Feb 27, 2012 at 18:23 UTC

    BTW: If, once you've fixed your code to perform the complete processing you intend, and you find the indication that it is going to take 50+ hours to process, depressing, then come back.

    Because whilst the calculation wouldn't be exactly as you have now, it might be possible to achieve your goal far more quickly.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      I stumbled on the source of the problem, but I do not understand the cause. In the code I pasted earlier, another wasteful thing I was doing was updating the topX list after each word. If instead I just save all the similarities and then use PDL functions to sort (and so find the topX, etc.), my problems go away (code below; and recall that the problems also went away if I skipped the PDL instruction, so it wasn't just those list operations).

      So it seems like it was an interaction: the PDL inner function led to large slow downs when I was updating a small list, adding to the total similarity, etc., on each step. The code below now settles to a constant 336 msecs per word, and so the whole set can be processed in about 3 hours.

      I've also gotten some advice from the PDL mailing list about how to use vectorized processes to speed this up tremendously. I'll report back if I manage to get that working.

      Thanks, everyone, for your help!

      jim
      #!/usr/bin/perl -s use PDL; use PDL::NiceSlice; use Time::HiRes qw ( time ) ; $|=1; $top = 20; $realStart = time(); while(<>){ chomp; ($wrd, @data) = split; $kernel{$wrd} = norm(pdl(@data)); # EXAMPLE LINE # word 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 + 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 } @kernelKeys = sort( keys %kernel ); printf STDERR "# read $#kernelKeys words in %.2f seconds\n", time()-$realStart; $startAll = time(); $at1 = 0; printf "#REC\ttheWord\tMEAN\tMAX\tmeanTOP$top\tTIME\n"; foreach $w1 (@kernelKeys) { $startWord = time(); @allSims = (); $at2 = -1; foreach $w2 (@kernelKeys) { $at2++; next if($at1 == $at2); # skip identical item, but not homophones push @allSims, inner($kernel{$w1},$kernel{$w2}); # $sim = inner($kernel{$w1},$kernel{w2}); # $totalsim+=$sim; # if($sim > $maxsim){ $maxsim = $sim; } # # keep the top 20 # if($#topX < $top){ # push @topX, $sim; # } else { # @topX = sort { $a <=> $b } @topX; # if($sim > $topX[0]){ $topX[0] = $sim; } # } } $at1++; $allSim = qsort(pdl(@allSims)); $now = time(); printf "$at1\t$w1\t%.6f\t%.6f\t%.6f\t%.5f\n", sum($allSim)/$#kernelKeys, max($allSim), sum($allSim->(($#kernelKeys - $top - 1 - 1):($#kernelKeys - 1))) +/$top, $now - $startWord; unless($at1 % 25) { $elapsed = $now - $startAll; $thisWord = $now - $startWord; $perWord = $elapsed / ($at1 + 1); $hoursRemaining = ($perWord * ($#kernelKeys - $at1 + 1))/3600; printf STDERR "$at1\t$w1\t %.6f\tElapsed %.6f\tPerWord %.6f\tHours +ToGo %.6f\n", $thisWord, $elapsed, $perWord, $hoursRemaining; } }
        o the whole set can be processed in about 3 hours.

        Glad you found a resolution, though 3 hours still isn't quick.

        But I wonder if you are open to having your methodology questioned? (By someone who has little knowledge of your goals and probably wouldn't understand them if he did :)

        Specifically I'd like to ask you about "word similarity"?


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://956321]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (11)
As of 2014-08-01 12:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Who would be the most fun to work for?















    Results (12 votes), past polls