Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re: Mysterious slow down with large data set

by BrowserUk (Patriarch)
on Feb 26, 2012 at 22:13 UTC ( [id://956323]=note: print w/replies, xml ) Need Help??


in reply to Mysterious slow down with large data set

I can't see what is causing the slowdown, but I can see one obvious thing that would speed it up a lot. You keep re-sorting the keys to %kernel every time when they do not change. Instead of:

foreach $w1 ( sort( keys %kernel ) ){ $totalsim = $maxsim = 0; @topX = (); $at2 = 0; foreach $w2 ( sort( keys %kernel ) ) { ...

Using:

my @sortedKeys = sort( keys %kernel ); foreach $w1 ( @sortedKeys ){ $totalsim = $maxsim = 0; @topX = (); $at2 = 0; foreach $w2 ( @sortedKeys ) {

may speed things up to the point that the slowdown becomes insignificant.

Also, using a sort to track top N is probably slower than a simple linear insertion and truncate if necessary.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

Replies are listed 'Best First'.
Re^2: Mysterious slow down with large data set
by jsmagnuson (Acolyte) on Feb 26, 2012 at 23:44 UTC

    Yes, I can't believe I did that. Thank you!

    Regarding the top N, I had previously tried this, which worked, but seemed more complicated. Is this what you had in mind?

    } elsif ($sim > min(pdl(@topList))) { $theMin = grep { $topX[$_] eq min(pdl(@topX)) } 0..$#topX; # replace the smallest $topX[$theMin] = $sim; # add this one push @topX, $sim; }
    Thanks!

      I tried it this way:

      @topX = (-1) x 20; ... $topX[ $_ ] < $sim and splice( @topX, $_, 0, $sim ), pop( @top +X ), last for 0 .. 19;

      A short-ciruited, linear insertion is at worst O(N) rather than O(N logN).

      It speeds things a little, but doesn't address the slowdown which is happening exclusively (and inexplicably) inside PDL.

      Unfortunately, the PDL documentation spends more time telling you about their 'philosophy'; and indexing the indexes to the documentation than is does telling you what these functions actually do; or how to inspect the results of what they did :(


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://956323]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (3)
As of 2024-04-24 05:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found