Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re^3: Fast data structure..!!!

by kyle (Abbot)
on Apr 15, 2008 at 16:26 UTC ( #680561=note: print w/ replies, xml ) Need Help??


in reply to Re^2: Fast data structure..!!!
in thread Fast data structure..!!!

I'd like a sample value for $dp. A Data::Dumper representation of it or some code to cook up a reasonable facsimile would work. When I run this on my computer, it doesn't do anything interesting until I give it a value for that, but what particular value it has could have a big effect on performance.

Thanks.


Comment on Re^3: Fast data structure..!!!
Download Code
Re^4: Fast data structure..!!!
by MimisIVI (Acolyte) on Apr 15, 2008 at 16:40 UTC
    Here is the whole code...I should give it earlier

    use strict; use Devel::Size qw(size); my $df=500000;my $tf=3; my $wektor = ''; my $packed = ''; my $nr=0; for(0 .. $df) { vec ($wektor, $nr++, 32) = $_; # DOC ID...... vec ($wektor, $nr++, 32) = $tf; # TF...... for(0 .. $tf) { vec ($wektor, $nr++, 32) = $_+10; # POSITIONS } } print "Vector's size: " . size( $wektor ) . " bytes\n"; ###################### UNPACK VECTOR2..... my %vec;my %pos; my $docID=0; my $tf=0; my $index=0; my $order=1; my $Aa=time(); for(0 .. $df) { $docID = vec ($wektor, $index++, 32); $tf = vec ($wektor, $index++, 32); $vec{$docID}=$tf; # print "Doc id: $docID\ttf: $tf\n"; for(0 .. $tf) { my $last=vec ($wektor, $index++, 32); $pos{$docID}{$last}=$order; } } print "unpack vector in \t",time()-$Aa," secs..\n";
      I still don't get why it takes 15s for you to execute that code, it runs in 3s on mine. Did your machine swap to disk or something? Or is that an ancient machine? or a debugging perl?

      A few thoughts anyway:

      1. Try to pack vector. You're only accessing it linearly anway

      2. Try not to use that vector at all. You can populate your %pos hash in the first place instead

      3. Store the data structure on disk once, for example in a BerkeleyDB. It seems to be constant, so you don't actually need to calculate it every time your program starts. Or store $wektor in a plain binary file after you created it, and in subsequent runs only retrieve that from disk and generate your hash from it.

      I ran this under Devel::NYTProf and found that this loop is the hot spot:

      for(0 .. $tf) { my $last=vec ($wektor, $index++, 32); $pos{$docID}{$last}=$order; }

      I changed it to this, following a suggestion from dvryaboy and also getting rid of the block completely:

      $pos{$docID}{vec($wektor, $index++, 32)}=$order for 0 .. $tf;

      That got somewhat faster.

      I tried taking out the constant reference to $pos{$docID} like this:

      $pos{$docID} ||= {}; my $did_ref = $pos{$docID}; $did_ref->{vec($wektor, $index++, 32)}=$order for 0 .. $tf;

      ...but that didn't make much difference to the hot loop, and it was more expensive outside the loop than the savings it got inside.

      This is without really understanding what's going on, though. I wouldn't be surprised if what you're doing would benefit from just using a better algorithm.

        Yes,this loop is the hot spot..Perhaps if i will change my algorithm will be the best way..

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://680561]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (8)
As of 2014-08-01 06:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (257 votes), past polls