http://www.perlmonks.org?node_id=680570


in reply to Re^3: Fast data structure..!!!
in thread Fast data structure..!!!

Here is the whole code...I should give it earlier

use strict; use Devel::Size qw(size); my $df=500000;my $tf=3; my $wektor = ''; my $packed = ''; my $nr=0; for(0 .. $df) { vec ($wektor, $nr++, 32) = $_; # DOC ID...... vec ($wektor, $nr++, 32) = $tf; # TF...... for(0 .. $tf) { vec ($wektor, $nr++, 32) = $_+10; # POSITIONS } } print "Vector's size: " . size( $wektor ) . " bytes\n"; ###################### UNPACK VECTOR2..... my %vec;my %pos; my $docID=0; my $tf=0; my $index=0; my $order=1; my $Aa=time(); for(0 .. $df) { $docID = vec ($wektor, $index++, 32); $tf = vec ($wektor, $index++, 32); $vec{$docID}=$tf; # print "Doc id: $docID\ttf: $tf\n"; for(0 .. $tf) { my $last=vec ($wektor, $index++, 32); $pos{$docID}{$last}=$order; } } print "unpack vector in \t",time()-$Aa," secs..\n";

Replies are listed 'Best First'.
Re^5: Fast data structure..!!!
by moritz (Cardinal) on Apr 15, 2008 at 16:59 UTC
    I still don't get why it takes 15s for you to execute that code, it runs in 3s on mine. Did your machine swap to disk or something? Or is that an ancient machine? or a debugging perl?

    A few thoughts anyway:

    1. Try to pack vector. You're only accessing it linearly anway

    2. Try not to use that vector at all. You can populate your %pos hash in the first place instead

    3. Store the data structure on disk once, for example in a BerkeleyDB. It seems to be constant, so you don't actually need to calculate it every time your program starts. Or store $wektor in a plain binary file after you created it, and in subsequent runs only retrieve that from disk and generate your hash from it.

Re^5: Fast data structure..!!!
by kyle (Abbot) on Apr 15, 2008 at 17:03 UTC

    I ran this under Devel::NYTProf and found that this loop is the hot spot:

    for(0 .. $tf) { my $last=vec ($wektor, $index++, 32); $pos{$docID}{$last}=$order; }

    I changed it to this, following a suggestion from dvryaboy and also getting rid of the block completely:

    $pos{$docID}{vec($wektor, $index++, 32)}=$order for 0 .. $tf;

    That got somewhat faster.

    I tried taking out the constant reference to $pos{$docID} like this:

    $pos{$docID} ||= {}; my $did_ref = $pos{$docID}; $did_ref->{vec($wektor, $index++, 32)}=$order for 0 .. $tf;

    ...but that didn't make much difference to the hot loop, and it was more expensive outside the loop than the savings it got inside.

    This is without really understanding what's going on, though. I wouldn't be surprised if what you're doing would benefit from just using a better algorithm.

      Yes,this loop is the hot spot..Perhaps if i will change my algorithm will be the best way..