Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask

Re^2: Fast data structure..!!!

by MimisIVI (Acolyte)
on Apr 15, 2008 at 15:36 UTC ( #680542=note: print w/replies, xml ) Need Help??

in reply to Re: Fast data structure..!!!
in thread Fast data structure..!!!

Hi Guys,

Here is the real code..

I 've got a bit strinbg where i saved (f.e. 2 milion)possitive integers with 4 bytes each one..The code is like below..

my $docID=0; my $tf=0; my $index=0; my $Aa=time(); for(1 .. 2000000) { $docID = vec ($dp, $index++, 32); $tf = vec ($dp, $index++, 32); $ TERM FREQUENCY $vec{$docID}=$tf; ### SAve Data for(1 .. $tf) { my $poss=vec ($dp, $index++, 32); $pos{$docID}{$poss}=\$order;# save Data } } print "unpack vector in \t",time()-$Aa," secs...\n";

Thats what i want to speed up...The vec is really very fast to read the bitstring but the saving make the prerfomance slow...:(

Any suggestions????

Replies are listed 'Best First'.
Re^3: Fast data structure..!!!
by moritz (Cardinal) on Apr 15, 2008 at 15:53 UTC

    Get a faster computer. On my notebook (about 1 year old, 2 GHz CPU and enough RAM) this takes about 1.6 seconds. I don't say that boast, but rather to tell you that your hardware isn't up to date.

    And are you sure that you actually access all items of that data structure? Right now you just build it up, but don't use it, so there's no way for us to tell.

      It may run that fast because the OP didn't supply a value for $dp. I pasted the code into a file that has use strict at the top, and it blew up immediately.

        Well, he wrote that it's the real code. If it's not, I can't help. Sorry.
Re^3: Fast data structure..!!!
by dvryaboy (Sexton) on Apr 15, 2008 at 16:21 UTC
    Assuming that $pos and $poss are two different things, and $pos is defined somewhere outside of the code snippet, you can save a bundle by doing something like this:
    my $docpos = $pos{$docID}; for (1 .. $tf) { $docpos->{vec ($dp, $index++, 32)}=\$order; }
    It may seem trivial, but that's about 2000000*avg_tf dereferences.
    Getting rid of an unnecessary "my" variable should also pick up a few extra cycles.
Re^3: Fast data structure..!!!
by kyle (Abbot) on Apr 15, 2008 at 16:26 UTC

    I'd like a sample value for $dp. A Data::Dumper representation of it or some code to cook up a reasonable facsimile would work. When I run this on my computer, it doesn't do anything interesting until I give it a value for that, but what particular value it has could have a big effect on performance.


      Here is the whole code...I should give it earlier

      use strict; use Devel::Size qw(size); my $df=500000;my $tf=3; my $wektor = ''; my $packed = ''; my $nr=0; for(0 .. $df) { vec ($wektor, $nr++, 32) = $_; # DOC ID...... vec ($wektor, $nr++, 32) = $tf; # TF...... for(0 .. $tf) { vec ($wektor, $nr++, 32) = $_+10; # POSITIONS } } print "Vector's size: " . size( $wektor ) . " bytes\n"; ###################### UNPACK VECTOR2..... my %vec;my %pos; my $docID=0; my $tf=0; my $index=0; my $order=1; my $Aa=time(); for(0 .. $df) { $docID = vec ($wektor, $index++, 32); $tf = vec ($wektor, $index++, 32); $vec{$docID}=$tf; # print "Doc id: $docID\ttf: $tf\n"; for(0 .. $tf) { my $last=vec ($wektor, $index++, 32); $pos{$docID}{$last}=$order; } } print "unpack vector in \t",time()-$Aa," secs..\n";
        I still don't get why it takes 15s for you to execute that code, it runs in 3s on mine. Did your machine swap to disk or something? Or is that an ancient machine? or a debugging perl?

        A few thoughts anyway:

        1. Try to pack vector. You're only accessing it linearly anway

        2. Try not to use that vector at all. You can populate your %pos hash in the first place instead

        3. Store the data structure on disk once, for example in a BerkeleyDB. It seems to be constant, so you don't actually need to calculate it every time your program starts. Or store $wektor in a plain binary file after you created it, and in subsequent runs only retrieve that from disk and generate your hash from it.

        I ran this under Devel::NYTProf and found that this loop is the hot spot:

        for(0 .. $tf) { my $last=vec ($wektor, $index++, 32); $pos{$docID}{$last}=$order; }

        I changed it to this, following a suggestion from dvryaboy and also getting rid of the block completely:

        $pos{$docID}{vec($wektor, $index++, 32)}=$order for 0 .. $tf;

        That got somewhat faster.

        I tried taking out the constant reference to $pos{$docID} like this:

        $pos{$docID} ||= {}; my $did_ref = $pos{$docID}; $did_ref->{vec($wektor, $index++, 32)}=$order for 0 .. $tf;

        ...but that didn't make much difference to the hot loop, and it was more expensive outside the loop than the savings it got inside.

        This is without really understanding what's going on, though. I wouldn't be surprised if what you're doing would benefit from just using a better algorithm.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://680542]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (5)
As of 2018-11-20 21:33 GMT
Find Nodes?
    Voting Booth?
    My code is most likely broken because:

    Results (232 votes). Check out past polls.