Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Do you really want to use an array there?

by deprecated (Priest)
on Jun 12, 2004 at 05:19 UTC ( [id://363563]=perlmeditation: print w/replies, xml ) Need Help??

Being a perl hacker, you may twitch and feel uncomfortable reading this node. It really is worth your time, so try to sit through it.

I like to start with code, so let's have a look at this:

package Wektor; sub mkwek { my ($cls, $ary) = (@_); my $wektor = ''; my $iter = 0; for (@$ary) { vec ($wektor, $iter++, 8) = $_; } return bless \$wektor; } sub windex { my ($wektor, $windex) = (@_); return vec ($$wektor, $windex, 8); } package Widgin; sub mkwig { my ($cls, $ary) = (@_); my $widgin = $ary; return bless $ary; } sub windex { my ($widgin, $windex) = (@_); return $widgin->[$windex] || 0; }
Take a moment, if you dare, to look over the perldoc for vec. I think, like most perl hackers, that looking at that particular page, you'll see some perl 4 ism's and some particularly unuseful code, and think, gee, this is right up there with Larry using ' instead of ::. I don't need that garbage. Please, read the above again, and see how simple vec is to use.

Recently, I've been working with obscenely large arrays. One of the problems I've bumped into is one of our user-developers has said that they have occasionally an array so large that they cannot fit two copies of it into memory. Additionally, any time you iterate over the array, you find that passing references to the array and accessing elements individually is also time consuming.

Enter the vector.

It is helpful to think of a vector as a joined array (on ''). Essentially, it is as if you had done something like:

# this is oversimplified. a lot. my $vector = join '', map { pack 'N', $_ } @ary
Wouldn't you rather use substr() (note: substr is fast!!)to refer to individual bytes of that "array" than to pass references around and deref them, or point at individual elements of it? String functions can be a lot faster than list operations. In fact, this is so important that many computers now ship with "vector units" (Crays and Macs among them). vec() is essentially a pretty wrapper around the C vector functions (with a little endian assumption. If you've compiled your perl, say, -faltivec, you'll have access to those units from within perl).

Vectors also allow you to perform bitwise operations on entire vectors at once, rather than walking the array (see vec). That is substantially faster.

If you don't think this is a substantial improvement, think about asking perl to my @foo = @{ $bar } and how much cpu that takes to do on a 30 million element array.

Providing benchmarks is kind of silly. I have a bunch of machines here with vector processors. My results will be different than yours. I would hope, though, by looking at the above, you can see that it might be worthwhile to actually try using a vector. In many cases, it is more efficient both cpu and memory-wise, than an array. And not much harder to use.

My own gains with vectors have been between 5 and 20 percent memory savings, and 30 to 500 percent increases in speed. My goal is not to convince people vectors are better -- they are sometimes, and not others -- but rather to convince people to try to use them. I rarely ever find a perl hacker who knows what a vector is, or why it might be much better to use a vector than an array for their application. vec is a perl builtin. Don't be afraid.

happy vectoring.

dep.

ps: if you find that you really find this attractive you may want to check out PDL and Bit::Vector.

--
Tilly is my hero.

Replies are listed 'Best First'.
Re: Do you really want to use an array there?
by gmpassos (Priest) on Jun 12, 2004 at 11:35 UTC
    First of all, very interesting text. I never have looked the vec() function of Perl very well, but can be very important for a big amount of data.

    Well, since vec is a CORE function, that work directly over the bits of the content of a string, it will much more faster, but don't forget that a vector is fixed, each element have the same size, is just an array of chars, and the Perl ARRAY is an array of SCALARS, that are much more complex.

    Graciliano M. P.
    "Creativity is the expression of the liberty".

      There's nothing in principle forcing a vector to have elements all of the same size, you just have to be a bit careful with the indexing if they are not.

      What I do occasionally find frustrating is the lack of certain useful primitives when dealing with vectors, particularly vectors of booleans, since the resulting code can often become somewhat ugly and non-intuitive. Consider for example testing whether all(a_i & b_i) == 0:

      # in an integer unless ($a & $b) { ... } # in a vector unless (($a & $b) =~ /[^\0]/) { ... }

      I look forward to the promise of perl6 - where you could have a vector class, for example, "just like a string" except that it has a different concept of which strings are true. I think such a class would be difficult to achieve cleanly in perl5 without going down the path of tying/overloading, and losing most of the efficiency gains that caused you to turn to vectors in the first place.

      Hugo

Re: Do you really want to use an array there?
by Zaxo (Archbishop) on Jun 12, 2004 at 06:44 UTC

    I'll vouch for PDL in that sort of situation.

    After Compline,
    Zaxo

Re: Do you really want to use an array there?
by demerphq (Chancellor) on Jun 14, 2004 at 13:07 UTC

    Its worth keeping mind Tie::VecArray and Tie::Array::PackedC although as hv says adding the tie interface probably slows things down too much for your taste.


    ---
    demerphq

      First they ignore you, then they laugh at you, then they fight you, then you win.
      -- Gandhi


Re: Do you really want to use an array there?
by MimisIVI (Acolyte) on Apr 14, 2008 at 01:31 UTC
    That was exactly what i was looking for 2 weeks!!!.... I couldnt imagine how fast can be this builtin function...

    Chears!!!

      MimisIVI, if you mean vec, Tachyon-II pointed you at vec 5 days ago in Re^6: Compress positive integers. It's right there in the last line. Maybe you missed it?

      Anyhow, here is Elias Gamma coded in pure Perl:

      sub log2{ int( ( log( $_[0] ) + 1e-15 ) / log( 2 ) ) } sub PP_EliasPack { my $packed = ''; my $out = 0; for my $num ( @_ ) { my $len = log2( $num ); $out += $len; vec( $packed, $out++, 1 ) = 1; vec( $packed, $out++, 1 ) = ( $num & ( 1 << $_ ) ) ? 1 : 0 for 0 .. $len - 1; } return $packed; } sub PP_EliasUnpack { my $packed = shift; my $bits = length( $packed ) * 8; my $in = 0; my @unpacked; while( $in < $bits ) { my( $len, $num ) = ( 0 ) x 2; $len++ while $in < $bits && vec( $packed, $in++, 1 ) == 0; last if $in == $bits; vec( $packed, $in++, 1 ) and $num |= ( 1 << $_ ) for 0 .. $len + - 1; $num |= ( 1 << $len ); push @unpacked, $num; } return @unpacked; }

      And for reference, here is the results of my benchmark with that included:

      C:\test>678848 Run with 15000 unique words in 1000 documents (Ave: 554 words/doc) ASCII uses 4755336 bytes W-BER uses 3196819 bytes Binary uses 3279128 bytes Elias uses 3980063 bytes PP_Elias uses 3980063 bytes 1 trial of Packing ascii (10.203s total) 1 trial of Unpacking ascii (3.159s total) 1 trial of Packing W-BER (18.159s total) 1 trial of Unpacking W-BER (1.516s total) 1 trial of Packing binary (9.910s total) 1 trial of Unpacking binary (1.455s total) 1 trial of Packing Elias (13.613s total) 1 trial of Unpacking Elias (2.739s total) 1 trial of Packing PP_Elias (31.128s total) 1 trial of Unpacking PP_Elias (18.094s total)

      Notice that PP_Elias achieves identical compression to Elias in C (as you'd expect :), but that it's twice as slow at packing, and nearly 10 times slower when unpacking.

      If you want to continue with using Elias, and can't build the C version yourself, I could let you have it pre-built (for win32), but I have to wonder why you would when W-BER is faster and achieves better compression?

      Also, I wonder if you saw Re^8: Byte allign compression in Perl.. where I demonstrated that you can have the DB do the selection for you using the schema I suggested way back when, in 0.312 of a second? For the record, I found a small optimisation in the schema that reduce that by a factor of 10, to 31 milliseconds.

      So the DB does the selection, sends you just the data you need to do your proximity calculations, and does it all faster than you could pack a single integer. Interested?


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        No my friend , i didnt missed the Tachyon comment about the vec function..but the deprecated's code help me to understand what is going on with this function..

        For the SQL command that you propose i want to ask you for which server is appropriate because on MySQL there is no command for the intersection( i tried some inner join but the perfomance was very very very slow for 1GB dataset (250000 pages,Average document length : 602 words Number of unique words: 907806)...

        About your code for the Elias technique i have to say that it is 3 times faster than mine (Thanks one more time!!!)...The only reason why i want to use compression in my index is for perfomance reasons..that was my thougths untill now but it seems that i wasnt right..:(..since with the vec function i can decode 3000000 doc ids in 2 seconds and 10 milion in 6 secs!!!) as the below code shows..

        use strict; use Devel::Size qw(size); my $wektor = ''; for(0 .. 10000000) { vec ($wektor, $_, 32) = $_; } print "Vector's size: " . size( $wektor ) . " bytes\n"; my @vec; my $Aa=time(); for(0 .. 10000000) { push @vec,vec ($wektor, $_, 32); } print "unpack vector in \t",time()-$Aa," secs...(oh Yeah!!!)\n";
        Size of vector is 40000032 bytes unpack vector in 6 secs...(oh Yeah!!!)
        In the above code i used 4 bytes for each doc id. I tried with a vector where i save the same number of doc ids but with only 1 byte for each doc id ( i saved only small numbers) and the time was completely the same..I cant understand why..DOes anyone??

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://363563]
Approved by davido
Front-paged by cLive ;-)
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (7)
As of 2024-03-19 02:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found