Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Processing values of a piddle (PDL) speedup using 'at' vs. 'index'

by kevbot (Hermit)
on Jul 15, 2012 at 06:13 UTC ( #981884=perlmeditation: print w/ replies, xml ) Need Help??

Hello Monks,

I have been using PDL in some data processing scripts that I use at work. One of my modules takes float values from a pdl and processes them before printing them to a text file. Although a text file is very inefficient for the type of data I am handling, I need to output a text file for use with some legacy (proprietary) software.

While profiling the code that writes the text file (thank you Devel::NYTProf), I discovered that there was a lot of time spent in various "PDL" subroutines. It occurred to me that I had made a bad assumption (or perhaps a case of expecting PDL to DWIM but not matching my expectations). I thought this issue might be encountered by others; hence, I'm writing this node.

I have been happy to see that PDL has been getting some love recently (new version released in May 2012). Also, Joel Berger has been making some blog posts recently regarding using Perl and PDL in scientific applications.

OK, back to the problem at hand. I had assumed (incorrectly) that when I placed a single value from a 1-dimensional piddle into a variable that the resulting value would be a perl scalar. For example,

my $val = $my_pdl->index($i);

In reality the $val is a piddle (i.e. ref($pdl) returns 'PDL'). My code still ran and gave the expected results...however it ran more slowly (compared to the other method using at, see below).

In my use case, I take a pre-existing pdl that contains hundreds of thousands (perhaps millions) of float values and check each element one-by-one (e.g. to see if it equals a special value, etc.) and then change the resulting value before I put it back into a perl array...where I latter dump the array into a text file. It turns out that the code runs significantly faster if I simply change the code to,

my $val = $my_pdl->at($i);
According to the PDL docs, the at method returns a single value inside a piddle as perl scalar (whereas, the index method returns a pdl). Perhaps this is a long write-up for a simple case of confusion, but the speed difference was quite noticeable. The code below gives a benchmark of iterating over all elements of a 10,000 element pdl using both methods (and doing nothing else).
#!/usr/bin/env perl use strict; use warnings; use PDL; use Benchmark qw(:all); cmpthese( 100, { 'pdl_values' => sub {&pdl_value}, 'perl_values' => sub {&perl_scalar_value}, } ); exit; sub pdl_value { my $pdl = ones( float, 10000 ); my $nelem = nelem($pdl); for ( my $i = 0; $i < $nelem; ++$i ) { my $val = $pdl->index($i); #Do something } return; } sub perl_scalar_value { my $pdl = ones( float, 10000 ); my $nelem = nelem($pdl); for ( my $i = 0; $i < $nelem; ++$i ) { my $val = $pdl->at($i); #Do something } return; }
Rate pdl_values perl_values pdl_values 16.8/s -- -74% perl_values 64.1/s 282% --
When I add the code to put the processed values into a perl array and to check for a "special value" the difference is even larger.
#!/usr/bin/env perl use strict; use warnings; use PDL; use Benchmark qw(:all); cmpthese( 100, { 'pdl_values' => sub {&pdl_value}, 'perl_values' => sub {&perl_scalar_value}, } ); exit; sub pdl_value { my $pdl = ones( float, 10000 ); $pdl->index(0) .= 999; my $nelem = nelem($pdl); my $special_value = 999; my @values; for ( my $i = 0; $i < $nelem; ++$i ) { my $val = $pdl->index($i); if ( $val == $special_value ) { $val = undef; } push @values, $val; } #Do something with @values return; } sub perl_scalar_value { my $pdl = ones( float, 10000 ); $pdl->index(0) .= 999; my $nelem = nelem($pdl); my $special_value = 999; my @values; for ( my $i = 0; $i < $nelem; ++$i ) { my $val = $pdl->at($i); if ( $val == $special_value ) { $val = undef; } push @values, $val; } #Do something with @values return; }
Rate pdl_values perl_values pdl_values 1.51/s -- -97% perl_values 52.1/s 3347% --
I discovered the at method by reading the PDL Book. I highly recommend this book for those that are just getting started with PDL. It can be a little difficult to find what you are looking for in the PDL docs (compared to the book). The book actually warns against the use of at, since it is "slow"; however, it is a significant improvement in this case. Perhaps there is yet another way that would be faster. Here's the relevant quote from the PDL Book:

Conversion to Perl types: at and list

You can get a PDL scalar out into the Perl world with at, which requires the index of the scalar to pull out:

pdl> $a = xvals(5)*2; # $a is a PDL pdl> $a4 = $a->at(4); # $a4 is a perl scalar

You can also export a whole PDL with list:

pdl> @a = $a->list; pdl> for($a->list) { print $_, - ; } 0-2-4-6-8-

Be careful with at, as you almost never want to use it - it is tedious for anything nontrivial, and extremely slow! Particularly if you find yourself placing an at call inside a for loop, you should probably stop and think about how to use threading for your problem - see below.

Well, I'm off to learn a little more about PDL.

UPDATE: I removed an extraneous line of code from the first code example (it didn't affect the benchmark results).

Comment on Processing values of a piddle (PDL) speedup using 'at' vs. 'index'
Select or Download Code
Re: Processing values of a piddle (PDL) speedup using 'at' vs. 'index'
by BrowserUk (Pope) on Jul 15, 2012 at 07:01 UTC

    My (very limited) experience of working with PDL suggests that if you need to manipulate the values in a piddle individually, rather than applying each operation to the entire piddle as a whole, then you should export the piddle, en-masse, to a perl array first. It saves huge amounts of time,

    And if part of the reason you are using piddles is to save memory, and that effectively prevents you from exporting the whole piddle to a perl array, then export it in large chunks 1000 or 10,000 at a time and overwrite the array with the next chunk.

    Of course, you should make sure that your operation cannot be done using a PDL function before resorting to exporting.

    One example was finding the max and min values of a million doubles. Accessing the values individually from Perl and comparing was more than an order of magnitude slower than exporting the whole lot and performing the operation in Perl; but using the minmax function was (from memory, I don't currently have a working PDL installation) 2 orders of magnitude faster than exporting.

    The biggest problem was actually finding the appropriate functions, which are often barely mentioned and weirdly named.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      Thanks for the reply.

      I think your comment,

      My (very limited) experience of working with PDL suggests that if you need to manipulate the values in a piddle individually, rather than applying each operation to the entire piddle as a whole, then you should export the piddle, en-masse, to a perl array first. It saves huge amounts of time,
      and this excerpt from the PDL Book,
      Be careful with at, as you almost never want to use it - it is tedious for anything nontrivial, and extremely slow! Particularly if you find yourself placing an at call inside a for loop, you should probably stop and think about how to use threading for your problem - see below.
      are getting at the same idea. That is, one should avoid processing values in a piddle individually and take advantage of PDL's various commands for manipulating whole vecotrs or matrices.

      The code example that I gave is a bit simplified compared to my actual use case. In my case, I iterate through a few thousand objects (these objects have attributes that are 1-dimensional pdls). The values from most (or all) of these pdls need to be exported into my text file. I take values from these pdls, check them to see if they are a special value, change the value if needed, and then put them into a perl array. The perl array is eventually written to a text file. Your comments have me thinking that another possibility might be to do something like this:

      • create an n-dimensional pdl (i.e. a pdl of 1-dimensional pdls)
      • then transform and/or take slices of the n-dimensional pdl for output to my text file (using PDL commands)
      • replace the special values (using a PDL commands...I am unsure which ones would apply here)
      • create the text file
      It's possible that this type of approach might be faster; however, the current approach using at is working plenty fast for me at the moment (other parts of my code are now the bottleneck).
        I take values from these pdls, check them to see if they are a special value, change the value if needed, and then put them into a perl array.

        I would say that is entirely the wrong way to do it.

        You need to export the piddle in order to print it. So export it first; then search that for your special values; and only access the piddle elements individually if you find the special value in the exported array -- just to update it with the new value.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

        First off: the best place to ask questions about PDL is on the perldl mailing list and the central site for info on all things PDL is http://pdl.perl.org.

        Second, you are using the right strategy here. The key to remember is that calculation with PDL objects (called "piddles") are performed with special C code and are very fast. If your work can be done on the piddle data directly, you will almost always see the best performance.

        In this case, I would suggest using PDL operations to find all the "special values" in the piddle, mark them as BAD and the list() method (or the newer unpdl() method) will convert the piddle back to a perl list or list of list structure with the the special values all now having the value 'BAD'.

        A map can be used to substituted undef if that is needed for your algorithm. NOTE: if you don't need the special value elements at all, it is easy to not include them in the list() output via PDL operations.

        Here is a short session with the PDL shell (pdl2) showing some calculations along these lines:

        pdl> apropos bad # PDL shells have online help PDL::Bad Module: PDL does process bad values PDL::BadValues Manual: Discussion of bad value support badflag getter/setter for the bad data flag badinfo information on the bad-value support ...many more... pdl> help isbad Module PDL::Bad isbad Signature: (a(); int [o]b()) Returns a binary mask indicating which values of the input are bad values Returns a 1 if the value is bad, 0 otherwise. Similar to isfinite. $a = pdl(1,2,3); $a->badflag(1); set($a,1,$a->badvalue); $b = isbad($a); print $b, "\n"; [0 1 0] This method works with input piddles that are bad. The output piddle will never contain bad values, but its bad value flag will be the same as the input piddle's flag. pdl> $data = rint(10*random(10)) pdl> p $data [5 9 8 3 5 6 7 7 6 10] pdl> $special = 7 pdl> p $data->setvaltobad($special) [5 9 8 3 5 6 BAD BAD 6 10] pdl> p $data->setvaltobad($special)->list 5 9 8 3 5 6 BAD BAD 6 10 pdl> @pdata = $data->setvaltobad($special)->list pdl> p "@pdata" 5 9 8 3 5 6 BAD BAD 6 10 pdl> foreach (@pdata) { $_ = undef if $_ eq 'BAD' } pdl> p "@pdata" Use of uninitialized value $pdata[6] ... Use of uninitialized value $pdata[7] ... 5 9 8 3 5 6 6 10 pdl> p which $data==$special # calc indices of "special vals" [6 7] pdl> @ordinary = $data->where($data != $special) pdl> p "@ordinary" # or output just ordinary values [5 9 8 3 5 6 6 10]

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://981884]
Approved by moritz
Front-paged by moritz
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (14)
As of 2014-07-10 13:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (211 votes), past polls