http://www.perlmonks.org?node_id=791611

jpl has asked for the wisdom of the Perl Monks concerning the following question:

Devel::NYTProf has implicated an XS routine I wrote as a major time sink, so I'd like to see some profiling results for the XS code. I built a perl.gprof, and it yields detailed information about perl itself, but not a peep about the xsub, even when I compiled it with -pg. I also tried the LD_PROFILE ld.so environment variable with non-pg versions of perl and the xsub, but that's not producing a gmon.out to analyze, possibly because of the late binding of the xsub.

All this is on an Ubuntu 9.04 x86-64 system, perl 5.10.1, but I have access to other architectures, if any monk has managed to divine a means of coaxing profiling information out of an xsub.

Replies are listed 'Best First'.
Re: profiling XS routines
by BrowserUk (Patriarch) on Aug 27, 2009 at 17:40 UTC

    One thought. Does your current profiling show that you are spending an excessive amount of time in each call to the XS sub? Or is the time accumulated over many, many relatively quick calls?

    The point being that to make really effective use of XS routines for speeding things up, each call to the XS routine should do as much as possible. This because there is a significant overhead involved in the transfers into and out of an XS subroutine.

    For example, PDL does its things very efficiently, and if you have a million numbers to be processed and pass them to PDL as a single lump, the speed up over a pure Perl equivalent is enormous. But, if your requirements mean that you have to process those numbers as 100,000 sets of 10, then your gains will be far less dramatic--maybe even negative. This because the time saved processing just 10 numbers in C, is outweighted by overhead of the transfer into XS and back.

    A second thought. When it comes to profiling C code, calls to system (clock etc.) routines can also have a significant overhead. And if you combine each call to get the time with another to printf the results somewhere, the overhead can completely distort the code-under-test.

    If you are using a Intel processor, there is a single instruction rdtsc which gets you a high resolution timestamp counter with very little overhead. If your compiler allows in-line assembler--or has a _rdtsc() intrinsic--then this can be combined with a small piece of C--as a macro--that fetches the counter and writes it to a buffer along with a line number (from __LINE__) very efficiently. That allows you to tag various points in your C code and gather some stats without overly distorting the timing.

    I don't have an example to hand. The last time I did this was on my old machine, so the source is archived off on a CD somewhere. But it shouldn't be too hard to recreate it if it would be useful?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      If you are using a Intel processor, there is a single instruction rdtsc which gets you a high resolution timestamp counter with very little overhead. If your compiler allows in-line assembler--or has a _rdtsc() intrinsic--then this can be combined with a small piece of C--as a macro--that fetches the counter and writes it to a buffer along with a line number (from __LINE__) very efficiently. That allows you to tag various points in your C code and gather some stats without overly distorting the timing.

      Generally, a nice idea, but it does not work unless perl is the only process running, and the OS never schedules another task, and the OS does not run code outside perl (e.g. in kernel mode). rdtsc does not know about processes and threads in the OS. It just counts CPU clock cycles, including those spent in other applications and the kernel.

      Typical x86-based operating systems (except for DOS) run at least in multi-tasking mode, the better ones can also handle more than one user at a time. On ancient (16 bit) versions of Windows, you could effectively disable task switches by disabling interrupts and not spinning the message loop. But modern operating systems including windows can reschedule a process either when it calls an O/S function or when a timer interrupt occurs. And you can't avoid the latter.

      The rdtsc function is also available on non-Intel CPUs, and according to http://en.wikipedia.org/wiki/Rdtsc, both Intel and AMD CPUs have some "interesting" differences between their TSC implementations.

      Alexander

      --
      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
        but it does not work unless perl is the only process running, and the OS never schedules another task, and the OS does not run code outside perl (e.g. in kernel mode). rdtsc does not know about processes and threads in the OS. It just counts CPU clock cycles, including those spent in other applications and the kernel.

        Whilst all of that is true, it is also true of pretty much every other timing method you want to name. There is no practical way of establishing exact timings with modern, deeply piped-lined, hyperthreading, multi-cored cpus running multi-tasking OSs.

        With processors that perform out-of-order execution; and through hyperthreading, can interleave the instructions of 2 or more separate processes through the same core; cache-line refreshes that can stall a process for relative eons; OSs that can preempt and re-prioritise at will; compilers that re-order and/or eliminate code; there is no practical way of exactly instrumenting any single piece of code. Quite literally: No way!.

        And even if there were--say by booting a highly simplified, single-tasking bootstrap kernal that just loads the CuT and times how long it takes to execute--the results would be so unrepresentative of any real-world usage as to be entirely meaningless. That's why when you read these thesis papers that optimise algorithms for maximum cache coherency and make huge claims for efficiencies, you should reach for the salt cellar.

        As with any benchmarking or profiling technique that you apply on any multi-tasking OS; or any multi-cored or multi-threaded processor, the best you can do is:

        1. set the affinity for the process;
        2. (maybe, set the affinities of every other process to a different core.)
        3. (maybe, disable power-saving features)
        4. run multiple iterations of the CuT;
        5. average the results.

        Repeat the above after each code/algorithm/compile option change to obtain roughly comparible timings.

        Mostly, so long as you don't do one set of timings whilst your virus scanner runs and you're listening to videos whilst playing Grand Turismo 5--and the other set whilst not--averaging an appropriate number of runs will give you a reasonable notion of whether changes you've made are moving things in the right direction or not.

        That said, the three single biggest optimisations I've made in my own XS routines are: a) do as much work in each individual call as possible; b) pass any mildy sizable quantity of data by reference rather than through the stack--including largish scalars; c) if the same data is going to be passed to the XS code on multiple occasions, pack it once and pass by reference and only unpack it when you need to.

        Indeed, second to a dear friends "Algorithm, algorithm, algorithm", the next best method of software optimisation is to do as little as possible! In Perl that means: don't (re)allocate memory; don't copy unnecessarially; don't make unnecessary sub/method calls; and don't convert between formats until you have to.

        Those rules-of-thumb can be emphasised by running a benchmark of two essentially similar pieces of code:

        sub sum1{ my @numbers = @_; my $total = 0; for my $number ( @numbers ){ $total = "$total" + "$number" } return $total };; sub sum2{ my $total = 0; $total += $_ for @{ $_[0] }; return $total };; @n = 1 .. 1e6;; cmpthese -1, { A=>q[ my $total=sum1( @n ); ], B=>q[ my $total=sum2( \@ +n ); ] };; Rate A B A 1.03/s -- -90% B 10.8/s 949% --

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: profiling XS routines
by syphilis (Archbishop) on Aug 27, 2009 at 13:02 UTC
    In the absence of something more sophisticated, I'd be grabbing the clock() values at various points, and using that to work out where the time is being spent.

    The function is defined in time.h and (double)clock()/CLOCKS_PER_SEC should give you the time in seconds. As I understand it, clock() itself returns the number of cpu cycles that have been run since the program began execution.

    But, of course, this is perl and there may well be something "more sophisticated".

    Cheers,
    Rob
Re: profiling XS routines
by creamygoodness (Curate) on Aug 27, 2009 at 14:25 UTC
    If I understand correctly, you want to know which C routines within the XSub are taking too long? For that, I use oprofile.