Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked

Re^3: Are monks hibernating?

by Withigo (Friar)
on Feb 14, 2007 at 19:12 UTC ( #600045=note: print w/replies, xml ) Need Help??

in reply to Re^2: Are monks hibernating?
in thread Are monks hibernating?

I've been regularly reading on here for years now and this is the first time I've heard that the major bottle neck in Perl is that sub/methods calls are slow. It's common knowledge that Perl is significantly slower than straight C, but are you saying that the bulk of that slow down occurs due to sub/method calls being slow?
I've recently been poking around the perlguts, writing some XS/Inline code, and reading Perl's C source, so I'd love to get a peek at the underlying reasons for the slowdown you mention--even just a cursory peek which is probably all I could comprehend at this point.

Replies are listed 'Best First'.
Re^4: Are monks hibernating?
by BrowserUk (Pope) on Feb 17, 2007 at 01:25 UTC
    ...the first time I've heard that the major bottle neck in Perl is that sub/methods calls are slow It's common knowledge that Perl is significantly slower than straight C, but are you saying that the bulk of that slow down occurs due to sub/method calls being slow?

    I never said "the major bottleneck". And of course it is slower than straight C. It could not be otherwise.

    However, here is a partial breakdown of the costs of subroutine and method calls, and various forms of OO:

    Rate OO_7 OO_8 OO_10 OO_9 OO_6 OO_5 OO_4 OO_3 OO_2 OO_1 sub2 s +ub1 inline OO_7 70.4/s -- -0% -43% -44% -69% -70% -81% -88% -89% -91% -92% - +95% -99% OO_8 70.5/s 0% -- -43% -44% -69% -70% -81% -88% -89% -91% -92% - +95% -99% OO_10 124/s 76% 76% -- -1% -45% -46% -66% -79% -81% -84% -85% - +91% -98% OO_9 126/s 79% 79% 1% -- -44% -46% -66% -79% -81% -84% -85% - +91% -98% OO_6 227/s 222% 222% 83% 80% -- -2% -38% -62% -65% -71% -73% - +84% -96% OO_5 232/s 229% 229% 87% 84% 2% -- -37% -61% -65% -70% -73% - +83% -96% OO_4 367/s 421% 420% 195% 191% 62% 58% -- -39% -44% -53% -57% - +74% -93% OO_3 600/s 752% 752% 383% 376% 165% 159% 64% -- -9% -23% -29% - +57% -88% OO_2 656/s 832% 831% 428% 421% 189% 183% 79% 9% -- -16% -23% - +53% -87% OO_1 782/s1010% 1009% 529% 520% 244% 237% 113% 30% 19% -- -8% - +44% -85% sub2 847/s1103% 1102% 582% 572% 273% 266% 131% 41% 29% 8% -- - +39% -84% sub1 1388/s1871% 1869%1017%1001% 512% 499% 278% 131% 112% 78% 64% + -- -73% inline5179/s7252% 7246%4067%4009%2182%2134%1312% 763% 689% 562% 511% 2 +73% --

    Note the big step changes between

    1. inline code and sub1; the code runs nearly 3 times more slowly.

      The 'body' of the code, a single integer increment is not worth a sub, but it is designed to emphasis the cost of the calls, and OO support code.

    2. sub1 & sub2; name the parameters and your code runs 5 times slower.
    3. OO_3 & OO_4; use getters and setters and your code is now 13 times slower.
    4. OO_6 & OO_9; use Best Practices and your code is 40 times slower.
    5. OO_10 & OO_8; admittedly far better than traditional InsideOut objects which result in code that runs 72 times more slowly.

    Of course this is 'worst case'. If your application is IO-bound, or DB-bound, or only creates a few big objects, with a few large methods and only runs for a few seconds; then none of this is of the slightest concern to you.

    But if your applications are running cpu-bound over large volumes of data. If you create lots of small objects with small methods and call them lots of times. If your program runs for a few hours when coded inline, or as a few large subroutines. Then if you properly OO-ize it using Class::Std, it could end up running for a day or more for every hour the original ran.

    Or to look at it another way, if you take an app that uses properly structured OO approach using Class::Std and you bastardize it to use simple subs and inline code blocks. If the original ran for 6 hours, the destructured version might end up taking 10 minutes!

    The point is that even a few percent improvement in the costs of function calling would benefit every Perl application, and could have profound affect on the performance for some types of applications.

    Benchmark code

    Another way of looking at it is to take a heavily recursive function (Ackermann,) and re-write it to use exactly the same algorithm--repeating all the same motions, stacking and unstacking the same amount of data; building and unwinding the same number of code block frames--but do it without calling a subroutine. Ie. A recursive algorithm coded manually using a loop and a stack.

    This shows that using the identical 'recursive' algorithm, but avoiding the sub calls, Ack( 3, 0 ) runs twice as fast. Ack( 3, 8 ) runs 4 times as fast. And by the time you get to Ack( 3, 9 ) it's a whopping 12 1/2 times faster. Note: There are no tricks here. No memoizations, or reduced recursion levels. When the truely recursive code stacks a parameter, this also stack a parameter. When it pops a parameter, this pops one. When it builds a block (sub) frame, this builds a block (while) frame. In as far as it is possible to emulate recursion through iteration this does so.

    It really shouldn't be possible to emulate what Perl does when it calls a subroutine, in unoptimized, Perl code, and beat it quite so emphatically.

    c:\test> Ack(3,0): 5 Ack(3,0): 5 1 trial of Recursive Ack 3,0 (227us total) 1 trial of Iterative Ack 3,0 (111us total) c:\test>b-recursive-nosub 8 Ack(3,8): 2045 Ack(3,8): 2045 1 trial of Recursive Ack 3,8 (13.698s total) 1 trial of Iterative Ack 3,8 (3.328s total) c:\test>b-recursive-nosub 9 Ack(3,9): 4093 Ack(3,9): 4093 1 trial of Recursive Ack 3,9 (162.104s total) 1 trial of Iterative Ack 3,9 (13.234s total)

    The benchmark

    At this point I had hoped to try and produce an annotated assembler level trace of the transitions into and out of a subroutine. I got part way (a couple of hours), through producing that and had finger trouble and managed to throw it all away. Forgive me for not starting over but it is incredibly tedious and requires a lot of concentration, and I'm simply not ready to face that again just yet.

    What I will say is, based on my inspection (on my Win32/AS 811 combination), of the disassembled C that is run when a Perl level subroutine is called, there appears to be considerable scope for optimisation.

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://600045]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (5)
As of 2021-01-20 23:44 GMT
Find Nodes?
    Voting Booth?