Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

FWIW, I didn't really set out to tweak your algorithm as such. It just kind of appalled me that for many combinations of total set/desired subset sizes, the (emperically) quickest algorithm available to the Perl programmer proved to be sort'n'slice!

Even with really quite large datasets (10k & 100k), as soon as the required subset moved above around 10%, the overhead of applying even quite a small number of perl opcodes to each value in the total set--unavoidably O(N)--outweighted the costs of performing an O(NlogN) (or whatever) sort algorithm in C.

So I set out to see how close I could get to the C/sort performance in Perl. Starting with your algorithm was the obvious choice as it outformed everything else offered. It's canny use of short-circuiting puts it head and shoulders above the other algorithms. Then it became a case of seeing how few (and the least expensive) opcodes one could use to fulfill it.

There may be a little more that could be trimmed from my reworking of your algorithm, but it rapidly became obvious that the only way to beat the sort'n'slice method would be to move to C--and implementing your algorithm was the obvious choice again.

Once you start comparing like with like (ie. C with C), then the benefits of your, basically O(N), algorithm shine relative to the O(NlogN) of the sort and it wins in most (though not all) cases over the sort as you would expect.

I'm not yet sure why the sort still wins occasionally with large subsets of large total sets--it probably comes down to the cost of extending the Perl stack to accomodate the return list, combined with the need to splice new high values into the return list as they are discovered. Perhaps someone with more XS experience than I could make it work better.

The upshot as far as I am concerned, is a comfirmation of something I've voiced here on a few occasions. Big O notation, useful as it is, doesn't tell the whole story when (some parts of) one algorithm are performed at the C level and (some parts of) the other algorithm are performed at the Perl level. And even when both are done in C, it is very difficult to incorporate the costs of the housekeeping of an algorithm (memory allocation etc.) ) into the overall O-notation costs.

In this case, whilst it ought to be easy to beat sort'n'slice--and is for smallish subsets of smallish total sets--it proved to be a lot harder to achieve that for the general case.

So, whilst O-notation can give very good insights into the potential comparative costs of algorithms, in the end, a good oldfashioned benchmark of the actual implementations is always required to make the final determination.

I've no doubt that were it possible to consider all the parts of the implementation of both algorithms in great detail, it would be possible to make the O-notation reflect the reality of them, but in my experience, that takes much longer and is much harder to do than simply code them and test it.


Examine what is said, not who speaks.
Silence betokens consent.
Love the truth but pardon error.

In reply to Re^3: Better mousetrap (getting top N values from list X) by BrowserUk
in thread Better mousetrap (getting top N values from list X) by Limbic~Region

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (3)
As of 2024-04-19 19:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found