Perl: the Markov chain saw PerlMonks

### Comment on

 Need Help??
No, no, no. Perl's strategy for growing data structures makes the amortized average insert time O(1). So growing a data structure of size n is O(n), not O(n log(n)).

For example consider a hash in the worst case scenario, where you have just done a hash split. Then every element has been moved once. Half of them were around for the previous hash split, so have been around twice. Half of them were around the previous time, and so have been around 3 times. And so on. So the total amount of moving is

n + n/2 + n/4 + ... = (1 + 1/2 + 1/4 + ...) * n = 2n = O(n)
thanks to the well-known geometric series.

With arrays the logic is similar except that each time we only grow an extra 1/5, and the overhead turns out to be 6n.

The result is that building either the hash or the array is O(n). But sorting the array is O(n log(n)). Therefore the hash scales better.

But (very big but) note that log(n) grows very slowly. If the data structures are big enough that they live on disk, then accessing your hash involves a disk seek, while merge sort involves accessing data on disk sequentially. Disk seeks are painfully slow, while accessing data on disk sequentially is orders of magnitude more efficient. That means that for data that lives on disk, the array approach is faster. Given the necessary numbers, generally by a couple of orders of magnitude.

But (very small but) hashes still scale better than arrays. So for very big data structures, hashing will again be faster than the array approach. However you are extremely unlikely to ever encounter a data set that large, if you did it probably wouldn't fit on your hard drive, and processing it sequentially by either approach would be too slow anyways. So you will want to parallelize your data processing. And when you do that, you will find that merge sort parallelizes better than hash lookups so arrays still win.

Thereby demonstrating that the theoretical big-O is not always the end of the story.

A final random note. Contrary to your final comment, it is the array that benefits more from duplicates. That is because if you're slightly clever in your merge sort, then eliminating lots of duplicates will reduce the size of your large passes, speeding up the sort. Hashing, by contrast, goes at the same exact speed but will take up less memory. (Until you seek to disk...)

Update: Lots of minor edits to improve wording.

In reply to Re^2: mathematical proof by tilly
in thread mathematical proof by donkeykong

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

• Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
• Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
• Read Where should I post X? if you're not absolutely sure you're posting in the right place.
• Posts may use any of the Perl Monks Approved HTML tags:
a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
• You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
 For: Use: & & < < > > [ [ ] ]
• Link using PerlMonks shortcuts! What shortcuts can I use for linking?

Create A New User
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (2)
As of 2018-02-22 07:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
When it is dark outside I am happiest to see ...

Results (288 votes). Check out past polls.

Notices?