http://www.perlmonks.org?node_id=1015432


in reply to Array vs. Hash for sparsely integer-indexed data

I do think that I understand what the notion was ... and it really does come down to just how many values are present to be looked-through; not what the domain of possible values is.   Which the OP, as far as I read, does not say.

Is “a sparsely-populated set” 10,000 entries, 100,000 or 100 million?   It makes a gigantic difference.   How many actual entries (regardless of the possible domain of values they could take) are there, and how are they going to be retrieved?   Is it ever necessary to iterate through them, or is it pure random-access?   How often do insertions happen and how do they happen?

You do like to pull “everybody else’s idea is fantastically worse than mine,” e.g. “25 days vs. 1.3 seconds” but, honestly, I don’t think anyone seriously benefits from such outlandish comparisons.   Hash tables are easy but can have a hidden cost; bit vectors have a size more-or-less based on domain size, and so on.   All of these things ... all of the rest of us know, too.   Most of the most-key decision factors of the actual problem to be solved are not in the OP.   At this writing, puterboy has not clarified.

Replies are listed 'Best First'.
Re^2: Array vs. Hash for sparsely integer-indexed data
by BrowserUk (Patriarch) on Jan 26, 2013 at 01:39 UTC
    e.g. “25 days vs. 1.3 seconds” but, honestly, I don’t think anyone seriously benefits from such outlandish comparisons.

    What is so "outlandish" about pointing out the difference that using a O(N2) proposal, instead of one of the three O(N) solutions already suggested, would make?

    It is exactly these kinds of untested postulations you've become famous for. Perhaps it was ...

    Oh! BTW, after further assessment, that 25 days projection proves to be a gross underestimate.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Heh ... calls to mind: this XKCD comic.

      ...roboticus

      When your only tool is a hammer, all problems look like your thumb.

Re^2: Array vs. Hash for sparsely integer-indexed data
by Lotus1 (Vicar) on Jan 26, 2013 at 16:10 UTC

    I have benefited from testing different approaches and comparing the results. In this node for example I would have assumed that your approach would perform similarly to BrowserUK's but the tests showed his approach to be ten times faster. Even the people who post broken code are contributing somewhat since their code can be tested and debugged and can serve as an example of common misunderstandings. But that requires that others actually test their code so it would be better if they test it in the first place.

Re^2: Array vs. Hash for sparsely integer-indexed data
by puterboy (Scribe) on Jan 29, 2013 at 03:55 UTC
    First of all thanks to *everyone* for all the helpful advice.

    Just some clarification. I am caching hard disk inodes since I need to look up hard links (it's to copy a highly hard-linked but structured tree of folders which is too slow to do using rsync and 'cp -a' since they (obviously) don't know the special structure).

    So the key is an integer less than max inodes on the filesystem. The value is a 32-digit hex number expressed as an ascii string

    Sometimes the list can be pretty dense if for example the tree being copied occupies the majority of the filesystem and if most of the inodes are used (or at least if they are allocated relatively contiguous); other times it could be o(10%) or less.

    Each inode hash entry may be looked up anywhere from once to 10s or even thousands of times (presumably no more than maxhardlinks).

    After doing some profiling, it seems that the speed of lookup is less important than memory usage since the rate limiting step is fstat'ing files which is significantly slower.

    So, it seems that hashes are best since except in the densest of cases, they will require less storage.

    Again, thanks for all the helpful advice. I learned a lot!