Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Array vs. Hash for sparsely integer-indexed data

by puterboy (Scribe)
on Jan 25, 2013 at 18:41 UTC ( #1015388=perlquestion: print w/ replies, xml ) Need Help??
puterboy has asked for the wisdom of the Perl Monks concerning the following question:

My perl program frequently needs to retrieve a cached set of data that is indexed by a positive but sparsely populated set of integers.

I am not sure whether it is better to use a hash vs. an array when considering efficiency of storage and efficiency of lookup speed.

I know in languages like 'C' you would need to declare the full range of index possibilities which would be space-inefficient for sparsely indexed data.

On the other hand, I am concerned that maybe a hash would be slower when the keys are known to all be positive integers.

In particular, I am asking:
1. What (if any) storage inefficiency (relative to hashes) is created in perl when using arrays for sparsely indexed data?
2. What (if any) lookup inefficiency (relative to arrays) is created in perl when using hashes for positive integer indexed keys?

Comment on Array vs. Hash for sparsely integer-indexed data
Re: Array vs. Hash for sparsely integer-indexed data
by davido (Archbishop) on Jan 25, 2013 at 18:59 UTC

    It's hard to know how sparse your data set is, and how large the range. If you have 100 items with indices between 0 and 3_000_000, the array isn't going to work out so well. On the other hand, if you have 10_000 items with indices between 0 and 32767, the array wouldn't be so bad after all.

    While hash insertions and lookups are both O(1) operations in order of growth, the constant factors are more expensive with hashes than with arrays. But do you know speed efficiency to be an issue?

    Without knowing enough about the specifics of the problem, I would have to recommend a hash. Once specifics are known (how sparse, how wide the range of indices, and how time-critical the code is), that recommendation could change. But the hash is the "general" solution, where the array would be a solution tailored to a specific set of criteria.


    Dave

Re: Array vs. Hash for sparsely integer-indexed data
by BrowserUk (Pope) on Jan 25, 2013 at 19:02 UTC
    1. What (if any) storage inefficiency (relative to hashes) is created in perl when using arrays for sparsely indexed data?

    It all depends on 'how sparse'":

    $a[ $_*2 ] = $_ for 1 .. 1e5;; $h{ $_*2 } = $_ for 1 .. 1e5;; print total_size( $_ ) for \( @a, %h );; 4497312 7393098 undef( @a ); undef( %h );; $a[ $_*20 ] = $_ for 1 .. 1e5;; $h{ $_*20 } = $_ for 1 .. 1e5;; print total_size( $_ ) for \( @a, %h );; 19177376 7493098 undef( @a ); undef( %h );; $a[ $_*50 ] = $_ for 1 .. 1e5;; $h{ $_*50 } = $_ for 1 .. 1e5;; print total_size( $_ ) for \( @a, %h );; 69509024 7526431

    The hash size remains static for a given number of entries, regardless of how sparse they are.

    At 1 in 2, the array uses a little over half the space of the hash; but by 1 in 20, almost 3 times as much and by 1 in 50 nearly 10 times as much.

    And remember, if your integer range starts at 1 billion; the array would require ~12GB (of basically wasted space), to hold the lowest value. Whilst you could wrap that over in an api to subtract 1 billion from each index, the additional subroutine call overhead would completely negate the array's lookup speed advantage; and then some.

    2. What (if any) lookup inefficiency (relative to arrays) is created in perl when using hashes for positive integer indexed keys?

    Test:

    $a[ $_ * 20 ] = $_ for 1 .. 1e5;; $found = 0; say time; defined( $a[ $_ ] ) and ++$found for 1 .. 20e5; +say time; say $found;; 1359141040.60138 1359141041.00958 100000 $h{ $_ * 20 } = $_ for 1 .. 1e5;; $found = 0; say time; exists( $h{ $_ } ) and ++$found for 1 .. 20e5; s +ay time; say $found;; 1359141122.04252 1359141123.06596 100000

    The array takes 0.4 seconds to do 2 million lookups of which 5% are found and 95% not.

    The hash takes 1.02 seconds to do the same lookups.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Also worth considering: if your lookups are (or lend themselves to be adapted to) a simple boolean truth; vec can reduce the space requirement to a fraction of that used by either arrays or hashes and yields lookup time that also beat arrays:

      $vec = ''; vec( $vec, $_*20, 1) =1 for 1 .. 1e5;; print total_size \$vec;; 250056 $found = 0; say time; vec( $vec, $_, 1 ) and ++$found for 1 .. 20e5; s +ay time; say $found;; 1359141761.02608 1359141761.35157 100000

      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Array vs. Hash for sparsely integer-indexed data
by flexvault (Parson) on Jan 25, 2013 at 19:11 UTC

    puterboy,

      In particular, I am asking: 1. What (if any) storage inefficiency (relative to hashes) is created in perl when using arrays for sparsely indexed data? 2. What (if any) lookup inefficiency (relative to arrays) is created in perl when using hashes for positive integer indexed keys?

    I'm sure you'll get more technical answers, but I use hashes now even when the indexing is sequential. The ability to use 'defined' or 'exists' to see if the key is present greatly improves the speed of lookups. And think of the storage benefits, since if you do the following:

    my @array; $array[100] = 246; my %hash; $hash{"100"} = 246;

    The array will use 101 ( 0..100 ) locations and the hash will allocate 8 (hashes are allocated in powers of 2).

    But why not 'benchmark' it for yourself for your exact situation and know for sure :-)

    Good Luck...Ed

    "Well done is better than well said." - Benjamin Franklin

Re: Array vs. Hash for sparsely integer-indexed data
by blue_cowdawg (Monsignor) on Jan 25, 2013 at 19:36 UTC
        My perl program frequently needs to retrieve a cached set of data that is indexed by a positive but sparsely populated set of integers.

    By the way it's Perl. Lower case perl is the command. Oh... never mind.

    Based on what you said I'd go with a hash. If you really want to implement a sparse array you could to this:

    package IndexedSparseArry; sub new { shift; my $self={ array => [ ] }; bless $self,"IndexedSparseArray"; } sub insert { my ($self,$index,$data)=@_; my @work = @{$self->{array}}; if ( $#work == -1 ) { # Nothing there yet. push @work,{index => $index, data => $data}; } else { my $i=0; # not to be confused with index. while ( ($i < $#work) && ($work[$i]->{index} < $index)){ $i++ } if ( $i == $#work ) { if ($work[$1] < $index ) { push @work,{index => index, data => $data); } else { $work[$i+1]=$work[$i]; $work[$i] = { index => $index, data=>$data}; } } else { my $j = $#work + 1; while ($j > $i){ $work[$j] = $work[$j-1]; $j--; } $work[$i] = { index => $index, data => $data }; $self->{array} = [@work]; } 1;
    Save that file to IndexedSparseArray.pm and in your main program:
    use strict; use warnings; use IndexedSparseArray; my $repo = IndexedSparseArray->new(); while (<i am reading in data >){ | handwaving $repo->insert($index,$data); } | etc.
    Big Caveat: I didn't test this and I was interrupted a bunch of times. This should work, but...

    I think I've given you something to chew on, hope it helps.


    Peter L. Berghold -- Unix Professional
    Peter -at- Berghold -dot- Net; AOL IM redcowdawg Yahoo IM: blue_cowdawg
Re: Array vs. Hash for sparsely integer-indexed data
by Anonymous Monk on Jan 25, 2013 at 20:23 UTC
    Also not to be excluded entirely is ... uhh ... an array of integers that is searched sequentially each time. Seriously. If you've got say 10,000 integers to look through, they're gonna take up 40K bytes. All sitting in memory all the time. You could loop through that. You could even loop through 10 times that and still be, well, okay. I'd suggest estimating how many unique values you're going to have, and set up a few short test-runs with randomly generated data to get a feel for how the various alternatives pan out. Then just pick one and go.
      Also not to be excluded entirely is ... uhh ... an array of integers that is searched sequentially each time.

      That is a surprise winner for my 'Worst Advice of the Month (January, 2013)' prize.

      ... set up a few short test-runs ...

      I'd suggest you take your own advice before doling it out to others.

      NB: The following uses 1/10 the number of values and 1/10 the number of lookups than the tests above for arrays, hashes and bit vectors, because it takes so long to complete.:

      @a = map $_*20, 1 .. 1e4;; say total_size \@a;; 320176 $found=0; say time; for my $tgt ( 1 .. 20e4 ) { $tgt == $_ and ++$foun +d and last for @a; }; say time; say $found;; 1359149323.02034 1359149541.80419 10000

      Using that as the basis for estimation, the total size will be 3.2MB -- 12 times that required by the bitvector solution.

      And the lookups that took hashes 1 second; arrays 2/5ths of a second and bitvectors 1/3rd of a second; would take: 25 days!

      It is really hard to see any circumstance when this would be a viable solution.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Array vs. Hash for sparsely integer-indexed data
by tobyink (Abbot) on Jan 25, 2013 at 22:49 UTC

    You could tie an array to a hashref for storage. This would be lovely in terms of memory usage, but not especially performant.

    package Cow { use Moo; has name => (is => 'lazy', default => sub { 'Mooington' }) } say Cow->new->name
Re: Array vs. Hash for sparsely integer-indexed data
by sundialsvc4 (Abbot) on Jan 26, 2013 at 00:27 UTC

    I do think that I understand what the notion was ... and it really does come down to just how many values are present to be looked-through; not what the domain of possible values is.   Which the OP, as far as I read, does not say.

    Is “a sparsely-populated set” 10,000 entries, 100,000 or 100 million?   It makes a gigantic difference.   How many actual entries (regardless of the possible domain of values they could take) are there, and how are they going to be retrieved?   Is it ever necessary to iterate through them, or is it pure random-access?   How often do insertions happen and how do they happen?

    You do like to pull “everybody else’s idea is fantastically worse than mine,” e.g. “25 days vs. 1.3 seconds” but, honestly, I don’t think anyone seriously benefits from such outlandish comparisons.   Hash tables are easy but can have a hidden cost; bit vectors have a size more-or-less based on domain size, and so on.   All of these things ... all of the rest of us know, too.   Most of the most-key decision factors of the actual problem to be solved are not in the OP.   At this writing, puterboy has not clarified.

      e.g. “25 days vs. 1.3 seconds” but, honestly, I don’t think anyone seriously benefits from such outlandish comparisons.

      What is so "outlandish" about pointing out the difference that using a O(N2) proposal, instead of one of the three O(N) solutions already suggested, would make?

      It is exactly these kinds of untested postulations you've become famous for. Perhaps it was ...

      Oh! BTW, after further assessment, that 25 days projection proves to be a gross underestimate.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        Heh ... calls to mind: this XKCD comic.

        ...roboticus

        When your only tool is a hammer, all problems look like your thumb.

      I have benefited from testing different approaches and comparing the results. In this node for example I would have assumed that your approach would perform similarly to BrowserUK's but the tests showed his approach to be ten times faster. Even the people who post broken code are contributing somewhat since their code can be tested and debugged and can serve as an example of common misunderstandings. But that requires that others actually test their code so it would be better if they test it in the first place.

      First of all thanks to *everyone* for all the helpful advice.

      Just some clarification. I am caching hard disk inodes since I need to look up hard links (it's to copy a highly hard-linked but structured tree of folders which is too slow to do using rsync and 'cp -a' since they (obviously) don't know the special structure).

      So the key is an integer less than max inodes on the filesystem. The value is a 32-digit hex number expressed as an ascii string

      Sometimes the list can be pretty dense if for example the tree being copied occupies the majority of the filesystem and if most of the inodes are used (or at least if they are allocated relatively contiguous); other times it could be o(10%) or less.

      Each inode hash entry may be looked up anywhere from once to 10s or even thousands of times (presumably no more than maxhardlinks).

      After doing some profiling, it seems that the speed of lookup is less important than memory usage since the rate limiting step is fstat'ing files which is significantly slower.

      So, it seems that hashes are best since except in the densest of cases, they will require less storage.

      Again, thanks for all the helpful advice. I learned a lot!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1015388]
Approved by blue_cowdawg
Front-paged by Lotus1
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (11)
As of 2014-12-29 17:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (194 votes), past polls