Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw

Comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
First an important correction. The amount of memory taken to store a hash varies widely depending on how you do it....

The difference in the memory consumed by the 2 methods you show is that with undef @foo{ 1 .. 1_000_000 ];, the extra memory is consumed by generating the list 1 .. 1_000_000 prior to generating the hash slice.

When you use ...for 1 .. 1_000_000;, no list is generated as for has an optimisation to allow integer ranges to act as iterators rather than list generators.

The method I used was

$hash{ $_ } = undef for 1 .. 1_000_000;
but on my machine,
undef $foo{ $_ } for 1 .. 1_000_000;
results in an identical memory consumption. Prior to the allocation of the hash, the perl process has 3332k allocated to it. After the allocation, it has 98184k allocated. Hence my assertion that the memory required is 95 MB. I can't explain the difference between the 95 MB I am seeing and the 69 MB you reported.

I tried using Devel::Size to verify the memory allocation the OS is reporting is actually being used by the hash rather than some portion of it being intermediate working storage that would be available to the process for other purposes afterwards, but calling either size( \%hash ); or total_size( \%hash ); both doubled the memory consumed by the process, pushing it beyond both my physical and virtual memory capacity before segfaulting.

Using DB_TREE rather than DB_HASH does speed up the insertion considerably.

use DB_File; tie %hash, 'DB_File', 'tree1M.db' , O_RDWR|O_CREAT, 0666, $DB_BTREE or die $!; print time; $hash{ $_ } = undef for 1 .. 1_000_000; print time; 1069482082 1069482433 # 351 secs (5.85 mins) $n=0; print time; exists $hash{ $_ } and $n++ for 1 .. 1_000_000; print time +, $/, $n; 1069482523 1069482799 # 276 secs (4.6 mins) 1000000

It reduces the insertion time from an hour to 5 minutes and the iteration time from 15 minutes to 4 at the cost of increasing the size of the database file from roughly twice the flat file size to 4x.

If you are only using the hash for it's fast "does it exist" property, then using the custom hash mechanism I described contrasts favourably with the insertion taking 16 seconds and the iteration 44.

print time; $hash{ substr $_, 0, 4 } .= $/. $_ for 1 .. 1_000_000; $hash{$_} .= $/ + for keys %hash; print time; 1069484550 1069484566 # 16 seconds. $n=0; print time; for ( 1 .. 1_000_000 ) { next unless exists $hash{ substr $_, 0, 4 } and 1+index( $hash{ substr $_, 0, 4 }, $/ . $_ . $/ ); $n++; } print time, $/, $n; 1069484612 1069484656 # 44 secs. 1000000

With respect to caching. I was recently playing with prime number generators/testers and came across some discussion about techniques for optimising code (C and assembler) to make best use of L1 and L2 caching. I'll admit that much of the discussion was over my head, but my conclusion was that compilers will have to become much clever than they currently are before general code will fully benefit from these.

The current trend to making caches larger and larger will mitigate this somewhat, but the depth of understanding required to fully utilise these caches and tailor algorithms to them is so great, and so processor specific, that I doubt that it will ever be more than a niche exercise.

Whether processors will ever get to the point where the microcode is capable of utilising out-of-order execution, re-ordering the pipelines, or re-writing the instruction caches to optimise cache hit rates at anything beyond the keyhole scale I'm not sure. Given that the disposable processor you get in musical birthday cards has more processing power in its 4mm2 of silicon that ENIAC did in its 200k watt consuming 30 tons, who knows what a "computer" will be capable of in another 60 years :)

It's just a shame I won't be around to see it.

Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail

In reply to Re: Re: A (memory) poor man's hash by BrowserUk
in thread A (memory) poor man's <strike>hash</strike> lookup table. by BrowserUk

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and all is quiet...

    How do I use this? | Other CB clients
    Other Users?
    Others drinking their drinks and smoking their pipes about the Monastery: (6)
    As of 2017-03-27 05:54 GMT
    Find Nodes?
      Voting Booth?
      Should Pluto Get Its Planethood Back?

      Results (315 votes). Check out past polls.