Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Array of Hash interface to an Array of Array?

by Anonymous Monk
on Sep 01, 2003 at 13:51 UTC ( [id://288128]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Monks,

My question is related to memory requirements for a large number of hashes.

I DP reasonably large arrays of identically structured records. For naming convenience, I use a hash for each record with the field names as the hash keys

For example:

my $people = [ { name => 'fred', age => 25, height => 1.5 }, { name => 'sally', age => 20, height => 1.4 }, ]; for my $person (@$people} { print $person->{name},"\n"; }

The record structures are much larger, with typically 80-100 fields and about 4K of data. When dealing with datasets where the data is say 60MB, I am seeing memory usage of 600MB - I believe that this is related to the hash algorith and how many buckets are used etc.

I have searched CPAN, but not found anything - I was wondering if there are any magic Tie modules out there that let me pretend that I am working with an Array of Hash, but that are implemented under the hood as an Array of Array - where the fieldname gets efficiently mapped to an index?

Something like:

my $people = [ [ 'fred', 25, 1.5 ], [ 'sally', 20, 1.4 ], ]; my $accessor = accessor->new( keys => ['name','age','height'], data => $people, ); for my $person (@$people} { print $person->{name},"\n"; }

Thanks in advance,

Jeff

Replies are listed 'Best First'.
Re: Array of Hash interface to an Array of Array?
by PodMaster (Abbot) on Sep 01, 2003 at 14:00 UTC
    The popular technique seems to be
    use constant name => 0; use constant age => 1; use constant height => 2; my $people = [ [ 'fred', 25, 1.5 ], [ 'sally', 20, 1.4 ], ]; for my $person( @$people ) { print $person->[name],"\n"; } __END__
    POE uses it.

    MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
    I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
    ** The third rule of perl club is a statement of fact: pod is sexy.

      I like 2POE too!

      Unfortunately consts dont quite hit the spot in this case - for example I have several arrays, with 'name' columns at different offsets, and the AofH and H's get passed around a number of modules, so there are scoping issues.

      I was wondering more about some sort of class that I could overlay on an AoA to make the thingies pretend they were Hashes even though there would only be one hash mapping the names to an array offset...

      Maybe there is a clue in Abigail's posting... off to google pseudo-hash...

      Regards

      Jeff

Re: Array of Hash interface to an Array of Array?
by Abigail-II (Bishop) on Sep 01, 2003 at 14:16 UTC
    Every data structure in Perl, whether it's a scalar, an array or a hash takes some memory overhead (in addition to the memory needed to store the data itself). For a scalar, this is about 24 bytes. For hashes and arrays, this is even 2 or 3 times more. And note that if you have an array with 10 elements, you pay overhead for one array and 10 scalars.

    In Perl. it's a lot cheaper to have 2 arrays with 10000 elements each, than to have 10000 arrays with 2 elements each. So, if you have lots and lots of data, you should give it some thought on how you are going to organize it. Arrays take less memory than hashes, but I it's probably more worthwhile to reduce the number of aggregates than to replace hashes with pseudo-hashes (which are dead anyway).

    You also might want to look into the Devel::Size CPAN module.

    Abigail

      Some quick testing by simply changing from Array of Hash to Array of Array, and using Top to compare total process mem:

      Input data: 40MB

      Array of 67,000 Hash: 350MB

      Array of 67,000 Array: 190MB

      So there appears to be a reasonable memory saving to be had. My biggest production datasets are 7 to 10 times larger - what process memory usage this will translate into is hard to anticipate, and difficult to test at the moment, but I would make working assumptions of 2G vs 3G - and I would rather the 160M - 1G was used for the DB cache than the Perl Hash.

      Regards,

      Jeff

Re: Array of Hash interface to an Array of Array?
by NetWallah (Canon) on Sep 01, 2003 at 13:59 UTC
    If you are dealing with that much data, you obviously have some means of storing it - most likely a database.... or something that can be viewed as a database - there are several modules that let you view a file as a database - I have heard of Berkleydb, but have not used that myself.

    I suggest that you leverage the database methods, particularly DBI to manage accessing and storing this information. This will make your code easier to read, and a lot more scalable.

      Thanks for the suggestion - I am familiar with databases, stored procs, set-relational theory/practice and various flavours of SQL, and the DBI interface.

      Unfortunately the DP in this instance cannot be expressed in SQL - without going into detail, it is complex and iterative.

      Regards,

      Jeff

      PS My favourite SQL shortcoming is:

      select name, sum(value) as value from mytable group by name order by abs(value) desc

      I only know of a single SQL engine that will perform this query correctly

        My favourite SQL shortcoming is:

        OK, at the risk of making a fool of myself, I'll bite.

        I believe that that statement is syntactically incorrect. If it works in one engine, I would say it is broken. (Which one are we talking about, by the way)?

        When you say sum(value) as value, in your order by clause, you can't use value (i.e. the right hand side of the selected item clause) as the aggregator. You must use sum(value) (the left hand side of the selected item clause).

        I just whipped up a sample table in Postgresql. It chokes on your example, but renders the correct results on mine. Consider:

        select * from foo; name | value ------+------- foo | -10 foo | -6 foo | 12 foo | 34 bar | 6 bar | 23 bar | 76 rat | -6 rat | -60 (9 rows)

        and the result:

        select name, sum(value) as value from foo group by name order by abs(s +um(value)) desc; name | value ------+------- bar | 105 rat | -66 foo | 30 (3 rows)

        Looks good to me. Now, it may well be that you can use what I term "the right hand side" in group by, but I think I'll continue using the left hand side. Even if it is clunky and redundantly verbose, at least I know what's going on.

        And I have been known to write Perl scripts that generate selected aggregates paired up with corresponding group by clauses just to ensure that it's all correct and avoid breaking the DRY principle.

Re: Array of Hash interface to an Array of Array?
by DrHyde (Prior) on Sep 01, 2003 at 15:14 UTC
    It sounds like you think you want a pseudohash. However, from the perlref manpage ...
    "WARNING: This section describes an experimental feature. Details may change without notice in future versions."
    They are deprecated, I believe, in perl 5.8 and will be removed in 5.10. IOW, you really don't want to use 'em. The fields pragma is the new implementation which should be safe to use. Whether fields will help conserve memory or not I don't know.

    However, you probably don't want to do that either. I think you should look at tie()ing the hash to a dbm file. I prefer using GDBM_File in most situations, but you should look at the handy comparison chart in the AnyDBM_File docs before choosing a dbm implementation.

    In combination with MLDBM you should be able to work with your data, with a remarkably low memory requirement.

      Thanks for the pointers...

      Pseudo-hashes only sound like half of what I want - is an array of 350,000 PHs really much more efficient than an array of 350,000 Hashes? Abigail alluded to more efficient use of memory arising from reducing the number of structures.

      I think that what I really want is a single something that stores data efficiently - eg an AofA or even an array of packed strings, or even a single huge string, or chunked pieces of string... possibly qubit encoded on chromium atoms for compactness...(slaps head twice, breathes deeply, zen-like calm returns...)

      ...that provides a transparent Array like interface for accesing collection elements and a Hash like interface to access properties of individual elements.... mmmm, Yet Another Database And Yes Another Database Approach (YADAYADA) is about to be born?... not 8-)

      I can see how tieing this into a flat-file db could further save memory, I already use Tie::File arrays to process incoming flat-files.

      scurries off to check it all out...

      Thanks again to all,

      Jeff

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://288128]
Approved by valdez
Front-paged by arthas
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (6)
As of 2024-03-28 08:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found