Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:
Monks,
My question is related to memory requirements for a large number of hashes.
I DP reasonably large arrays of identically structured records. For naming convenience, I use a hash for each record with the field names as the hash keys
For example:
my $people = [
{ name => 'fred', age => 25, height => 1.5 },
{ name => 'sally', age => 20, height => 1.4 },
];
for my $person (@$people} {
print $person->{name},"\n";
}
The record structures are much larger, with typically 80-100 fields and about 4K of data. When dealing with datasets where the data is say 60MB, I am seeing memory usage of 600MB - I believe that this is related to the hash algorith and how many buckets are used etc.
I have searched CPAN, but not found anything - I was wondering if there are any magic Tie modules out there that let me pretend that I am working with an Array of Hash, but that are implemented under the hood as an Array of Array - where the fieldname gets efficiently mapped to an index?
Something like:
my $people = [
[ 'fred', 25, 1.5 ],
[ 'sally', 20, 1.4 ],
];
my $accessor = accessor->new(
keys => ['name','age','height'],
data => $people,
);
for my $person (@$people} {
print $person->{name},"\n";
}
Thanks in advance,
Jeff
Re: Array of Hash interface to an Array of Array?
by PodMaster (Abbot) on Sep 01, 2003 at 14:00 UTC
|
The popular technique seems to be
use constant name => 0;
use constant age => 1;
use constant height => 2;
my $people = [
[ 'fred', 25, 1.5 ],
[ 'sally', 20, 1.4 ],
];
for my $person( @$people ) {
print $person->[name],"\n";
}
__END__
POE uses it.
MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!" | I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README). | ** The third rule of perl club is a statement of fact: pod is sexy. |
| [reply] [Watch: Dir/Any] [d/l] |
|
I like 2POE too!
Unfortunately consts dont quite hit the spot in this case - for example I have several arrays, with 'name' columns at different offsets, and the AofH and H's get passed around a number of modules, so there are scoping issues.
I was wondering more about some sort of class that I could overlay on an AoA to make the thingies pretend they were Hashes even though there would only be one hash mapping the names to an array offset...
Maybe there is a clue in Abigail's posting... off to google pseudo-hash...
Regards
Jeff
| [reply] [Watch: Dir/Any] |
Re: Array of Hash interface to an Array of Array?
by Abigail-II (Bishop) on Sep 01, 2003 at 14:16 UTC
|
Every data structure in Perl, whether it's a scalar, an
array or a hash takes some memory overhead (in addition
to the memory needed to store the data itself). For a
scalar, this is about 24 bytes. For hashes and arrays, this
is even 2 or 3 times more. And note that if you have an
array with 10 elements, you pay overhead for one array and
10 scalars.
In Perl. it's a lot cheaper to have 2 arrays with 10000
elements each, than to have 10000 arrays with 2 elements
each. So, if you have lots and lots of data, you should
give it some thought on how you are going to organize it.
Arrays take less memory than hashes, but I it's probably
more worthwhile to reduce the number of aggregates than
to replace hashes with pseudo-hashes (which are dead anyway).
You also might want to look into the Devel::Size
CPAN module.
Abigail | [reply] [Watch: Dir/Any] |
|
Some quick testing by simply changing from Array of Hash to Array of Array, and using Top to compare total process mem:
Input data: 40MB
Array of 67,000 Hash: 350MB
Array of 67,000 Array: 190MB
So there appears to be a reasonable memory saving to be had. My biggest production datasets are 7 to 10 times larger - what process memory usage this will translate into is hard to anticipate, and difficult to test at the moment, but I would make working assumptions of 2G vs 3G - and I would rather the 160M - 1G was used for the DB cache than the Perl Hash.
Regards,
Jeff
| [reply] [Watch: Dir/Any] |
Re: Array of Hash interface to an Array of Array?
by NetWallah (Canon) on Sep 01, 2003 at 13:59 UTC
|
If you are dealing with that much data, you obviously have some means of storing it - most likely a database.... or something that can be viewed as a database - there are several modules that let you view a file as a database - I have heard of Berkleydb, but have not used that myself.
I suggest that you leverage the database methods, particularly DBI to manage accessing and storing this information. This will make your code easier to read, and a lot more scalable. | [reply] [Watch: Dir/Any] |
|
Thanks for the suggestion - I am familiar with databases, stored procs, set-relational theory/practice and various flavours of SQL, and the DBI interface.
Unfortunately the DP in this instance cannot be expressed in SQL - without going into detail, it is complex and iterative.
Regards,
Jeff
PS My favourite SQL shortcoming is:
select
name, sum(value) as value
from mytable
group by name
order by abs(value) desc
I only know of a single SQL engine that will perform this query correctly | [reply] [Watch: Dir/Any] [d/l] |
|
My favourite SQL shortcoming is:
OK, at the risk of making a fool of myself, I'll bite.
I believe that that statement is syntactically incorrect. If it works in one engine, I would say it is broken. (Which one are we talking about, by the way)?
When you say sum(value) as value, in your order by clause, you can't use value (i.e. the right hand side of the selected item clause) as the aggregator. You must use sum(value) (the left hand side of the selected item clause).
I just whipped up a sample table in Postgresql. It chokes on your example, but renders the correct results on mine. Consider:
select * from foo;
name | value
------+-------
foo | -10
foo | -6
foo | 12
foo | 34
bar | 6
bar | 23
bar | 76
rat | -6
rat | -60
(9 rows)
and the result:
select name, sum(value) as value from foo group by name order by abs(s
+um(value)) desc;
name | value
------+-------
bar | 105
rat | -66
foo | 30
(3 rows)
Looks good to me. Now, it may well be that you can use what I term "the right hand side" in group by, but I think I'll continue using the left hand side. Even if it is clunky and redundantly verbose, at least I know what's going on.
And I have been known to write Perl scripts that generate selected aggregates paired up with corresponding group by clauses just to ensure that it's all correct and avoid breaking the DRY principle.
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
|
|
Re: Array of Hash interface to an Array of Array?
by DrHyde (Prior) on Sep 01, 2003 at 15:14 UTC
|
It sounds like you think you want a pseudohash. However, from the perlref manpage ...
"WARNING: This section describes an experimental feature. Details may change without notice in future versions."
They are deprecated, I believe, in perl 5.8 and will be removed in 5.10. IOW, you really don't want to use 'em. The fields pragma is the new implementation which should be safe to use. Whether fields will help conserve memory or not I don't know.
However, you probably don't want to do that either. I think you should look at tie()ing the hash to a dbm file. I prefer using GDBM_File in most situations, but you should look at the handy comparison chart in the AnyDBM_File docs before choosing a dbm implementation.
In combination with MLDBM you should be able to work with your data, with a remarkably low memory requirement. | [reply] [Watch: Dir/Any] |
|
Thanks for the pointers...
Pseudo-hashes only sound like half of what I want - is an array of 350,000 PHs really much more efficient than an array of 350,000 Hashes? Abigail alluded to more efficient use of memory arising from reducing the number of structures.
I think that what I really want is a single something that stores data efficiently - eg an AofA or even an array of packed strings, or even a single huge string, or chunked pieces of string... possibly qubit encoded on chromium atoms for compactness...(slaps head twice, breathes deeply, zen-like calm returns...)
...that provides a transparent Array like interface for accesing collection elements and a Hash like interface to access properties of individual elements.... mmmm, Yet Another Database And Yes Another Database Approach (YADAYADA) is about to be born?... not 8-)
I can see how tieing this into a flat-file db could further save memory, I already use Tie::File arrays to process incoming flat-files.
scurries off to check it all out...
Thanks again to all,
Jeff
| [reply] [Watch: Dir/Any] |
|
|