It makes no sense to read the entire file into an array if you are then going to process that array element-by-element into a different data structure. Reading it line by line and building you data structure as you go makes much more sense, would save you a large amount of memory and almost certainly speed the whole thing up considerably.
As tilly pointed out, pushing to a pre-extended array, is just extending it further, so there is no benefit from doing things that way.
However, I think that your data structure is unnecessarially complicated for the data you wish to store. Your creating an array of hashes of hashes, but a large proportion of the data you are storing are constants. 'class' is a constant, 'fv' is a constant, the value of 1 for every member of the inner hash are all constants. Not only is this information redundantly chewing up a large amount of memory, it also (potentially) makes accessing and iterating your data more complex and slower than it need be. Eg.
#! perl -slw
use strict;
use Devel::Size qw[size total_size];
my @cases = (
{ class=>'classA', fv=>{ featureA=>1, featureB=>1, featureC=>1, fe
+atureD=>1 } },
{ class=>'classB', fv=>{ featureA=>1, featureB=>1, featureE=>1, fe
+atureF=>1 } },
{ class=>'classC', fv=>{ featureB=>1, featureC=>1, featureD=>1, fe
+atureE=>1 } },
{ class=>'classD', fv=>{ featureC=>1, featureD=>1, featureE=>1, fe
+atureF=>1 } },
{ class=>'classE', fv=>{ featureA=>1, featureD=>1, featureE=>1, fe
+atureF=>1 } },
{ class=>'classF', fv=>{ featureD=>1, featureE=>1, featureF=>1, fe
+atureG=>1 } },
{ class=>'classG', fv=>{ featureA=>1, featureC=>1, featureD=>1, fe
+atureG=>1 } },
{ class=>'classH', fv=>{ featureA=>1, featureB=>1, featureD=>1, fe
+atureG=>1 } },
{ class=>'classI', fv=>{ featureA=>1, featureC=>1, featureE=>1, fe
+atureF=>1 } },
{ class=>'classJ', fv=>{ featureB=>1, featureD=>1, featureF=>1, fe
+atureG=>1 } },
);
use constant FEATURE_A=>0;
use constant FEATURE_B=>1;
use constant FEATURE_C=>2;
use constant FEATURE_D=>3;
use constant FEATURE_E=>4;
use constant FEATURE_F=>5;
use constant FEATURE_G=>6;
my %cases = (
classA=>[ 1, 1, 1, 1, 0, 0, 0 ],
classB=>[ 1, 1, 0, 0, 1, 1, 0 ],
classC=>[ 0, 1, 1, 1, 1, 0, 0 ],
classD=>[ 0, 0, 1, 1, 1, 1, 0 ],
classE=>[ 1, 0, 0, 1, 1, 1, 0 ],
classF=>[ 0, 0, 0, 1, 1, 1, 1 ],
classG=>[ 1, 0, 1, 1, 0, 0, 1 ],
classH=>[ 1, 1, 0, 1, 0, 0, 0 ],
classI=>[ 1, 0, 1, 0, 1, 1, 0 ],
classJ=>[ 0, 1, 0, 1, 0, 1, 1 ],
);
print 'Array of hash of hash : ', total_size( \@cases );
print 'Hash of array : ', total_size( \%cases );
__END__
C:\test>test2
Array of hash of hash : 4088
Hash of array : 2534
The two data structures above represent exactly the same information, but the latter requires 60% less space. multiply that out across 300,000 records and you have a substantial wastage. Depending upon how you intend to use the data structure, it may not lend itself to your needs, but it is worth considering.
Of course, you could save another large chunk of memory by representing the presence or absence of a feature in each class by a '0' or '1' in a string and access it using substr
# If $class has feature A
if( substr $cases{ $class }, FEATURE_A ) ) {...
# Add feature G to $class
substr $cases{ $class }, FEATURE_G = '1';
Hiding the substr with an lvalue function would be even cleaner.
You could also go a step further, saving more memory and use bits in a bit-string and vec, but thats probably a step too far for most purposes. Maybe as you have a gig of RAM you feel the need to use it all :)
I absolutely hate it when bloated applications steal all my ram or processor and stop me from running other stuff at the same time. Use as much as you need to, but don't waste it:)
Then again, I also hate it when resturants serve me more food than I can eat.
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
|