Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re: Slowness when inserting into pre-extended array

by BrowserUk (Patriarch)
on Jul 19, 2003 at 21:00 UTC ( [id://275937]=note: print w/replies, xml ) Need Help??


in reply to Slowness when inserting into pre-extended array

It makes no sense to read the entire file into an array if you are then going to process that array element-by-element into a different data structure. Reading it line by line and building you data structure as you go makes much more sense, would save you a large amount of memory and almost certainly speed the whole thing up considerably.

As tilly pointed out, pushing to a pre-extended array, is just extending it further, so there is no benefit from doing things that way.

However, I think that your data structure is unnecessarially complicated for the data you wish to store. Your creating an array of hashes of hashes, but a large proportion of the data you are storing are constants. 'class' is a constant, 'fv' is a constant, the value of 1 for every member of the inner hash are all constants. Not only is this information redundantly chewing up a large amount of memory, it also (potentially) makes accessing and iterating your data more complex and slower than it need be. Eg.

#! perl -slw use strict; use Devel::Size qw[size total_size]; my @cases = ( { class=>'classA', fv=>{ featureA=>1, featureB=>1, featureC=>1, fe +atureD=>1 } }, { class=>'classB', fv=>{ featureA=>1, featureB=>1, featureE=>1, fe +atureF=>1 } }, { class=>'classC', fv=>{ featureB=>1, featureC=>1, featureD=>1, fe +atureE=>1 } }, { class=>'classD', fv=>{ featureC=>1, featureD=>1, featureE=>1, fe +atureF=>1 } }, { class=>'classE', fv=>{ featureA=>1, featureD=>1, featureE=>1, fe +atureF=>1 } }, { class=>'classF', fv=>{ featureD=>1, featureE=>1, featureF=>1, fe +atureG=>1 } }, { class=>'classG', fv=>{ featureA=>1, featureC=>1, featureD=>1, fe +atureG=>1 } }, { class=>'classH', fv=>{ featureA=>1, featureB=>1, featureD=>1, fe +atureG=>1 } }, { class=>'classI', fv=>{ featureA=>1, featureC=>1, featureE=>1, fe +atureF=>1 } }, { class=>'classJ', fv=>{ featureB=>1, featureD=>1, featureF=>1, fe +atureG=>1 } }, ); use constant FEATURE_A=>0; use constant FEATURE_B=>1; use constant FEATURE_C=>2; use constant FEATURE_D=>3; use constant FEATURE_E=>4; use constant FEATURE_F=>5; use constant FEATURE_G=>6; my %cases = ( classA=>[ 1, 1, 1, 1, 0, 0, 0 ], classB=>[ 1, 1, 0, 0, 1, 1, 0 ], classC=>[ 0, 1, 1, 1, 1, 0, 0 ], classD=>[ 0, 0, 1, 1, 1, 1, 0 ], classE=>[ 1, 0, 0, 1, 1, 1, 0 ], classF=>[ 0, 0, 0, 1, 1, 1, 1 ], classG=>[ 1, 0, 1, 1, 0, 0, 1 ], classH=>[ 1, 1, 0, 1, 0, 0, 0 ], classI=>[ 1, 0, 1, 0, 1, 1, 0 ], classJ=>[ 0, 1, 0, 1, 0, 1, 1 ], ); print 'Array of hash of hash : ', total_size( \@cases ); print 'Hash of array : ', total_size( \%cases ); __END__ C:\test>test2 Array of hash of hash : 4088 Hash of array : 2534

The two data structures above represent exactly the same information, but the latter requires 60% less space. multiply that out across 300,000 records and you have a substantial wastage. Depending upon how you intend to use the data structure, it may not lend itself to your needs, but it is worth considering.

Of course, you could save another large chunk of memory by representing the presence or absence of a feature in each class by a '0' or '1' in a string and access it using substr

# If $class has feature A if( substr $cases{ $class }, FEATURE_A ) ) {... # Add feature G to $class substr $cases{ $class }, FEATURE_G = '1';

Hiding the substr with an lvalue function would be even cleaner.

You could also go a step further, saving more memory and use bits in a bit-string and vec, but thats probably a step too far for most purposes. Maybe as you have a gig of RAM you feel the need to use it all :)

I absolutely hate it when bloated applications steal all my ram or processor and stop me from running other stuff at the same time. Use as much as you need to, but don't waste it:)

Then again, I also hate it when resturants serve me more food than I can eat.


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller

Replies are listed 'Best First'.
Re: Re: Slowness when inserting into pre-extended array
by ryangabbard (Initiate) on Jul 19, 2003 at 21:54 UTC
    "It makes no sense to read the entire file into an array .... and almost certainly speed the whole thing up considerably." Oh, certainly. I had temporarily rewritten it this way to see if the slowness was due to readng the file or processing it. Unfortunately the sparse representation I'm using is necessary for the way the data structure will be applied; each record has about 30 features, but which features each record has varies among a set of about 150,000, making any representation which stores the values of all features for all records impractical. I also unfortunately must have all the data in memory at once in order to calculate various infromation statistics for each attribute and to do extensive swapping of records (I'm building a decision-tree learner). :-(
      Have you considered trying to use data structures which are backed by files, like BerkeleyDB or in a relational database like MySQL? Of course they will have to go to disk for data sometimes, but they should be able to make intelligent caching choices so that most of the time you can access data without going to disk...

      Yes, it is slower than having it all in RAM if you have the memory for that, but it is likely to make better file access choices than will be made by your OS paging randomly accessed RAM back and forth into virtual memory...

        hm - that could be useful. I'll look into that. Thanks!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://275937]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (3)
As of 2025-06-21 08:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.