Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Slowness when inserting into pre-extended array

by ryangabbard (Initiate)
on Jul 19, 2003 at 18:52 UTC ( [id://275917]=perlquestion: print w/replies, xml ) Need Help??

ryangabbard has asked for the wisdom of the Perl Monks concerning the following question:

Hello, Perl Monks - I need to read in a very large amount of data (~100 MB) from a file, process it a bit, and store it in an array of references to hashes. This is what I am doing at the moment (I'm a bit of a Perl newbie, so I doubtlessly have many inefficiencies here; suggestions appreciated):
my @cases=(); my @FILE=<DATA>; close(DATA); my $num_lines=scalar(@FILE); $#cases=$num_lines; #pre-extend array foreach my $line (@FILE) { if (($dot % 1000) == 0) { print STDERR "."; } $line=~/^(\S*) [0-9.]* (.*)$/o; my ($class, $feature_vector) = ($1, $2); my %case; $case{'class'}=$class; foreach my $feature (split /\s+/, $feature_vector) { $case{'fv'}{$feature}=1; } push @cases, \%case; $dot++; }
This is very fast for the first ~20,000 lines (out of a total of ~300,000), then suddenly slows down dramatically. Lack of memory is not the problem - at the point it slows down I still have upwards of 700 MB free. At first I thought the processing of each line into the case hash with its attendent splits and regular expressions was the problem, but if I alter the above code to:
my @FILE=<DATA>; close(DATA); my $fred; foreach my $line (@FILE) { if (($dot % 1000) == 0) { print STDERR "."; } $line=~/^(\S*) [0-9.]* (.*)$/o; my ($class, $feature_vector) = ($1, $2); my %case; $case{'class'}=$class; foreach my $feature (split /\s+/, $feature_vector) { $case{'fv'}{$feature}=1; } $fred= \%case; $dot++; }
then the entire file is processed on the order of 100 times more quickly. I've tried using something like $cases[$dot]=\%case or even making cases a hash indexed by case number, but both approaches exhibit a similar slow-down. Any ideas on why this slow-down occurs? (Perl version 5.6.1 being run under a Windows XP system with 1 GB RAM) Thanks, Ryan Gabbard

Replies are listed 'Best First'.
Re: Slowness when inserting into pre-extended array
by tilly (Archbishop) on Jul 19, 2003 at 19:24 UTC
    Well, first of all your pre-extending is not doing what you think. When you pre-extend an array, you fill it with undefs. When you push onto that, you are then adding to the end of the array, after the undefs. You need to insert by index.

    Personally I am inclined to forget about pre-extending. The doubling logic that Perl uses for allocating memory means that in building a large array it only moves pointers an average of once each. Adding extra logic in Perl to avoid this expense can't save much, and usually loses.

    Based on both theory and experiments that I just ran, your code should not have any significant algorithmic inefficiencies unless your data is slow. Given the overhead of Perl data structures, the amount of data, RAM, etc that you have, and your described performance profile I am fairly confident that your performance problem is memory pressure. I don't know the details of how Windows XP handles memory, but it would not surprise me in the least for it to either have complications in how it reports memory that mislead you, or for it to choose to page memory earlier than you think.

    My strong advice is to rework this script so that you don't need to have all of this data in RAM at once. (Without seeing the rest of the script I can't give good advice on how to do that.) Similarly you can save a big chunk of RAM by not storing the file in an array, process it as you read it instead.

      Some remarks on the code in the while loop:
      $line=~/^(\S*) [0-9.]* (.*)$/o;
      The /o modifier is not necessary, as the string to match is already a constant and is therefor already compiled at compile time. From perldoc perlop:

      PATTERN may contain variables, which will be interpolated (and the pattern recompiled) every time the pattern search is evaluated, except for when the delimiter is a single quote. (Note that $(, $), and $| are not interpolated because they look like end-of-string tests.) If you want such a pattern to be compiled only once, add a "/o" after the trailing delimiter. This avoids expensive run-time recompilations, and is useful when the value you are interpolating won't change over the life of the script. However, mentioning "/o" constitutes a promise that you won't change the variables in the pattern. If you change them, Perl won't even notice.

      my ($class, $feature_vector) = ($1, $2);
      Are you sure that the regex will always match? If not, you may introduce duplicates for each time there's no match. From perldoc perlre:

      NOTE: failed matches in Perl do not reset the match variables, which makes easier to write code that tests for a series of more specific cases and remembers the best match.

      Liz

        re: /o
        Thanks!
        re: will regexp always match?
        Yes - it's taken directly from some other code that reads the same file format with no problem.
Re: Slowness when inserting into pre-extended array
by Elian (Parson) on Jul 19, 2003 at 19:42 UTC
    Pre-extending can be very worthwhile if you undo it first. As tilly has pointed out, preextending the array gives you a bigger array filled with undefs and pushing on the end just makes it even bigger, since you're not overwriting any of those undefs. What you should do instead is:
    $#foo = 1_000_000; $#foo = 0;
    This will make the array big enough to hold 1 million elements (use whatever size you really need) then set it to be holding zero elements but.... perl doesn't release the memory it allocated to the array--it's still there, attached to the array, as a ready cache. When you push onto the array perl will reuse this space until you run out past where you pre-extended.
Re: Slowness when inserting into pre-extended array
by BrowserUk (Patriarch) on Jul 19, 2003 at 21:00 UTC

    It makes no sense to read the entire file into an array if you are then going to process that array element-by-element into a different data structure. Reading it line by line and building you data structure as you go makes much more sense, would save you a large amount of memory and almost certainly speed the whole thing up considerably.

    As tilly pointed out, pushing to a pre-extended array, is just extending it further, so there is no benefit from doing things that way.

    However, I think that your data structure is unnecessarially complicated for the data you wish to store. Your creating an array of hashes of hashes, but a large proportion of the data you are storing are constants. 'class' is a constant, 'fv' is a constant, the value of 1 for every member of the inner hash are all constants. Not only is this information redundantly chewing up a large amount of memory, it also (potentially) makes accessing and iterating your data more complex and slower than it need be. Eg.

    #! perl -slw use strict; use Devel::Size qw[size total_size]; my @cases = ( { class=>'classA', fv=>{ featureA=>1, featureB=>1, featureC=>1, fe +atureD=>1 } }, { class=>'classB', fv=>{ featureA=>1, featureB=>1, featureE=>1, fe +atureF=>1 } }, { class=>'classC', fv=>{ featureB=>1, featureC=>1, featureD=>1, fe +atureE=>1 } }, { class=>'classD', fv=>{ featureC=>1, featureD=>1, featureE=>1, fe +atureF=>1 } }, { class=>'classE', fv=>{ featureA=>1, featureD=>1, featureE=>1, fe +atureF=>1 } }, { class=>'classF', fv=>{ featureD=>1, featureE=>1, featureF=>1, fe +atureG=>1 } }, { class=>'classG', fv=>{ featureA=>1, featureC=>1, featureD=>1, fe +atureG=>1 } }, { class=>'classH', fv=>{ featureA=>1, featureB=>1, featureD=>1, fe +atureG=>1 } }, { class=>'classI', fv=>{ featureA=>1, featureC=>1, featureE=>1, fe +atureF=>1 } }, { class=>'classJ', fv=>{ featureB=>1, featureD=>1, featureF=>1, fe +atureG=>1 } }, ); use constant FEATURE_A=>0; use constant FEATURE_B=>1; use constant FEATURE_C=>2; use constant FEATURE_D=>3; use constant FEATURE_E=>4; use constant FEATURE_F=>5; use constant FEATURE_G=>6; my %cases = ( classA=>[ 1, 1, 1, 1, 0, 0, 0 ], classB=>[ 1, 1, 0, 0, 1, 1, 0 ], classC=>[ 0, 1, 1, 1, 1, 0, 0 ], classD=>[ 0, 0, 1, 1, 1, 1, 0 ], classE=>[ 1, 0, 0, 1, 1, 1, 0 ], classF=>[ 0, 0, 0, 1, 1, 1, 1 ], classG=>[ 1, 0, 1, 1, 0, 0, 1 ], classH=>[ 1, 1, 0, 1, 0, 0, 0 ], classI=>[ 1, 0, 1, 0, 1, 1, 0 ], classJ=>[ 0, 1, 0, 1, 0, 1, 1 ], ); print 'Array of hash of hash : ', total_size( \@cases ); print 'Hash of array : ', total_size( \%cases ); __END__ C:\test>test2 Array of hash of hash : 4088 Hash of array : 2534

    The two data structures above represent exactly the same information, but the latter requires 60% less space. multiply that out across 300,000 records and you have a substantial wastage. Depending upon how you intend to use the data structure, it may not lend itself to your needs, but it is worth considering.

    Of course, you could save another large chunk of memory by representing the presence or absence of a feature in each class by a '0' or '1' in a string and access it using substr

    # If $class has feature A if( substr $cases{ $class }, FEATURE_A ) ) {... # Add feature G to $class substr $cases{ $class }, FEATURE_G = '1';

    Hiding the substr with an lvalue function would be even cleaner.

    You could also go a step further, saving more memory and use bits in a bit-string and vec, but thats probably a step too far for most purposes. Maybe as you have a gig of RAM you feel the need to use it all :)

    I absolutely hate it when bloated applications steal all my ram or processor and stop me from running other stuff at the same time. Use as much as you need to, but don't waste it:)

    Then again, I also hate it when resturants serve me more food than I can eat.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller

      "It makes no sense to read the entire file into an array .... and almost certainly speed the whole thing up considerably." Oh, certainly. I had temporarily rewritten it this way to see if the slowness was due to readng the file or processing it. Unfortunately the sparse representation I'm using is necessary for the way the data structure will be applied; each record has about 30 features, but which features each record has varies among a set of about 150,000, making any representation which stores the values of all features for all records impractical. I also unfortunately must have all the data in memory at once in order to calculate various infromation statistics for each attribute and to do extensive swapping of records (I'm building a decision-tree learner). :-(
        Have you considered trying to use data structures which are backed by files, like BerkeleyDB or in a relational database like MySQL? Of course they will have to go to disk for data sometimes, but they should be able to make intelligent caching choices so that most of the time you can access data without going to disk...

        Yes, it is slower than having it all in RAM if you have the memory for that, but it is likely to make better file access choices than will be made by your OS paging randomly accessed RAM back and forth into virtual memory...

Re: Slowness when inserting into pre-extended array
by Taulmarill (Deacon) on Jul 19, 2003 at 19:33 UTC
    donīt know if it helps (iīm kind of newbie too) but what do you think about

    while (<DATA>) { $line = $_; ...do some stuff... }
    instead of storing all data in @FILE.
      This would solve some of the bloat, but probably not too much of the speed decrease that you've seen.

      What I think you're running into is a basic shortcoming of hashes in general, and Perl in particular. What happens is that each hash key is internally "hashed" into a 4-byte integer value. And these keys are then put in a sorted list, so you can quickly perform a binary search on it to check whether the key already exists or not.

      Ideally every different key should have a different hash value. Reality however shows that with huge amounts of keys, the chance of two different key values getting the same hash value, is actually quite large. This situation is then handled by creating a list of all the keys that have that same hash value. In some "worst case" scenarios, you might actually wind up with a single hash value for all keys given.

      The search in the linked list is linear and therefore needs more and more time for each key added.

      Now, the distribution of the hash keys in your dataset may be have some worst cases in it.

      So how can this be fixed? Good question. You could try changing the keys by adding some other, fixed string to the keys. This should change the distribution of the hash key values, hopefully reducing the worst case.

      If you're not stuck with the current Perl version, you might want to try out 5.8.1-RC1, as this has a new feature that will randomize the distribution of the hash keys. This in itself will not reduce the worst case scenario, but will make it a moving target, so you won't have that bad performance all of the time.

      This randomization was introduced because the delay that you experienced, is actually a way to create a Denial of Service attack for CGI-scripts.

      Hope this helps.

      Liz

        I strongly suspect that you have fallen into the common trap of taking a fact that you know and attempting to apply it semi-randomly where it doesn't apply. This trap is particularly bad because you have enough half-digested knowledge to sound authoritative, which can help give rise to myths.

        Therefore let me engage in some rumor control.

        First let me demonstrate that your theory cannot apply in this case. The worst-case hashing behaviour that you describe shows up very rarely when someone isn't doing something deliberately to cause it - else hashes would not be useful. Furthermore the original code has many small hashes being created, and unless the nature of the data changes sharply, there is absolutely no reason to believe that hashes created out of lines past line 20,000 are any different than hashes created out of lines before then. Therefore this cannot apply.

        Secondly allow me to explain why we don't want your meme spread. It is bad because you make people scared to use large hashes. This is inappropriate - hashes normally work just fine until you run out of memory. That is not guaranteed behaviour - there are cases where a given hashing algorithm won't work - but unless you fear someone being malicious to you, there is no need to worry about using them.

        Thirdly allow me to explain some fundamental misunderstandings that you seem to have. Hashes assign keys pseudo-randomly to buckets. Perl dynamically reallocates buckets so that the number of keys per bucket stays in a fixed range. Therefore if the hashing algorithm works (Perl's generally will unless the keys are maliciously chosen) the odds of a collision stays the same no matter how many keys there are. There are always a few collisions. Searching through buckets with them is part of the average amortized cost of doing a hash lookup. Large or small plays no significant role.

        But if someone is malicious, then keys are going to all go into the same bucket. And with Perl's linked-list implementation for buckets (alternatives to that exist - they have worse average performance though so they are not used) causing a quadratic performance problem. However quadratic performance with only a small amount of data performs just fine. In order to find a performance problem you have to both be malicious and you have to throw lots of data in.

        Which is why when you heard about this attack you heard about lots of data. Not because lots of data is normally bad for hashes. But because this attack will only work if, among other things, you throw lots of data into it.

        In summary, this performance attack is real. Go ahead and warn people about it. But it isn't something that you should look for right away when you run into performance problems.

        This doesn't seem to have anything at all to do with hashes, as the only difference between the slow and fast versions of the program is that the slow version pushes onto an array while the fast version doesn't. Given perl's behaviour with array extension, I'd pretty much guarantee that the problem is the array push, with a lesser possibility being a process memory freelist fragmentation from all the hash allocations, but I'd doubt that one.
        If you're concerned that this might be the case, try printing scalar(%hash). From perldata(1):
        If you evaluate a hash in scalar context, it returns false if the hash is empty. If there are any key/value pairs, it returns true; more precisely, the value returned is a string consisting of the number of used buckets and the number of allocated buckets, separated by a slash. This is pretty much useful only to find out whether Perl's internal hashing algorithm is performing poorly on your data set. For example, you stick 10,000 things in a hash, but evaluating %HASH in scalar context reveals ""1/16"", which means only one out of sixteen buckets has been touched, and presumably contains all 10,000 of your items. This isn't supposed to happen.
Re: Slowness when inserting into pre-extended array
by sgifford (Prior) on Jul 20, 2003 at 05:18 UTC
    You could verify what part of your code is being slow either by using the Time::HiRes module to time different parts of your code:
    use Time::HiRes qw(time); use vars qw($then); sub timepart { my($partname)=@_; my $now = time; if ($partname) { $time{$partname} += $now - $then; } $then = $now; } timepart(); foreach my $line (@FILE) { timepart('loop'); if (($dot % 1000) == 0) { print STDERR "."; } timepart('mod'); $line=~/^(\S*) [0-9.]* (.*)$/o; my ($class, $feature_vector) = ($1, $2); timepart('regex'); my %case; $case{'class'}=$class; timepart('createcase'); foreach my $feature (split /\s+/, $feature_vector) { $case{'fv'}{$feature}=1; } timepart('feature'); push @cases, \%case; timepart('pushcase'); $dot++; } print "\n"; while(my($k,$v)=each(%time)) { printf "%10s: %10.6f\n",$k,$v; }
    or by reducing your code to a very small amount and seeing if it's still slow:
    for $i in (1..2_000_000) { my %case; push(@r,\%case); }

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://275917]
Approved by sauoq
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (4)
As of 2024-07-25 03:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.