Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number

Fast seeking in a large array of hashes to generate a report.

by jbrugger (Parson)
on Jun 23, 2005 at 06:55 UTC ( #469303=perlquestion: print w/replies, xml ) Need Help??

jbrugger has asked for the wisdom of the Perl Monks concerning the following question:

Hi coding Gods.

I have the following challenge: (i hope i explain it clear enough, if not please tell!
I have a data-object that contains the following structure of data, filled from a database:
@Persons = ( { fistname => "firstname", surname => "lastname", age => 25 place => "place", }, { fistname => "another firstname", surname => "another lastname", age => 25 place => "place", }, # etc etc... )

next, we have a config file, in where we can describe what we want to choose (show) from the dataset:
eg, all persons of the age 25, with the surname aSurName.
<!-- this is just one of manu expressions --> <expr type="match"> <field type="equal" key="age">25</field> <field type="equal" key="surname">aSurName</field> </expr>

On this moment, we filter the content by looping over the array, and looking up the values of the hashes by the key given from the xml-file:
foreach $field (@Fields) { # <- age and surname in this case scalar(@Persons)==0 && return 0; for ($i=0;$i<scalar(@Persons);$i++) { ($Persons[$i]->{$field->{key}} eq $field->{content}) && next; # when a match field can not be found in a dataset element, th +e dataset element can be removed from # the dataset since all match fields have to be found in the da +taset element this way the dataset element # does not have to be considdered in the next # iteration (next match field) splice(@Persons,$i,1); $i--; } } #return...
Now the last function takes about 95% of the program-runtime, and i'm searching for a better (faster!) way to lookup the data in where i have to match multiple fields in an hash to another hash in the array, and where the next iteration may look at differnt keys, as given by the xml-input-file.
Any ideas would be appriciated!

Thanks for the replies, it basicaly works like this:

- Get DB dataset (in an regular and optimized fashion speeding up the dbi)
- Fill Dataobject
- pass Dataobject around the entire program (reference), the different parts (objects) can perform a custom filter on the object to select the information they need from it the expressions used to filter can be eq, <, >, rexexp etc, as described in the xml-file
- clear reactionobject
- next DB dataset.....
if i build a query, i have to set selecions on about 100 places in a huge query for about 10.000.000 recordsets

The second option given (a lookuptable) did not speed it up i've tried that, but have to recreate the hashtable a 10.000.000 times, and search in it

zaxo, your idea is interesting but i don't know the matching fields beforehand. how would you fill the grep, not using eval() that slows everything down?

"We all agree on the necessity of compromise. We just can't agree on when it's necessary to compromise." - Larry Wall.

Replies are listed 'Best First'.
Re: Fast seeking in a large array of hashes to generate a report.
by Zaxo (Archbishop) on Jun 23, 2005 at 07:29 UTC

    Your splice is one obvious inefficiency. You could push onto another array or grep. The resulting array of indexes could be used in a slice of @Persons.

    Where is @Persons populated from? If a file, you could do your filtering as you read it. If a database, you can make it do all the work.

    To make executable filters from form data, be cautious. You'll be tempted to eval, but that is inviting trouble. Try setting up a dispatch table (a hash of code references) of functions to call via the field names, with field name and value as arguments. Where you have several fields, grep is again your friend.

    After Compline,

Re: Fast seeking in a large array of hashes to generate a report.
by demerphq (Chancellor) on Jun 23, 2005 at 08:26 UTC

    One thing to do is to construct a subroutine to do the processing logic. Your XML input file presumably isnt changing for the duration of the run, so theres no need to do a loop of the possibilities for each input record. Instead your loop over the xml file should happen once, manufacturing perl code that does the condition logic. This is then wrapped up in a subroutine and eval()ed into existance. The resulting sub will be considerably faster than the doing all of that processing over and over.

    IMO you are encountering a common problem for perl programmers, using low level coding tactics inside of a scripting language. Consider that you are in essence writing an interpreter, and you arent doing it particularly efficiently. Not only that you are writing an interpreter in a language that is itself interpreted (although at an opcode level). So in order to bypass the problem of an interpreter running in an interpreter what you should be doing is exploiting the perl compiler/interpreter to do this for you.

    IOW what you want to be writing a translator, that converts from your mini-language to perl. Then let perl rip. Also since the code you write will be writing code some of the normal style and good practice rules dont apply in the generated code, which allows you to do things like strip out all of the unnecessary code and do things very efficiently. For instance by unrolling the field processing loop you might end up with hundreds of statements almost the same but not quite, the sort of thing that one would advise against, but since its generated code who care? The code doing the generation doesnt look like that so its not really a problem.

    Ive seen approaches like this cut per record processing times massively. Make the processing loop bare, unroll every loop and method call and subroutine inside of it that you can. IOW treat subroutines as macros and do the looping in the code generation if you possibly can. Reuse lexical vars when you know you can, precompute as much as you can outside of the subroutine and bind it in via perls closure mechanism. Even things like using hashes as records can be sacrificed. Instead in the generated code use fetchrow_array and put the results into lexicals and then where your logic normally would have been $rec->{field} replace it with $field to avoid the unnecessary hash lookup (ie, where you can preresolve a hash lookup at compile time do so). All of this stuff can make a really serious difference to how fast your code runs.

    Update: A note about using a DB, while if you can possibly do your filtering in a DB then you should. But its worthwhile learning techniques for when using the DB is not possible or practical. For instance if the filtering rules were complex then it might mean a fetch per rule, which on a large table or one without relevent indexes could be quite problematic. Likewise the data source could be flat file, or some other medium where queries against prestablished indexes werent possible.


Re: Fast seeking in a large array of hashes to generate a report.
by Tanalis (Curate) on Jun 23, 2005 at 07:20 UTC
    If you're populating the dataset from a database, then restricting that dataset from a config file, why not simply generate some SQL to restrict the dataset you're downloading from the DB in the first place?

    The database can almost certainly perform this search-and-restrict quicker server-side than you can by downloading all data and then restricting what you display.

    Edited for clarity.
Re: Fast seeking in a large array of hashes to generate a report.
by barrachois (Pilgrim) on Jun 23, 2005 at 07:37 UTC
    First, you might consider using a relational database (SQLite, MySQL, ...) - that's what they're for, eh?

    Second, if you really want to code it yourself, and depending on how large the menu of possible selections is, you could generate lookup hashes ahead of time. If, for example, you have 200 possible surnames and 50 possible ages, then you could generate a hash with keys 'surname-age' (e.g. 'smith-25') whose values are something like a reference to a list of indices into @Persons (e.g. [ 10, 102, 145, ... ] ). Seems like a classic "space-vs-speed" trade off.

    An intermediate approach might be to somehow cache the results of each filter loop, so that the slow part only happens once.

Re: Fast seeking in a large array of hashes to generate a report.
by rev_1318 (Chaplain) on Jun 23, 2005 at 07:23 UTC
    Why would you store all your records in a hash and then drop all you don't need? Wouldn't it be faster to query the database based on the match-requirements? That saves you the trouble of walking through the array.

    If the database queries aren't an option, to could look for a unique key and transform the array into a hash, based on that key. That can save you a lot of time.


Re: Fast seeking in a large array of hashes to generate a report.
by robartes (Priest) on Jun 23, 2005 at 08:49 UTC

    Perhaps you could change the way you build your Dataobject. Instead of one array of hashes, you could build four hashes of (arrays of) hashes, each with a different field as the key. So, you could for example build a HoAoH with surname as key:

    %Persons_by_surname = ( 'Schwartz' => [ { ..person 1 }, { ..person 2}, ], 'Wall' => [ {..person 1..} ] );

    And the same for the other fields (except perhaps for age, where there will be a limited number of discrete values). Instead of looping over an array, this changes your algorithm to a hash lookup, which should be faster.

    To facilitate your later lookup phase, you could put these data hashes in a wrapper hash like this:

    my %data= ( 'surname' => \%Persons_by_surname, 'firstname' => \%Persons_by_firstname, # ... and so on );

    Doing this, you'll have an initial hit on performance when building the extra hashes, but you should have faster lookups in the end. You'll have to benchmark your scripts to see whether this is worth it.


      Correct, but this only works for the "eq" operator, not for a regexp (i tried).
      update i'll try it, i did not know about Tie::Hash::Regex. i'll let you know the result.

      "We all agree on the necessity of compromise. We just can't agree on when it's necessary to compromise." - Larry Wall.

        Have you looked at Tie::Hash::Regex -- that allows you to look for hash keys with a regular expression. I have no idea as to the performance impact of this though, so you'll have to benchmark it.


Re: Fast seeking in a large array of hashes to generate a report.
by themage (Friar) on Jun 23, 2005 at 11:34 UTC
    Hi ppl, Hi jbrugger, I created a small script that generates an array with 1.000.000 records as listed in your example as follow:
    use strict; use Time::HiRes qw(time); my @data=(); my @names=qw(albert arthur bernard bill charlie david jack jonh joseph + mark michael peter steven); my @surnames=qw(bell brown canfield cornwell devlin doyle golden hoffm +an maclean powell rowling tovard twain warlick); my @places=qw(amsterdan athens belfast berlin bern brussels copenhagen + helsinki lisbon london luxenbourg madrid oslo paris rome stockholm v +aduz vienna); for (my $i=0;$i<=1000000;$i++) { my $name=$names[int(rand()*$#names)]; my $surname=$surnames[int(rand()*$#surnames)]; my $place=$places[int(rand()*$#places)]; my $age=int(rand()*50)+25; my $rec={name=>$name,surname=>$surname,place=>$place,age=>$age +}; push @data, $rec; }
    Copied @data to @Persons and @Person2, and used the @Filters arrays as follow:
    my @Fields=( { key => 'age', content=>35 }, { key => 'place', content=>'london'}, ); my @Persons=@data; my @Person2=@data; print "Persons: $#Persons\nPerson2: $#Person2\n\n"; my $field; my $person; my $i=0;
    Timed your implementation with:
    my $ta=time(); foreach $field (@Fields) { # <- age and surname in this case scalar(@Persons)==0 && last; for ($i=0;$i<scalar(@Persons);$i++) { ($Persons[$i]->{$field->{key}} eq $field->{content}) && next; splice(@Persons,$i,1); $i--; } } my $tb=time();
    And then made a small change in your filtering algorith, so the main interaction is against the data, not against the filters as follow:
    my $use=1; my @Rset=(); foreach $person (@Person2) { $use=1; foreach $field (@Fields) { if ($person->{$field->{key}} ne $field->{content}) { $use=0; last; } } if ($use) { push @Rset, $person; } } my $tc=time;
    And then printed some stats:
    my $iab=$tb-$ta; my $ibc=$tc-$tb; print "TIMES:\nab: $iab\nbc: $ibc\n"; print "COUNTS:\nab: ", $#Persons,"\nbc: ",$#Rset,"\n";
    I was expecting some diference (not know better or worst, was just a test), and got some. Enexpected, I admit. In my P4 3.2Ghz, 1GB memory this are some results:
    mpneves@voyager perl$ ./ Persons: 1000000 Person2: 1000000 TIMES: ab: 17.5640239715576 bc: 1.96199607849121 COUNTS: ab: 1153 bc: 1153 mpneves@voyager perl$ ./ Persons: 1000000 Person2: 1000000 TIMES: ab: 18.5665609836578 bc: 1.95595502853394 COUNTS: ab: 1185 bc: 1185 mpneves@voyager perl$ ./ Persons: 1000000 Person2: 1000000 TIMES: ab: 24.4221110343933 bc: 3.58428502082825 COUNTS: ab: 1213 bc: 1213
    Which means that my version is performing better at least 1 to 3 than yours. Obviously, this is a test in my notebok, with random data. with your computers, with real data and filters it can performe diferently. All code I used is here. If you join all code segments as they are presented you will have my complete code.
Re: Fast seeking in a large array of hashes to generate a report.
by srdst13 (Pilgrim) on Jun 23, 2005 at 10:43 UTC

    It seems to me that what is missing here is a comparison of the relative speed of using direct database queries versus using the "hash-in-memory" method. Since you seem most worried about speed, it seems reasonable to just code up the database query version and see if it is faster or not. I'm guessing that it will be, presuming you have indices on all your lookup columns and that you use appropriate DBI calls (sticking to arrayrefs, fetchall, etc.). But there really isn't any way to know unless you try, is there? Perl is a wonderful tool, but modern relational databases are also. Using both to their fullest is the goal.

    Also, note that you may not have to fetch all records in every case. If you have queries that only generate summary statistics, the DB can usually do those quickly as well, so you never have to retrieve those records directly at all.


      Yes i know how to use DBI and a database.
      As stated before, the problem i get when i use sql, is that i create a monsterous sql with inner, outer left and rigt joins, if -statemetns( not fast) and 100'ds of where statements.
      The database is big, and the structure complex, the problem as described here, is only a simple representation of the real situation.

      "We all agree on the necessity of compromise. We just can't agree on when it's necessary to compromise." - Larry Wall.
        Sorry. Despite your explanation above, I didn't really grasp the complexity of the problem. Thanks for clarifying....


Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://469303]
Approved by Corion
Front-paged by broquaint
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (2)
As of 2022-09-30 15:43 GMT
Find Nodes?
    Voting Booth?
    I prefer my indexes to start at:

    Results (126 votes). Check out past polls.