Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

How to improve this data structure?

by fiddler42 (Beadle)
on May 21, 2013 at 14:31 UTC ( #1034553=perlquestion: print w/ replies, xml ) Need Help??
fiddler42 has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I recently made the rookie mistake of authoring a perl script with a sample data input file that was not realistic in size. So now I am running the script with much larger data input files and it is painfully slow. Here is the root cause of the problem:-

1) A data input file is parsed, and each line of the file is added to an array like so:-

push (@StatsArray, {RegionNum=>$RegionNum, AR=>$AR[$RegionNum], BCR=>$BCR[$RegionNum]});
I like the data structure above because it is easy to understand and I can add new elements to the array as the script becomes more sophisticated. I included only 3 of 15 variables, and I expect the number of variables to increase a little more.

2) Once the array is fully populated (millions of entries) and I need to pluck data from it, I need to do two things: a) *always* numerically sort by RegionNum first and then b) numerically sort some other key. Here is an example:-
foreach $RegionCoords (@AllRegionCoords) { $RegionNum++; foreach (sort {$$a{RegionNum} <=> $$b{RegionNum} or $$a{AR} <=> $ +$b{AR}} @StatsArray) { $RegionKey = $$_{RegionNum}; $ARKey = $$_{AR}; if ($RegionKey == $RegionNum) { # do stuff with sorted AR data in region n } } }
Again, easy to follow. The problem is there are many different region numbers and I am not breaking down the data into smaller chunks. For example, although there may be 1 million entries in @StatsArray, all of the entries are made up of, say, 10 region numbers with 100,000 entries in each one. So instead of sorting only 100,000 entries at a time, I am always sorting 1 million entries at a time. A lot of separate sorts are issued, hence the runtime problem.

I have two questions:-

1) How can I reformat the StatsArray to be dependent upon region numbers and make sorting faster? 10 StatsArrays instead of 1 would be fine.

2) If a new data structure is proposed, how would I numerically sort something like the AR key? This is really what I am struggling with the most: ways to sort only 1 of n keys in an array when its data structure gets more complicated.

I do not use arrays like this very often, so any suggestions would be much appreciated.

Thanks!

-fiddler42

Comment on How to improve this data structure?
Select or Download Code
Re: How to improve this data structure?
by choroba (Abbot) on May 21, 2013 at 14:41 UTC
    A simple fix which might help a bit is to grep the array before running sort on it:
    for (sort { $a->{RegionNum} <=> $b->{RegionNum} or $a->{AR} <=> $b->{AR} } grep $_->{RegionNum} == $RegionNum, @StatsArray) { $ARKey = $_->{AR}; # do stuff, no need to check the RegionKey. }

    Untested.

    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
      Thanks a lot...this simple grep reduced runtimes by about 35%.
Re: How to improve this data structure?
by hdb (Parson) on May 21, 2013 at 14:47 UTC

    From your code snippet it looks to me as if the AR (and BCR) entry is always the same give a $RegionNum. This can't be correct. Can you clarify?

      Some pre-processing happens before @StatsArray is populated, so the AR and BCR values (and others) are always unique.
Re: How to improve this data structure?
by davido (Archbishop) on May 21, 2013 at 14:50 UTC

    Are the region numbers sequential, or discontiguous?

    If sequential:

    my @StatsArray; push @{$StatsArray[$RegionNum]}, { AR => $AR[$RegionNum], BCR => $BCR[$RegionNum], };

    If sparse, this:

    my %Stats; push @{$Stats{$RegionNum}}, { ... };

    Then you could do this:

    my @sorted = sort { $a->{AR} <=> $b->{AR} } @{ $StatsArray[$RegionNum] + };

    ...for example.

    It might be that you're getting to the point where a database would scale better though. Another approach might be to just build a binary tree, maintaining nodes in sorted order. This allows for relatively inexpensive inserts and searches. But it does sound like you might need the scalability of a DB approach.


    Dave

      Agreed on the database -- you also then have the advantage that you can ingest the numbers once, and then run whatever processing needs to be done on it, iterating as you go.
      Most excellent: runtime reduction of 70%! BTW, there was a typo in the original post: need @{$StatsArray{$RegionNum}} (no square brackets). Region numbers are sequential.
      Most excellent: runtime reduction of 70%! BTW, there was a typo in the original post. Need...

      @{$StatsArray{$RegionNum}}

      ...no square brackets. Region numbers are sequential. Definitely need to think about building a DB, though...thanks for the help.

        If they're sequential, it should be @StatsArray, in which case, @{$StatsArray[$RegionNumber]} would be appropriate, and probably even a little faster since array index lookups require less constant-time to achieve than hash lookups.

        Here's a sort of loose and dirty explanation of why you get such a good speedup here. Let's assume that your original @StatsArray had 1_000_000 entries, and that there are ten regions, each of which has 100_000 entries.

        Your original approach was sorting 1_000_000 entries. Sort is an O(n log n) operation, so we can say that there were approximately 1M * log(1M) units of work going on.

        The grep approach helps because grep is an O(n) operation. So you walk through the million item list one time, and pull out 100_000 entries. Then you sort the 100_000 entries. So you have 1M + ( 100K * log(100K) ) units of work, approximately.

        My approach eliminates the need for the grep. So you do away with the "1M" units of work, and are left with 100K * log(100K) units of work.

        This is really a rough approximation of what's going on, but fits fairly well, and I think should help to explain why you see such an improvement.

        The database approach would still scale better, so that you don't have to rewrite the code when 1_000_000 entries becomes 100_000_000. ;)


        Dave

Re: How to improve this data structure?
by sundialsvc4 (Monsignor) on May 21, 2013 at 15:38 UTC

    ++ on using a database

    SQLite files are wonderful for this sort of thing ... there is no “server.”   There are useful utilities for importing text-files and so forth.   The only gotcha, when you finally get down to programming that updates things, is a rather important one:   use transactions.   SQLite is specifically designed to commit data to disk and to re-read the data to verify it, every single time, unless a transaction is in-progress.   (If there is, it buffers the data much more sensibly.)   But, with that being said, it is an excellent and robust tool ... and free.   (It is actually in the public domain.)

    You’ve got millions of records to deal with, and you can’t be writing a Perl program every single time . . .   You might find need to get to this data in all sorts of ways – reports, spreadsheets, who-knows.   SQLite can take you there.

      I assume you wanted to say something like:

      Regards, Karl

      «The Crux of the Biscuit is the Apostrophe»

        Yeah, more or less.

        “A million records” is a volume that is “vaguely interesting” to SQLite ... it will take a few minutes’s one-time cost to import the data (and might not require a program).   A couple minutes more to add some indexes.   From that point on, you can use any tool or combination of tools that has a DB-interface (including Perl of course ...) to get the job done, and now the query-engine is the one that’s doing all the heavy lifting.   So long as the “transactions” caveat is carefully adhered-to esp. when doing updates, it’s really quite a remarkable piece of software engineering.   (It’s rare when a piece of software genuinely surprises me blows me away.   Perl/CPAN did it.   So did this.)   I suspect that most of the things that the O.P. is right now “writing programs to do” can probably be reduced to a query (and perhaps a now-trivial program to digest the results).   Furthermore, a huge bonus is that you can put results into a different table, which of course is a self-describing data structure.   (A single SQLite file can contain any number of tables.)

Re: How to improve this data structure?
by BrowserUk (Pope) on May 21, 2013 at 19:11 UTC

    Why an array of hashes? If you used a hash of arrays of hashes:

    push @{ %statHash{ $RegionNum } }, { AR=>$AR[$RegionNum], BCR=>$BCR[$R +egionNum], ... };

    You only need to sort the subset of hashes in each region (and avoid the grep).


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: How to improve this data structure?
by space_monk (Chaplain) on May 21, 2013 at 23:14 UTC

    Everyone else has offered good advice on the array processing, however it may also be worth asking how you are reading the data from the file. At one million lines of data, improvements in transferring the data from file to memory may also cut your runtime

    If you spot any bugs in my solutions, it's because I've deliberately left them in as an exercise for the reader! :-)
Re: How to improve this data structure?
by TomDLux (Vicar) on May 22, 2013 at 20:13 UTC

    If you want to preserve the data, a database is definitely an interesting solution. I haven't tried using SQLLite and similar when I have a transient need for data --- traditional extract / transform / load situations.

    If you know the number of region numbers is significantly less than the number of records, you could use multiple StatsArrays, one for each region number. Since you don't know how many there will be, the obvious solution is to use an array of arrays ... the outer array for the region, the inner array for records in that region. I only wish your regions were numbered 0, 1, 2, 3, ... but that's stretching things a bit, so lets just use a hash that maps region number to array element.

    I haven't run it, so there may be bugs. When a new RegionNum comes along, you detect it isn';t in the hash, so you add a new entry with the next array slot, and provide an anonymous array for the records to go in. Then, when you need the records sorted, it is far easier to sort the subset with the same regionNum. Sorting NxM log(M) will be faster than sorting NM log(NM).

    package RegionRecords; my %regionNums; my @all_regions; my $N = 0; sub add { my ( $record ) = @_; if ( ! exists $regionNums{$record->{RegionNum} }) { $regionNums{$record->{RegionNum} } = $N; $all_regions[$N] = []; $N++; } push $all_regions->[$regionNums{$record->{RegionNum} }], $record; } 1; package main; RegionRecord::add( {RegionNum=>$RegionNum, AR=>$AR[$RegionNum], BCR=>$BCR[$RegionNum]} ); print RegionRecord::allRecords();

    As Occam said: Entia non sunt multiplicanda praeter necessitatem.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1034553]
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (4)
As of 2014-07-26 14:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (178 votes), past polls