http://www.perlmonks.org?node_id=481551

clearcache has asked for the wisdom of the Perl Monks concerning the following question:

I'm writing some Perl code to do some statistical analysis on two data files which contain records that may or may not be related. I have some code that creates a scoring system to rank matches bewtween records - all records from file a are matched with those records from file b that are possible matches. Each possible match is assigned a score. I have approx. 10 million possible combinations.

My data is stored like this:

$rMatch->{$File1RecordID}->{$File2RecordID} = $ranking;

My first step in matching the data is to find all those records in file 2 that are a top ranked match with only one record in file 1. I consider those "strong matches" and want to eliminate them from the pool of records before matching the remaining data.

The second step is to create combinations of probable matches all remaining records from file 1 and all records from file 2. The lowest overall "score" for that match will be the recordset alignment that I go with for the remaining records.

My problem? The process of identifying the "strong matches" is too CPU intensive, running at 100% usage for a long time. My CPU is actually overheating and my machine is shutting down. I can mitigate that with a sleep(1) statement, but that's not very elegant...it slows this massive task down...and I suspect the problem is the number of times that I sort my hash in the program to identify strong matches. In other words, inefficiency in my code ;) Is there a better way to do this? This sort is the only way I know to find the lowest value in my hash.

In the code below, id1 is my id from the first file, id2 is the id from the second file, and $rC holds the ranking for the combination of id1 and id2. Basically, I want to check all other combinations of records for the existence of id2 as a top rank. If I do not find any other combinations, I'm reasonably confident that those 2 records should be matched.

sub IsStrongMatch { # Return true if id2 is only top ranked match for id1 my $id1 = shift; my $id2 = shift; my $rC = shift; for my $i1 ( keys %{$rC} ) { next if $i1 == $id1; foreach my $i2 ( sort { $rC->{$i1}->{$a} <=> $rC->{$i1}->{$b} +} keys %{$rC->{$i1}} ) { if ( $id2 == $i2 ) { return 0; } last; } } return 1; }

2006-08-31 Retitled by planetscape, as per Monastery guidelines: one-word (or module-only) titles hinder site navigation

( keep:0 edit:14 reap:0 )

Original title: 'Efficiency'

Replies are listed 'Best First'.
Re: How can I improve the efficiency of this very intensive code?
by sk (Curate) on Aug 06, 2005 at 21:23 UTC
    I feel the use of Hash might not be required for your task. You have recordID which can act as an index to an array so why put them in a hash and mess up the order? You get linear access in array using index anyways an no overhead of the hash-table

    That said, i would do a Matrix(nxn) (square not a requirement, dimensions might change based on num of records of course) to keep track of scores. Consider the following table

    File1rec/File2rec123456
    13789910
    2483116
    3149497
    44310723
    5425995
    6562569
    The values inside the cell are the scores. Now if you want the best matching score (max value) then a O(n) max will provide you the answer for your records and you have to do that n-times for each record in your first file.

    Sorting to finx max/min is an overkill. I might be missing your porblem so please correct me if i am wrong.

    cheers

    SK

      I was thinking about the use of arrays...my ids are pretty big numbers so I wouldn't use them alone as array indices. I could always use $., however, when I read in the file rather than the id.

      My ranking is based on # of seconds from last log entry in one file to first log entry in the second file. So I create scoring by looking at # of seconds between each record. My ability to identify a "strong match" comes from the rate of concurrent users in the application that my data comes from. Low concurrent users, I'll have lots of strong matches - records that clearly line up. If I have high concurrent users with lots of log file entries, then I've got to get a little creative.

      I was sorting b/c my hash is being used to store # of elapsed seconds...not a true "rank" in terms of 1, 2, 3, etc.

      I'm considering the use of arrays, but don't want to lose the elapsed seconds as data quite yet b/c that will be used in the next step to figure out the best match from the remaining data.

Re: How can I improve the efficiency of this very intensive code?
by polettix (Vicar) on Aug 06, 2005 at 21:50 UTC
    This sort is the only way I know to find the lowest value in my hash
    Heh - finding the lowest value in a set (according to a given cost function) requires traversing the array only once - a sort not! I have not enough will to read what your code is doing exactly, but this should solve the specific ordering problem using List::Util:
    use List::Util; #... later in the code sub IsStrongMatch { # Return true if id2 is only top ranked match for id1 my $id1 = shift; my $id2 = shift; my $rC = shift; for my $i1 ( keys %{$rC} ) { next if $i1 == $id1; my $href = $rc->{$i1}; my $minid = reduce { $href->{$a} < $href->{$b} ? $a : $b } keys %$href; return 0 if defined $minid && $id2 == $minid; } return 1; }
    Untested!!!

    Update: I realised that this answer is a little cryptic. reduce in List::Util allows you to visit each element in an array and do "something" based on the values you encounter along the path. What this code does (more or less) is probably best explained by mrborisguy in this post.

    Flavio
    perl -ple'$_=reverse' <<<ti.xittelop@oivalf

    Don't fool yourself.
Re: How can I improve the efficiency of this very intensive code?
by hv (Prior) on Aug 06, 2005 at 23:36 UTC

    It may be that sorted arrays are the way to go, as mentioned elsewhere. But it might make sense to go from the data you have now to an inverted list of best matches, so as to sort each list only once:

    my $inv = {}; for my $index (keys %$rc) { my $subhash = $rc->{$index}; my @sorted = sort { $subhash->{$a} <=> $subhash->{$b} } keys %$sub +hash; if ($subhash->{$sorted[0]} < $subhash->{$sorted[1]}) { # this is the best match push @{ $inv->{$sorted[0]} }, $index; } }
    and now the strong match test becomes:
    return (@{ $inv{$id2} } == 1) ? 1 : 0;

    I'm not sure that I fully understand the problem, so the code might need some tweaking ...

    Hugo

Re: How can I improve the efficiency of this very intensive code?
by EvanCarroll (Chaplain) on Aug 06, 2005 at 21:58 UTC
    Yes you can do quite a few things to that sub:
    sub IsStrongMatch { # Return true if id2 is only top ranked match for id1 my $id1 = shift; my $id2 = shift; my $rC = shift; for my $i1 ( keys %{$rC} ) { next if $i1 == $id1; foreach my $i2 ( sort { $rC->{$i1}->{$a} <=> $rC->{$i1}->{$b} +} keys %{$rC->{$i1}} ) { if ( $id2 == $i2 ) { return 0; } last; } } return 1; }

    A. You can remove the temporary variable $i2, it isn't used anywhere in a subscope and $_ can be used instead, that way you don't have to copy to a temp var.
    B. You can use @_[0-2] rather than shifting to temp variables, which is many more processes than needed.
    C. You can return undef rather than 0 rumor mill says this is a minor optimization.


    Evan Carroll
    www.EvanCarroll.com
      sub IsStrongMatch { # Return true if id2 is only top ranked match for id1 my $id1 = shift; my $id2 = shift; my $rC = shift; for my $i1 ( keys %{$rC} ) { next if $i1 == $id1; foreach my $i2 ( sort { $rC->{$i1}->{$a} <=> $rC->{$i1}->{$b} +} keys %{$rC->{$i1}} ) { if ( $id2 == $i2 ) { return 0; } last; } } return 1; }
      On second look, I have to wonder what the purpose of the 'last' is being in a for loop, outside of a conditional. If that is what you wanted you could just take the first element of the array returned to by sort, and run the if {} on that.


      Evan Carroll
      www.EvanCarroll.com

        Update: I just read frodo72's post, which makes the point better than this code.

        In that case, if you are just going to use the first one, wouldn't finding the first one be faster than sort?

        sub IsStrongMatch { # Return true if id2 is only top ranked match for id1 my $id1 = shift; my $id2 = shift; my $rC = shift; for my $i1 ( keys %{$rC} ) { next if $i1 == $id1; my $i2; for ( keys %{ $rC->{ $i1 } } ) { $i2 = $_ if not defined( $i2 ); # if there's a good way to find a starting $i2 # before beginning, then take this out, and # don't check every time $i2 = $_ if $rC->{ $i1 }->{ $_ } < $rC->{ $i1 }->{ $i2 }; } if ( $id2 == $i2 ) { return 0; } } return 1; }

            -Bryan

        Yes, certainly. Thank you.