How can I improve the efficiency of this very intensive code?

clearcache has asked for the wisdom of the Perl Monks concerning the following question:

I'm writing some Perl code to do some statistical analysis on two data files which contain records that may or may not be related. I have some code that creates a scoring system to rank matches bewtween records - all records from file a are matched with those records from file b that are possible matches. Each possible match is assigned a score. I have approx. 10 million possible combinations.

My data is stored like this:

$rMatch->{$File1RecordID}->{$File2RecordID} = $ranking;
[download]

My first step in matching the data is to find all those records in file 2 that are a top ranked match with only one record in file 1. I consider those "strong matches" and want to eliminate them from the pool of records before matching the remaining data.

The second step is to create combinations of probable matches all remaining records from file 1 and all records from file 2. The lowest overall "score" for that match will be the recordset alignment that I go with for the remaining records.

My problem? The process of identifying the "strong matches" is too CPU intensive, running at 100% usage for a long time. My CPU is actually overheating and my machine is shutting down. I can mitigate that with a sleep(1) statement, but that's not very elegant...it slows this massive task down...and I suspect the problem is the number of times that I sort my hash in the program to identify strong matches. In other words, inefficiency in my code ;) Is there a better way to do this? This sort is the only way I know to find the lowest value in my hash.

In the code below, id1 is my id from the first file, id2 is the id from the second file, and $rC holds the ranking for the combination of id1 and id2. Basically, I want to check all other combinations of records for the existence of id2 as a top rank. If I do not find any other combinations, I'm reasonably confident that those 2 records should be matched.

sub IsStrongMatch
{
    # Return true if id2 is only top ranked match for id1
    my $id1 = shift;
    my $id2 = shift;
    my $rC  = shift;

    for my $i1 ( keys %{$rC} )
    {
        next if $i1 == $id1;
        foreach my $i2 ( sort { $rC->{$i1}->{$a} <=> $rC->{$i1}->{$b} 
+} keys %{$rC->{$i1}} )
        {
            if ( $id2 == $i2 )
            {
                return 0;
            }
            last;
        }
    }
    return 1;
}
[download]

2006-08-31 Retitled by planetscape, as per Monastery guidelines: one-word (or module-only) titles hinder site navigation

( keep:0 edit:14 reap:0 )

Original title: 'Efficiency'

Comment on How can I improve the efficiency of this very intensive code? Select or Download Code

Replies are listed 'Best First'.

Re: How can I improve the efficiency of this very intensive code?
by sk (Curate) on Aug 06, 2005 at 21:23 UTC

That said, i would do a Matrix(nxn) (square not a requirement, dimensions might change based on num of records of course) to keep track of scores. Consider the following table

File1rec/File2rec 1 2 3 4 5 6

1 3 7 8 9 9 10

2 4 8 3 1 1 6

3 1 4 9 4 9 7

4 4 3 10 7 2 3

5 4 2 5 9 9 5

6 5 6 2 5 6 9

The values inside the cell are the scores. Now if you want the best matching score (max value) then a O(n) max will provide you the answer for your records and you have to do that n-times for each record in your first file.

Sorting to finx max/min is an overkill. I might be missing your porblem so please correct me if i am wrong.