|Just another Perl shrine|
How can I improve the efficiency of this very intensive code?by clearcache (Sexton)
|on Aug 06, 2005 at 21:07 UTC||Need Help??|
clearcache has asked for the
wisdom of the Perl Monks concerning the following question:
I'm writing some Perl code to do some statistical analysis on two data files which contain records that may or may not be related. I have some code that creates a scoring system to rank matches bewtween records - all records from file a are matched with those records from file b that are possible matches. Each possible match is assigned a score. I have approx. 10 million possible combinations.
My data is stored like this:
My first step in matching the data is to find all those records in file 2 that are a top ranked match with only one record in file 1. I consider those "strong matches" and want to eliminate them from the pool of records before matching the remaining data.
The second step is to create combinations of probable matches all remaining records from file 1 and all records from file 2. The lowest overall "score" for that match will be the recordset alignment that I go with for the remaining records.
My problem? The process of identifying the "strong matches" is too CPU intensive, running at 100% usage for a long time. My CPU is actually overheating and my machine is shutting down. I can mitigate that with a sleep(1) statement, but that's not very elegant...it slows this massive task down...and I suspect the problem is the number of times that I sort my hash in the program to identify strong matches. In other words, inefficiency in my code ;) Is there a better way to do this? This sort is the only way I know to find the lowest value in my hash.
In the code below, id1 is my id from the first file, id2 is the id from the second file, and $rC holds the ranking for the combination of id1 and id2. Basically, I want to check all other combinations of records for the existence of id2 as a top rank. If I do not find any other combinations, I'm reasonably confident that those 2 records should be matched.