Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

I'm writing some Perl code to do some statistical analysis on two data files which contain records that may or may not be related. I have some code that creates a scoring system to rank matches bewtween records - all records from file a are matched with those records from file b that are possible matches. Each possible match is assigned a score. I have approx. 10 million possible combinations.

My data is stored like this:

$rMatch->{$File1RecordID}->{$File2RecordID} = $ranking;

My first step in matching the data is to find all those records in file 2 that are a top ranked match with only one record in file 1. I consider those "strong matches" and want to eliminate them from the pool of records before matching the remaining data.

The second step is to create combinations of probable matches all remaining records from file 1 and all records from file 2. The lowest overall "score" for that match will be the recordset alignment that I go with for the remaining records.

My problem? The process of identifying the "strong matches" is too CPU intensive, running at 100% usage for a long time. My CPU is actually overheating and my machine is shutting down. I can mitigate that with a sleep(1) statement, but that's not very elegant...it slows this massive task down...and I suspect the problem is the number of times that I sort my hash in the program to identify strong matches. In other words, inefficiency in my code ;) Is there a better way to do this? This sort is the only way I know to find the lowest value in my hash.

In the code below, id1 is my id from the first file, id2 is the id from the second file, and $rC holds the ranking for the combination of id1 and id2. Basically, I want to check all other combinations of records for the existence of id2 as a top rank. If I do not find any other combinations, I'm reasonably confident that those 2 records should be matched.

sub IsStrongMatch { # Return true if id2 is only top ranked match for id1 my $id1 = shift; my $id2 = shift; my $rC = shift; for my $i1 ( keys %{$rC} ) { next if $i1 == $id1; foreach my $i2 ( sort { $rC->{$i1}->{$a} <=> $rC->{$i1}->{$b} +} keys %{$rC->{$i1}} ) { if ( $id2 == $i2 ) { return 0; } last; } } return 1; }

2006-08-31 Retitled by planetscape, as per Monastery guidelines: one-word (or module-only) titles hinder site navigation

( keep:0 edit:14 reap:0 )

Original title: 'Efficiency'


In reply to How can I improve the efficiency of this very intensive code? by clearcache

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others wandering the Monastery: (8)
    As of 2020-05-30 19:24 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?
      If programming languages were movie genres, Perl would be:















      Results (173 votes). Check out past polls.

      Notices?