Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Algorithm To Select Lines Based On Attributes

by ~~David~~ (Hermit)
on Jan 15, 2009 at 16:44 UTC ( #736582=perlquestion: print w/ replies, xml ) Need Help??
~~David~~ has asked for the wisdom of the Perl Monks concerning the following question:

I would like to know if there is a more efficient way of doing the following, or recommendations on how to make this more robust, smaller, faster, etc... I have a list of up to 1M lines. Each line contains a DEFECT number, followed by up to 30 attributes.
DEFECTID ATTR1 ATTR2 ... ATTR30
I need to be able to select DEFECTIDs that match certain criteria, select a random sample of those, and then add them to a data set.
I currently have a rules hash that I read in from a local file that has the following structure: $hash->{RULENUMBER}->{RULETYPE} = value I want all the rules to be additive ( so if I get 10 defects from rule1, and 8 defects from rule2, my final set would contain 18 defects ). I also need to be able to negate the rule by adding an ! in the value
I have the following algorithm / code working, but I was wondering if it could be better ( one caveat is that I can't generalize and loop through some genereric rule because some are text rules, some are ranges, some are equivalencies, etc. ).
DEFECT: foreach my $defect ( @$list ){ # $summary is my pseudo object, # this line converts each defect line into a hash # whose keys are the attributes desired to be selected my $line = parseLine( $summary, $defect ); if ( ! $line ){ next DEFECT }; RULE: foreach my $rulenum ( keys %$rulelist ){ # one example rule, but there are many.. if ( defined $rulelist->{$rulenum}->{REGION} ){ my $rule = $rulelist->{$rulenum}->{REGION}; if ( $rule =~ s/!// ){ # handles negation if ( $line -> {REGIONID} == $rule ){ next RULE }; } elsif ( $line -> {REGIONID} != $rule) { next RULE }; } } # i create a filtered list in the $summary hash push @{$summary->{FILTEREDLIST}->{$rule}}, $defect; }
The problem is that I have a ton of rules, and looping through each rule for every defect seems to be overkill. I was just wondering if there is a more efficient way of doing this using some kind of parallel approach...
Any help would be greatly appreciated.

Update: Fixed typo pointed out by hbm

Comment on Algorithm To Select Lines Based On Attributes
Select or Download Code
Re: Algorithm To Select Lines Based On Attributes
by JavaFan (Canon) on Jan 15, 2009 at 17:08 UTC
    Without knowing more about the data and the rules, I don't think you can get away with comparing each defect with all the rules.

    Having said that, if you have many rules of the form "specific_attribute equals X" or "specific_attribute falls between Y and Z", you can preprocess the rule data. For exact matches, build hash; for ranged matches, use something like a segment tree. That would reduce your algorithm from O(A*R) to O((A+R)*log2(R)), where A is the total number of attributes (of all defects together) and R the number of rules.

Re: Algorithm To Select Lines Based On Attributes
by xdg (Monsignor) on Jan 15, 2009 at 17:13 UTC

    Some quick thoughts:

    • #1: Use Devel::NYTProf or another profiler to see what the actual hot-spots in your code are!

    • Read lines from disk one at a time rather than slurping into @lines

    • Consider defining rules as subroutines acting on an argument and then use Memoize to cache results (assuming attributes re-occur frequently)

    • If you're re-running this against the same set of lines and rules frequently, cache the rule test results in a file or DB so you have DEFECTID and a list of rules it matches.

    • Perhaps reorganize the rules (if you can): $hash->{RULETYPE}->{RULENUMBER} = value. Then iterate the list of rules for each attribute, rather than (as you have it), iterating the attributes for each rule. I think that saves a lot of if ( defined $rulelist->{$rulenum}->{REGION} ) comparisons.

    -xdg

    Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.

      Thanks for the suggestions. I have one question about bullet #2:
      I need to read to the end of the file before I enter this subroutine because I need to know how some information at the bottom of the file before I decide which rule set to use. I figured it would be better to cache that DEFECTLIST into memory rather than re-reading the file again. Is that best? Or, is there someway I could store the position in the file of the beginning and the end of the defect list, and always ensure that all characters between it are the DEFECTLIST? I don't have experience with stuff like that...
      I will definately think about using Memoize and see if I can implement it.
      Thanks again.
        I need to know how some information at the bottom of the file before I decide which rule set to use

        Maybe you can use File::ReadBackwards to find the information you need, then jump back to the start of the file and read forwards. If memory isn't an issue, then it may not matter, but anytime I see a file that large being slurped for a linear scan, I wonder if it could be done line by line instead.

        -xdg

        Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.

Re: Algorithm To Select Lines Based On Attributes
by mr_mischief (Prior) on Jan 15, 2009 at 17:31 UTC
    Update: nevermind about my solution. I missed your caveat about what types of rules you have. The advice given at the bottom still stands. Examples are better than prose for many things.

    A sample of input and a sample of output like this is very helpful in determining whether we're talking about the same spec. If I've made any incorrect assumptions about your spec, please give your own sample input and output so a monk can write a program to match.

      Input Example:
      DefectRecordSpec 17 DEFECTID XREL YREL XINDEX YINDEX XSIZE YSIZE DEFEC +TAREA DSIZE CLASSNUMBER TEST CLUSTERNUMBER ROUGHBINNUMBER FINEBINNUMB +ER REVIEWSAMPLE IMAGECOUNT IMAGELIST ; DefectList 1 466.458 4741.229 -2 24 2.000 1.725 3.451440 2.000 0 1 0 0 0 0 0 0 2 5329.992 4264.499 -2 24 1.500 0.862 1.294290 1.500 0 1 0 0 0 0 0 0 3 1469.965 4591.523 -1 24 0.500 0.431 0.215715 0.500 0 1 0 0 0 0 0 0 4 7082.505 4283.913 -1 24 2.000 1.725 3.451440 2.000 0 1 0 0 0 0 0 0 5 777.809 2623.219 -2 24 1.000 0.862 0.862860 1.000 0 1 0 0 0 0 0 0 6 376.807 3904.135 -1 24 1.500 0.862 1.294290 1.500 0 1 0 0 0 0 0 0 7 5345.841 3877.818 0 24 0.500 0.431 0.215715 0.500 0 1 0 0 0 0 0 0
      I could make a rule XINDEX,-2 ( give me X-index of only -2. Negating this give me all values without XINDEX = -2 ). Output:
      DefectRecordSpec 17 DEFECTID XREL YREL XINDEX YINDEX XSIZE YSIZE DEFEC +TAREA DSIZE CLASSNUMBER TEST CLUSTERNUMBER ROUGHBINNUMBER FINEBINNUMB +ER REVIEWSAMPLE IMAGECOUNT IMAGELIST ; DefectList 1 466.458 4741.229 -2 24 2.000 1.725 3.451440 2.000 0 1 0 0 0 0 0 0 2 5329.992 4264.499 -2 24 1.500 0.862 1.294290 1.500 0 1 0 0 0 0 0 0 5 777.809 2623.219 -2 24 1.000 0.862 0.862860 1.000 0 1 0 0 0 0 0 0

        You said the rules should be additive. If you're applying two rules, say XINDEX == -2, and XSIZE == 2, do you want to select defect 1 once or twice ? If twice, then I you're stuck with N * R work, where N is the number of records and R is the number of rules. If once, then one trick would be to run easy and/or most likely to match rules first.

        If the rules weren't additive, I'd suggest looking for ways to quickly exclude defects which definitely would not match any rule -- the idea being that you may be able to quickly hack away at the search space.

        So, as others have said, the problem is how to take your rules and render them into something that Perl can execute as quickly as possible. Here I think the basic trick is something along these lines:

        use strict ; use warnings ; my @DefectRecordSpec = qw(DEFECTID XREL YREL XINDEX YINDEX XSIZE YSIZE + DEFECTAREA DSIZE CLASSNUMBER TEST CLUSTERNUMBER ROUGHBINNUMBE +R FINEBINNUMBER REVIEWSAMPLE IMAGECOUNT IMAGELIST) ; my @DefectList = ( '1 466.458 4741.229 -2 24 2.000 1.725 3.451440 2.000 0 1 0 0 0 0 0 0 +', '2 5329.992 4264.499 -2 24 1.500 0.862 1.294290 1.500 0 1 0 0 0 0 0 0 +', '3 1469.965 4591.523 -1 24 0.500 0.431 0.215715 0.500 0 1 0 0 0 0 0 0 +', '4 7082.505 4283.913 -1 24 2.000 1.725 3.451440 2.000 0 1 0 0 0 0 0 0 +', '5 777.809 2623.219 -2 24 1.000 0.862 0.862860 1.000 0 1 0 0 0 0 0 0 +', '6 376.807 3904.135 -1 24 1.500 0.862 1.294290 1.500 0 1 0 0 0 0 0 0 +', '7 5345.841 3877.818 0 24 0.500 0.431 0.215715 0.500 0 1 0 0 0 0 0 0 +', ) ; my @rules = ( '$XINDEX == -2', '($XSIZE > 1.75) && ($XSIZE <= 2.15)', ) ; my $fields = '$'. join(', $', @DefectRecordSpec) ; my $rules = join("\n || ", map "($_)", @rules) ; my $sub = join("\n", 'sub {', ' my ($line) = @_ ;', ' $line =~ s/^\s+// ;', # Trim leading spaces, so split is not foo +led # NB: the split will discard trailing spac +es and newline ' my ('. $fields .') = split(/\s+/, $line) ;', ' return '. $rules .' ;', '} ;' ) ; my $test = eval $sub ; print $sub, "\n" ; foreach my $d (@DefectList) { print "$d\n" if $test->($d) ; } ;
        where your rules are each small Perl expressions, so can be arbitrarily complicated.

Re: Algorithm To Select Lines Based On Attributes
by hbm (Hermit) on Jan 15, 2009 at 17:36 UTC
    If I'm following your code correctly, shouldn't the following line be using my $rulenum?
    foreach my $rule ( keys %$rulelist ){

    And some formatting suggestions, particularly as your list gets longer and more nested. Perhaps change these:

    if ( defined $rulelist->{$rulenum}->{REGION} ){ my $rule = $rulelist->{$rulenum}->{REGION}; if ( $rule =~ s/!// ){ # handles negation if ( $line -> {REGIONID} == $rule ){ next RULE }; } elsif ( $line -> {REGIONID} != $rule) { next RULE }; ...
    To this:
    if ( defined (my $rule = $rulelist->{$rulenum}->{REGION} )){ next RULE if ( $rule =~ s/!// && $line->{REGIONID} == $rule); next RULE if ( $line->{REGIONID} != $rule); ...

    Update:Fixed a typo, I had omitted the close-paren in the first next RULE statement.

      Thanks for the formatting suggestion. Much clearer...
Re: Algorithm To Select Lines Based On Attributes
by jethro (Monsignor) on Jan 15, 2009 at 18:00 UTC

    This looks like you might be able to generate some sort of parallel comparision. If you wanted only lines where all rules matched it might even have been possible to use a combination of anding strings together and a regex match, used twice for non-negated and negated.

    But since you want or, the fastest way I came up is to construct the comparision code into a string and eval that for each line. Sort of compiling the search pattern.

    For example, if you have two rules that compare attr2 with 'n' and attr4 with '!x', then you could construct the following code string:

    return 1 if ($a[2] eq 'n' or $a[4] neq 'x'); return 0;
    Now just eval this for each line, having your lines split into @a first. If you need to know which rules matched, construct this instead:

    my @return=(); if ($a[2] eq 'n') { push @return,2; } if ($a[2] neq 'x') { push @return,4; } return @return;

    The only thing left is to eval your generated code for each line and check the return value of the eval

Re: Algorithm To Select Lines Based On Attributes
by perlesque (Initiate) on Jan 15, 2009 at 18:53 UTC
    David, One method I've used to speed up something similar in the past is to generate a function containing your rule code in advance of your main processing loop. The advantage of this is all of your rule code is compiled and this should reduce the time taken on each DEFECT iteration. Each DEFECT would then simply be passed to your generated rule function. A very simple example would be:
    my $rule_code = ""; foreach my $rulenum ( keys %$rulelist ){ # one example rule, but there are many.. if ( defined $rulelist->{$rulenum}->{REGION} ){ my $rule = $rulelist->{$rulenum}->{REGION}; if ( $rule =~ s/!// ){ $condition = "!="; } else { $condition = "=="; } # add the rule to the generated rule code $rule_code .= 'if ( $line -> {REGIONID} ' . $condition . "$rule + ) { return \"$rule\" };\n"; } } eval <<EOT; sub match_rule { my \$line = shift; # generated code goes here $rule_code return undef; # we did not match any rules!! }; EOT
    Your main loop would then become something like:
    DEFECT: foreach my $defect ( @$list ){ # $summary is my pseudo object, # this line converts each defect line into a hash # whose keys are the attributes desired to be selected my $line = parseLine( $summary, $defect ); if ( ! $line ){ next DEFECT }; # i create a filtered list in the $summary hash my $rule = match_rule( $line ); if ( defined $rule ) { push @{$summary->{FILTEREDLIST}->{$rule}}, $defect; } }
    You may find the generation technique also helpful in making production of your rules simpler/more generic. Kind Regards Frank Perlesque
Re: Algorithm To Select Lines Based On Attributes
by BrowserUk (Pope) on Jan 15, 2009 at 19:39 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://736582]
Approved by andye
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (9)
As of 2014-07-23 22:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (154 votes), past polls