Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
Update: nevermind about my solution. I missed your caveat about what types of rules you have. The advice given at the bottom still stands. Examples are better than prose for many things.
Instead of looking through each line for each defect, look for each defect in every line using an alternating regular expression. This lets you only look through each line once, and gives you the advantage of having the highly optimized regex engine do much of the work.

I'm not even sure where %rulelist or $rulenum are supposed to be set in the above.

Do negated defects just not add up, or do they actually remove a defect from the final count? Here I'll assume they just don't get added in.

If I'm not misunderstanding your spec, this does everything you need short of reading which defects interest you from another file:

use strict; use warnings; my @defects_to_check = qw( ATTR1 ATTR3 ATTR7 ); my $alternation = join '|(?<!!)', @defects_to_check; # previous and next lines use negative look-behind to ensure # only defects listed without '!' preceding them get matched my $regex = qr/(?<!!)$alternation/; open ( my $df, '<', 'defects_file' ) or die "can't read defects_file: +$!\n"; my $total_defects = 0; while ( <$df> ) { next unless /^DEFECTID/; my @defects_found = $_ =~ m/$regex/g; $total_defects += scalar @defects_found; print "defects found this line: ", (join ', ', @defects_found), "\n" +; print "total defects so far: $total_defects\n"; } close $df;

Given this input file for defects:

DEFECTID ATTR1 ATTR7 ATTR4 DEFECTID ATTR3 !ATTR1 DEFECTID ATTR2 ATTR5 ATTR3 DEFECTID ATTR4 DEFECTID ATTR3
it produces this output:
defects found this line: ATTR1, ATTR7 total defects so far: 2 defects found this line: ATTR3 total defects so far: 3 defects found this line: ATTR3 total defects so far: 4 defects found this line: total defects so far: 4 defects found this line: ATTR3 total defects so far: 5

Now, with a million lines, I'd probably not print the new defects found and the new total for every line. If you need to know which defects had what subtotals, you could accomplish that with a hash:

use strict; use warnings; my @defects_to_check = qw( ATTR1 ATTR3 ATTR7 ); my $alternation = join '|(?<!!)', @defects_to_check; # previous and next lines use negative look-behind to ensure # only defects listed without '!' preceding them get matched my $regex = qr/(?<!!)$alternation/; open ( my $df, '<', 'defects_file' ) or die "can't read defects_file: +$!\n"; my $total_defects = 0; my %defect_subtotals; while ( <$df> ) { next unless /^DEFECTID/; my @defects_found = $_ =~ m/$regex/g; $total_defects += scalar @defects_found; $defect_subtotals{ $_ }++ for @defects_found; } close $df; print "Found $total_defects total defects.\nDefect breakdown follows:\ +n"; print $_ . ":\t\t" . $defect_subtotals{$_} . "\n" for sort keys %defec +t_subtotals;

Given the same input file as above, it produces this output:

Found 5 total defects. Defect breakdown follows: ATTR1: 1 ATTR3: 3 ATTR7: 1

A sample of input and a sample of output like this is very helpful in determining whether we're talking about the same spec. If I've made any incorrect assumptions about your spec, please give your own sample input and output so a monk can write a program to match.


In reply to Re: Algorithm To Select Lines Based On Attributes by mr_mischief
in thread Algorithm To Select Lines Based On Attributes by ~~David~~

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (4)
As of 2024-03-29 00:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found