Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses

Re^3: Remove blank lines from REGEX output

by Laurent_R (Canon)
on Feb 05, 2014 at 22:28 UTC ( #1073620=note: print w/replies, xml ) Need Help??

in reply to Re^2: Remove blank lines from REGEX output
in thread Remove blank lines from REGEX output

I think that you are wrong on that. The next approach is very human readable and and a very efficient way to build a decision tree in many cases. Suppose that you have a set of business rules specifying which lines of a file you want to process and which you want to discard. You can do it this way:
while (<$IN>) { chomp; next if /^#/; # discard line, it is a comment (starts + with #) next if /^\s*$/; # discard line, contains only spaces next if length < $min_length; # line is too short next if /^REM/; # another form of comment next unless /^.{3}\d{4}/; # lines of interest have 4 digits from +position 4 to 7 # now the real processing ... }
This is much cleaner and much more readable than a long series of nested if ... elsif ... elsif ... It is also often quite efficient, because as soon as you discard a line for one reason, none of the subsequent tests has to run (of course, it will be more efficient if you are able to put first the most common causes for exclusions and last the rare ones). There are other ways of achieving similar results. For example, you could have:
while (<$IN>) { chomp; next if /^#/ or /^\s*$/ or length < $min_useful_length or /^REM/ + or not /^.{3}\d{4}/ ...
This is more concise, and any condition evaluating to TRUE will also lead to short-circuiting the subsequent conditions, so that the performance will be similar, but that removes the opportunity to document the business rules that led to exclusion. I might use any of the two techniques, depending on the situation, but if the business rules are somewhat complicated or numerous, I prefer the first one.

Replies are listed 'Best First'.
Re^4: Remove blank lines from REGEX output
by AnomalousMonk (Chancellor) on Feb 06, 2014 at 00:22 UTC
    next if /^#/ or /^\s*$/ or length < $min_useful_length or  /^REM/ or not /^.{3}\d{4}/ ...
    ... that removes the opportunity to document the business rules ...
    next if /^#/ # document this or /^\s*$/ # document this too or length < $min_useful_length # and this one or /^REM/ # and so on ... or not /^.{3}\d{4}/ # ... ... ;

    ... but I am sometimes very (unpleasantly) surprised by interactions between the very lowest level | precedence logical operators and other expressions (but I can't seem to come up with a good example), so maybe individual statements are better after all.

      Well, yes, sure, you can still document your rules this way, but then you loose the concision. In that type of situation, I usually prefer a series of individual next statements (as you said, less risk of operator precedence mistakes). I would use the more concise form with several or only when the rules are more or less obvious or self-documenting. For example, in a procedure where I wish validate a date format, I could possibly have something like this:
      next if $day_in_month !~ /^\d\d?$/ or $month_nr !~ /^\d\d?$/; # one or + two digits next if $day_in_month < 1 or $day_in_month > 31 or $month_nr < 1 or $m +onth_nr > 12; next if ...
      Although this is a somewhat silly example, as this is certainly not good way to validate a date (it would not reject 31 Feb., for example), but it could still be useful in some cases to check the date format (do I have, as expected, dd/mm/yyyy or is it something else such as mm/dd/yyyy, yyyy/mm/dd or still something else), or to check that this piece of data is really likely to be a date and not, say, some numbers representing something else such as, say, a phone number, an IP address or whatever.

        One can also take another tack and use a series of statements like

        next if not proper_year($year); next if not proper_month($month_nr); next if not proper_day($day_in_month, $month_nr, $year);

        which might be said to offer the best of both concision (it's self-documenting, or can be made so) and encapsulation (annual, monthly and daily proprieties defined in one place in their respective validation functions). This sort of approach is what I usually prefer if I have the opportunity to take a second pass at a program and consolidate.

        Then there's the OO approach...

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1073620]
LanX ♪..♫ Fiesta, fiesta mexicana ... ♪..♫
[mandarin]: LanX aren't you a bit late to the party?

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (8)
As of 2018-06-18 17:41 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (110 votes). Check out past polls.