Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Common Perl Pitfalls

by Joe_ (Beadle)
on Apr 09, 2012 at 21:52 UTC ( [id://964216]=perlmeditation: print w/replies, xml ) Need Help??

This node in no way means that I claim to be an expert on Perl. I hardly consider myself at an intermediate level (I'm still making my way through the Alpaca!). These are just some of the most common ways I've managed to shoot myself in the foot. I thought I would share them here in the hope that they would benefit others and in the hope that I may receive enlightenment from other, more experienced monks on how to better handle these issues. Most of them have to do with regex (go figure!). Here they are in no particular order:

Regex in a loop

This one is probably due to how much Perl sounds like regular, spoken English. I'm always tempted to go:
while($some_data=~ m/some_regex/) { # Do some stuff }
This just sounds natural: as long as the data you're reading matches your regex, keep doing some stuff. Only, the above code (should the regex match) would yield an infinite loop. The correct semantics for that (as I too often forget) are:
while($some_data=~ m/some_regex/g) { # Do some stuff }
The reason is that the first snippet would always match the same portion of $some_data, there would be nothing to advance the regex engine to other parts of the string. The 'g' modifier is exactly what's needed in this case. Its iterative matching capabilities mean that it keeps trying to match at the position where the last successful match ended, which makes for correct semantics.

A variable as part of a regex

I can't count how many times I wrote the following:
$to_replace='some_string'; $my_string=~ s/$to_replace/$better_data/;
While seemingly innocuous, the above code is actually catastrophically unsafe. To see why, imagine if I had written:
$to_replace='a caret looks like ^'; $my_string=~ s/$to_replace/$better_data/;
See the problem here? The semantics I was going for, here, were that I had this string whose occurrence I wanted to replace in another string. What Perl sees is that the variable $to_replace contains a regex metacharacter (^) and would interpret it as an anchor for a regex match. To get the correct semantics, you should make it a habit to quote your variables in regex substitution:
$to_replace='a caret looks like ^'; $my_string=~ s/\Q$to_replace\E/$better_data/;
Of course if you had meant the string $to_replace is an actual regex to match against, you're better off using the qr operator:
$to_replace=qr/^your_regex_here$/; $my_string=~ s/$to_replace/$better_data/;
The \Q and \E would have been wrong in this case, of course.

Regex in a loop 2

OK. So I learned about the 'g' modifier's use in a loop and I used it correctly as follows:
while($string=~ m/(reg)(ex)/g) { $string=~ s/$1$2/$1ister/; }
or so I thought! The problem with the above is that you're modifying the $string variable WHILE you're matching it against your regex. The 'g' modifier keeps track of the position where the last successful match left off, but that position will most likely refer to a totally different place after the substitution goes through. The above example is, of course, a bit contrived because I could just say:
$string=~ s/(reg)ex/$1ister/g;
If I do need to substitute in a loop, I usually match my regex against a dummy copy of the data:
$dummy=$string; while($dummy=~ m/(reg)(ex)/g) { $string=~ s/$1$2/$1ister/; }
As pointed out by Jenda, substitution in a loop may be done by omitting the 'g' modifier:
while($string=~ m/(reg)(ex)/) { $string=~ s/$1$2/$1ister/; }
This is no longer an infinite loop because you're modifying the string inside the loop before retrying the match.

Deleting some array elements

I've struggled a lot with the concept of deleting array elements at given indices. The two most common approaches I used were:
for($i=0;$i<@array;$i++) { if(&should_delete($i)) { delete $array[$i]; } }
and
for($i=0;$i<@array;$i++) { if(&should_delete($i)) { splice @array,$i,1; } }
The problem with the first approach is that none of the remaining elements have their index changed: 'delete' simply replaces the element(s) at the given position(s) with 'undef'. The second approach is a total disaster! When you splice an element from the array, the array shrinks. So, while the first splice may work correctly, all the subsequent ones will delete the wrong elements since the indices of all elements would have changed after the first splice. I found a very simple solution for this:
for($i=0;$i<@array;$i++) { if(&should_delete($i)) { splice @array,$i,1; $i--; } }
This works because whenever an element is spliced, the indices of all the elements that come after it will be decremented by one.
Thanks to an amazing one-liner courtesy of JavaFan, I now know better:
@array = @array[grep {!should_delete($_)} 0..$#array];
An amazing use of grep and (see his analysis in the replies) a more efficient approach than mine.

Slurping gone wrong

I'm a big fan of slurping input files and matching them against a regex in multiline mode. I believe the following is common idiom:
$rec_sep=$/; undef $/; $slurp=<INPUTFILE>; # Regex matching here ...
The problem with this is that it's all too common (for me at least) to forget to return the record separator to its old state (this is usually a newline unless you've changed it for your purposes). Why is this a problem? Imagine the following scenario (that has occurred to me before):
$rec_sep=$/; undef $/; $slurp=<INPUTFILE>; print "Enter something: "; $something=<STDIN>;
So what happens? The user, used to ending his/her input by pressing "Return" will be unpleasantly surprised to find that it won't work! Perl completely ignores that newline because it's not the record separator anymore. The solution: redefine $/ right after your slurp:
$rec_sep=$/; undef $/; $slurp=<INPUTFILE>; $/=$rec_sep; print "Enter something: "; $something=<STDIN>;
An even better solution (thanks again to Jenda) would be:
my $data = do {local $/; <INPUTFILE>};

Waiting on a pipe

This is a less obvious one. I once did something close to the following:
open COMMAND,'-|','some_command'; $input=<COMMAND>; #... bla bla $pid=fork #....bla bla wait if $pid #... bla bla
All the bla bla's represent sections of code I don't actually remember. The point is, the wait returned way before the forked process terminated. You know why? Because that pipe I had just opened to some command was actually another child process that wait perceived to have terminated correctly. Two better approaches are as follows:
open COMMAND,'-|','some_command'; $input=<COMMAND>; close COMMAND; #... bla bla $pid=fork #....bla bla wait if $pid #... bla bla
or
open COMMAND,'-|','some_command'; $input=<COMMAND>; #... bla bla $pid=fork #....bla bla waitpid $pid,0; #... bla bla
Of course, it's always good practice to close your pipes as soon as you don't need them.

True is 1, false is...?

Many, many, many times I've fallen for this one. While debugging, I would print out the value of some logical test to see whether it tested true. When it tests true, it prints out "1" as expected. When it tests false, however, Perl prints nothing. I used to go around in circles thinking there's something wrong with my debugging procedure, that my code never actually reached the print statement for some reason. The thing is, Perl treats the result of a false logical test as the empty string (in scalar context) and (I'm guessing) the empty list in list context; this is why nothing was being printed.

Shooting yourself in the foot on purpose

Finally, I leave you with a very short code snippet that me and a fellow Perl enthusiast once came up with in order to test the limits of our system:
undef $_ for keys %SIG; fork while 1;
A little piece of advice: close anything important you're working on before you run this one :D

Conclusion

I know that most of these will probably look too obvious to the veterans. Please feel free to suggest better solutions that the ones I provide (or give me another approach altogether). Also, if you can share some of the stories on ways you've managed to shoot yourself in the foot with Perl, that would be awesome, too!

EDIT: Made some changes to the proposed solutions above according to some keen insights from JavaFan and Jenda. Thank you for your constructive criticism.

Replies are listed 'Best First'.
Re: Common Perl Pitfalls
by JavaFan (Canon) on Apr 09, 2012 at 23:11 UTC
    The solution: redefine $/ right after your slurp:
    No. That's just another potential problem. The solution is:
    my $slurp; {local $/; $slurp = <INPUTFILE>};
    Or:
    my $slurp = do {local(@ARGV, $/) = "inputfile"; <>};
    Or even:
    my $slurp = `cat inputfile`;

      Great stuff. Thanks for the feedback.
      I really like the second solution.
      The third solution isn't as portable, though.
      Why do you think redefining $/ as I did is a potential problem?

        The third solution isn't as portable, though.
        cat is probably available on more platforms than perl is. Of course, Windows rules the world, and both cat and perl are ported to Windows -- and, AFAIK, neither comes standard with the OS. Unlike cat, perl is not included in the POSIX standard for shell utilities.

        Why do you think redefining $/ as I did is a potential problem?
        Well, you consider someone modifying the code to be a potential problem. Would if someone modifies your code, and adds a return after the first assignment to $/, but before the second? Would if someone wraps the code in an eval, and the read triggers an exception?

      Rather than mess around with $INPUT_RECORD_SEPARATOR (AKA: $/), and restore it afterwards, a better method would be to use the File::Slurp module.

      It is another common Perl pitfall to write new code for a common problem when you should have looked on CPAN. There is a very good chance that you will find a fully debugged implementation that considers all the edge cases. It never ceases to amaze me that people would prefer to spend half a day writing and debugging code, instead of 15 minutes finding and installing a module from CPAN.

        It always amazes me people prefer downloading a CPAN module, and using it, over writing a one-liner. I'm even more amazed that people think that just because there's a module on CPAN, it automatically is fully debugged and covers all the edge cases.

        I do wonder though, if it takes half a day to write:

        my $slurp = do {local $/; <HANDLE>};
        how long do you need to type in:
        use Some::Module::From::CPAN; my $slurp = Some::Module::From::CPAN->some_API(some_argument);
        Twice the number of lines, so, a full work day?
      First solution is not equal to last two, as it implies that INPUTFILE is already open. I would say that the correct idiom looks like this:
      my $slurp = do { open my $fh, '<', "inputfile"; local $/; <$fh> };

      P.S. I really like the second one, thanks. Not for production use, of course. :)

      Regards,
      Alex.

        Considering that Joe_'s example uses the handle INPUTFILE, I don't have any problems with the implication, and I really don't see the need to come up with the snobby term correct idiom. (You consider a piece of code with error handling to be correct idiom? You're fired).
        I really like the second one, thanks. Not for production use, of course. :)
        Why not? It's not significant different from your correct idiom. It misses error handling (but then, so does your correct idiom), but that's easily handled: just add a // die "slurp: $!";.
Re: Common Perl Pitfalls
by Jenda (Abbot) on Apr 09, 2012 at 23:55 UTC

    Re: Regex in a loop The first while statement is perfectly fine ... if you intend to modify the variable. The first loop reads "while the variable still matches the regular expression do something with the variable" while the second reads "while there's still something more to match in the variable do something with the match". The rule is that in the first case you SHOULD modify the variable within the loop, while in the second case you SHOULD NOT modify it.

    Re: Deleting some array elements You should use grep():  @filtered = grep {whatever('test', $you, want_with($_, 'aliased to an array element'))} @all;

    Re: Slurping gone wrong my $data = do {local $/; <INPUTFILE>}; or use File::Slurp

    Re: True is 1, false is...? Don't print just the value, print some more info to make sure you are looking at the result of the print statement you think you are and always put some kind of quotes around the variable: print "Computed the number of angels: >$angel_count<\n";

    Jenda
    Enoch was right!
    Enjoy the last years of Rome.

      I agree on almost all counts.
      I remember having used the option you talked about (modifying the regex and matching in the loop without 'g') before. I just don't like it, though. I feel that it's quite unstable and will turn into an infinite loop the second you're not looking...

Re: Common Perl Pitfalls
by JavaFan (Canon) on Apr 09, 2012 at 23:16 UTC
    Deleting some array elements
    I'd write that as:
    @array = @array[grep {!should_delete($_)} 0..$#array];
    If only because your splice solution can be quadratic worst case, while the above is linear (assuming should_delete has a running time bounded by a constant).

      That's a really great one, too.
      I've only recently started coming to grips with grep and map. I've always felt that this problem can be tackled by a one-liner but I just couldn't put my hands on it. Thanks for finally providing it :)

      Care to elaborate on that "quadratic" comment? How do you figure? I'm not that good with complexity theory, I'm afraid...

        Care to elaborate on that "quadratic" comment?
        Say you want to delete all elements in the second half of the array. The first N/2 iterations of your loop, no splicing happens. But on the N/2 + 1st iteration, the splicing takes at least N/2 - 1 steps, as that many array elements need to be moved. On the N/2 + 2nd iteration, the splicing takes at least N/2 - 2 steps. In total, you will be moving

        ΣN/2-1i=1(i)

        array elements. If I've done my math correctly, the above sum equals (N2 - 2N + 4)/8. Which means your algorithm runs in Ω(N2) time.

Re: Common Perl Pitfalls
by JavaFan (Canon) on Apr 09, 2012 at 23:28 UTC
    The thing is, Perl treats the result of a false logical test as the empty string (in scalar context)
    It's actually a dual (triple) var:
    $ perl -MDevel::Peek -wE '$x = 1 < 0; Dump $x' SV = PVNV(0x97f45a8) at 0x9805ad0 REFCNT = 1 FLAGS = (IOK,NOK,POK,pIOK,pNOK,pPOK) IV = 0 NV = 0 PV = 0x9801438 ""\0 CUR = 0 LEN = 4

      I will have to RTFM on that one :)

Re: Common Perl Pitfalls
by Anonymous Monk on Apr 09, 2012 at 23:56 UTC

      Thanks for the links. They seem really interesting.

Re: Common Perl Pitfalls
by JavaFan (Canon) on Apr 10, 2012 at 06:49 UTC
    Of course if you had meant the string $to_replace is an actual regex to match against, you're better off using the qr operator:
    I don't get this point. You started off that section with:
    $to_replace='some_string'; $my_string=~ s/$to_replace/$better_data/;
    and doomed this catastrophically unsafe, because $to_replace may actually contain characters that have a special meaning.

    But if $to_replace is actually a regexp, the premises is gone -- any special characters are intentional. In fact, it's quite fine in that case to use the above.

      I actually meant to say that one shouldn't use a scalar as a regex anyway. I meant to say that, even if your correct semantics didn't require the use of \Q and \E (i.e. you actually needed the metacharacters) then you're better off using the qr// operator instead of building your regex as a literal string.

        Well, that's nice you want to say that, but can you back up your statement with an argument?
Re: Common Perl Pitfalls
by sundialsvc4 (Abbot) on Apr 12, 2012 at 13:33 UTC

    I wish that this thread had not promptly become “threaded” so much, thereby diluting its content such that now someone would have to wade through a lot of back-and-forth conversation to glean the “final” meaning out of it -- some of those conversations seeming to be fairly nit-picking anyhow.   Threads, particularly in the Meditations section, are going to be referred-to for many years to come.   I’d therefore suggest that they ought to be built-up in this way, when possible ... you are building a sort of reference article when you write in this particular section.   If you want to debate fine-points, do it over in Seekers, then put a summarization or edit over here, with appropriate hyperlinks to the relevant discussions.   If you want to do major edits, come back up to the first level of reply-nesting.   I think that would make for better final-content, IMHO.   The <strike> strike tag is great to explicitly show edits.   Someone ought to be able to read just the top-article and perhaps the first-level replies and come away with an accurate reply (incorporating all of the various back-and-forths which they didn’t have to wade through) with a minimal amount of reading.   I just think that would be better... it reflects how I use this resource (constantly), anyhow, and what I personally would prefer as a reader.

Re: Common Perl Pitfalls
by girarde (Hermit) on Apr 11, 2012 at 18:06 UTC
    This post would have been imporoved by the use of <continue> tags.

      Thanks for the advice. I do agree. The <continue> tags didn't work so I used spoilers instead.

Re: Common Perl Pitfalls
by muba (Priest) on Apr 29, 2012 at 19:32 UTC

    while($string=~ m/reg(ex)/) { $string=~ s/$1/ister/; }

    That frightens me a bit. Your substitution replaces the first occurance "ex" with "ister" in the string, no matter whether "ex" is part of "regex" or something else:

    my $string = "Sometimes there are extra effects you didn't foresee whe +n using a regex."; while($string=~ m/reg(ex)/) { $string=~ s/$1/ister/; } print $string;

    Output:

    Sometimes there are istertra effects you didn't foresee when using a r +egister.

    Istertra?

      You are quite right. It's a terrible mistake on my part. I will edit it.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://964216]
Approved by ww
Front-paged by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (5)
As of 2024-03-29 12:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found