Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Good-bye Unix filter idiom

by martin (Pilgrim)
on Sep 06, 2012 at 19:12 UTC ( #992168=perlmeditation: print w/ replies, xml ) Need Help??

One of the perl idioms I have been using hundreds of times is the diamond operator in a while loop condition:

while (<>) { # ... process a chunk of input ... # ... print result ... }
This idiom will make my program act like a Unix-style filter program. It will take its input from files named on the command line or, in the absence of arguments, standard input, and it will write to standard output. A dash as a filename also means standard input.

What I like about idioms like this is that they put boring stuff behind the scenes and leave the essence of my code, the parts that are specific to the task at hand, in the limelight.

This use of the diamond operator, however, is flawed. It makes programs inflexible and insecure.

Flexibility

More often than not, I need a filehandle for my input stream. For example, I may need to configure IO layers. The diamond operator provides a filehandle, ARGV, but only after having started to read from each file, which would be too late. The open pragma does not help me there either, as behind-the-scenes open is out of my scope and thus not affected by a pragma.

I can work around this by forcing the next file to be opened without consuming the first chunk of input, like this:

use strict; use warnings; my $input_layers = ':encoding(utf-8)'; if (!eof()) { binmode ARGV, $input_layers or die $!; } while (<>) { # ... process a chunk of input ... # ... print result ... if (eof && !eof()) { binmode ARGV, $input_layers or die $!; } }
The eof builtin function with empty parentheses checks whether there is more input available for the diamond operator, which means it must open the next file and try to read from it, thereby updating the ARGV handle. This way I can sneak in a binmode command before the diamond operator itself can grab some of the file content. The eof without parentheses prevents me from doing this more than once for each file, as it checks whether the file currently being read has already been exhausted.

To summarize, there is a way to do such things, but the idiomatic conciseness will go out of the window.

Security

I am concerned that my programs should only do things I intended them to. For example, a typical filter program should not run arbitrary code fed to it by the user.

That is why I avoid using open with two-argument syntax. Two-argument open will interpret its second argument, chopping whitespace from it, determining access modes from leading or trailing special characters, and even running external programs to set up an input or output pipeline. This can be hazardous.

The diamond operator, however, happily uses two-argument open to processes the contents of the @ARGV array, which by default contains the command line arguments supplied by the user. This means, that an argument "< foo" makes my program read a file named "foo", an argument "> foo" makes it clobber a file named "foo", and an argument "rm * |" makes it run a program that on my platform happens to remove every file in the current directory.

In order to squash this type of exploit, I need to sanitize @ARGV before I use the diamond operator. The CPAN module ARGV::readonly will try to do this for me, though not in a platform-independent way. Running perl in taint mode will prevent opening user-supplied filenames with access modes other than readonly, but on some platforms the only way to turn on taint mode is a command line flag supplied by the same user I am trying to guard myself against. This means I can't enforce it.

As an alternative, the CPAN module Iterator::Diamond looks promising. Unfortunately, it has some wrinkles wanting to get ironed out before it can act as a true replacement for the builtin construct (or indeed pass its own test suite under recent perl versions). This should be a matter of only some minor edits, though. In fact, I have just posted a patch to the module's bug tracker queue with a couple of suggestions.

With a working version of Iterator::Diamond, the old idiom could be replaced by this:

use strict; use warnings; use Iterator::Diamond; my $iterator = Iterator::Diamond->new(magic => 'stdin'); while ($iterator->has_next) { my $line = $iterator->next; # ... process a chunk of input ... # ... print result ... }
As there is now a constructor call to set things up, there is also a good place where future expansions could provide additional features such as an option to specify input layers for all files, say. I am looking forward to using this module or something similar.

Meanwhile, I'll use explicit three-argument open when I can afford it, and while (<STDIN>) when I am in a hurry.

Good-bye, diamond operator.

Comment on Good-bye Unix filter idiom
Select or Download Code
Re: Good-bye Unix filter idiom
by BrowserUk (Pope) on Sep 06, 2012 at 22:10 UTC
    The diamond operator, however, happily uses two-argument open to processes the contents of the @ARGV array, which by default contains the command line arguments supplied by the user. This means, that an argument "< foo" makes my program read a file named "foo", an argument "> foo" makes it clobber a file named "foo", and an argument "rm * |" makes it run a program that on my platform happens to remove every file in the current directory.

    I don't get this.

    You (or someone) is using a "unix filter" perl script at the command line.

    You are concerned that you (or they) might accidentally (or maliciously) type a dangerous command in place of a filename as input to that script.

    What is to stop you (or they) from entering that command directly into the command prompt you (or they) are using?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    RIP Neil Armstrong

      What is to stop you (or they) from entering that command directly into the command prompt you (or they) are using?

      The danger is you can have a filename like  "rm -rf / |" so a ./foobar * ends up executing the command (and deleting files ) instead of opening that file

      A workaround is to perl -MARGV::readonly foobar *

      Quoth BrowserUk,

      I don't get this. You (or someone) is using a "unix filter" perl script at the command line. You are concerned that you (or they) might accidentally (or maliciously) type a dangerous command in place of a filename as input to that script. What is to stop you (or they) from entering that command directly into the command prompt you (or they) are using?

      The program might run with other privileges than those of the user. This is a fairly common scenario, and the "Unix-like filter program" pattern is intended to fit into it.

      Writing to standard output, e.g., is also part of the strategy to avoid overwriting arbitrary files. If the user employs redirection to write the output to a file, this file will be (or fail to be) created according to the user's privileges. Conversely, if the user just told the program where to put the output, the program would have to worry about doing this safely. This would add complexity.

Re: Good-bye Unix filter idiom
by sedusedan (Monk) on Sep 07, 2012 at 03:06 UTC

    In my module Data::Unixish, I am using this idiom:

    while (my ($index, $item) = each @ary) { ... }

    I picked each() over $iterator->next() as the former is much faster and works with Perl arrays. This way, my data can be a Perl array, or a file handle (accessed via array using Tie::File), or STDIN/files in command-line arguments (the diamond operator, accessed via array using Tie::Diamond) and I don't need to care.

    So, long live the Unix filter idiom :)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://992168]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (7)
As of 2014-12-27 14:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (177 votes), past polls