Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

"Correct" program style questions

by Ovid (Cardinal)
on Oct 23, 2002 at 02:34 UTC ( #207263=perlquestion: print w/ replies, xml ) Need Help??
Ovid has asked for the wisdom of the Perl Monks concerning the following question:

Bit of an odd question, but figured I would toss this out to some people. I'm preparing some material for people who may be new to the CGI module and to Web programming in general and I realized that some of my "style" in programming might seem a bit odd to some. As I am already familiar with my style, it seems natural to me. However, I thought I should toss this out to my fellow monks for some feedback.

#!/usr/bin/perl -wT use strict; use CGI qw/:standard/; use HTML::Entities; my $_name = param('name') || ''; my $_color = param( 'color' ) || ''; my ( $name ) = $_name =~ /^([[:alpha:][:punct:][:space:]]+)$/; my ( $color ) = $_color =~ /^([[:alpha:]]+)$/; encode_entities( $name ); print header, <<"END_HTML"; <html> <head><title>Test page</title></head> <body> <p>Your name is "$name" and the color you listed was "$color"< +/p> </body> </html> END_HTML

First, you'll notice that I tend to use two different variables for grabbing parameter values. I prefer two as it's much less likely that one will accidentally read bad data into the untainted variable. Also, the following popular construct can generate a warning if the parameter is undefined:

my ($action) = param('action') =~ /^([[:alpha:]]+)$/;

As I don't like spurious warnings in my error log, it seems cleaner this way.

Also, I am using POSIX character classes. I feel that they are easier on the eyes, but many don't care for them. I could also switch to Unicode if you think that would be more general purpose or go back to the traditional regex characters if that is what the audience will be more familiar with.

The Web sites that I build are usually, but not always for US audiences. As a result, I haven't had to worry too much about Unicode and friends. Would the program be less "correct" if I switched [:alpha:] to [a-zA-Z]? Or would it be better if I used \p{IsAlpha}? What are the pros and cons of using traditional regex characters versus POSIX or Unicode?

Any other comments welcome.

Cheers,
Ovid

Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

Comment on "Correct" program style questions
Select or Download Code
Re: "Correct" program style questions
by sauoq (Abbot) on Oct 23, 2002 at 02:55 UTC
    Also, the following popular construct can generate a warning if the parameter is undefined:
    my ($action) = param('action') =~ /^([[:alpha:]]+)$/;
    As I don't like spurious warnings in my error log, it seems cleaner this way.

    Why would you consider those warnings spurious? If they were a result of an incorrect parameter name in a form, they could help with debugging. If they are the result of someone purposefully trying to submit incorrect parameters it might be interesting to know that...

    The code looks fine, by the way. I don't use POSIX character classes very often myself but I don't have any problem reading them.

    -sauoq
    "My two cents aren't worth a dime.";
    
Re: "Correct" program style questions
by Kanji (Parson) on Oct 23, 2002 at 03:05 UTC
    Would the program be less "correct" if I switched [:alpha:] to [a-zA-Z]? Or would it be better if I used \p{IsAlpha}? What are the pros and cons of using traditional regex characters versus POSIX or Unicode?

    I don't think it would be any less correct, but one of the pros of traditional character classes is compatability with older versions of Perl.

    Whether that's enough a pro to warrant change depends on how prevalent 5.005_03 is in your circles, but perhaps a require 5.006; would be in order if it isn't?

        --k.


Re: "Correct" program style questions
by perrin (Chancellor) on Oct 23, 2002 at 05:52 UTC
    Pros and cons of POSIX character classes? Con: I have never ever seen anyone use them before. Pro: they are portable.

    I would suggest naming your variables $tainted_name and $tainted_color instead of $_name and $_color, to make it more obvious what you're doing.

      ... or maybe something more terse like $t_name and t_color. That's just a question of style really though; I agree with your point.
        $t_name doesn't mean the same thing as $tainted_name. In general I suggest avoiding the impulse to abbreviate. It may make the code shorter, but it also makes it harder to read.
Re: "Correct" program style questions
by BrowserUk (Pope) on Oct 23, 2002 at 10:12 UTC

    Not so much a suggestion as a question arising... Is there any particular reason for using temporary vars as opposed to say

    #!/usr/bin/perl -wT use strict; use CGI qw/:standard/; use HTML::Entities; my ( $name ) = (param('name') || '') =~ /^([[:alpha:][:punct:][:s +pace:]]+)$/; my ( $color ) = (param('color') || '') =~ /^([[:alpha:]]+)$/; encode_entities( $name ); print header, <<"END_HTML"; <html> <head><title>Test page</title></head> <body> <p>Your name is "$name" and the color you listed was "$color"</p> </body> </html> END_HTML

    Cor! Like yer ring! ... HALO dammit! ... 'Ave it yer way! Hal-lo, Mister la-de-da. ... Like yer ring!
      my ( $color ) = (param('color') || '') =~ /^([[:alpha:]]+)$/;

      That's an interesting line. My only immediate problem with it is that we come closer to obfuscating the code. People will likely understand the intent, but not necessarily see the subtleties. For instance, $color will be undef and not the empty string, if param('color') doesn't return anything. However, if the regular expression was stripped from the end of that, then $color would indeed default to the emptry string. Of course, I can't say that mine is much clearer in that regard.

      Cheers,
      Ovid

      Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

        it's worse than obfustication, it's out right betrayal worthy of the machiavellian-est of perl programers. if param('color') returns a false value (be it undef or the empty string) the regex against '' is going to always return undef.

        I assume you simply want to provide a default value for $color if a) it's not present (param('color') returns empty string or undef) or b) it isn't a sequence of only [:alpha:] chars.

        use locale; my $color = param('color') =~ /^\w+$/ ? param('color') : "default color";
        or you could write it as:
        my $color = "default color"; $color = param('color') if param('color') =~ /^\w+$/;

        of course, the real solution to this is:

        use locale; sub is_alpha { my ($datum) = @_; # return "" so we don't get unitialized value errors $datum =~ /^(?:\w+)$/ ? $datum : ""; } my $color = is_alpha(param('color')) || "default value";

        Or why not sub class CGI and add these checks to CGI::param? Have MyCGI::param return an object with color, name, etc methods so you could just write $cgi->param->color;

        Of course, I can't say that mine is much clearer in that regard.

        It isn't. I felt ickier the more I looked at that code for handling params too.

        There is no reason to try to differentiate between undef and the empty string in this case. Interpolated, printed, or otherwise stringified, undef is an empty string. So,

        # Cargo-cultish version with a temp variable. my $_name = param('name') || ''; my ($name) = $_name =~ /^([[:alpha:]]+)$/;
        is equivalent to
        # BrowserUk's cargo-cultish only version. my ($name) = (param('name') || '') =~ /^([[:alpha:]]+)$/;
        which is also equivalent to
        # My preferred version. my ($name) = param('name') =~ /^([[:alpha:]]+)$/;

        Using the construct param('name') || '' at all is questionable. It throws away information. CGI returns an empty string in the case that the parameter was submitted with no value and undef in the case that it wasn't submitted at all. If you need to differentiate between those cases then you can check for undef. If you don't need to, you can ignore the difference (as above.)

        -sauoq
        "My two cents aren't worth a dime.";
        
Re: "Correct" program style questions
by joe++ (Friar) on Oct 23, 2002 at 18:15 UTC
    Hi Ovid,

    I would personally rather not import anything in my namespace, but use the OO interface instead. This is a matter of taste, but anyway this would be "my coding style":

    #!/usr/bin/perl -wT use strict; use CGI; my $cgi = new CGI; my $_name = $cgi->param('name') || ''; ...
    I know, this isn't going to work with HTML::Entities, but I made the lucky choice to use UTF-8 as encoding for our webpages, so no character escaping necessary (unless your input charset isn't UTF-8, but that's another story about Text::Iconv).

    Otherways, I like the POSIX character classes (even more when we're talking UTF-8 and international users). I still use those old fashioned ranges like [a-zA-Z] however, 'cause we are still on 5.00_03 on our production servers blush ;-)

    --
    Cheers, Joe
    Do I need a Disclaimer?

      joe++ wrote: I know, [not importing anything] isn't going to work with HTML::Entities....

      Actually, it can work with HTML::Entities. Merely specify an empty import list and fully qualify the function name.

      use HTML::Entities (); HTML::Entities::encode_entities( $name );

      Alternatively, you could simply require HTML::Entities and not call the import method.

      Cheers,
      Ovid

      Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

Re: "Correct" program style questions
by l2kashe (Deacon) on Oct 23, 2002 at 18:44 UTC

    I've never used the POSIX char classes, but my suggestion would be...

    1) Somewhere at the beginning of the write up state something like "For regular expressions I will be using [:alpha:] to match alpha-numerical strings, which is generally matched via [a-zA-Z0-9], so on and so forth for each of the char classes

    2) Within the first few snippets give a few quick in regards to TIMTOWTDI with references to camels, llamas, and panthers (o my!), but then state that henceforth your code will always be in this form to maintain consistency across the text.

    Also I would show the reasoning behind using multiple variables, even though you don't *have* to, so they at least have a feeling of the thought process behind creating the code. I've heard I don't know how many times.. "Oh yeah, well, I can read it, but writing it is still way too beyond me."... Generally if they can follow the why it was used, the how/when will come as a natural extension

    P.s: sorry about nesting the first comment in code blocks, but I'm not familiar with HTML, don't really want to be, and I couldn't figure out how to get a literal bracket, which I will now figure out how to do.. :P

    /* And the Creator, against his better judgement, wrote man.c */

Re: "Correct" program style questions
by adrianh (Chancellor) on Oct 23, 2002 at 21:39 UTC

    The Once And Only Once bigot in me would force me to rewrite

    my $_name = param('name') || ''; my $_color = param( 'color' ) || '';

    as

    my ($_name, $_color) = map {param($_) || ''} qw(name color);

    My illogical fear that there would be at least one person in the world called '0' would then force me to rewrite the above as:

    my ($_name, $_color) = map {defined($_) ? $_ : ''} map {param($_)} qw(name color);

    As for the regex style... Personally I would go for traditional regexes if I was running a tutorial, since this is what most students would come across if they tried to follow up their perl work elsewhere. However that's just my preference - and depends on the audience and what they're expected to do with what their taught ;-)

    I would also use the OO style of using CGI since I dislike namespace pollution - but that's just me.

    I would also run encode_entities over $color as well as $name. I met a totally evil proxy a few years back that stripped HTML content to 7 bits and have been paranoid ever since!

    AFAIK \p{IsAlpha} is exactly the same as using :alpha:.

    (and are we ignoring uninitialised value warnings from the print if $name or $color don't match? :-)

Re: "Correct" program style questions
by particle (Vicar) on Oct 24, 2002 at 01:34 UTC

    i have changed my regular expression style significantly as of late. my list of reasons includes training regex newbies and preparing for perl 6.

    here are some of the changes

    • never use a bare regex: // is always written as m//
    • always use the x modifier, and put it up front (more like perl6):
      m/(?x) ((?:blah){3}) ## match w/ capture: blahblahblah /
    • replace all ^ and $ anchors with the safer \A and \z. once bitten, twice shy ;)
      if i read the apocalypses correctly, perl 6's ^ and $ anchors will DWIM better than the current implementation

    my take on your regexes might go something like:

    ... =~ m/(?x) \A ## match w/ capture: one or more alpha, punct, or space posix charac +ters ( [[:alpha:][:punct:][:space:]]+ ) \z /; ## and ... =~ m/(?x) \A ## match w/capture: one or more alpha posix characters ( [[:alpha:]]+ ) \z /;

    the comments are a little verbose. i'd consider them extraneous for this example, but neccessary in more complex patterns.

    ~Particle *accelerates*

Re: "Correct" program style questions
by kshay (Beadle) on Oct 24, 2002 at 15:15 UTC
    This isn't one of the specific points you asked about, but I've developed a habit of not printing HTML code directly. I almost always build up a variable containing the result page, and then print it at the end.

    Obviously it doesn't make any difference in this case, but in most scripts that do anything non-trivial, there are opportunities along the way for something to go wrong, in which case you want to bail out and print an error page instead of what you'd planned to return. If you've been printing pieces of your page as you go along, you can't do that.

    --Kevin

Re: "Correct" program style questions
by blssu (Pilgrim) on Oct 24, 2002 at 15:25 UTC

    I've just gone through about 300 CGI scripts helping people clean up code and convert to Apache::Registry. Here's some of what I've learned.

    • Beginners need structure more than anything. I like your code for parameters because it is easy to add a new one correctly. We had serious problems with tainting and your code should eliminate that.

      The only thing I would change with your param code is to reformat it like this:

      my $_name = param('name') || ''; my ( $name ) = $_name =~ /^([[:alpha:][:punct:][:space:]]+)$/; my $_color = param('color') || ''; my ( $color ) = $_color =~ /^([[:alpha:]]+)$/;
      That's easier to cut and paste. Beginners always cut and paste regardless of how many times I warn them it's a bad habit. "Use loops!" "Use subroutines!" "Sigh." "Ok, but at least fix the indentation."

    • We had major problems with duplicated constants (between and within scripts). I recommend creating an application-setup module and then importing it into every script. Once again, beginners will look for how it was done before and then try to copy it. Something simple like this:

      package MyApp; use strict; $MyApp::data_dir = "/usr/local/data"; $MyApp::db_user = "app"; $MyApp::db_password = "secret"; ... 1;
      The repeated use of the package name is ugly, but it cuts and pastes more easily. People will also automatically use the full package name in the CGI code, so confusion with lexical variables is reduced.

    • Put a few utility subroutines in the application-setup module, just so that people will see how that works. For example, create a sub that untaints a parameter used in many different scripts. The people I worked with had trouble thinking ahead -- they never made decisions that reduced maintenance work -- but they did recognize things that were easier "now".

    • If you think you might use mod_perl some day, wrap all your scripts in a sub and then call the sub. Like this:

      #!perl handler(CGI->new); sub handler { my($q) = @_; ... }
      The sub must be removed when you convert to mod_perl, but it will prevent subtle problems and it will get people thinking of the CGI script in terms of a "request handler" instead of "running a script".

Re: "Correct" program style questions
by princepawn (Parson) on Oct 24, 2002 at 16:31 UTC
Re: "Correct" program style questions
by Aristotle (Chancellor) on Oct 26, 2002 at 00:00 UTC
    Here's a condensed form that can correctly differentiate all cases as per sauoq's requirements elsewhere in this thread. It certainly isn't something for the beginner's class though.
    my ($foo, $bar, $baz, $quux); defined && /^([[:alpha:]]+)$/ ? $foo = $1 : taint_fail('foo') for scal +ar param('foo'); defined && /^([[:alpha:]]+)$/ ? $bar = $1 : taint_fail('bar') for scal +ar param('bar'); defined && /^([[:alpha:]]+)$/ ? $baz = $1 : taint_fail('baz') for scal +ar param('baz'); defined && /^([[:alpha:]]+)$/ ? $quux = $1 : taint_fail('quux') for sc +alar param('quux');
    But any remotely hubristic programmer will of course write this:
    my %check = ( foo => qr/^([[:alpha:]]+)$/, bar => qr/^([[:alpha:]]+)$/, baz => qr/^([[:alpha:]]+)$/, quux => qr/^([[:alpha:]]+)$/, ); my %f; for my $pname (keys %check) {; defined && /$check{$pname}/ ? $f{$pname} = $1 : untaint_fail($pnam +e) for scalar param($pname); }

    This will call untaint_fail only when the regex failed, and preserve undef vs empty string where they're valid input.

    Update: Oops. The code would previously leave things tainted since it was assigning $1 back to the tainted variable. Shifting things around minorly fixed that.

    Makeshifts last the longest.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://207263]
Approved by mdillon
Front-paged by footpad
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (5)
As of 2014-10-25 01:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (139 votes), past polls