Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

An exercise in "That's not the way I'd do it" (TNTWIDI)

by jcwren (Prior)
on Mar 18, 2001 at 05:45 UTC ( #65223=perlmeditation: print w/replies, xml ) Need Help??

This idea came about as a result of a chatterbox conversation (one of many, in reality) on commenting code, idioms, semantics, and how we all have different coding habits.

I'd like to propose that we all look at xluke_repwalker.pl, and pick it to pieces.

Ground rules:

  • I will not get offended at anything anyone says about this code.

  • It's not a golfing contest, at least to the degree of "I can reduce this line by two characters, and still claim that it will run"

  • Claiming that whether leading and/or trailing spaces around parantheses or not makes it more readable doesn't count, nor does whether the opening brace goes on the same line as the if/for/while verb.

  • Comments about readability are encouraged. Do variable names make sense? Could we improve the program flow? (Let's skip over the "We could make this a module, and permit other programs to have repwalking capability", however.)

  • Certain people saying that've already done that in their WebTechniques column doesn't count <G>

I can't think of any other rules that are really necessary, so let's start with that. Doesn't matter what your skill level is particularly, all are encouraged to comment.

Be sure to reply to this node, and not the xluke_repwalker.pl node, please.

--Chris

e-mail jcwren
  • Comment on An exercise in "That's not the way I'd do it" (TNTWIDI)

Replies are listed 'Best First'.
(dkubb) Re: (2) An exercise in "That's not the way I'd do it" (TNTWIDI)
by dkubb (Deacon) on Mar 18, 2001 at 10:22 UTC

    I have a few suggestions for improvement to your code, I'll outline the general ones then get more specific.

    These are based on my personal opinion, please feel free to review my review.

    • IMHO, a general rule is that the data type you use should closely represent the way you want to use the data in your code. In your code, I would would "use vars" for variables that are global in nature, but that are still going to be modified by your code. Constants are things which should never change, and will stay the same through-out the execution of your program. I believe you should "use constant" for this type of data. For example, your defaults at the top of the script should be constants, not global variables..

    • The data source is the root of all data structures. The data structure you use has different names than the data source. Why bother to remember that "last" maps to "LastReputation" in the database? If you change the names in your data structure to match your database column names, your code becomes simpler.

      Calling the same thing by two different names is sure to cause confusion later on.

    • I noticed you asked for comments regarding readability. Often when I am not sure, I refer to perlstyle for style guidelines. It's not perfect, but it seems to cover the general consensus of the perl community.

    • Also, another tip that I use when trying to name my object variables, I refer to the module's docs for their naming conventions. For example, say $sth to anyone who's worked with DBI and they'll instantly know what you are talking about. Say $r and anyone who's work with mod perl will tell you it's the Apache Request object. About the only place I break this is with CGI, I don't like using $q and prefer to use $cgi, I think it's more self describing.

    • It's usually safer to do checks on hashes using exists and/or defined, rather than just doing the equivalent of a if($hash{key}) {}. You'll eliminate some possibility of unitilized value warnings.

    • When defining a hash or array, and you mean for it to be empty, it's redundant to do this: my @array = ();, when you can just do: my @array;. This is more personal preference, but I believe it's more readable without this redundancy.

    • I notice that you tend to use if(! $test) {} instead of unless ($test) {}. I usually try to stay away from these types of constructs, in general. It can get you into some nasty logic mind-benders, where you have code that's doing the equivalent of an english "double negative".

      All through grade school my English teachers always told me not to do this, but I never really quite understood what they meant, until I began coding..

    • Inside your code there is a temporary variable, that has the prefix $out, which is the output variable that gets emailed and/or printed to STDOUT. I think it might be a better idea to do your formatting all in a single place, rather than mixing your logic and presentation code. You could have a hash that represents state, and place your data into it, then right before you send the mail or print, you run it through a filter to format it. I've used HTML::Template in an almost indentical situation as this, and it works wonderfully. This won't work very well is if the amount of data is large, but I believe it should in your case.

      This seperation of logic and presentation allows you to output the data in any format you like, even HTML. And the resulting code becomes much cleaner and easier to understand.

    • You should change the code so that it either uses a database or a flat file. Right now your code is always using a flat file, and sometimes using a MySQL database. Why not use one or the other? It would simplify your code somewhat, as you'd create a single $dbh database handle, and pass it around to your read_file(), write_file(), and db_update() routines. Your -d flag should allow people to decide between csv or MySQL, with the default being csv of course.

    Keep in mind the above suggestions are snippets from my own personal style. While it's not the one true way, I've found it usually does not conflict with the general consensus on perlmonks and CPAN. As well, I've never been burned by the suggestions above or below - that usualy happens only when I ignore my own guidelines. =)

    • The place where you check $arg{d} you create a synthetic variable called @nodelist, then proceed to push the results of a loop onto it. You could have done a direct assignment, or even better not created the variable in the first place.

      In the call to update_replist() I think it would be clearer if it returned a hash reference, equivalent to %dbreplist. IMHO, passing in a reference to something into a subroutine, only to have it be modified in the subroutine can have the same unintended side effects that people face when using global variables. In general, I think if you can help it, don't do it. In this case, there are a few places where it might be required, so place the hash reference as the final parameter to this subroutine, so it can be optionally passed in.

      Here's the block using the suggestions above:

      if(exists $arg{I}) { my $hreplist = initialize_rep_file($username, $password, $filename); db_update( update_replist('I', $hreplist, [keys %$hreplist]) ) if exists $arg{d}; exit; }
    • read_file/write_file:

      Both of these routines do things that are already in DBD::CSV. Granted, there is a little bit of increased over-head, but the code inside the module is documented and standardized.

      Using DBD::CSV read_file becomes: (Note, the name should probably be changed to read_source or something similar)

      sub read_file { @_ == 2 or croak 'Incorrect number of parameters'; my $dbh = shift; my $source = shift; my $statement = sprintf(q{ SELECT NodeId , Title , Date , LastReputation , Reputation FROM %s }, $source, ); my $sth = $dbh->prepare($statement); $sth->execute; my %nodehash; while(my $row = $sth->fetchrow_hashref) { croak "Node ID $row->{NodeId} is duplicated!" if exists $nodehash{ $row->{NodeId} }; $nodehash{ $row->{NodeId} } = $row; } return \%nodehash; }

      Now if it is changed to using DBD::CSV, it might even be possible to factor out write_file(), and use dbupdate() (below) for both functions, but I will leave that for jcwren to decide.

    • dbupdate:

      • The SQL query here should be written to use placeholders and be prepared outside of the loop.
      • You can simulate the the or croak behaviour using the RaiseError parameter in DBI.

      Here's an example of the resulting code:

      sub db_update { @_ == 2 or croak 'Incorrect number of parameters'; my $dbh = shift; my $hreplist = shift; my $statement = sprintf(q{ INSERT INTO %s (%s) VALUES (%s) }, DEF_DBTABLE, join(', ', @{ &DB_COLUMNS }), join(', ', ('?') x @{ &DB_COLUMNS }), ); my $sth = $dbh->prepare($statement); foreach my $nodeid (sort keys %$hreplist) { $sth->execute(@{ $hreplist->{$nodeid} }{ @{&DB_COLUMNS} }; } }
      Your comments were quite educational for me. However, I did not fully understand your point quoted below. What is the danger of modifying the reference in a subroutine, and what should you do instead? Thanks.

      "IMHO, passing in a reference to something into a subroutine, only to have it be modified in the subroutine can have the same unintended side effects that people face when using global variables. In general, I think if you can help it, don't do it."

        The claim about changing things by reference being similar to globals is one I understand but only halfway agree on. What is similar is action at a distance. However while globals by definition create that, modifying by reference can either increase or decrease.

        Remember that the goal of programming well is to do the job, within your set constraints, while maximizing the maintainability of the code. From the point of view of getting something simple done, globals are always easiest. However they also open you to global side-effects, which increases the amount the programmer needs to keep in mind. In general if you have n sections of code, any of which can interact with any other, that is order n*n interactions to keep in mind. Which is bad.

        However many people hear this and say, Oh, I won't use globals. But I will use a reference to something that I pass around everywhere and set and change as I need. Voila! No more globals!

        But while you have removed the label, you have not changed the problem very much. The data is still being shared everywhere, it is just being shared by a different mechanism. You still have action at a distance. You still have the maintainability issue.

        This is the sense in which I agree with dkubb's criticism.

        However I only halfway agree. As a counter example in object oriented programming you everywhere pass around a reference to something, and you can modify it by reference anywhere. Yet good programmers who avoid globals are quite willing to use OO programming. What is different?

        The difference is that OO puts the code that requests from the data structure together with the code that places it. Therefore all of the information about mechanics, internal counters, etc is encapsulated in a relatively small section of code. While requests may come from many places for many reasons, the mechanics of how to do it no longer suffer as much from action at a distance.

        Now in practice that might not be true. And certainly it takes experience to produce good OO designs. But that is the principle.

        But I should qualify all of this with one observation. Perl as a language has a ton of global variables. Good programmers avoid them. However, despite having a ton of globals, Perl is a rather effective language. How does that work?

        IMHO what is going on is simple. The maintainance problem that globals cause is due to the increase in how much you need to keep in mind. Globals are almost always useful in terms of getting things working, but they bite you down the road. Therefore there is a balance between the theoretical evil and the practical utility. Every global is another thing to be constantly kept in mind, and only applies when working on that code. Therefore the cost of learning Perl's globals can be amortized across many programs. While the cost of keeping a global in your scripts can only be amortized across that script. So the tradeoff comes at a different point.

        Even so I find that for me, Perl has gone too far. Most of the special variables in perlvar I do not use. My brain is too small to keep them in mind. At least they have names that cannot be confused with regular variables, so if someone else uses one I can look it up. But I don't use them. I prefer to use that mental energy for general principles, known bugs, and remembering my grocery list. :-)

Re (tilly) 1: An exercise in "That's not the way I'd do it" (TNTWIDI)
by tilly (Archbishop) on Mar 19, 2001 at 07:20 UTC
    First of all I really like the idea of this thread. I think that the process of reviewing code is very instructive. This is particuarly true when experienced programmers review each other's code.

    I hope you don't mind, but I picked a simpler piece of code, xstatswhore.pl to work with. It is at least related...

    1. You have documentation aimed at users embedded in the code using #. If you put the same documentation in pod format it can be extracted with the perldoc utility.
    2. You have key variables like $pm_site which are scoped for the whole program basically so that you can issue a request in one function and then do error handling in another. I prefer to put the error handling where the action happens, and if I can move that variable to a tighter scope, that is fine by me.
    3. You have a known problem in your data - the home node is considered an article and you don't want it to be. Again there is action at a distance with the fix for the issue appearing nowhere near where the problem is. Were vroom to fix the bug on his end, you would also now be correcting for a non-existent bug and you would now get the wrong answer. I would try to find the home node and delete it.
    4. The indenting on your XML parsing goes very deep. If you are going to leave the logic inline, using a standard small indent for each level would make it more readable for me. I keep to 80 columns.
    5. On a related note, I know mirod will dislike me for saying this, but every time I look at the XML approach to a problem I have to wonder why it is so hard to work with. I am seriously tempted to handroll a solution to kill that dependency. I have had bad luck in the past on XML::* upgrades, and many have (as you note) trouble installing them. Plus I bet it would be faster.
    6. In get_article_list, why do you pull the node in and then call it through @_ rather than using the variable that you have in scope?
    7. I agree with you that max and min would be nice to have as native functions. But you can get them from List::Util, and that is good enough for me. Plus the way that it implements them is able to take a list as an argument.
    8. I don't like the hard-coded constants in your summarize function. Also Perl is list-oriented. If you take the approach of first extracting an array of reps and then taking summary statistics on that, then you can simplify that code without running into bugs on users with only negative reps.
    9. The histogram code is buggy in at least 2 ways. First of all the maximum rep has nothing to do with the widest bin. Secondly the home node is an extra article with a rep of 0. The second one shows, again, the dangers of fixing a bug at a distance from where it enters your code. Then you have to always remember it and fix it at a distance in multiple places.
    10. In show_histogram I don't like the naming of the hash. On page 8 of Camel 2 there is the advice that a hash lookup is linguistically "of". So you say that $wife{Adam} = "Eve";. Given that the hash is has the size of the bin for its value, it deserves a name like %bin_count.
    11. I do not like the way that show_histogram loops. In my experience looping by explicit incrementing values (particularly if you are incrementing several in parallel) is easier to get wrong than looping over a range.
    12. A matter of personal taste. I tend to leave out tests on numbers of arguments. Then again I tend to like APIs with variable number of arguments. And if I want to be sure that the caller knows my API, I am likely to hack up a named argument approach - in my experience I am less likely to forget how many arguments I have than I am to forget what order they go in. So I omitted them, but that is very much a YMMV item.
    And I guess that if I am going to say all of that, I should show what the revised version looks like:
    #!/usr/local/bin/perl -w use LWP::Simple; use XML::Twig; use URI::Escape; use HTML::Entities; use List::Util qw(min max sum); use Carp; use Getopt::Std; use POSIX qw(ceil floor); use strict; { my %args; getopts('u:p:b:', \%args); # Replace these dies with username/password my $user = $args{u} || # You can put your name here die("You need to specify the user with -u"); my $pass = $args{p} || # You can put your password here die("You need to specify the password with -p"); my $bin_size = $args{b} || 5; my $articles = get_article_list($user, $pass); my $summary = summarize ($user, $articles); show_reps($summary); if (0 < $bin_size) { show_histogram($bin_size, $articles, $summary); } } # Takes user name/password, returns a hash of articles # Sorry mirod, every time I look at this kind of thing, # I have to wonder *why* programming for XML always # seems so much harder than it needs to be. sub get_article_list { my ($user, $pass) = @_; my %node_info; my $twig = new XML::Twig( TwigRoots => { NODE => sub { my ($t, $node) = @_; my $title = decode_entities($node->text()); $title =~ s/&apos;/'/g; # Why is this missed? my %rec = ( id => $node->att('id'), rep => $node->att('reputation'), title => $title, date => $node->att('createtime'), ); if (exists $node_info{$rec{id}}) { confess("Duplicated id '$rec{id}'"); } else { $node_info{$rec{id}} = \%rec; } $t->purge(); }, } ); $twig->parse( get_pm_node("User nodes info xml generator", $user, $pass) ); unless (%node_info) { confess("Couldn't fetch article info for $user with pass $pass"); } # Vroom includes your home node as an article. :-( my $home_id = min (keys %node_info); if ($node_info{$home_id}{title} eq $user) { delete $node_info{$home_id}; } else { warn("Is $home_id your home id? Presuming not.\n"); } return \%node_info; } # Takes a node name, plus optional name/password, and fetches the page sub get_pm_node { my $node = uri_escape(shift); my $url = "http://www.perlmonks.org/index.pl?node=$node"; if (@_) { my $user = uri_escape(shift); my $passwd = uri_escape(shift); $url .= "&op=login&user=$user&passwd=$passwd"; } # Why doesn't LWP::Simple preserve the error? return get($url) or confess("Cannot fetch '$url'"); } sub show_histogram { my $bin_size = shift; my $articles = shift; my %bin_count; foreach my $rep (map {$_->{rep}} values %$articles) { my $bin = floor( ($rep + 0.5) / $bin_size); ++$bin_count{$bin}; } # Try to keep on one page my $width = 50; my $max_count = max (values %bin_count); my $scale = ceil($max_count / $width); print " Reputation Article Count\n"; print "------------- -------", "-" x 50, "\n"; my @bins = sort {$a <=> $b} keys %bin_count; foreach my $bin (min(@bins)..max(@bins)) { my $count = $bin_count{$bin} || 0; my $extra = ($count % $scale) ? '.' : ''; my $start = $bin * $bin_size; my $end = $start + $bin_size - 1; printf "%4d .. %4d \[%4d\] %s$extra\n", $start, $end, $count, '#' x floor ($count / $scale); } print "\n Scale: #=$scale\n\n" if $scale > 1; } # This is essentially untouched. sub show_reps { my $hsummary = shift; print "\n"; printf (" User: %s\n", $hsummary->{username}); printf (" Total articles: %d\n", $hsummary->{articles}); printf (" Total reputation: %d\n", $hsummary->{reputation}); printf (" Min reputation: %d\n", $hsummary->{repmin}); printf (" Max reputation: %d\n", $hsummary->{repmax}); printf ("Average reputation: %3.2f\n", $hsummary->{average}); print "\n"; } sub summarize { my $user = shift; my $articles = shift; my @reps = map {$_->{rep}} values %$articles; unless (@reps) { confess("No articles found."); } my %summary = ( articles => scalar(@reps), repmax => max(@reps), repmin => min(@reps), reputation => sum(@reps), username => $user, ); $summary{average} = $summary{reputation} / $summary{articles}; return \%summary; } __END__ =head1 NAME t_statswhore.pl - get summary statistics =head1 SYNOPSIS t_statswhore.pl C<-u> <user> C<-p> <password> [C<-b> <bin size>] =head1 DESCRIPTION This program goes to www.perlmonks.org and generates summary statistics about your posts. This only works for your own account since reps are proprietary. If you wish you can embed your name and password into the code instead of passing them on the command line. =head1 DEPENDENCIES Quite a few. LWP::Simple, XML::Twig, List::Util, URI::Escape, and HTML::Entities. =head1 TIPS An interested developer should be able to fairly easily remove the dependency on XML::Twig. URI::Escape is largely gratuitous, and the main purpose of HTML::Entities is to make sure that people with strange names can have their home node identified and removed from the article list. =head1 AUTHOR AND COPYRIGHT Written by Ben Tilly based on the original by J.C.Wren. Copyright 2000, 2001 J.C.Wren jcwren@jcwren.com Copyright 2001 by Ben Tilly No rights reserved. But jcwren likes hearing about people using it.
I could only come up with 3 items
by spaz (Pilgrim) on Mar 18, 2001 at 08:56 UTC
    • Your MySQL table structure is pretty hard to read. Align the datatypes with tabs (like you would align a group of assignments) (btw, do that many of them need to be NOT NULL?)
    • There's a couple assignments that aren't aligned, you seem to be trying to keep that consistant
    • I try to format my code for an 80 character width display. Personal preference because it's a typical LCD
    -- Dave

      I prefer 72 columns... because there are times when it's good to print out code in hard copy, and even today, most teletypes^H^H^H^H^H^H^H^H^H printers are 72 columns wide.

      Besides, it'll make your source code fit on punch-cards.

Re: An exercise in "That's not the way I'd do it" (TNTWIDI)
by ColonelPanic (Friar) on Mar 18, 2001 at 10:07 UTC
    you use a lot of trailing control statements, such as:
    do_something() if (condition); do_something($_) foreach (@array);
    I use these frequently, too, but I would try to restrict their use in a program where maximum readability is a goal. People who are using Perl as their non-primary language can get thrown off by these. (I know, because I was in that category until recently.)

    The basic trailing if statements are straightforward, but a trailing foreach in conjunction with some fairly complex code can make things get messy. A regular foreach or if statement would allow things to spread out onto multiple lines, making things more readable.

    When's the last time you used duct tape on a duct? --Larry Wall
Re: An exercise in "That's not the way I'd do it" (TNTWIDI)
by chipmunk (Parson) on Mar 18, 2001 at 19:53 UTC
    I guess that the 'x' in xstatswhore.pl and xluke_repwalker.pl stands for XML. To me an 'x' prepended to the name of a program generally stands for the X Windows system (xemacs, xbiff, etc.). I would suggest renaming the new versions of these scripts to avoid confusion.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://65223]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (4)
As of 2023-01-27 21:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?