http://www.perlmonks.org?node_id=501762

jimbus has asked for the wisdom of the Perl Monks concerning the following question:

OK,

Over the last serveral data processing scripts, I've been refining my style and noticed several patterns. Now, I'm in the process of building a template script which should speed up the writing of future scripts. I'll update the older scripts, too, so things are consistent and easily maintained. And to that end I have a question or two.

Yesterday on the CB, we were discussing strict and that globals were verbotten except for those declared in packages... is that to say that it is exceptable create a package for the purpose housing global variables? I was standing in the shower thinking that declaring all of my general use variables at the top would make the scripts look like my old pascal programs and it occurred to me that maybe I could put some of the things I use in every script (ftp handle, dbi handle, machine name, remote path, local path, etc) in a package and neaten things up a bit.

All of my scripts pretty much do the same thing:

  • ftp connect to a remote machine
  • get a listing
  • download the files to the incoming directory
  • coalate data if necessay
  • establish DB connection
  • close connections
  • exit
  • Does it make sense to build a genric script and force the differences out to functions in packages with overloaded names? For example: the template might call the sub "process_line" which would be a different sub based on which network element's package was loaded (NE::WSB, NE::SMSC, etc).

    I haven't really thought this entirely through... but I've been thinking and that can be a dangerous thing :)

    Thanks

    Jimbus

    Never moon a werewolf!

    Replies are listed 'Best First'.
    Re: Of strict, globals, packages and style
    by dragonchild (Archbishop) on Oct 20, 2005 at 19:06 UTC
      First off, you speak of "some of the things I use in every script". To me, that screams "config file". There are literally dozens of config file handlers on CPAN. My current favorite is Config::Std, but I've also used others, like Config::ApacheFormat, to good benefit. All that matters is that you pick one and are consistent.

      Second off, you say "my scripts pretty much do the same thing". Then, you very nicely list each one and ask "Does it make sense to build a genric script ...?"

      My answer to that question is "No." You don't want to build a generic script, you want to build a general-purpose library. For example, you always set up an ftp connection first. Ok, create a connect_ftp() function and house it in some library.

      Why do it that way? Well, there's a lot of reasons. The biggest reason from a software engineering standpoint is testability of your library. You will want to verify that each piece works correctly on its own. You can't do that without a well-defined interface. But, for these little scripty-doos, you may not have the time for all that jazz.

      The most important reason for you is that you will be asked to write a quick one-off script to do something just like what your generic script would do, but ever so slightly different. Maybe, it needs to process files that have already been downloaded. Or, you might want to run the file processing without the DB connection. Or, you want to create a report from the DB. I don't know and, more importantly, neither do you. If you have all your different actions in a tinker-toy/lego type of setup, it becomes easy to write that one-off script.

      So, here's what I'd recommend:

      • Pick a config file module from CPAN and use it religiously.
      • Create a repository of useful functions. Call it "Useful::Stuff" (filename of Useful/Stuff.pm). It will look something like:
        package Useful::Stuff; use 5.6.0; use strict; use warnings; our $VERSION = 0.01; use base 'Exporter'; our @EXPORT_OK = qw( connect_ftp ); sub connect_ftp { } 1; __END__ =head1 NAME =cut
      • Within this library, use as many CPAN modules as you can. You don't want to write anything more than you have to.
      • Note the POD at the end. Document the crap out of this thing.
      • Note the $VERSION variable at the top. You will want this thing in version control.
      • Don't export variables. Only export functions.
      • Use @EXPORT_OK, but provide useful %EXPORT_TAGS so you can do something like use Useful::Stuff ':standard'; and have a standard import list.
      • I'd put this into a distribution suitable for CPAN (though you won't upload it). That way, when you install it on a new box, it will take care of all the prerequisites (like DBI, DBD::whatever, Net::FTP, etc). I'd use Module::Build and avoid ExtUtils::MakeMaker like the plague.

      Please ask if you want more information on any point above or if you have questions about what you're doing. :-)


      My criteria for good software:
      1. Does it work?
      2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
        First off, you speak of "some of the things I use in every script". To me, that screams "config file".
        You don't want to build a generic script, you want to build a general-purpose library.

        And, after he has done that, he wants to build a generic script using that library. Such a tool would be the sensible place to make use of the aforementioned config file, by the way.

        Generally you want to avoid creating a multitude of tools that all do essentially the same thing for all the same reasons you avoid duplication of code elsewhere. It's often easier to cp mytool mynewtool and begin making changes but in the long run you end up with a messy toolbox filled with buggy, poorly documented, out-dated, narrow purpose tools.

        In other words, create power tools. When it comes to software, it doesn't make sense to build the equivalent of 10 different screwdrivers and a hand-drill when you can build a power drill that takes options like --bit-size, --masonry, and --phillips-head. When you do build multiple tools, build them for different purposes. In the OP's case, for example, it might make sense to have one tool that retrieves the files and another to collate the data and insert it into the DB.

        -sauoq
        "My two cents aren't worth a dime.";
        
    Re: Of strict, globals, packages and style
    by jdporter (Paladin) on Oct 20, 2005 at 19:00 UTC

      I think it's an excellent idea. What you've hit upon is called the Strategy design pattern. I like this pattern and I use it a lot. The fact that you invented it independently just proves that it's a real pattern. :-)

      The outline you illustrated is a framework. The dbi handle, ftp handle, etc. exist as Singletons within that framework. When you call the strategized methods, they're being called in the context of the framework, so you should probably pass a handle to that context to the called method. Using that handle, the method can get at the singleton objects that live in the framework. When you've achieved this, you won't need any real global data at all. (The "global" data can easily be limited to lexicals in the main program file.)

      We're building the house of the future together.
    Re: Of strict, globals, packages and style
    by saberworks (Curate) on Oct 20, 2005 at 18:36 UTC
      It sounds like you have a pretty good idea. If all your scripts follow the same format, and the only differences are config (which server to connect to, which files to process, which database tables to insert into), and the single process_line() function, it sounds like you've got a need for making a parent class, which of course sets up the db/ftp/connections and whatnot (possibly from command line options or config files), and does most of the rest of the work. Each of your children modules will simply inherit from the base class and define a process_line() function. So that will be the only difference, module-to-module.

      It doesn't even sound like you have a need for separate scripts, you can simply have one script which also takes an argument for which child script it will activate, like:
      my_script.pl --type=WSB
    Re: Of strict, globals, packages and style
    by amw1 (Friar) on Oct 20, 2005 at 18:38 UTC
      I think it does. We do similar things here.
      example code of the type of thing we do:
      # All of the packages are Foo::<something> my $BASE_PACKAGE = "Foo"; $type_string = /^TYPE:\s*(.+)$/; # Type is used as the package name my $type = $1; my $sub = LoadPlugin($type, "process_line"); &{$sub}($arg1, $arg2, ...); sub LoadPlugin { my $package = shift(); my $routine = shift(); # construct Foo::$type my $plugin = $BASE_PACKAGE . "::" . $package; # try to load it eval "require $plugin"; if($@) { die "Couldn't load $plugin"; } # if we've loaded successfully return a # ref to the function my $sub = \&{$plugin . '::' . $routine}; return $sub; }
      This isn't the actual code we use, but the important points are here. It has worked pretty well for us.
    Re: Of strict, globals, packages and style
    by ioannis (Abbot) on Oct 20, 2005 at 23:26 UTC
      In my experience, the best tools are not the elegant tools. Keep it simple! It might not seen elegant now, but six months later you will be able to quickly understand your code. Who wants to review (more) lengthy code just because it was written 'the right way'.
        In my experience, the elegant tools are simple. Convoluted is never elegant.

        Caution: Contents may have been coded under pressure.
        Simple is all nice and good as long as you really need to do simple things, too. In the long term, if you don't do it "the right way" you'll end up with a mess of spaghetti code, and tons of bugs. I know because I did it ;)
    Re: Of strict, globals, packages and style
    by Moron (Curate) on Oct 21, 2005 at 09:58 UTC
      I am not saying you will automatically suffer from the following problems, but the following kinds of downsides to the template approach occur with such regularity as to have long since influenced me into starting every program from a completely blank sheet and only later going through my standards checklist:

      1) what may seem often a good idea, even to the extent where it might deserve biblical reverence from users of this site, may still be inappropriate for some (even common) situations. Familiar cases of this include:

      a) a one-liner perl -e being used in a shell script is usually too simple to be improvable by the techniques you'd apply to a fully-fledged script. The following example from a real system (of a quick XML out-filter/data counter), if padded out with 'use strict' and 'use warnings', would only tend to frighten any non-perl-proficient colleagues that might have to support it into concluding quickly that perl must be really dangerous and perhaps a reversion to awk would be a better idea, just in case!

      #!/bin/sh #... intervening code ... export CountDataLines = `perl -e 'while(<>){/^\s*</ or $notXML++;} pri +nt $notXML;' < $FileName`
      b) use/require Module/File (!!!)

      Rather than just use everything that is used somewhere in your project or even entire department's history, better performance and fitness to purpose can be achieved by conscious selection of libraries to meet actual usage by the current file you are working on.

      2) Cut-and-paste-o-mania. Of course it might appear superficially that if using tested code then this must be a good approach, the downside to look out for comes in two deadly forms:

      a) it encourages the use of an algorithm without understanding it - a common recipe for disaster.

      b) it often creates more bugs both at parse time and runtime than if you'd started from scratch, because in reality there is more that can go wrong than you might imagine (different names, different dependencies, different looping/storage methods), when cutting code from one file and pasting it into another - even when the functional purpose is identical and even when the first file is as 'clean' as anything ever written!

      (Update: I just realised that using my in a main program doesn't qualify as a global and have had to modify the following...) In regard to globals, only when their need is (update: on the face of it) detected do I then introduce the first and last of them (update actually i realise I don;t!) - a global (update: my-scoped, main) hash that can serve as a global dictionary whose reference can be passed around - in other words it is (er, actually not) declared global and never used as such apart from near the top of the main program...

      my %gdd; Init( \%gdd)
      ...so that the subroutine Init begins (update: I have fleshed this out with some examples now):
      sub Init #example of using a reference to a unique global dictionary my $gref = shift; open my $lh, ">>$ENV{ LOGFILE}" or Die( "$!: $ENV{ LOGFILE }" ); $gref -> { FH }{ LOGFILE } = $lh; #... # example of initialising specific hash levels of a gdd using a lo +cally scoped reference: for my $streamType qq( GSM SMS MMS ESP GPRS ) { $gref -> { ST }{ $streamType }{ BILLABLE } = ( $streamType eq 'ESP' ) ? 0 : 1; $gref -> { ST }{ $streamType }{ BYDURATION } = ( $streamType eq 'GSM' ) ? 1 : 0; $gref -> { ST }{ $streamType }{ BYVOLUME } = ( $streamType eq 'MMS' ) ? 1 : ( $streamType eq 'GPRS ) ? 1 : 0; } } # example of using the FH for a logfile somewhere else in codeland... sub SomewhereSomeModule{ my $gref = shift; #... Log( $gref, 'description of functional phase' ); #... } sub Log{ my ($gref, $msg ) = @_; my $lh = $gref -> { FH }{ LOGFILE }; print $lh MyTimeStamp() . ": $msg\n"; }
      This is more manageable than having explicit global usages you have to remember all mixed up with other scoped identifiers in your code wherever you might be in your sources. Update: and I subsequently realised it isn't even global.

      More Update: I suppose I should say how it is then possible to remain optimally efficient and effective:

      1) the template can safely include a module/program comment heading plus use strict and use warnings.

      2) the creation of package(s) to keep common functionality is far better than using templates for functional purposes - if there is a single sequence of actions that are required to do something in several places, why not have a single parameter-driven method that goes through the sequence in just one code location.

      -M

      Free your mind

    Re: Of strict, globals, packages and style
    by jimbus (Friar) on Oct 21, 2005 at 14:58 UTC

      Roy Johnson opened a whole other can of worms for for me on the ChatterBox, he suggested that I was started to think in an object oriented way...

      I've done a bit of OO programming with java and I've always struggled because I try to make monolithic objects that do everything and end up frustrated when my patterns work.

      For example, in this instance, I was focusing on the data I was processing and I could probably make an object model out of it, but it would blow all of my existing logic. But if I make my processes the object and pass the data in standardized format (hashes or arrays, depending on the situation) between them, it seems to work really well.:

    • $fp = Reports::XXX::FileProcessor->new
    • $fp->getConfig
    • $fp->getFiles
    • $lp = Reports::XXX::LineProcessor->new
    • while ($file = $fp->next)
      • $lp->process($file)
    • $dp = Reports::XXX::DataProcessor->new
    • while ($data = $dp->next)
      • $dp->store($data)
    • $dp->cleanup

      XXX being the NetworkElement I'm working on that would be passed in like one of the suggestions above.

      Anyhow, thanks for all the input, there is a lot of data and varying opinions to process here. Unfortunately, I don't have a lot of time to tarry on this (I'm actually operations, not dev), so I'll probably close my eyes and pick one :)

      thanks again

      Never moon a werewolf!