http://www.perlmonks.org?node_id=440730

pboin has asked for the wisdom of the Perl Monks concerning the following question:

I've written modules for my own use, but I'm making the leap to something publicly available on CPAN. This brings a lot more complexity and responsibility, so I'd appreciate some advice.

Ultimately, this code will a)take a specially formatted input file and break pieces of it out to separate files and b) offer a reverse function to bundle many text files into one.1 For this discussion, just think of tar/untar and you'll basically be right on track.

Now, the question:

I'm planning on offering a pair of command-line scripts with my module in addition to the callable module. (Much like /usr/bin/module-starter that comes with Module::Starter .) However, when people call my routines from perl, I don't really want to do any real I/O. It seems like that's over zealous, and frankly, I don't want the responsibility of opening and writing to someone else's filesystem.

So, given that I'm passed an input filename and an output directory name, should I:

  • open the neccessary files and write to them, assuming the caller knows what he's doing
  • return a hashref with output names as keys, and file contents as data values (caller can write them out himself)
  • something else?
1Technically, I'm working on replicas of IEBUPDTE and IEBPTPCH. They are utilities to let me work with mainframe PDS members locally. This is OS/390, z/OS stuff.

Replies are listed 'Best First'.
Re: Should Modules Do I/O?
by scmason (Monk) on Mar 18, 2005 at 16:30 UTC
    I think that you should offer both. If you know what you are doing, then make it so. If the user trusts you, they will let you write. If they do not trust you, they will write it themselves.

    Passing the data (or references to it) around can cause more lines of code, more chances for error and confusion. Also, if your module is supposed to handle this data, then the user expects you will handle it. What would you think of tar if it only untarred the archive in memory and then gave you memory references? It would be less usefull.

    Of course, you want to provide references to the data, so that the user could manipulate it before writing it to disk.

    This is my humble opinion. Hope that it helps.

Re: Should Modules Do I/O?
by BrowserUk (Patriarch) on Mar 18, 2005 at 18:07 UTC

    I think you asked a very good question and are thinking along the right lines.

    As counterpoint to other responses, I hate modules that insist on writing stuff to disk, and force me to re-open the output file* in order to get the data back into my program.

    Maybe

    • I only need a 1 of the 100,000 files in the archive.
    • I don't have enough disk space to expand the whole archive and want to process the files one at a time.
    • I want to change the names of the files before they are written.
    • Or only write those that contain a given string.
    • Or I want write 'head' and/ or 'tail' utilities to apply to the files inside the archive.
    • Or I want to process a huge archive from an ftp stream and stop transferring when I find the file I want.
    • I need to create the file with special permissions or security attributes, or on a different machine.
    • Or...

    So, I'd infinitely prefer an interface that allowed me to supply an open filehandle for the archive file (new or existing), and methods for adding and retrieving files from the archive:

    Reading

    use Your::Module; open my $arc, 'ftp ftp://some.dot.com/pub/3GB.archive |' or die; local $/ = \65536; ## Could your module handle this? my $arcObj = Your::Module->new( $arc ); while( my( $name, $dataRef ) = $arcObj->next ) { if( $name =~ m[^file(\d+.type)$] and $$dataRef =~ m[this|that] ){ open $out, '>', localtime . $1 or die $!; print $out $$dataRef; last } }

    Writing

    use Win32API::File qw[ :all ]; use Your::Module; my $hObject = CreateFile( '//?/UNC/Server/Share/Dir/File.Ext', FILE_READ_EA, FILE_SHARE_READ, pack( "L P i", 12, $pSecDesc, $bInheritHandle ), TRUNCATE_EXISTING, FILE_FLAG_SEQUENTIAL_SCAN|FILE_FLAG_WRITE_THROUGH, SECURITY_IDENTIFICATION|SECURITY_IMPERSONATION, 0 ) or die $^E; OsFHandleOpen( FILE, $hObject, $sMode ) or die $^E; my $arcObj = Your::Module->new( \*FILE ); opendir DIR, '//SERVER/DIR/'; while( my $file = readdir DIR ) { open my $fh, '<', $file or die $!; $arcObj->addFile( "/DIR/$file", do{ local $/; <$fh> } ); } close DIR;

    Providing that kind of flexibility for the users, combined with the reduction in code in your module, would make your module more powerful and useful.

    Especially as you're providing 'arcit.pl' and 'unarcit.pl' scripts for the simple case, thereby avoiding the "boilerplate code" charge.

    (*usually after patching the module to provide a way of finding out the filename)


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco.
    Rule 1 has a caveat! -- Who broke the cabal?
      I hate modules that insist on writing stuff to disk, and force me to re-open the output file in order to get the data back into my program

      That's why other Monks suggested that both interfaces be supported. Your post would make it three. But that's fine, as long as each one is documented and maintained. Choice is good. False dilemmas are bad.

        both interfaces be supported

        Providing both seems like YALORC to me. Yet Another Lump Of Reduntant Code,

        Your post would make it three

        With the interface I describe, the other two interfaces can be trivially derived through subclassing, or just simple procedural wrappers.

        That could be done as a part of the module, but I see no value-add in that, as it is equally trivial for the user To Do It themselves, and they can tailor it to their exact requirements, instead of having to work around the supplied interface.

        Neither of the other two interfaces can be easily wrapped to provide each other, nor that which I described.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco.
        Rule 1 has a caveat! -- Who broke the cabal?
Re: Should Modules Do I/O?
by Mugatu (Monk) on Mar 18, 2005 at 17:08 UTC
    return a hashref with output names as keys, and file contents as data values (caller can write them out himself)

    The only thing this would do is force most of your users to use the same boilerplate every time they used your module:

    for (my($file,$data) = each %returnedhash) { open my $f, ">", $file or die "$!"; print $f $data; }

    And we all know that boilerplate is generally bad. Modules are written to avoid boilerplate, not create more of it.

    If you're afraid that your module might occasionally generate invalid data, and you don't want to clobber their files with it, I can sort of understand your reluctance. But believe me, if you provide them with data and a filename, they will clobber their own files happily, and still blame you for any problems that may occur. The only difference is that they will be cursing you for an inconvenient interface as well. :-)

    If, on the other hand, you're confident in your module, then I don't see a problem. As long as the module is clearly documented to clobber files, you should have no reason to be concerned. To that end, your test suite should be heavily loaded towards pounding on the file IO code. This can at least provide a minimum of reassurance.

    Update: of course, I agree with all the other fine Monks who recommend having both methods available.

      You know, that's exactly right (re: the boiler plate).

      I'd already thought, that that's what I'd do if I were the caller. But, I didn't extend that thought to that's what everyone would do, so why bother?

      Good point, and thanks. I'm working on the test suite now, just as an exercise to see how many ways I might want to call this thing.

Re: Should Modules Do I/O?
by duct_tape (Hermit) on Mar 18, 2005 at 16:33 UTC

    I personally do not see a problem with the module doing the i/o as long as it does proper error handling. Having an interface that it easy to use will make it more useful to people.

    One idea is that you routines could expect different parameters depending on how the user wants to handle the i/o. For example someone could pass you in a string with the file name if they don't care if you do the i/o. Or they could pass in existing filehandles if they want to handle it themself.

    And finally one other thing that could be done is to have the method(s) return the data, but then also provide methods that will deal with reading and writing the data as well. That way they are not forced to use your i/o handling, but instead can add an extra line or two of code to use it.

Re: Should Modules Do I/O?
by brian_d_foy (Abbot) on Mar 18, 2005 at 20:24 UTC

    I try to keep I/O out of my modules unless that is the main function of the particular function or method. If I have to issue a warning, I use one of the carp functions.

    If you want to do I/O, try to design it so the argument can deal with either a file or a filehandle (whether input or output). It's nice to be able to get the results back into a scalar sometime, but if you don't get that far, somebody can use IO::Scalar.

    I think most people will have an answer that parallels their thought about die-ing inside a module. I've decided that it's one of those landmine questions where most people already know what they think and will force the users of their code to do it their way. Their way usually optimizes the process for their initial use of it, and everything else has to work around that. It's mildly annoying, but that's life.

    --
    brian d foy <bdfoy@cpan.org>
Re: Should Modules Do I/O?
by Anonymous Monk on Mar 18, 2005 at 16:43 UTC
    If modules wouldn't do I/O, you wouldn't have DBI.pm, Tk.pm or CGI.pm, to name a few well known modules that do I/O.
Re: Should Modules Do I/O?
by Joost (Canon) on Mar 18, 2005 at 22:18 UTC
    One advantage of having direct access to the data from the API (i.e. not forcing users to use the filesystem) is that it's easier to write tests.

    Also, in my experience writing tests for your module generally gives you a good indication about the quality of the API: if your tests look straight-forward, the API is probably good. update: or rather, if your tests look clumsy, the API is probably bad.

    For your specific instance, I would probably write a test like so (method and class-names made up):

    # using Test::More... my $input_string = "some input data"; my $proc = Data::Processor->new( string => $input_string ); my @out_filenames = $proc->names; is_deeply(\@out_filenames,[qw(some list of filenames)]); is($proc->data($some_filename),"data for filename"); $proc->close();
    then you can expand to direct to disk methods:
    my $proc = Data::Processor->( file => "some filename"); # or handle => \*DATA for my $name ($proc->names) { $proc->write( $some_filename, file => $name ); # or handle => \*STDOUT }

    I wouldn't return a hashref with all the filenames and data in it - chances are, that users who want to access the data directly aren't interested in all files, plus, if your archive is big, it would consume a lot of memory.

Re: Should Modules Do I/O?
by NateTut (Deacon) on Mar 18, 2005 at 17:49 UTC
    Do both. You never know what someone might want to do with the intermediate data, so leave the option of returning a hashref in there.

    However if you know most of the time the data will be written to a file go ahead and do it. Users of your module should test with it before using it on real data anyway. If they don't is that your fault?
Modules should do I/O when needed
by inq123 (Sexton) on Mar 19, 2005 at 14:32 UTC
    IMHO, modules should do I/O when needed. In your current case, returning a hash ref with file content is doable due to the small size of the expected file contents. However, there are situations when modules have to process files as big as several GB, the module simply cannot return the file content (unless module returns it bit by bit, just adding a bit more hassle to calling program).

    I also feel that the worry in OP is somewhat contradictory: if one assumes that the caller knows how and whether and where they need to open file and write file contents in a hashref out with correct permissions, then how can one not assume that they could know to send in correct filename/path with the right permissions for your module to write (after all, they do it in their calling program, right)?

    In all, I see very little justification for providing hashref with file contents just so that users can deal with I/O themselves (unless there are special reasons unnamed in your question).

Re: Should Modules Do I/O?
by nothingmuch (Priest) on Mar 23, 2005 at 15:09 UTC
    This kind of dilema normally raises questions about how the problem could be broken down into better interfaces.

    Lets pretend I'm writing your module. This is what I would do:

    • figure out what i need to do - how are files broken?
    • figure out how i want it to look on the outside - @new_files = break($file_name)?
    • try to find out where this could be used otherwise - split on different strings? split in a different way? split file handles? split arrays? split... ? brainstorm, and find out what your data really looks best as, in the context of well-known and well-accepted abstractions
    • figure out what is the lowest level i can deal with this
    • write the easy interface i wanted in bullet 2, and implement it by using the low level interface, and write the low level interface, both at the same time. This serves as a sanity check for the low level interface, and also gets me what i want
    What this yields: I usually give 120% effort for writing a simple thing, but when we look at it a while later, the first solution is about 300% more work, while additions to this solution takes about 150%, a pretty good figure (imaginary, too... i don't know how big the difference actually is, i just feel that there is one).

    anyway, i hope this helps.

    My conclusion: make two modules. One with an easy interface, and one with a flexible one. =)

    Update: I'd like to say that one of the things that /reaaally/ annoys me btw, is that most of the parsing and archiving tools on the CPAN take care of files for you so that it'd be "easy". Most of the stuff that handles that is done, and my typical usage is for things that havne't been done. I needed to parse many formats, and compress on the fly, and what not, on streams, and scalars, with callbacks, without, and so on. Normally i either look into the guts of the module and just wait till my code breaks when the module is updated. Sometimes i just give up.

    Modularity is not worth anything without reusability, and in fact, to encapsulate decisions which are not yours to encapsulate (but rather, a FileManager's module responsibility), in a module, makes it much harder than if it were not modularized.

    -nuffin
    zz zZ Z Z #!perl