Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Improving reproducibility and record-keeping with Log::Reproducible (create or re-run archive of script parameters, git snapshot, etc.)

by frozenwithjoy (Curate)
on Mar 25, 2014 at 06:08 UTC ( #1079643=CUFP: print w/ replies, xml ) Need Help??

I wrote a Perl module, Log::Reproducible, to help improve reproducibility (and record-keeping) of analyses. The code, README, and (some) tests are hosted on GitHub. I'm posting the README below. It contains relevant code snippets and describes how to use Log::Reproducible. If interested, please take a look the code on GitHub. I'm aiming to submit to CPAN and would really appreciate any feedback.

EDIT: Added code for module, too (at very bottom)

EDIT #2: Thanks to a suggestion by DrHyde in a GitHub issue, I've added some Perl-related info to the archives: version, path to the perl binary that was used, and @INC. It has already been pushed to the develop branch. Next, I'll update the module so that current vs archived Perl info is compared when reproducing an archive. In the event that the info doesn't match, the script will bail or the user will be prompted whether or not to continue.

Note: I've tested this module alongside other modules that use and/or manipulate @ARGV and have not found any conflicts as long as Log::Reproducible is imported before the other modules.


TAG LINE: Increase your reproducibility with the Perl module Log::Reproducible. Set it and forget it... until you need it!

MOTIVATION: In science (and probably any other analytical field), reproducibility is critical. If an analysis cannot be faithfully reproduced, it was arguably a waste of time. Log::Reproducible provides effortless record keeping of the conditions under which scripts are run and allows easily replication of those conditions.

Usage

Creating Archives

Just add a single line near the top of your Perl script before accessing @ARGV, calling a module that manipulates @ARGV, or processing command line options with a module like Getopt::Long:

use Log::Reproducible;

That's all!

Now, every time you run your script, the command line options and other arguments passed to it will be archived in a simple log file whose name reflects the script and the date/time it began running.

Other Archive Contents

Also included in the archive are (in order):

  • custom notes, if provided (see Adding Archive Notes, below)
  • the date/time
  • the working directory
  • the directory containing the script
  • git repository info, if applicable (see Git Repo Info, below)

For example, running the script sample.pl would result in an archive file named rlog-sample.pl-YYYYMMDD.HHMMSS.

If it was run as perl bin/sample.pl -a 1 -b 2 -c 3 OTHER ARGUMENTS, the contents of the archive file would be:

sample.pl -a 1 -b 2 -c 3 OTHER ARGUMENTS #WHEN: YYYYMMDD.HHMMSS #WORKDIR: /path/to/working/dir #SCRIPTDIR: bin (/path/to/working/dir/bin)

Reproducing an Archived Analysis

To reproduce an archived run, all you need to do is run the script followed by --reproduce and the path to the archive file. For example:

perl sample.pl --reproduce rlog-sample.pl-YYYYMMDD.HHMMSS

This results in:

  1. The script being executed with the command line options and arguments used in the original archived run
  2. The creation of a new archive file identical to the older one (except with an updated date/time in the archive filename)

Adding Archive Notes

Notes can be added to an archive using --repronote:

perl sample.pl --repronote 'This is a note'

If the note contains spaces, it must be surrounded by quotes.

Notes can span multiple lines:

perl sample.pl --repronote "This is a multi-line note: The moon had a cat's mustache For a second from Book of Haikus by Jack Kerouac"

Where are the Archives Stored?

When creating or reproducing an archive, a status message gets printed to STDERR indicating the archive's location. For example:

Reproducing archive: /path/to/repro-archive/rlog-sample.pl-20140321.14 +4307 Created new archive: /path/to/repro-archive/rlog-sample.pl-20140321.14 +4335

Default

By default, runs are archived in a directory called repro-archive that is created in the current working directory (i.e., whichever directory you were in when you executed your script).

Global

You can set a global archive directory with the environmental variable REPRO_DIR. Just add the following line to ~/.bash_profile:

export REPRO_DIR=/path/to/archive

Script

You can set a script-level archive directory by passing the desired directory when importing the Log::Reproducible module:

use Log::Reproducible '/path/to/archive';

This approach overrides the global archive directory settings.

Via Command Line

You can override all other archive directory settings by passing the desired directory on the command line when you run your script:

perl sample.pl --reprodir /path/to/archive

Git Repo Info

PSA: If you are writing, editing, or even just using Perl scripts and you are at all concerned about reproducibility, you should be using git (or another version control system)!

If git is installed on your system and your script resides within a git repository, a useful collection of info about the current state of the git repository will be included in the archive:

  • Current branch
  • Truncated SHA1 hash of most recent commit
  • Commit message of most recent commit
  • List of modified, added, removed, and unstaged files
  • A summary of changes to previously committed files (both staged and unstaged)

An example of the git info from an archive:

#GITCOMMIT: develop f483a06 Awesome commit message #GITSTATUS: M staged-modified-file #GITSTATUS: M unstaged-modified-file #GITSTATUS: A newly-added-file #GITSTATUS: ?? untracked-file #GITDIFFSTAGED: diff --git a/staged-modified-file b/staged-modified-fi +le #GITDIFFSTAGED: index ce2f709..a04c0f6 100644 #GITDIFFSTAGED: --- a/staged-modified-file #GITDIFFSTAGED: +++ b/staged-modified-file #GITDIFFSTAGED: @@ -1,3 +1,3 @@ #GITDIFFSTAGED: An unmodified line #GITDIFFSTAGED: -A deleted line #GITDIFFSTAGED: +An added line #GITDIFFSTAGED: Another unmodified line #GITDIFF: diff --git a/unstaged-modified-file b/unstaged-modified-file #GITDIFF: index ce2f709..a04c0f6 100644 #GITDIFF: --- a/unstaged-modified-file #GITDIFF: +++ b/unstaged-modified-file #GITDIFF: @@ -1,3 +1,3 @@ #GITDIFF: An unmodified line #GITDIFF: -A deleted line #GITDIFF: +An added line #GITDIFF: Another unmodified line

If you are familiar with git, you will be able to figure out that the git repository is on the develop branch and the most recent commit (f483a06) has the message: "Awesome commit message".

In addition to a newly added file and an untracked file, there are two previously-committed modified files. One modified file has subsequently been staged (staged-modified-file) and the other is unstaged (unstaged-modified-file). Both modified files have had A deleted line replaced with An added line.

For most purposes, you might not require all of this information; however, if you need to determine the conditions that existed when you ran a script six months ago, these details could be critical!


The Code

package Log::Reproducible; use strict; use warnings; use autodie; use feature 'say'; use Cwd; use File::Path 'make_path'; use File::Basename; use POSIX qw(strftime); # TODO: Add verbose (or silent) option # TODO: Standalone script that can be used upstream of any command lin +e functions # TODO: Allow customizion of --repronote/--reprodir/--reproduce upon i +mport (to avoid conflicts or just shorten) sub import { my ( $pkg, $dir ) = @_; reproduce($dir); } sub _first_index (&@) { # From v0.33 of the wonderful List::MoreUti +ls my $f = shift; # https://metacpan.org/pod/List::MoreUtils foreach my $i ( 0 .. $#_ ) { local *_ = \$_[$i]; return $i if $f->(); } return -1; } sub reproduce { my $dir = shift; $dir = _set_dir($dir); make_path $dir; my ( $prog, $prog_dir, $cmd, $note ) = _parse_command(); my ( $repro_file, $now ) = _set_repro_file( $dir, $prog ); if ( $cmd =~ /\s-?-reproduce\s+(\S+)/ ) { my $old_repro_file = $1; $cmd = _reproduce_cmd( $prog, $old_repro_file, $repro_file ); } _archive_cmd( $cmd, $repro_file, $note, $prog_dir, $now ); } sub _set_dir { my $dir = shift; my $cli_dir = _get_repro_arg("reprodir"); if ( defined $cli_dir ) { $dir = $cli_dir; } elsif ( !defined $dir ) { if ( defined $ENV{REPRO_DIR} ) { $dir = $ENV{REPRO_DIR}; } else { my $cwd = getcwd; $dir = "$cwd/repro-archive"; } } return $dir; } sub _parse_command { my $note = _get_repro_arg("repronote"); for (@ARGV) { $_ = "'$_'" if /\s/; } my ( $prog, $prog_dir ) = fileparse $0; my $cmd = join " ", $prog, @ARGV; return $prog, $prog_dir, $cmd, $note; } sub _get_repro_arg { my $repro_arg = shift; my $arg; my $arg_idx = _first_index { $_ =~ /^-?-$repro_arg$/ } @ARGV; if ( $arg_idx > -1 ) { $arg = $ARGV[ $arg_idx + 1 ]; splice @ARGV, $arg_idx, 2; } return $arg; } sub _set_repro_file { my ( $dir, $prog ) = @_; my $now = strftime "%Y%m%d.%H%M%S", localtime; my $repro_file = "$dir/rlog-$prog-$now"; return $repro_file, $now; } sub _reproduce_cmd { my ( $prog, $old_repro_file, $repro_file ) = @_; die "Reproducible archive file ($old_repro_file) does not exists.\ +n" unless -e $old_repro_file; open my $old_repro_fh, "<", $old_repro_file; my $cmd = <$old_repro_fh>; close $old_repro_fh; chomp $cmd; my ( $old_prog, @args ) = $cmd =~ /((?:\'[^']+\')|(?:\"[^"]+\")|(? +:\S+))/g; @ARGV = @args; say STDERR "Reproducing archive: $old_repro_file"; _validate_prog_name( $old_prog, $prog, @args ); return $cmd; } sub _archive_cmd { my ( $cmd, $repro_file, $note, $prog_dir, $now ) = @_; my ( $gitcommit, $gitstatus, $gitdiff_cached, $gitdiff ) = _git_info($prog_dir); my $cwd = cwd; my $full_prog_dir = $prog_dir eq "./" ? $cwd : "$cwd/$prog_dir"; $full_prog_dir = "$prog_dir ($full_prog_dir)"; open my $repro_fh, ">", $repro_file; say $repro_fh $cmd; _add_archive_comment( "NOTE", $note, $repro_fh +); _add_archive_comment( "WHEN", $now, $repro_fh +); _add_archive_comment( "WORKDIR", $cwd, $repro_fh +); _add_archive_comment( "SCRIPTDIR", $full_prog_dir, $repro_fh +); _add_archive_comment( "GITCOMMIT", $gitcommit, $repro_fh +); _add_archive_comment( "GITSTATUS", $gitstatus, $repro_fh +); _add_archive_comment( "GITDIFFSTAGED", $gitdiff_cached, $repro_fh +); _add_archive_comment( "GITDIFF", $gitdiff, $repro_fh +); close $repro_fh; say STDERR "Created new archive: $repro_file"; } sub _git_info { my $prog_dir = shift; return if `which git` eq ''; my $gitbranch = `cd $prog_dir; git rev-parse --abbrev-ref HEAD 2>& +1;`; return if $gitbranch =~ /fatal: Not a git repository/; chomp $gitbranch; my $gitlog = `cd $prog_dir; git log -n1 --oneline;`; my $gitcommit = "$gitbranch $gitlog"; my $gitstatus = `cd $prog_dir; git status --short;`; my $gitdiff_cached = `cd $prog_dir; git diff --cached;`; my $gitdiff = `cd $prog_dir; git diff;`; return $gitcommit, $gitstatus, $gitdiff_cached, $gitdiff; } sub _add_archive_comment { my ( $title, $comment, $repro_fh ) = @_; if ( defined $comment ) { my @comment_lines = split /\n/, $comment; say $repro_fh "#$title: $_" for @comment_lines; } } sub _validate_prog_name { my ( $old_prog, $prog, @args ) = @_; die <<EOF if $old_prog ne $prog; Current ($prog) and archived ($old_prog) program names don't match! If this was expected (e.g., filename was changed), please re-run as: perl $prog @args EOF } 1;

Comment on Improving reproducibility and record-keeping with Log::Reproducible (create or re-run archive of script parameters, git snapshot, etc.)
Select or Download Code
Re: Improving reproducibility and record-keeping with Log::Reproducible
by zentara (Archbishop) on Mar 25, 2014 at 10:41 UTC
    Hi, nice work, but I think you should add Git, somewhere in the node title. I just wanted to mention that I think I saw choroba mention in the chatterbox that he had developed some way of generating a color-coded output of all transactions on a Git node. I think maybe a Tk or Gtk3 interface would be a cool feature to add, to visualize it all.

    I'm not really a human, but I play one on earth.
    Old Perl Programmer Haiku ................... flash japh
      This is the tool I mentioned in the CB: Screencast on YouTube. I plan to write a Meditation on it once I have some more time.
      لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
        Thanks. I'll take a look at it.
      Thanks. Went ahead and updated the node title.
Re: Improving reproducibility and record-keeping with Log::Reproducible
by DrHyde (Prior) on Mar 25, 2014 at 11:56 UTC
    Looks very useful.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: CUFP [id://1079643]
Approved by Athanasius
Front-paged by Athanasius
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (2)
As of 2014-09-20 03:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (152 votes), past polls