Clarity in parsing fixed length data.

rodion has asked for the wisdom of the Perl Monks concerning the following question:

Our small group of programmers will be parsing (and sometimes writing), a fair amount of fixed-width-field records, and we want to settle on a local convention for specifying the format string used by unpack(), as well as the array of field names for the record. We will do the unpacking into a hash slice with something like

my %test_vals;
@test_vals{ @TEST_FLD } = unpack $TEST_REC_PACK, $rec;
[download]

to load up the test_vals hash . Below is our first pass on specifying the two constants needed, $TEST_REC_PACK and @TEST_FLD.

I've looked at Data::FixedFormat, Text::FixedLength and Parse::FixedLength to see if they would simplify things, but they don't appear offer enough to justify the mental overhead of having to refer to documentation in a separate module. Initially at least, plain old Perl looks clearer.

Update: Note that the correspondence with the fat commas, "=>" is loading an array, rather than a hash. This is so that the field specs and field names are available in order. A hash would lose this order.

I have two questions, at different levels, about our first pass approach below.

Does this make sense to the monks as a clear way to present the spec? ("Looks fine to me." is just as useful a response as "I suggest you do this differently...")
Are there suggestions for what we use to split the array of fld-name => pack-spec pairs? Does the cleaner and more compartmentalized approach of the sub-call version outweigh the compactness of the one-line-ish idiom, or the other way round. Do you have something clearer to suggest? More elegant?

Your considered wisdom, advice and opinions are much appreciated.

my @TEST_FLD = ();
my $TEST_REC_PACK = q{}; 
my @TEST_REC_PACK_FLDS = (
   status => 'n  ',   #   0
   time   => 'n  ',   #   2
   date   => 'N  ',   #   4
   code   => 'a16',   #   8   key
   msid   => 'a10',   #  24   key);
my $TEST_REC_LEN =       34;

# split pack-spec & fld names (one-line-ish)
my $odd=0;
($odd^=1) ? push @TEST_FLD,$_ : ($TEST_REC_PACK.=$_) 
                                for (@TEST_REC_PACK_FLDS);

# split pack-spec & fld names (using a sub)
($TEST_REC_PACK,@TEST_FLD) = 
               Split_pack_flds_spec(@TEST_REC_PACK_FLDS);

sub Split_pack_flds_spec {
   my $spec;
   my @fld;
   my $odd = 0;
   for (@_) { 
      if ($odd = !$odd) {
         push @fld,$_;
      }
      else {
         $spec .= $_; 
      }
   }
   return ($spec,@fld);
}
[download]

Testing Code:

print 'fields    =  (',join(',',@TEST_FLD),")\n";
print "pack spec =  \"$TEST_REC_PACK\"\n";

my $rec = "\x{00}\x{7B}\x{06}\x{B3}\x{01}\x{32}\x{1A}\x{83}"
         .'code_value------msid_value';

my %test_vals;
@test_vals{ @TEST_FLD } = unpack $TEST_REC_PACK, $rec;

my ($key,$val);
while (($key,$val)=each(%test_vals)) {
   printf " %6s => %s\n",$key,$val;
}
[download]

Comment on Clarity in parsing fixed length data. Select or Download Code

Replies are listed 'Best First'.
Re: Clarity in parsing fixed length data. by ikegami (Patriarch) on Aug 10, 2006 at 22:11 UTC
Both of your alternatives (one-line-ish, and using a sub) are very hard to read, and require lots of duplicate code for each message type. Hide the gory details, leaving a clean interface (such as those below). This process is called encapsulation. A procedural interface: `my $TEST_REC_FORMAT = get_msg_format [ status => 'n ', # 0 time => 'n ', # 2 date => 'N ', # 4 code => 'a16', # 8 key msid => 'a10', # 24 key ]; my %test_vals = parse_msg($TEST_REC_FORMAT, $rec);` [download] An OO interface: `my $TEST_REC_FORMAT = MessageFormat->new([ status => 'n ', # 0 time => 'n ', # 2 date => 'N ', # 4 code => 'a16', # 8 key msid => 'a10', # 24 key ]); my %test_vals = $TEST_REC_FORMAT->parse($rec);` [download] What exactly `get_msg_format` or `new` returns is not important. How exactly it is calculate is not important either.	[reply] [d/l] [select]
Re^2: Clarity in parsing fixed length data. by runrig (Abbot) on Aug 10, 2006 at 22:17 UTC
It's so very hard to read. What's important is to hide the gory details and provide a neat interface And that's why I like Parse::FixedLength. The interface is very similar to what you have. I used to rewrite the same gory details over and over (similar to what the OP has) until they were stuffed into that module.	[reply]
Re^2: Clarity in parsing fixed length data. by rodion (Chaplain) on Aug 10, 2006 at 22:42 UTC
I take it the initial reaction then is that someone reading the code won't mind going elsewhere to understand what's going on, or won't need to do so because the function names are clear enough. (Or if they do mind a little, this is out-weighed by greater ease in writing the code.)	[reply]
Re^3: Clarity in parsing fixed length data. by GrandFather (Saint) on Aug 11, 2006 at 00:02 UTC
In general you should aim to localise the understanding of the code to the current context. By that I mean, the bit of code you are looking at now should be written so its purpose is clear without having to refer to other materials. Generally that means rewriting something like: `setOptions (1, 0, 'e');` [download] as: `setOptions (send_log => 1, suppress_warnings => 0, max_level => 'e');` [download] A little more verbose, but now (even without the context) it's pretty clear what the parameters do and what setOption is likely to achieve. Or to paraphrase a great man: make your code as clear as possible, but no clearer :). DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re^3: Clarity in parsing fixed length data. by runrig (Abbot) on Aug 10, 2006 at 23:30 UTC
That's my opinion. Even if the function is simple, I still wouldn't copy it over and over, because I believe it's a distraction from the rest of the code. If you're going to go this way, stuff the function in in a module called OurCompany::FixedLengthUtils or something, and then no one has to think about how it works, they can just use it, or look up the docs for it. And, I would probably create a separate module/class for each format especially if you'll be reusing the same formats in different scripts (which Parse-FixedLength has instructions on doing -- disclaimer: I'm the author :-)	[reply]
Re^4: Clarity in parsing fixed length data. by rodion (Chaplain) on Aug 13, 2006 at 11:49 UTC
Re^5: Clarity in parsing fixed length data. by runrig (Abbot) on Aug 16, 2006 at 19:19 UTC
Re: Clarity in parsing fixed length data. by hv (Prior) on Aug 11, 2006 at 10:52 UTC
I find unclear the way the two lists are interleaved. In a style similar to that hinted at in GrandFather's comment, I'd tend to write this as list of hashrefs: `my @TEST_REC_PACK_FIELDS = ( { name => 'status', data => 'n' }, { name => 'time', data => 'n' }, { name => 'date', data => 'N' }, { name => 'code', data => 'a16' }, { name => 'msid', data => 'a10' }, );` [download] At this point you can just as easily construct the list of fieldnames and the unpack string, but it is more self-documenting, and you can easily extend the structures later if you need to start adding validation callbacks and the like. Hope this helps, Hugo	[reply] [d/l]
Re: Clarity in parsing fixed length data. by imp (Priest) on Aug 10, 2006 at 22:59 UTC
I would recommend using one of the existing modules for this job. It isn't a large amount of work to duplicate the functionality, but the time it takes to document your new module and the time it takes to put together a comprehensive test suite can be substantial. Someone else has already done this work, so it would be better to use their existing code and test suite. They might have thought of rare cases that you would have missed. And if you know of a case they don't test for, I'm sure the maintainer would appreciate a well written test case for it.	[reply]
Re: Clarity in parsing fixed length data. by Ieronim (Friar) on Aug 11, 2006 at 08:59 UTC
As i can see, your problem can be limited to finding a better way to create and document 'unpack' templates. I myself would do it in an OO way (let's call the class for format templates 'Unpack::Format'): `my $format = Unpack::Format->new( spec =>[ status => 'n', # 0 time => 'n', # 2 date => 'N', # 4 code => 'a16', # 8 key msid => 'a10', # 24 key ] ); #overloaded @test_vals{ @$format } = unpack "$format", $rec; # pure OO style @test-vals{$format->fieldnames) = unpack $format->templatestr, $rec;` [download] It's easy to implement, and i think it's a much more clear way to create templates than using constants for field names and unpack string. UPD: according to Hofmator's note, name-format pairs must be passed as an array reference. It seems to be a very easy typo to make :) UPDATE 2: Working implementation (tested) Read more... (2 kB) `s;;Just-me-not-h-Ni-m-P-Ni-lm-I-ar-O-Ni;;tr?IerONim-?HAcker ?d;print`	[reply] [d/l] [select]
Re^2: Clarity in parsing fixed length data. by Hofmator (Curate) on Aug 11, 2006 at 11:17 UTC
In the way it is written, your solution suffers from the same problems as BrowserUK's solution: The order of the fields gets lost by using a hashref as the spec-argument to `Unpack::Format::new`. -- Hofmator	[reply] [d/l]
Re: Clarity in parsing fixed length data. by BrowserUk (Patriarch) on Aug 10, 2006 at 22:01 UTC
Update: As pointed out below, this doesn't work! Sorry, A rush of blood I think. If you used a hash to contain your spec: `my %TEST_REC_PACK_FLDS = { status => 'n ', # 0 time => 'n ', # 2 date => 'N ', # 4 code => 'a16', # 8 key msid => 'a10', # 24 key };` [download] Then your field names becomes just `use constant @TEST_FLDS => keys %TEST_REC_PACK_FLDS;` [download] and you build your format string using `use constant TEST_REC_PACK => $join ' ', values %TEST_REC_PACK_FLDS;` [download] You might want to uppercase the field names. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^2: Clarity in parsing fixed length data. by runrig (Abbot) on Aug 10, 2006 at 22:12 UTC
If you need your keys to be in a certain order, then I wouldn't use a hash in this way (unless it's one of those tied hash modules). An array is the appropriate thing here.	[reply]
Re^2: Clarity in parsing fixed length data. by ikegami (Patriarch) on Aug 10, 2006 at 22:12 UTC
Field order is lost by using a hash.	[reply]
Re^2: Clarity in parsing fixed length data. by rodion (Chaplain) on Aug 10, 2006 at 22:18 UTC
Thanks, I meant to mention that the order of the fields is important for the list of field names. (I've added an update mentioning that.) I considered using tied hashes that preserve order, but thought that would be less clear. If I just needed to preserve the order of the unpack spec, I could also use the "@0 n @2 n @4 N @8 a16 ..." form of unpack() specs, but the fieldname order is also useful to us.	[reply]


Do you know where your variables are?
	PerlMonks